{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced Polars Features in Data-Wrangler\n", "\n", "This tutorial explores advanced Polars capabilities that you can leverage through data-wrangler's backend system. We'll cover lazy evaluation, advanced data type handling, cross-backend workflows, and optimization techniques.\n", "\n", "## Overview\n", "\n", "Polars offers several advanced features that make it ideal for sophisticated data processing:\n", "\n", "- **Lazy Evaluation**: Query optimization and efficient execution\n", "- **Columnar Storage**: Memory-efficient data representation\n", "- **Parallel Processing**: Built-in multi-threading\n", "- **Type System**: Rich data type support\n", "- **Interoperability**: Seamless conversion with other formats\n", "\n", "Let's explore these features in the context of data-wrangler!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datawrangler as dw\n", "import numpy as np\n", "import pandas as pd\n", "import polars as pl\n", "import time\n", "from datawrangler.core.configurator import set_dataframe_backend, get_dataframe_backend\n", "from datawrangler.zoo.polars_dataframe import polars_to_pandas, pandas_to_polars\n", "import matplotlib.pyplot as plt\n", "\n", "print(\"šŸš€ Advanced Polars tutorial environment loaded!\")\n", "print(f\"Current backend: {get_dataframe_backend()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Working with Polars LazyFrames\n", "\n", "Polars LazyFrames enable query optimization and can provide significant performance improvements for complex operations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create sample data for lazy evaluation demonstration\n", "large_array = np.random.rand(50000, 10)\n", "print(f\"Created test array: {large_array.shape[0]:,} rows x {large_array.shape[1]} columns\")\n", "\n", "# Convert to Polars DataFrame\n", "polars_df = dw.wrangle(large_array, backend='polars')\n", "print(f\"Polars DataFrame type: {type(polars_df)}\")\n", "print(f\"Shape: {polars_df.shape}\")\n", "\n", "# Convert to LazyFrame for optimization\n", "lazy_df = polars_df.lazy()\n", "print(f\"\\nLazyFrame type: {type(lazy_df)}\")\n", "print(\"LazyFrame created - no computation performed yet!\")\n", "\n", "# Build a complex query\n", "lazy_result = (\n", " lazy_df\n", " .filter(pl.col(\"column_0\") > 0.5)\n", " .with_columns([\n", " (pl.col(\"column_1\") * 2).alias(\"doubled_col1\"),\n", " (pl.col(\"column_2\") + pl.col(\"column_3\")).alias(\"sum_col23\")\n", " ])\n", " .group_by(pl.col(\"doubled_col1\").round(1))\n", " .agg([\n", " pl.col(\"sum_col23\").mean().alias(\"avg_sum\"),\n", " pl.col(\"column_0\").count().alias(\"count\")\n", " ])\n", ")\n", "\n", "print(\"\\nšŸ“‹ Query plan created (still lazy):\")\n", "print(lazy_result.explain())\n", "\n", "# Execute the query\n", "print(\"\\n⚔ Executing optimized query...\")\n", "start_time = time.time()\n", "result = lazy_result.collect()\n", "execution_time = time.time() - start_time\n", "\n", "print(f\"āœ… Query executed in {execution_time:.4f} seconds\")\n", "print(f\"Result shape: {result.shape}\")\n", "print(\"\\nFirst few rows:\")\n", "print(result.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Advanced Data Type Handling\n", "\n", "Polars has a rich type system that can be leveraged for efficient data processing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create diverse data types for testing\n", "mixed_data = {\n", " 'integers': np.random.randint(0, 100, 1000),\n", " 'floats': np.random.rand(1000),\n", " 'strings': [f\"item_{i}\" for i in range(1000)],\n", " 'booleans': np.random.choice([True, False], 1000),\n", " 'dates': pd.date_range('2023-01-01', periods=1000, freq='H')\n", "}\n", "\n", "# Create pandas DataFrame first\n", "pandas_df = pd.DataFrame(mixed_data)\n", "print(\"šŸ“Š Original pandas DataFrame info:\")\n", "print(pandas_df.dtypes)\n", "print(f\"Memory usage: {pandas_df.memory_usage(deep=True).sum() / 1024:.1f} KB\")\n", "\n", "# Convert to Polars and examine types\n", "polars_mixed = dw.wrangle(pandas_df, backend='polars')\n", "print(\"\\nšŸš€ Polars DataFrame schema:\")\n", "print(polars_mixed.schema)\n", "print(f\"Memory usage: {polars_mixed.estimated_size() / 1024:.1f} KB\")\n", "\n", "# Demonstrate type-specific operations\n", "print(\"\\nšŸ”§ Type-specific operations:\")\n", "\n", "# String operations\n", "string_ops = polars_mixed.select([\n", " pl.col(\"strings\").str.lengths().alias(\"string_lengths\"),\n", " pl.col(\"strings\").str.to_uppercase().alias(\"uppercase\"),\n", " pl.col(\"strings\").str.contains(\"item_1\").alias(\"contains_item1\")\n", "])\n", "print(\"String operations result:\")\n", "print(string_ops.head())\n", "\n", "# Date operations\n", "date_ops = polars_mixed.select([\n", " pl.col(\"dates\").dt.year().alias(\"year\"),\n", " pl.col(\"dates\").dt.month().alias(\"month\"),\n", " pl.col(\"dates\").dt.weekday().alias(\"weekday\")\n", "])\n", "print(\"\\nDate operations result:\")\n", "print(date_ops.head())\n", "\n", "# Numerical operations with conditions\n", "numerical_ops = polars_mixed.select([\n", " pl.when(pl.col(\"integers\") > 50)\n", " .then(pl.col(\"floats\") * 2)\n", " .otherwise(pl.col(\"floats\"))\n", " .alias(\"conditional_floats\"),\n", " pl.col(\"integers\").cast(pl.Float64).alias(\"integers_as_float\")\n", "])\n", "print(\"\\nNumerical operations result:\")\n", "print(numerical_ops.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Cross-Backend Workflows\n", "\n", "One of data-wrangler's strengths is the ability to seamlessly switch between backends within the same workflow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a complex workflow that leverages both backends\n", "print(\"šŸ”„ CROSS-BACKEND WORKFLOW DEMONSTRATION\")\n", "print(\"=\" * 50)\n", "\n", "# Step 1: Start with raw data\n", "raw_data = np.random.rand(5000, 8)\n", "print(f\"1. Raw data: {raw_data.shape}\")\n", "\n", "# Step 2: Initial processing with Polars (fast)\n", "print(\"2. Initial processing with Polars backend...\")\n", "polars_processed = dw.wrangle(raw_data, backend='polars')\n", "print(f\" Result type: {type(polars_processed)}\")\n", "\n", "# Step 3: Add computed columns using Polars operations\n", "polars_enhanced = polars_processed.with_columns([\n", " (pl.col(\"column_0\") + pl.col(\"column_1\")).alias(\"sum_01\"),\n", " (pl.col(\"column_2\") * pl.col(\"column_3\")).alias(\"product_23\"),\n", " pl.col(\"column_4\").rolling_mean(window_size=10).alias(\"rolling_mean_4\")\n", "])\n", "print(f\"3. Enhanced with computed columns: {polars_enhanced.shape}\")\n", "\n", "# Step 4: Convert to pandas for specialized analysis\n", "print(\"4. Converting to pandas for specialized operations...\")\n", "pandas_df = polars_to_pandas(polars_enhanced)\n", "print(f\" Converted type: {type(pandas_df)}\")\n", "\n", "# Step 5: Use pandas-specific functionality (e.g., complex plotting)\n", "correlation_matrix = pandas_df.corr()\n", "print(f\"5. Computed correlation matrix: {correlation_matrix.shape}\")\n", "\n", "# Step 6: Convert back to Polars for final processing\n", "print(\"6. Converting back to Polars for final aggregations...\")\n", "final_polars = pandas_to_polars(pandas_df)\n", "\n", "# Step 7: Final aggregations with Polars\n", "summary_stats = final_polars.select([\n", " pl.all().mean().suffix(\"_mean\"),\n", " pl.all().std().suffix(\"_std\"),\n", " pl.all().min().suffix(\"_min\"),\n", " pl.all().max().suffix(\"_max\")\n", "])\n", "\n", "print(\"7. Final summary statistics computed!\")\n", "print(f\" Summary shape: {summary_stats.shape}\")\n", "\n", "print(\"\\nāœ… Cross-backend workflow completed successfully!\")\n", "print(\" Benefits: Polars speed + Pandas ecosystem compatibility\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Parallel Processing and Performance Optimization\n", "\n", "Polars automatically leverages multiple CPU cores. Let's demonstrate this capability." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import multiprocessing\n", "\n", "print(f\"šŸ–„ļø Available CPU cores: {multiprocessing.cpu_count()}\")\n", "print(\"Polars will automatically use multiple cores for operations!\")\n", "\n", "# Create large dataset for parallel processing demonstration\n", "large_dataset = np.random.rand(100000, 20)\n", "print(f\"\\nšŸ“Š Large dataset: {large_dataset.shape[0]:,} rows x {large_dataset.shape[1]} columns\")\n", "\n", "# Convert to Polars DataFrame\n", "polars_large = dw.wrangle(large_dataset, backend='polars')\n", "\n", "# Demonstrate parallel operations\n", "print(\"\\n⚔ Performing parallel operations...\")\n", "\n", "# Complex aggregation that benefits from parallelization\n", "start_time = time.time()\n", "parallel_result = polars_large.select([\n", " # Multiple statistical operations across all columns\n", " pl.all().mean().suffix(\"_mean\"),\n", " pl.all().std().suffix(\"_std\"),\n", " pl.all().quantile(0.25).suffix(\"_q25\"),\n", " pl.all().quantile(0.75).suffix(\"_q75\"),\n", " pl.all().skew().suffix(\"_skew\")\n", "])\n", "parallel_time = time.time() - start_time\n", "\n", "print(f\"āœ… Parallel operations completed in {parallel_time:.4f} seconds\")\n", "print(f\"Result shape: {parallel_result.shape}\")\n", "\n", "# Demonstrate lazy evaluation with parallel execution\n", "print(\"\\nšŸ”„ Lazy evaluation with parallel execution...\")\n", "start_time = time.time()\n", "\n", "lazy_parallel = (\n", " polars_large.lazy()\n", " .filter(pl.col(\"column_0\") > 0.3)\n", " .group_by((pl.col(\"column_1\") * 10).round().cast(pl.Int64))\n", " .agg([\n", " pl.col(\"column_2\").sum().alias(\"sum_col2\"),\n", " pl.col(\"column_3\").mean().alias(\"mean_col3\"),\n", " pl.col(\"column_4\").count().alias(\"count\")\n", " ])\n", " .sort(\"count\", descending=True)\n", " .collect() # Execute with optimization\n", ")\n", "\n", "lazy_parallel_time = time.time() - start_time\n", "print(f\"āœ… Lazy parallel operations completed in {lazy_parallel_time:.4f} seconds\")\n", "print(f\"Result shape: {lazy_parallel.shape}\")\n", "\n", "# Show memory efficiency\n", "print(\"\\nšŸ’¾ Memory efficiency demonstration:\")\n", "memory_efficient = (\n", " polars_large.lazy()\n", " .select([\"column_0\", \"column_1\", \"column_2\"]) # Select only needed columns\n", " .filter(pl.col(\"column_0\") > 0.5) # Filter early\n", " .with_columns([\n", " (pl.col(\"column_1\") * pl.col(\"column_2\")).alias(\"product\")\n", " ])\n", " .collect()\n", ")\n", "print(f\"Memory-efficient query result: {memory_efficient.shape}\")\n", "print(\"Benefits: Early filtering, column selection, and optimized execution\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Advanced Text Processing with Polars\n", "\n", "Polars excels at string operations and can significantly speed up text processing workflows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create sample text data for advanced processing\n", "sample_texts = [\n", " \"Machine learning algorithms process vast amounts of data efficiently.\",\n", " \"Data science combines statistical methods with computational techniques.\",\n", " \"Artificial intelligence enables automated decision-making systems.\",\n", " \"Deep learning neural networks learn complex patterns automatically.\",\n", " \"Natural language processing understands human communication patterns.\",\n", " \"Computer vision analyzes visual information from digital images.\",\n", " \"Big data analytics extracts insights from massive datasets.\",\n", " \"Cloud computing provides scalable infrastructure for applications.\"\n", "] * 1000 # Scale up for performance testing\n", "\n", "print(f\"šŸ“ Text dataset: {len(sample_texts):,} documents\")\n", "\n", "# Process with Polars backend for embeddings\n", "print(\"\\nšŸš€ Processing text with Polars backend...\")\n", "start_time = time.time()\n", "text_embeddings = dw.wrangle(sample_texts, backend='polars')\n", "text_processing_time = time.time() - start_time\n", "\n", "print(f\"āœ… Text processing completed in {text_processing_time:.4f} seconds\")\n", "print(f\"Embeddings shape: {text_embeddings.shape}\")\n", "print(f\"Embeddings type: {type(text_embeddings)}\")\n", "\n", "# Demonstrate advanced string operations on original text\n", "print(\"\\nšŸ”§ Advanced string operations with Polars...\")\n", "\n", "# Create a DataFrame with the original texts\n", "text_df = pl.DataFrame({\"text\": sample_texts[:100]}) # Use subset for demonstration\n", "\n", "# Perform various string operations\n", "text_analysis = text_df.with_columns([\n", " pl.col(\"text\").str.len_chars().alias(\"char_count\"),\n", " pl.col(\"text\").str.len_bytes().alias(\"byte_count\"),\n", " pl.col(\"text\").str.n_chars().alias(\"n_chars\"),\n", " pl.col(\"text\").str.split(\" \").list.len().alias(\"word_count\"),\n", " pl.col(\"text\").str.to_lowercase().alias(\"lowercase\"),\n", " pl.col(\"text\").str.contains(\"data\").alias(\"contains_data\"),\n", " pl.col(\"text\").str.extract(r\"(\\w+ing)\", 1).alias(\"words_ending_ing\"),\n", " pl.col(\"text\").str.replace_all(\"data\", \"information\").alias(\"replaced_text\")\n", "])\n", "\n", "print(\"String analysis results (first 3 rows):\")\n", "print(text_analysis.head(3))\n", "\n", "# Text statistics\n", "text_stats = text_analysis.select([\n", " pl.col(\"char_count\").mean().alias(\"avg_chars\"),\n", " pl.col(\"word_count\").mean().alias(\"avg_words\"),\n", " pl.col(\"contains_data\").sum().alias(\"docs_with_data\"),\n", " pl.col(\"words_ending_ing\").filter(pl.col(\"words_ending_ing\").is_not_null()).count().alias(\"docs_with_ing_words\")\n", "])\n", "\n", "print(\"\\nšŸ“Š Text statistics:\")\n", "print(text_stats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Global Backend Management\n", "\n", "Data-wrangler provides flexible backend management for different parts of your workflow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"āš™ļø GLOBAL BACKEND MANAGEMENT\")\n", "print(\"=\" * 40)\n", "\n", "# Check current backend\n", "current_backend = get_dataframe_backend()\n", "print(f\"Current default backend: {current_backend}\")\n", "\n", "# Demonstrate backend switching workflow\n", "test_data = np.random.rand(1000, 5)\n", "\n", "print(\"\\n1. Using current backend (no explicit specification):\")\n", "result1 = dw.wrangle(test_data)\n", "print(f\" Result type: {type(result1)}\")\n", "\n", "print(\"\\n2. Explicitly using Polars backend:\")\n", "result2 = dw.wrangle(test_data, backend='polars')\n", "print(f\" Result type: {type(result2)}\")\n", "\n", "print(\"\\n3. Explicitly using pandas backend:\")\n", "result3 = dw.wrangle(test_data, backend='pandas')\n", "print(f\" Result type: {type(result3)}\")\n", "\n", "# Change global backend\n", "print(\"\\n4. Changing global backend to Polars:\")\n", "set_dataframe_backend('polars')\n", "print(f\" New default backend: {get_dataframe_backend()}\")\n", "\n", "result4 = dw.wrangle(test_data) # Now uses Polars by default\n", "print(f\" Result type: {type(result4)}\")\n", "\n", "# Reset to pandas\n", "print(\"\\n5. Resetting to pandas backend:\")\n", "set_dataframe_backend('pandas')\n", "print(f\" Reset to: {get_dataframe_backend()}\")\n", "\n", "# Demonstrate context-aware processing\n", "print(\"\\n6. Context-aware processing example:\")\n", "\n", "def smart_processing(data, size_threshold=10000):\n", " \"\"\"Automatically choose backend based on data size.\"\"\"\n", " data_size = np.prod(data.shape) if hasattr(data, 'shape') else len(data)\n", " \n", " if data_size > size_threshold:\n", " print(f\" Large dataset detected ({data_size:,} elements) - using Polars\")\n", " return dw.wrangle(data, backend='polars')\n", " else:\n", " print(f\" Small dataset detected ({data_size:,} elements) - using pandas\")\n", " return dw.wrangle(data, backend='pandas')\n", "\n", "# Test with different sizes\n", "small_data = np.random.rand(50, 10) # 500 elements\n", "large_data = np.random.rand(200, 100) # 20,000 elements\n", "\n", "small_result = smart_processing(small_data)\n", "large_result = smart_processing(large_data)\n", "\n", "print(f\" Small result type: {type(small_result)}\")\n", "print(f\" Large result type: {type(large_result)}\")\n", "\n", "print(\"\\nāœ… Backend management demonstration complete!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Performance Monitoring and Profiling\n", "\n", "Let's create tools to monitor and profile Polars performance in your workflows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import functools\n", "import time\n", "from typing import Any, Callable\n", "\n", "class PerformanceProfiler:\n", " \"\"\"Simple performance profiler for data-wrangler operations.\"\"\"\n", " \n", " def __init__(self):\n", " self.operations = []\n", " \n", " def profile_operation(self, name: str):\n", " \"\"\"Decorator to profile operations.\"\"\"\n", " def decorator(func: Callable) -> Callable:\n", " @functools.wraps(func)\n", " def wrapper(*args, **kwargs) -> Any:\n", " start_time = time.time()\n", " result = func(*args, **kwargs)\n", " end_time = time.time()\n", " \n", " self.operations.append({\n", " 'name': name,\n", " 'duration': end_time - start_time,\n", " 'result_type': type(result).__name__\n", " })\n", " \n", " return result\n", " return wrapper\n", " return decorator\n", " \n", " def get_summary(self):\n", " \"\"\"Get performance summary.\"\"\"\n", " if not self.operations:\n", " return \"No operations recorded\"\n", " \n", " total_time = sum(op['duration'] for op in self.operations)\n", " avg_time = total_time / len(self.operations)\n", " \n", " summary = f\"šŸ“Š Performance Summary:\\n\"\n", " summary += f\"Total operations: {len(self.operations)}\\n\"\n", " summary += f\"Total time: {total_time:.4f}s\\n\"\n", " summary += f\"Average time: {avg_time:.4f}s\\n\\n\"\n", " \n", " summary += \"Operation details:\\n\"\n", " for i, op in enumerate(self.operations, 1):\n", " summary += f\"{i}. {op['name']}: {op['duration']:.4f}s ({op['result_type']})\\n\"\n", " \n", " return summary\n", "\n", "# Create profiler instance\n", "profiler = PerformanceProfiler()\n", "\n", "# Define profiled operations\n", "@profiler.profile_operation(\"Array to Polars\")\n", "def array_to_polars(data):\n", " return dw.wrangle(data, backend='polars')\n", "\n", "@profiler.profile_operation(\"Array to Pandas\")\n", "def array_to_pandas(data):\n", " return dw.wrangle(data, backend='pandas')\n", "\n", "@profiler.profile_operation(\"Text to Polars\")\n", "def text_to_polars(data):\n", " return dw.wrangle(data, backend='polars')\n", "\n", "@profiler.profile_operation(\"Cross-backend conversion\")\n", "def cross_backend_conversion(data):\n", " polars_df = dw.wrangle(data, backend='polars')\n", " pandas_df = polars_to_pandas(polars_df)\n", " return pandas_to_polars(pandas_df)\n", "\n", "# Run performance tests\n", "print(\"šŸƒā€ā™‚ļø Running performance profiling tests...\")\n", "\n", "test_array = np.random.rand(5000, 10)\n", "test_texts = sample_texts[:100]\n", "\n", "# Profile different operations\n", "result1 = array_to_polars(test_array)\n", "result2 = array_to_pandas(test_array)\n", "result3 = text_to_polars(test_texts)\n", "result4 = cross_backend_conversion(test_array)\n", "\n", "print(\"\\n\" + profiler.get_summary())\n", "\n", "# Create performance visualization\n", "fig, ax = plt.subplots(figsize=(10, 6))\n", "operations = [op['name'] for op in profiler.operations]\n", "durations = [op['duration'] for op in profiler.operations]\n", "colors = ['#ff7f0e', '#1f77b4', '#ff7f0e', '#2ca02c']\n", "\n", "bars = ax.bar(operations, durations, color=colors)\n", "ax.set_title('Operation Performance Comparison')\n", "ax.set_ylabel('Time (seconds)')\n", "ax.tick_params(axis='x', rotation=45)\n", "\n", "# Add value labels on bars\n", "for bar, duration in zip(bars, durations):\n", " ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,\n", " f'{duration:.4f}s', ha='center', va='bottom')\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"āœ… Performance profiling complete!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best Practices and Optimization Tips\n", "\n", "Here are key recommendations for getting the most out of Polars in data-wrangler:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"šŸ’” POLARS OPTIMIZATION BEST PRACTICES\")\n", "print(\"=\" * 50)\n", "\n", "best_practices = {\n", " \"1. Use Lazy Evaluation\": {\n", " \"description\": \"Convert to LazyFrame for complex operations\",\n", " \"example\": \"df.lazy().filter(...).group_by(...).collect()\",\n", " \"benefit\": \"Query optimization and better performance\"\n", " },\n", " \"2. Filter Early\": {\n", " \"description\": \"Apply filters before expensive operations\", \n", " \"example\": \"df.filter(condition).expensive_operation()\",\n", " \"benefit\": \"Reduces data size for subsequent operations\"\n", " },\n", " \"3. Select Columns Wisely\": {\n", " \"description\": \"Only select needed columns\",\n", " \"example\": \"df.select(['col1', 'col2']).process()\",\n", " \"benefit\": \"Lower memory usage and faster processing\"\n", " },\n", " \"4. Leverage Parallel Processing\": {\n", " \"description\": \"Polars automatically uses multiple cores\",\n", " \"example\": \"Large aggregations benefit automatically\",\n", " \"benefit\": \"Faster execution on multi-core systems\"\n", " },\n", " \"5. Use Appropriate Data Types\": {\n", " \"description\": \"Cast to optimal types for your data\",\n", " \"example\": \"df.cast({'col': pl.Int32})\",\n", " \"benefit\": \"Reduced memory usage and faster operations\"\n", " },\n", " \"6. Batch Operations\": {\n", " \"description\": \"Combine multiple operations in single call\",\n", " \"example\": \"df.with_columns([op1, op2, op3])\",\n", " \"benefit\": \"Reduced overhead and better optimization\"\n", " }\n", "}\n", "\n", "for i, (practice, details) in enumerate(best_practices.items(), 1):\n", " print(f\"\\n{practice}:\")\n", " print(f\" šŸ“‹ {details['description']}\")\n", " print(f\" šŸ’» Example: {details['example']}\")\n", " print(f\" āœ… Benefit: {details['benefit']}\")\n", "\n", "# Demonstrate good vs bad practices\n", "print(\"\\n\\n🚨 PERFORMANCE COMPARISON: Good vs Bad Practices\")\n", "print(\"=\" * 60)\n", "\n", "large_test_data = np.random.rand(50000, 15)\n", "polars_df = dw.wrangle(large_test_data, backend='polars')\n", "\n", "# Bad practice: Multiple separate operations\n", "print(\"āŒ Bad practice (multiple separate operations):\")\n", "start = time.time()\n", "bad_result = polars_df.filter(pl.col(\"column_0\") > 0.5)\n", "bad_result = bad_result.with_columns([(pl.col(\"column_1\") * 2).alias(\"doubled\")])\n", "bad_result = bad_result.with_columns([(pl.col(\"column_2\") + pl.col(\"column_3\")).alias(\"sum_23\")])\n", "bad_result = bad_result.group_by(\"doubled\").agg([pl.col(\"sum_23\").mean()])\n", "bad_time = time.time() - start\n", "print(f\" Time: {bad_time:.4f}s\")\n", "\n", "# Good practice: Lazy evaluation with batched operations\n", "print(\"\\nāœ… Good practice (lazy evaluation + batched operations):\")\n", "start = time.time()\n", "good_result = (\n", " polars_df.lazy()\n", " .filter(pl.col(\"column_0\") > 0.5)\n", " .with_columns([\n", " (pl.col(\"column_1\") * 2).alias(\"doubled\"),\n", " (pl.col(\"column_2\") + pl.col(\"column_3\")).alias(\"sum_23\")\n", " ])\n", " .group_by(\"doubled\")\n", " .agg([pl.col(\"sum_23\").mean()])\n", " .collect()\n", ")\n", "good_time = time.time() - start\n", "print(f\" Time: {good_time:.4f}s\")\n", "\n", "improvement = (bad_time - good_time) / bad_time * 100\n", "print(f\"\\nšŸš€ Performance improvement: {improvement:.1f}% faster with best practices!\")\n", "\n", "print(\"\\nāœ… Advanced Polars tutorial complete!\")\n", "print(\"You're now ready to leverage Polars' full power in data-wrangler! šŸŽ‰\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This tutorial covered advanced Polars features in data-wrangler:\n", "\n", "### šŸš€ Key Capabilities Explored\n", "\n", "1. **Lazy Evaluation**: Query optimization for complex operations\n", "2. **Advanced Data Types**: Rich type system with specialized operations\n", "3. **Cross-Backend Workflows**: Seamless switching between pandas and Polars\n", "4. **Parallel Processing**: Automatic multi-core utilization\n", "5. **Text Processing**: High-performance string operations\n", "6. **Backend Management**: Flexible global and per-operation control\n", "7. **Performance Profiling**: Tools for monitoring and optimization\n", "\n", "### šŸ’” Best Practices Learned\n", "\n", "- Use lazy evaluation for complex queries\n", "- Filter early and select only needed columns\n", "- Batch operations for better performance\n", "- Leverage automatic parallel processing\n", "- Choose appropriate data types\n", "- Mix backends as needed for optimal workflows\n", "\n", "### šŸŽÆ When to Use Advanced Features\n", "\n", "- **Large datasets** (>10,000 rows): Use lazy evaluation\n", "- **Complex transformations**: Batch operations and optimize queries\n", "- **Memory constraints**: Leverage columnar storage and early filtering\n", "- **Mixed workflows**: Combine pandas ecosystem with Polars performance\n", "\n", "With these advanced techniques, you can build highly efficient data processing pipelines that scale from prototyping to production! 🌟" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }