{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Advanced Polars Features in Data-Wrangler\n",
    "\n",
    "This tutorial explores advanced Polars capabilities that you can leverage through data-wrangler's backend system. We'll cover lazy evaluation, advanced data type handling, cross-backend workflows, and optimization techniques.\n",
    "\n",
    "## Overview\n",
    "\n",
    "Polars offers several advanced features that make it ideal for sophisticated data processing:\n",
    "\n",
    "- **Lazy Evaluation**: Query optimization and efficient execution\n",
    "- **Columnar Storage**: Memory-efficient data representation\n",
    "- **Parallel Processing**: Built-in multi-threading\n",
    "- **Type System**: Rich data type support\n",
    "- **Interoperability**: Seamless conversion with other formats\n",
    "\n",
    "Let's explore these features in the context of data-wrangler!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datawrangler as dw\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import polars as pl\n",
    "import time\n",
    "from datawrangler.core.configurator import set_dataframe_backend, get_dataframe_backend\n",
    "from datawrangler.zoo.polars_dataframe import polars_to_pandas, pandas_to_polars\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "print(\"🚀 Advanced Polars tutorial environment loaded!\")\n",
    "print(f\"Current backend: {get_dataframe_backend()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Working with Polars LazyFrames\n",
    "\n",
    "Polars LazyFrames enable query optimization and can provide significant performance improvements for complex operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create sample data for lazy evaluation demonstration\n",
    "large_array = np.random.rand(50000, 10)\n",
    "print(f\"Created test array: {large_array.shape[0]:,} rows x {large_array.shape[1]} columns\")\n",
    "\n",
    "# Convert to Polars DataFrame\n",
    "polars_df = dw.wrangle(large_array, backend='polars')\n",
    "print(f\"Polars DataFrame type: {type(polars_df)}\")\n",
    "print(f\"Shape: {polars_df.shape}\")\n",
    "\n",
    "# Convert to LazyFrame for optimization\n",
    "lazy_df = polars_df.lazy()\n",
    "print(f\"\\nLazyFrame type: {type(lazy_df)}\")\n",
    "print(\"LazyFrame created - no computation performed yet!\")\n",
    "\n",
    "# Build a complex query\n",
    "lazy_result = (\n",
    "    lazy_df\n",
    "    .filter(pl.col(\"column_0\") > 0.5)\n",
    "    .with_columns([\n",
    "        (pl.col(\"column_1\") * 2).alias(\"doubled_col1\"),\n",
    "        (pl.col(\"column_2\") + pl.col(\"column_3\")).alias(\"sum_col23\")\n",
    "    ])\n",
    "    .group_by(pl.col(\"doubled_col1\").round(1))\n",
    "    .agg([\n",
    "        pl.col(\"sum_col23\").mean().alias(\"avg_sum\"),\n",
    "        pl.col(\"column_0\").count().alias(\"count\")\n",
    "    ])\n",
    ")\n",
    "\n",
    "print(\"\\n📋 Query plan created (still lazy):\")\n",
    "print(lazy_result.explain())\n",
    "\n",
    "# Execute the query\n",
    "print(\"\\n⚡ Executing optimized query...\")\n",
    "start_time = time.time()\n",
    "result = lazy_result.collect()\n",
    "execution_time = time.time() - start_time\n",
    "\n",
    "print(f\"✅ Query executed in {execution_time:.4f} seconds\")\n",
    "print(f\"Result shape: {result.shape}\")\n",
    "print(\"\\nFirst few rows:\")\n",
    "print(result.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Advanced Data Type Handling\n",
    "\n",
    "Polars has a rich type system that can be leveraged for efficient data processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create diverse data types for testing\n",
    "mixed_data = {\n",
    "    'integers': np.random.randint(0, 100, 1000),\n",
    "    'floats': np.random.rand(1000),\n",
    "    'strings': [f\"item_{i}\" for i in range(1000)],\n",
    "    'booleans': np.random.choice([True, False], 1000),\n",
    "    'dates': pd.date_range('2023-01-01', periods=1000, freq='H')\n",
    "}\n",
    "\n",
    "# Create pandas DataFrame first\n",
    "pandas_df = pd.DataFrame(mixed_data)\n",
    "print(\"📊 Original pandas DataFrame info:\")\n",
    "print(pandas_df.dtypes)\n",
    "print(f\"Memory usage: {pandas_df.memory_usage(deep=True).sum() / 1024:.1f} KB\")\n",
    "\n",
    "# Convert to Polars and examine types\n",
    "polars_mixed = dw.wrangle(pandas_df, backend='polars')\n",
    "print(\"\\n🚀 Polars DataFrame schema:\")\n",
    "print(polars_mixed.schema)\n",
    "print(f\"Memory usage: {polars_mixed.estimated_size() / 1024:.1f} KB\")\n",
    "\n",
    "# Demonstrate type-specific operations\n",
    "print(\"\\n🔧 Type-specific operations:\")\n",
    "\n",
    "# String operations\n",
    "string_ops = polars_mixed.select([\n",
    "    pl.col(\"strings\").str.lengths().alias(\"string_lengths\"),\n",
    "    pl.col(\"strings\").str.to_uppercase().alias(\"uppercase\"),\n",
    "    pl.col(\"strings\").str.contains(\"item_1\").alias(\"contains_item1\")\n",
    "])\n",
    "print(\"String operations result:\")\n",
    "print(string_ops.head())\n",
    "\n",
    "# Date operations\n",
    "date_ops = polars_mixed.select([\n",
    "    pl.col(\"dates\").dt.year().alias(\"year\"),\n",
    "    pl.col(\"dates\").dt.month().alias(\"month\"),\n",
    "    pl.col(\"dates\").dt.weekday().alias(\"weekday\")\n",
    "])\n",
    "print(\"\\nDate operations result:\")\n",
    "print(date_ops.head())\n",
    "\n",
    "# Numerical operations with conditions\n",
    "numerical_ops = polars_mixed.select([\n",
    "    pl.when(pl.col(\"integers\") > 50)\n",
    "      .then(pl.col(\"floats\") * 2)\n",
    "      .otherwise(pl.col(\"floats\"))\n",
    "      .alias(\"conditional_floats\"),\n",
    "    pl.col(\"integers\").cast(pl.Float64).alias(\"integers_as_float\")\n",
    "])\n",
    "print(\"\\nNumerical operations result:\")\n",
    "print(numerical_ops.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Cross-Backend Workflows\n",
    "\n",
    "One of data-wrangler's strengths is the ability to seamlessly switch between backends within the same workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a complex workflow that leverages both backends\n",
    "print(\"🔄 CROSS-BACKEND WORKFLOW DEMONSTRATION\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "# Step 1: Start with raw data\n",
    "raw_data = np.random.rand(5000, 8)\n",
    "print(f\"1. Raw data: {raw_data.shape}\")\n",
    "\n",
    "# Step 2: Initial processing with Polars (fast)\n",
    "print(\"2. Initial processing with Polars backend...\")\n",
    "polars_processed = dw.wrangle(raw_data, backend='polars')\n",
    "print(f\"   Result type: {type(polars_processed)}\")\n",
    "\n",
    "# Step 3: Add computed columns using Polars operations\n",
    "polars_enhanced = polars_processed.with_columns([\n",
    "    (pl.col(\"column_0\") + pl.col(\"column_1\")).alias(\"sum_01\"),\n",
    "    (pl.col(\"column_2\") * pl.col(\"column_3\")).alias(\"product_23\"),\n",
    "    pl.col(\"column_4\").rolling_mean(window_size=10).alias(\"rolling_mean_4\")\n",
    "])\n",
    "print(f\"3. Enhanced with computed columns: {polars_enhanced.shape}\")\n",
    "\n",
    "# Step 4: Convert to pandas for specialized analysis\n",
    "print(\"4. Converting to pandas for specialized operations...\")\n",
    "pandas_df = polars_to_pandas(polars_enhanced)\n",
    "print(f\"   Converted type: {type(pandas_df)}\")\n",
    "\n",
    "# Step 5: Use pandas-specific functionality (e.g., complex plotting)\n",
    "correlation_matrix = pandas_df.corr()\n",
    "print(f\"5. Computed correlation matrix: {correlation_matrix.shape}\")\n",
    "\n",
    "# Step 6: Convert back to Polars for final processing\n",
    "print(\"6. Converting back to Polars for final aggregations...\")\n",
    "final_polars = pandas_to_polars(pandas_df)\n",
    "\n",
    "# Step 7: Final aggregations with Polars\n",
    "summary_stats = final_polars.select([\n",
    "    pl.all().mean().suffix(\"_mean\"),\n",
    "    pl.all().std().suffix(\"_std\"),\n",
    "    pl.all().min().suffix(\"_min\"),\n",
    "    pl.all().max().suffix(\"_max\")\n",
    "])\n",
    "\n",
    "print(\"7. Final summary statistics computed!\")\n",
    "print(f\"   Summary shape: {summary_stats.shape}\")\n",
    "\n",
    "print(\"\\n✅ Cross-backend workflow completed successfully!\")\n",
    "print(\"   Benefits: Polars speed + Pandas ecosystem compatibility\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Parallel Processing and Performance Optimization\n",
    "\n",
    "Polars automatically leverages multiple CPU cores. Let's demonstrate this capability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import multiprocessing\n",
    "\n",
    "print(f\"🖥️  Available CPU cores: {multiprocessing.cpu_count()}\")\n",
    "print(\"Polars will automatically use multiple cores for operations!\")\n",
    "\n",
    "# Create large dataset for parallel processing demonstration\n",
    "large_dataset = np.random.rand(100000, 20)\n",
    "print(f\"\\n📊 Large dataset: {large_dataset.shape[0]:,} rows x {large_dataset.shape[1]} columns\")\n",
    "\n",
    "# Convert to Polars DataFrame\n",
    "polars_large = dw.wrangle(large_dataset, backend='polars')\n",
    "\n",
    "# Demonstrate parallel operations\n",
    "print(\"\\n⚡ Performing parallel operations...\")\n",
    "\n",
    "# Complex aggregation that benefits from parallelization\n",
    "start_time = time.time()\n",
    "parallel_result = polars_large.select([\n",
    "    # Multiple statistical operations across all columns\n",
    "    pl.all().mean().suffix(\"_mean\"),\n",
    "    pl.all().std().suffix(\"_std\"),\n",
    "    pl.all().quantile(0.25).suffix(\"_q25\"),\n",
    "    pl.all().quantile(0.75).suffix(\"_q75\"),\n",
    "    pl.all().skew().suffix(\"_skew\")\n",
    "])\n",
    "parallel_time = time.time() - start_time\n",
    "\n",
    "print(f\"✅ Parallel operations completed in {parallel_time:.4f} seconds\")\n",
    "print(f\"Result shape: {parallel_result.shape}\")\n",
    "\n",
    "# Demonstrate lazy evaluation with parallel execution\n",
    "print(\"\\n🔄 Lazy evaluation with parallel execution...\")\n",
    "start_time = time.time()\n",
    "\n",
    "lazy_parallel = (\n",
    "    polars_large.lazy()\n",
    "    .filter(pl.col(\"column_0\") > 0.3)\n",
    "    .group_by((pl.col(\"column_1\") * 10).round().cast(pl.Int64))\n",
    "    .agg([\n",
    "        pl.col(\"column_2\").sum().alias(\"sum_col2\"),\n",
    "        pl.col(\"column_3\").mean().alias(\"mean_col3\"),\n",
    "        pl.col(\"column_4\").count().alias(\"count\")\n",
    "    ])\n",
    "    .sort(\"count\", descending=True)\n",
    "    .collect()  # Execute with optimization\n",
    ")\n",
    "\n",
    "lazy_parallel_time = time.time() - start_time\n",
    "print(f\"✅ Lazy parallel operations completed in {lazy_parallel_time:.4f} seconds\")\n",
    "print(f\"Result shape: {lazy_parallel.shape}\")\n",
    "\n",
    "# Show memory efficiency\n",
    "print(\"\\n💾 Memory efficiency demonstration:\")\n",
    "memory_efficient = (\n",
    "    polars_large.lazy()\n",
    "    .select([\"column_0\", \"column_1\", \"column_2\"])  # Select only needed columns\n",
    "    .filter(pl.col(\"column_0\") > 0.5)              # Filter early\n",
    "    .with_columns([\n",
    "        (pl.col(\"column_1\") * pl.col(\"column_2\")).alias(\"product\")\n",
    "    ])\n",
    "    .collect()\n",
    ")\n",
    "print(f\"Memory-efficient query result: {memory_efficient.shape}\")\n",
    "print(\"Benefits: Early filtering, column selection, and optimized execution\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Advanced Text Processing with Polars\n",
    "\n",
    "Polars excels at string operations and can significantly speed up text processing workflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create sample text data for advanced processing\n",
    "sample_texts = [\n",
    "    \"Machine learning algorithms process vast amounts of data efficiently.\",\n",
    "    \"Data science combines statistical methods with computational techniques.\",\n",
    "    \"Artificial intelligence enables automated decision-making systems.\",\n",
    "    \"Deep learning neural networks learn complex patterns automatically.\",\n",
    "    \"Natural language processing understands human communication patterns.\",\n",
    "    \"Computer vision analyzes visual information from digital images.\",\n",
    "    \"Big data analytics extracts insights from massive datasets.\",\n",
    "    \"Cloud computing provides scalable infrastructure for applications.\"\n",
    "] * 1000  # Scale up for performance testing\n",
    "\n",
    "print(f\"📝 Text dataset: {len(sample_texts):,} documents\")\n",
    "\n",
    "# Process with Polars backend for embeddings\n",
    "print(\"\\n🚀 Processing text with Polars backend...\")\n",
    "start_time = time.time()\n",
    "text_embeddings = dw.wrangle(sample_texts, backend='polars')\n",
    "text_processing_time = time.time() - start_time\n",
    "\n",
    "print(f\"✅ Text processing completed in {text_processing_time:.4f} seconds\")\n",
    "print(f\"Embeddings shape: {text_embeddings.shape}\")\n",
    "print(f\"Embeddings type: {type(text_embeddings)}\")\n",
    "\n",
    "# Demonstrate advanced string operations on original text\n",
    "print(\"\\n🔧 Advanced string operations with Polars...\")\n",
    "\n",
    "# Create a DataFrame with the original texts\n",
    "text_df = pl.DataFrame({\"text\": sample_texts[:100]})  # Use subset for demonstration\n",
    "\n",
    "# Perform various string operations\n",
    "text_analysis = text_df.with_columns([\n",
    "    pl.col(\"text\").str.len_chars().alias(\"char_count\"),\n",
    "    pl.col(\"text\").str.len_bytes().alias(\"byte_count\"),\n",
    "    pl.col(\"text\").str.n_chars().alias(\"n_chars\"),\n",
    "    pl.col(\"text\").str.split(\" \").list.len().alias(\"word_count\"),\n",
    "    pl.col(\"text\").str.to_lowercase().alias(\"lowercase\"),\n",
    "    pl.col(\"text\").str.contains(\"data\").alias(\"contains_data\"),\n",
    "    pl.col(\"text\").str.extract(r\"(\\w+ing)\", 1).alias(\"words_ending_ing\"),\n",
    "    pl.col(\"text\").str.replace_all(\"data\", \"information\").alias(\"replaced_text\")\n",
    "])\n",
    "\n",
    "print(\"String analysis results (first 3 rows):\")\n",
    "print(text_analysis.head(3))\n",
    "\n",
    "# Text statistics\n",
    "text_stats = text_analysis.select([\n",
    "    pl.col(\"char_count\").mean().alias(\"avg_chars\"),\n",
    "    pl.col(\"word_count\").mean().alias(\"avg_words\"),\n",
    "    pl.col(\"contains_data\").sum().alias(\"docs_with_data\"),\n",
    "    pl.col(\"words_ending_ing\").filter(pl.col(\"words_ending_ing\").is_not_null()).count().alias(\"docs_with_ing_words\")\n",
    "])\n",
    "\n",
    "print(\"\\n📊 Text statistics:\")\n",
    "print(text_stats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Global Backend Management\n",
    "\n",
    "Data-wrangler provides flexible backend management for different parts of your workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"⚙️  GLOBAL BACKEND MANAGEMENT\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "# Check current backend\n",
    "current_backend = get_dataframe_backend()\n",
    "print(f\"Current default backend: {current_backend}\")\n",
    "\n",
    "# Demonstrate backend switching workflow\n",
    "test_data = np.random.rand(1000, 5)\n",
    "\n",
    "print(\"\\n1. Using current backend (no explicit specification):\")\n",
    "result1 = dw.wrangle(test_data)\n",
    "print(f\"   Result type: {type(result1)}\")\n",
    "\n",
    "print(\"\\n2. Explicitly using Polars backend:\")\n",
    "result2 = dw.wrangle(test_data, backend='polars')\n",
    "print(f\"   Result type: {type(result2)}\")\n",
    "\n",
    "print(\"\\n3. Explicitly using pandas backend:\")\n",
    "result3 = dw.wrangle(test_data, backend='pandas')\n",
    "print(f\"   Result type: {type(result3)}\")\n",
    "\n",
    "# Change global backend\n",
    "print(\"\\n4. Changing global backend to Polars:\")\n",
    "set_dataframe_backend('polars')\n",
    "print(f\"   New default backend: {get_dataframe_backend()}\")\n",
    "\n",
    "result4 = dw.wrangle(test_data)  # Now uses Polars by default\n",
    "print(f\"   Result type: {type(result4)}\")\n",
    "\n",
    "# Reset to pandas\n",
    "print(\"\\n5. Resetting to pandas backend:\")\n",
    "set_dataframe_backend('pandas')\n",
    "print(f\"   Reset to: {get_dataframe_backend()}\")\n",
    "\n",
    "# Demonstrate context-aware processing\n",
    "print(\"\\n6. Context-aware processing example:\")\n",
    "\n",
    "def smart_processing(data, size_threshold=10000):\n",
    "    \"\"\"Automatically choose backend based on data size.\"\"\"\n",
    "    data_size = np.prod(data.shape) if hasattr(data, 'shape') else len(data)\n",
    "    \n",
    "    if data_size > size_threshold:\n",
    "        print(f\"   Large dataset detected ({data_size:,} elements) - using Polars\")\n",
    "        return dw.wrangle(data, backend='polars')\n",
    "    else:\n",
    "        print(f\"   Small dataset detected ({data_size:,} elements) - using pandas\")\n",
    "        return dw.wrangle(data, backend='pandas')\n",
    "\n",
    "# Test with different sizes\n",
    "small_data = np.random.rand(50, 10)   # 500 elements\n",
    "large_data = np.random.rand(200, 100) # 20,000 elements\n",
    "\n",
    "small_result = smart_processing(small_data)\n",
    "large_result = smart_processing(large_data)\n",
    "\n",
    "print(f\"   Small result type: {type(small_result)}\")\n",
    "print(f\"   Large result type: {type(large_result)}\")\n",
    "\n",
    "print(\"\\n✅ Backend management demonstration complete!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Performance Monitoring and Profiling\n",
    "\n",
    "Let's create tools to monitor and profile Polars performance in your workflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import functools\n",
    "import time\n",
    "from typing import Any, Callable\n",
    "\n",
    "class PerformanceProfiler:\n",
    "    \"\"\"Simple performance profiler for data-wrangler operations.\"\"\"\n",
    "    \n",
    "    def __init__(self):\n",
    "        self.operations = []\n",
    "    \n",
    "    def profile_operation(self, name: str):\n",
    "        \"\"\"Decorator to profile operations.\"\"\"\n",
    "        def decorator(func: Callable) -> Callable:\n",
    "            @functools.wraps(func)\n",
    "            def wrapper(*args, **kwargs) -> Any:\n",
    "                start_time = time.time()\n",
    "                result = func(*args, **kwargs)\n",
    "                end_time = time.time()\n",
    "                \n",
    "                self.operations.append({\n",
    "                    'name': name,\n",
    "                    'duration': end_time - start_time,\n",
    "                    'result_type': type(result).__name__\n",
    "                })\n",
    "                \n",
    "                return result\n",
    "            return wrapper\n",
    "        return decorator\n",
    "    \n",
    "    def get_summary(self):\n",
    "        \"\"\"Get performance summary.\"\"\"\n",
    "        if not self.operations:\n",
    "            return \"No operations recorded\"\n",
    "        \n",
    "        total_time = sum(op['duration'] for op in self.operations)\n",
    "        avg_time = total_time / len(self.operations)\n",
    "        \n",
    "        summary = f\"📊 Performance Summary:\\n\"\n",
    "        summary += f\"Total operations: {len(self.operations)}\\n\"\n",
    "        summary += f\"Total time: {total_time:.4f}s\\n\"\n",
    "        summary += f\"Average time: {avg_time:.4f}s\\n\\n\"\n",
    "        \n",
    "        summary += \"Operation details:\\n\"\n",
    "        for i, op in enumerate(self.operations, 1):\n",
    "            summary += f\"{i}. {op['name']}: {op['duration']:.4f}s ({op['result_type']})\\n\"\n",
    "        \n",
    "        return summary\n",
    "\n",
    "# Create profiler instance\n",
    "profiler = PerformanceProfiler()\n",
    "\n",
    "# Define profiled operations\n",
    "@profiler.profile_operation(\"Array to Polars\")\n",
    "def array_to_polars(data):\n",
    "    return dw.wrangle(data, backend='polars')\n",
    "\n",
    "@profiler.profile_operation(\"Array to Pandas\")\n",
    "def array_to_pandas(data):\n",
    "    return dw.wrangle(data, backend='pandas')\n",
    "\n",
    "@profiler.profile_operation(\"Text to Polars\")\n",
    "def text_to_polars(data):\n",
    "    return dw.wrangle(data, backend='polars')\n",
    "\n",
    "@profiler.profile_operation(\"Cross-backend conversion\")\n",
    "def cross_backend_conversion(data):\n",
    "    polars_df = dw.wrangle(data, backend='polars')\n",
    "    pandas_df = polars_to_pandas(polars_df)\n",
    "    return pandas_to_polars(pandas_df)\n",
    "\n",
    "# Run performance tests\n",
    "print(\"🏃‍♂️ Running performance profiling tests...\")\n",
    "\n",
    "test_array = np.random.rand(5000, 10)\n",
    "test_texts = sample_texts[:100]\n",
    "\n",
    "# Profile different operations\n",
    "result1 = array_to_polars(test_array)\n",
    "result2 = array_to_pandas(test_array)\n",
    "result3 = text_to_polars(test_texts)\n",
    "result4 = cross_backend_conversion(test_array)\n",
    "\n",
    "print(\"\\n\" + profiler.get_summary())\n",
    "\n",
    "# Create performance visualization\n",
    "fig, ax = plt.subplots(figsize=(10, 6))\n",
    "operations = [op['name'] for op in profiler.operations]\n",
    "durations = [op['duration'] for op in profiler.operations]\n",
    "colors = ['#ff7f0e', '#1f77b4', '#ff7f0e', '#2ca02c']\n",
    "\n",
    "bars = ax.bar(operations, durations, color=colors)\n",
    "ax.set_title('Operation Performance Comparison')\n",
    "ax.set_ylabel('Time (seconds)')\n",
    "ax.tick_params(axis='x', rotation=45)\n",
    "\n",
    "# Add value labels on bars\n",
    "for bar, duration in zip(bars, durations):\n",
    "    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,\n",
    "            f'{duration:.4f}s', ha='center', va='bottom')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"✅ Performance profiling complete!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Best Practices and Optimization Tips\n",
    "\n",
    "Here are key recommendations for getting the most out of Polars in data-wrangler:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"💡 POLARS OPTIMIZATION BEST PRACTICES\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "best_practices = {\n",
    "    \"1. Use Lazy Evaluation\": {\n",
    "        \"description\": \"Convert to LazyFrame for complex operations\",\n",
    "        \"example\": \"df.lazy().filter(...).group_by(...).collect()\",\n",
    "        \"benefit\": \"Query optimization and better performance\"\n",
    "    },\n",
    "    \"2. Filter Early\": {\n",
    "        \"description\": \"Apply filters before expensive operations\", \n",
    "        \"example\": \"df.filter(condition).expensive_operation()\",\n",
    "        \"benefit\": \"Reduces data size for subsequent operations\"\n",
    "    },\n",
    "    \"3. Select Columns Wisely\": {\n",
    "        \"description\": \"Only select needed columns\",\n",
    "        \"example\": \"df.select(['col1', 'col2']).process()\",\n",
    "        \"benefit\": \"Lower memory usage and faster processing\"\n",
    "    },\n",
    "    \"4. Leverage Parallel Processing\": {\n",
    "        \"description\": \"Polars automatically uses multiple cores\",\n",
    "        \"example\": \"Large aggregations benefit automatically\",\n",
    "        \"benefit\": \"Faster execution on multi-core systems\"\n",
    "    },\n",
    "    \"5. Use Appropriate Data Types\": {\n",
    "        \"description\": \"Cast to optimal types for your data\",\n",
    "        \"example\": \"df.cast({'col': pl.Int32})\",\n",
    "        \"benefit\": \"Reduced memory usage and faster operations\"\n",
    "    },\n",
    "    \"6. Batch Operations\": {\n",
    "        \"description\": \"Combine multiple operations in single call\",\n",
    "        \"example\": \"df.with_columns([op1, op2, op3])\",\n",
    "        \"benefit\": \"Reduced overhead and better optimization\"\n",
    "    }\n",
    "}\n",
    "\n",
    "for i, (practice, details) in enumerate(best_practices.items(), 1):\n",
    "    print(f\"\\n{practice}:\")\n",
    "    print(f\"   📋 {details['description']}\")\n",
    "    print(f\"   💻 Example: {details['example']}\")\n",
    "    print(f\"   ✅ Benefit: {details['benefit']}\")\n",
    "\n",
    "# Demonstrate good vs bad practices\n",
    "print(\"\\n\\n🚨 PERFORMANCE COMPARISON: Good vs Bad Practices\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "large_test_data = np.random.rand(50000, 15)\n",
    "polars_df = dw.wrangle(large_test_data, backend='polars')\n",
    "\n",
    "# Bad practice: Multiple separate operations\n",
    "print(\"❌ Bad practice (multiple separate operations):\")\n",
    "start = time.time()\n",
    "bad_result = polars_df.filter(pl.col(\"column_0\") > 0.5)\n",
    "bad_result = bad_result.with_columns([(pl.col(\"column_1\") * 2).alias(\"doubled\")])\n",
    "bad_result = bad_result.with_columns([(pl.col(\"column_2\") + pl.col(\"column_3\")).alias(\"sum_23\")])\n",
    "bad_result = bad_result.group_by(\"doubled\").agg([pl.col(\"sum_23\").mean()])\n",
    "bad_time = time.time() - start\n",
    "print(f\"   Time: {bad_time:.4f}s\")\n",
    "\n",
    "# Good practice: Lazy evaluation with batched operations\n",
    "print(\"\\n✅ Good practice (lazy evaluation + batched operations):\")\n",
    "start = time.time()\n",
    "good_result = (\n",
    "    polars_df.lazy()\n",
    "    .filter(pl.col(\"column_0\") > 0.5)\n",
    "    .with_columns([\n",
    "        (pl.col(\"column_1\") * 2).alias(\"doubled\"),\n",
    "        (pl.col(\"column_2\") + pl.col(\"column_3\")).alias(\"sum_23\")\n",
    "    ])\n",
    "    .group_by(\"doubled\")\n",
    "    .agg([pl.col(\"sum_23\").mean()])\n",
    "    .collect()\n",
    ")\n",
    "good_time = time.time() - start\n",
    "print(f\"   Time: {good_time:.4f}s\")\n",
    "\n",
    "improvement = (bad_time - good_time) / bad_time * 100\n",
    "print(f\"\\n🚀 Performance improvement: {improvement:.1f}% faster with best practices!\")\n",
    "\n",
    "print(\"\\n✅ Advanced Polars tutorial complete!\")\n",
    "print(\"You're now ready to leverage Polars' full power in data-wrangler! 🎉\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "This tutorial covered advanced Polars features in data-wrangler:\n",
    "\n",
    "### 🚀 Key Capabilities Explored\n",
    "\n",
    "1. **Lazy Evaluation**: Query optimization for complex operations\n",
    "2. **Advanced Data Types**: Rich type system with specialized operations\n",
    "3. **Cross-Backend Workflows**: Seamless switching between pandas and Polars\n",
    "4. **Parallel Processing**: Automatic multi-core utilization\n",
    "5. **Text Processing**: High-performance string operations\n",
    "6. **Backend Management**: Flexible global and per-operation control\n",
    "7. **Performance Profiling**: Tools for monitoring and optimization\n",
    "\n",
    "### 💡 Best Practices Learned\n",
    "\n",
    "- Use lazy evaluation for complex queries\n",
    "- Filter early and select only needed columns\n",
    "- Batch operations for better performance\n",
    "- Leverage automatic parallel processing\n",
    "- Choose appropriate data types\n",
    "- Mix backends as needed for optimal workflows\n",
    "\n",
    "### 🎯 When to Use Advanced Features\n",
    "\n",
    "- **Large datasets** (>10,000 rows): Use lazy evaluation\n",
    "- **Complex transformations**: Batch operations and optimize queries\n",
    "- **Memory constraints**: Leverage columnar storage and early filtering\n",
    "- **Mixed workflows**: Combine pandas ecosystem with Polars performance\n",
    "\n",
    "With these advanced techniques, you can build highly efficient data processing pipelines that scale from prototyping to production! 🌟"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}