Advanced Polars Features in Data-Wrangler
This tutorial explores advanced Polars capabilities that you can leverage through data-wrangler’s backend system. We’ll cover lazy evaluation, advanced data type handling, cross-backend workflows, and optimization techniques.
Overview
Polars offers several advanced features that make it ideal for sophisticated data processing:
Lazy Evaluation: Query optimization and efficient execution
Columnar Storage: Memory-efficient data representation
Parallel Processing: Built-in multi-threading
Type System: Rich data type support
Interoperability: Seamless conversion with other formats
Let’s explore these features in the context of data-wrangler!
[ ]:
import datawrangler as dw
import numpy as np
import pandas as pd
import polars as pl
import time
from datawrangler.core.configurator import set_dataframe_backend, get_dataframe_backend
from datawrangler.zoo.polars_dataframe import polars_to_pandas, pandas_to_polars
import matplotlib.pyplot as plt
print("🚀 Advanced Polars tutorial environment loaded!")
print(f"Current backend: {get_dataframe_backend()}")
1. Working with Polars LazyFrames
Polars LazyFrames enable query optimization and can provide significant performance improvements for complex operations.
[ ]:
# Create sample data for lazy evaluation demonstration
large_array = np.random.rand(50000, 10)
print(f"Created test array: {large_array.shape[0]:,} rows x {large_array.shape[1]} columns")
# Convert to Polars DataFrame
polars_df = dw.wrangle(large_array, backend='polars')
print(f"Polars DataFrame type: {type(polars_df)}")
print(f"Shape: {polars_df.shape}")
# Convert to LazyFrame for optimization
lazy_df = polars_df.lazy()
print(f"\nLazyFrame type: {type(lazy_df)}")
print("LazyFrame created - no computation performed yet!")
# Build a complex query
lazy_result = (
lazy_df
.filter(pl.col("column_0") > 0.5)
.with_columns([
(pl.col("column_1") * 2).alias("doubled_col1"),
(pl.col("column_2") + pl.col("column_3")).alias("sum_col23")
])
.group_by(pl.col("doubled_col1").round(1))
.agg([
pl.col("sum_col23").mean().alias("avg_sum"),
pl.col("column_0").count().alias("count")
])
)
print("\n📋 Query plan created (still lazy):")
print(lazy_result.explain())
# Execute the query
print("\n⚡ Executing optimized query...")
start_time = time.time()
result = lazy_result.collect()
execution_time = time.time() - start_time
print(f"✅ Query executed in {execution_time:.4f} seconds")
print(f"Result shape: {result.shape}")
print("\nFirst few rows:")
print(result.head())
2. Advanced Data Type Handling
Polars has a rich type system that can be leveraged for efficient data processing.
[ ]:
# Create diverse data types for testing
mixed_data = {
'integers': np.random.randint(0, 100, 1000),
'floats': np.random.rand(1000),
'strings': [f"item_{i}" for i in range(1000)],
'booleans': np.random.choice([True, False], 1000),
'dates': pd.date_range('2023-01-01', periods=1000, freq='H')
}
# Create pandas DataFrame first
pandas_df = pd.DataFrame(mixed_data)
print("📊 Original pandas DataFrame info:")
print(pandas_df.dtypes)
print(f"Memory usage: {pandas_df.memory_usage(deep=True).sum() / 1024:.1f} KB")
# Convert to Polars and examine types
polars_mixed = dw.wrangle(pandas_df, backend='polars')
print("\n🚀 Polars DataFrame schema:")
print(polars_mixed.schema)
print(f"Memory usage: {polars_mixed.estimated_size() / 1024:.1f} KB")
# Demonstrate type-specific operations
print("\n🔧 Type-specific operations:")
# String operations
string_ops = polars_mixed.select([
pl.col("strings").str.lengths().alias("string_lengths"),
pl.col("strings").str.to_uppercase().alias("uppercase"),
pl.col("strings").str.contains("item_1").alias("contains_item1")
])
print("String operations result:")
print(string_ops.head())
# Date operations
date_ops = polars_mixed.select([
pl.col("dates").dt.year().alias("year"),
pl.col("dates").dt.month().alias("month"),
pl.col("dates").dt.weekday().alias("weekday")
])
print("\nDate operations result:")
print(date_ops.head())
# Numerical operations with conditions
numerical_ops = polars_mixed.select([
pl.when(pl.col("integers") > 50)
.then(pl.col("floats") * 2)
.otherwise(pl.col("floats"))
.alias("conditional_floats"),
pl.col("integers").cast(pl.Float64).alias("integers_as_float")
])
print("\nNumerical operations result:")
print(numerical_ops.head())
3. Cross-Backend Workflows
One of data-wrangler’s strengths is the ability to seamlessly switch between backends within the same workflow.
[ ]:
# Create a complex workflow that leverages both backends
print("🔄 CROSS-BACKEND WORKFLOW DEMONSTRATION")
print("=" * 50)
# Step 1: Start with raw data
raw_data = np.random.rand(5000, 8)
print(f"1. Raw data: {raw_data.shape}")
# Step 2: Initial processing with Polars (fast)
print("2. Initial processing with Polars backend...")
polars_processed = dw.wrangle(raw_data, backend='polars')
print(f" Result type: {type(polars_processed)}")
# Step 3: Add computed columns using Polars operations
polars_enhanced = polars_processed.with_columns([
(pl.col("column_0") + pl.col("column_1")).alias("sum_01"),
(pl.col("column_2") * pl.col("column_3")).alias("product_23"),
pl.col("column_4").rolling_mean(window_size=10).alias("rolling_mean_4")
])
print(f"3. Enhanced with computed columns: {polars_enhanced.shape}")
# Step 4: Convert to pandas for specialized analysis
print("4. Converting to pandas for specialized operations...")
pandas_df = polars_to_pandas(polars_enhanced)
print(f" Converted type: {type(pandas_df)}")
# Step 5: Use pandas-specific functionality (e.g., complex plotting)
correlation_matrix = pandas_df.corr()
print(f"5. Computed correlation matrix: {correlation_matrix.shape}")
# Step 6: Convert back to Polars for final processing
print("6. Converting back to Polars for final aggregations...")
final_polars = pandas_to_polars(pandas_df)
# Step 7: Final aggregations with Polars
summary_stats = final_polars.select([
pl.all().mean().suffix("_mean"),
pl.all().std().suffix("_std"),
pl.all().min().suffix("_min"),
pl.all().max().suffix("_max")
])
print("7. Final summary statistics computed!")
print(f" Summary shape: {summary_stats.shape}")
print("\n✅ Cross-backend workflow completed successfully!")
print(" Benefits: Polars speed + Pandas ecosystem compatibility")
4. Parallel Processing and Performance Optimization
Polars automatically leverages multiple CPU cores. Let’s demonstrate this capability.
[ ]:
import multiprocessing
print(f"🖥️ Available CPU cores: {multiprocessing.cpu_count()}")
print("Polars will automatically use multiple cores for operations!")
# Create large dataset for parallel processing demonstration
large_dataset = np.random.rand(100000, 20)
print(f"\n📊 Large dataset: {large_dataset.shape[0]:,} rows x {large_dataset.shape[1]} columns")
# Convert to Polars DataFrame
polars_large = dw.wrangle(large_dataset, backend='polars')
# Demonstrate parallel operations
print("\n⚡ Performing parallel operations...")
# Complex aggregation that benefits from parallelization
start_time = time.time()
parallel_result = polars_large.select([
# Multiple statistical operations across all columns
pl.all().mean().suffix("_mean"),
pl.all().std().suffix("_std"),
pl.all().quantile(0.25).suffix("_q25"),
pl.all().quantile(0.75).suffix("_q75"),
pl.all().skew().suffix("_skew")
])
parallel_time = time.time() - start_time
print(f"✅ Parallel operations completed in {parallel_time:.4f} seconds")
print(f"Result shape: {parallel_result.shape}")
# Demonstrate lazy evaluation with parallel execution
print("\n🔄 Lazy evaluation with parallel execution...")
start_time = time.time()
lazy_parallel = (
polars_large.lazy()
.filter(pl.col("column_0") > 0.3)
.group_by((pl.col("column_1") * 10).round().cast(pl.Int64))
.agg([
pl.col("column_2").sum().alias("sum_col2"),
pl.col("column_3").mean().alias("mean_col3"),
pl.col("column_4").count().alias("count")
])
.sort("count", descending=True)
.collect() # Execute with optimization
)
lazy_parallel_time = time.time() - start_time
print(f"✅ Lazy parallel operations completed in {lazy_parallel_time:.4f} seconds")
print(f"Result shape: {lazy_parallel.shape}")
# Show memory efficiency
print("\n💾 Memory efficiency demonstration:")
memory_efficient = (
polars_large.lazy()
.select(["column_0", "column_1", "column_2"]) # Select only needed columns
.filter(pl.col("column_0") > 0.5) # Filter early
.with_columns([
(pl.col("column_1") * pl.col("column_2")).alias("product")
])
.collect()
)
print(f"Memory-efficient query result: {memory_efficient.shape}")
print("Benefits: Early filtering, column selection, and optimized execution")
5. Advanced Text Processing with Polars
Polars excels at string operations and can significantly speed up text processing workflows.
[ ]:
# Create sample text data for advanced processing
sample_texts = [
"Machine learning algorithms process vast amounts of data efficiently.",
"Data science combines statistical methods with computational techniques.",
"Artificial intelligence enables automated decision-making systems.",
"Deep learning neural networks learn complex patterns automatically.",
"Natural language processing understands human communication patterns.",
"Computer vision analyzes visual information from digital images.",
"Big data analytics extracts insights from massive datasets.",
"Cloud computing provides scalable infrastructure for applications."
] * 1000 # Scale up for performance testing
print(f"📝 Text dataset: {len(sample_texts):,} documents")
# Process with Polars backend for embeddings
print("\n🚀 Processing text with Polars backend...")
start_time = time.time()
text_embeddings = dw.wrangle(sample_texts, backend='polars')
text_processing_time = time.time() - start_time
print(f"✅ Text processing completed in {text_processing_time:.4f} seconds")
print(f"Embeddings shape: {text_embeddings.shape}")
print(f"Embeddings type: {type(text_embeddings)}")
# Demonstrate advanced string operations on original text
print("\n🔧 Advanced string operations with Polars...")
# Create a DataFrame with the original texts
text_df = pl.DataFrame({"text": sample_texts[:100]}) # Use subset for demonstration
# Perform various string operations
text_analysis = text_df.with_columns([
pl.col("text").str.len_chars().alias("char_count"),
pl.col("text").str.len_bytes().alias("byte_count"),
pl.col("text").str.n_chars().alias("n_chars"),
pl.col("text").str.split(" ").list.len().alias("word_count"),
pl.col("text").str.to_lowercase().alias("lowercase"),
pl.col("text").str.contains("data").alias("contains_data"),
pl.col("text").str.extract(r"(\w+ing)", 1).alias("words_ending_ing"),
pl.col("text").str.replace_all("data", "information").alias("replaced_text")
])
print("String analysis results (first 3 rows):")
print(text_analysis.head(3))
# Text statistics
text_stats = text_analysis.select([
pl.col("char_count").mean().alias("avg_chars"),
pl.col("word_count").mean().alias("avg_words"),
pl.col("contains_data").sum().alias("docs_with_data"),
pl.col("words_ending_ing").filter(pl.col("words_ending_ing").is_not_null()).count().alias("docs_with_ing_words")
])
print("\n📊 Text statistics:")
print(text_stats)
6. Global Backend Management
Data-wrangler provides flexible backend management for different parts of your workflow.
[ ]:
print("⚙️ GLOBAL BACKEND MANAGEMENT")
print("=" * 40)
# Check current backend
current_backend = get_dataframe_backend()
print(f"Current default backend: {current_backend}")
# Demonstrate backend switching workflow
test_data = np.random.rand(1000, 5)
print("\n1. Using current backend (no explicit specification):")
result1 = dw.wrangle(test_data)
print(f" Result type: {type(result1)}")
print("\n2. Explicitly using Polars backend:")
result2 = dw.wrangle(test_data, backend='polars')
print(f" Result type: {type(result2)}")
print("\n3. Explicitly using pandas backend:")
result3 = dw.wrangle(test_data, backend='pandas')
print(f" Result type: {type(result3)}")
# Change global backend
print("\n4. Changing global backend to Polars:")
set_dataframe_backend('polars')
print(f" New default backend: {get_dataframe_backend()}")
result4 = dw.wrangle(test_data) # Now uses Polars by default
print(f" Result type: {type(result4)}")
# Reset to pandas
print("\n5. Resetting to pandas backend:")
set_dataframe_backend('pandas')
print(f" Reset to: {get_dataframe_backend()}")
# Demonstrate context-aware processing
print("\n6. Context-aware processing example:")
def smart_processing(data, size_threshold=10000):
"""Automatically choose backend based on data size."""
data_size = np.prod(data.shape) if hasattr(data, 'shape') else len(data)
if data_size > size_threshold:
print(f" Large dataset detected ({data_size:,} elements) - using Polars")
return dw.wrangle(data, backend='polars')
else:
print(f" Small dataset detected ({data_size:,} elements) - using pandas")
return dw.wrangle(data, backend='pandas')
# Test with different sizes
small_data = np.random.rand(50, 10) # 500 elements
large_data = np.random.rand(200, 100) # 20,000 elements
small_result = smart_processing(small_data)
large_result = smart_processing(large_data)
print(f" Small result type: {type(small_result)}")
print(f" Large result type: {type(large_result)}")
print("\n✅ Backend management demonstration complete!")
7. Performance Monitoring and Profiling
Let’s create tools to monitor and profile Polars performance in your workflows.
[ ]:
import functools
import time
from typing import Any, Callable
class PerformanceProfiler:
"""Simple performance profiler for data-wrangler operations."""
def __init__(self):
self.operations = []
def profile_operation(self, name: str):
"""Decorator to profile operations."""
def decorator(func: Callable) -> Callable:
@functools.wraps(func)
def wrapper(*args, **kwargs) -> Any:
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
self.operations.append({
'name': name,
'duration': end_time - start_time,
'result_type': type(result).__name__
})
return result
return wrapper
return decorator
def get_summary(self):
"""Get performance summary."""
if not self.operations:
return "No operations recorded"
total_time = sum(op['duration'] for op in self.operations)
avg_time = total_time / len(self.operations)
summary = f"📊 Performance Summary:\n"
summary += f"Total operations: {len(self.operations)}\n"
summary += f"Total time: {total_time:.4f}s\n"
summary += f"Average time: {avg_time:.4f}s\n\n"
summary += "Operation details:\n"
for i, op in enumerate(self.operations, 1):
summary += f"{i}. {op['name']}: {op['duration']:.4f}s ({op['result_type']})\n"
return summary
# Create profiler instance
profiler = PerformanceProfiler()
# Define profiled operations
@profiler.profile_operation("Array to Polars")
def array_to_polars(data):
return dw.wrangle(data, backend='polars')
@profiler.profile_operation("Array to Pandas")
def array_to_pandas(data):
return dw.wrangle(data, backend='pandas')
@profiler.profile_operation("Text to Polars")
def text_to_polars(data):
return dw.wrangle(data, backend='polars')
@profiler.profile_operation("Cross-backend conversion")
def cross_backend_conversion(data):
polars_df = dw.wrangle(data, backend='polars')
pandas_df = polars_to_pandas(polars_df)
return pandas_to_polars(pandas_df)
# Run performance tests
print("🏃♂️ Running performance profiling tests...")
test_array = np.random.rand(5000, 10)
test_texts = sample_texts[:100]
# Profile different operations
result1 = array_to_polars(test_array)
result2 = array_to_pandas(test_array)
result3 = text_to_polars(test_texts)
result4 = cross_backend_conversion(test_array)
print("\n" + profiler.get_summary())
# Create performance visualization
fig, ax = plt.subplots(figsize=(10, 6))
operations = [op['name'] for op in profiler.operations]
durations = [op['duration'] for op in profiler.operations]
colors = ['#ff7f0e', '#1f77b4', '#ff7f0e', '#2ca02c']
bars = ax.bar(operations, durations, color=colors)
ax.set_title('Operation Performance Comparison')
ax.set_ylabel('Time (seconds)')
ax.tick_params(axis='x', rotation=45)
# Add value labels on bars
for bar, duration in zip(bars, durations):
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
f'{duration:.4f}s', ha='center', va='bottom')
plt.tight_layout()
plt.show()
print("✅ Performance profiling complete!")
Best Practices and Optimization Tips
Here are key recommendations for getting the most out of Polars in data-wrangler:
[ ]:
print("💡 POLARS OPTIMIZATION BEST PRACTICES")
print("=" * 50)
best_practices = {
"1. Use Lazy Evaluation": {
"description": "Convert to LazyFrame for complex operations",
"example": "df.lazy().filter(...).group_by(...).collect()",
"benefit": "Query optimization and better performance"
},
"2. Filter Early": {
"description": "Apply filters before expensive operations",
"example": "df.filter(condition).expensive_operation()",
"benefit": "Reduces data size for subsequent operations"
},
"3. Select Columns Wisely": {
"description": "Only select needed columns",
"example": "df.select(['col1', 'col2']).process()",
"benefit": "Lower memory usage and faster processing"
},
"4. Leverage Parallel Processing": {
"description": "Polars automatically uses multiple cores",
"example": "Large aggregations benefit automatically",
"benefit": "Faster execution on multi-core systems"
},
"5. Use Appropriate Data Types": {
"description": "Cast to optimal types for your data",
"example": "df.cast({'col': pl.Int32})",
"benefit": "Reduced memory usage and faster operations"
},
"6. Batch Operations": {
"description": "Combine multiple operations in single call",
"example": "df.with_columns([op1, op2, op3])",
"benefit": "Reduced overhead and better optimization"
}
}
for i, (practice, details) in enumerate(best_practices.items(), 1):
print(f"\n{practice}:")
print(f" 📋 {details['description']}")
print(f" 💻 Example: {details['example']}")
print(f" ✅ Benefit: {details['benefit']}")
# Demonstrate good vs bad practices
print("\n\n🚨 PERFORMANCE COMPARISON: Good vs Bad Practices")
print("=" * 60)
large_test_data = np.random.rand(50000, 15)
polars_df = dw.wrangle(large_test_data, backend='polars')
# Bad practice: Multiple separate operations
print("❌ Bad practice (multiple separate operations):")
start = time.time()
bad_result = polars_df.filter(pl.col("column_0") > 0.5)
bad_result = bad_result.with_columns([(pl.col("column_1") * 2).alias("doubled")])
bad_result = bad_result.with_columns([(pl.col("column_2") + pl.col("column_3")).alias("sum_23")])
bad_result = bad_result.group_by("doubled").agg([pl.col("sum_23").mean()])
bad_time = time.time() - start
print(f" Time: {bad_time:.4f}s")
# Good practice: Lazy evaluation with batched operations
print("\n✅ Good practice (lazy evaluation + batched operations):")
start = time.time()
good_result = (
polars_df.lazy()
.filter(pl.col("column_0") > 0.5)
.with_columns([
(pl.col("column_1") * 2).alias("doubled"),
(pl.col("column_2") + pl.col("column_3")).alias("sum_23")
])
.group_by("doubled")
.agg([pl.col("sum_23").mean()])
.collect()
)
good_time = time.time() - start
print(f" Time: {good_time:.4f}s")
improvement = (bad_time - good_time) / bad_time * 100
print(f"\n🚀 Performance improvement: {improvement:.1f}% faster with best practices!")
print("\n✅ Advanced Polars tutorial complete!")
print("You're now ready to leverage Polars' full power in data-wrangler! 🎉")
Summary
This tutorial covered advanced Polars features in data-wrangler:
🚀 Key Capabilities Explored
Lazy Evaluation: Query optimization for complex operations
Advanced Data Types: Rich type system with specialized operations
Cross-Backend Workflows: Seamless switching between pandas and Polars
Parallel Processing: Automatic multi-core utilization
Text Processing: High-performance string operations
Backend Management: Flexible global and per-operation control
Performance Profiling: Tools for monitoring and optimization
💡 Best Practices Learned
Use lazy evaluation for complex queries
Filter early and select only needed columns
Batch operations for better performance
Leverage automatic parallel processing
Choose appropriate data types
Mix backends as needed for optimal workflows
🎯 When to Use Advanced Features
Large datasets (>10,000 rows): Use lazy evaluation
Complex transformations: Batch operations and optimize queries
Memory constraints: Leverage columnar storage and early filtering
Mixed workflows: Combine pandas ecosystem with Polars performance
With these advanced techniques, you can build highly efficient data processing pipelines that scale from prototyping to production! 🌟