DataWranglerο
Transform messy data into clean pandas/Polars DataFrames with intelligent automation
DataWrangler is a powerful Python package that automatically converts diverse data formats into clean, analysis-ready DataFrames. Whether youβre working with arrays, text, images, or mixed data types, DataWrangler intelligently detects formats and applies appropriate transformations β all with a simple, unified API.
π New: High-performance Polars backend support for 2-100x faster processing!
Why DataWrangler?ο
- π― Intelligent Automation
No more manual data preprocessing. DataWrangler automatically detects data types and applies appropriate transformations.
- β‘ High Performance
Choose between pandas (familiar) and Polars (fast) backends. Get dramatic speedups with zero code changes.
- π§ Unified API
One simple function handles arrays, text, images, files, URLs, and mixed data types.
- π Research-Ready
Built for data science workflows with advanced text processing, embeddings, and ML preprocessing.
- π‘οΈ Production-Tested
Robust error handling, comprehensive testing, and battle-tested in real research environments.
Quick Start Examplesο
Basic Data Wrangling
import datawrangler as dw
import numpy as np
# Arrays become DataFrames automatically
array_data = np.random.rand(1000, 5)
df = dw.wrangle(array_data)
print(df.head())
High-Performance with Polars
# Same operation, 2-100x faster with Polars backend
fast_df = dw.wrangle(array_data, backend='polars')
# Set global backend preference
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars') # All operations now use Polars
Advanced Text Processing
# Text documents become embedding vectors
documents = [
"Machine learning transforms data into insights",
"Data science combines statistics with programming",
"AI enables automated decision-making systems"
]
# Automatic text embeddings with state-of-the-art models
text_df = dw.wrangle(documents)
print(f"Embedded {len(documents)} documents into {text_df.shape} DataFrame")
# Use modern transformer models for better quality
sentence_model = {'model': 'all-mpnet-base-v2'}
embeddings = dw.wrangle(documents, text_kwargs={'model': sentence_model})
Mixed Data Types in One Call
# Process multiple data types simultaneously
mixed_data = [
np.random.rand(500, 10), # NumPy array
"path/to/image.jpg", # Image file
documents, # Text documents
"https://api.example.com/data.csv" # Remote CSV
]
results = dw.wrangle(mixed_data, return_dtype=True)
dataframes, detected_types = results
for df, dtype in zip(dataframes, detected_types):
print(f"{dtype}: {df.shape}")
Function Decoration for Seamless Integration
from datawrangler.decorate import funnel
@funnel # Automatically converts inputs to DataFrames
def analyze_data(df):
"""Your function works with any data type now!"""
return df.describe()
# Works with arrays, text, files - anything!
stats = analyze_data(array_data) # NumPy array
text_stats = analyze_data(documents) # Text documents
Common Use Casesο
- π¬ Research & Academia
Literature analysis and text mining
Experimental data processing
Multi-modal data integration
Reproducible research pipelines
- πΌ Business Intelligence
Customer feedback analysis
Sales data aggregation
Performance monitoring dashboards
Cross-platform data integration
- π€ Machine Learning
Feature engineering automation
Text preprocessing for NLP models
Multi-source data fusion
Model input preparation
- π Data Engineering
ETL pipeline simplification
Real-time data processing
Data lake preprocessing
Format standardization
Performance Benefitsο
DataWrangler with Polars backend delivers significant performance improvements:
import time
# Large dataset example
large_array = np.random.rand(100000, 50)
# Pandas backend (traditional)
start = time.time()
pandas_df = dw.wrangle(large_array, backend='pandas')
pandas_time = time.time() - start
# Polars backend (high-performance)
start = time.time()
polars_df = dw.wrangle(large_array, backend='polars')
polars_time = time.time() - start
speedup = pandas_time / polars_time
print(f"Polars is {speedup:.1f}x faster!")
# Typical result: 50-100x speedup for large arrays
Real-world performance gains:
Array processing: 2-100x faster conversion
Text embeddings: 3-10x faster document processing
Aggregations: 5-50x faster group-by operations
Memory usage: 30-70% reduction for large datasets
Getting Startedο
Installation:
pip install pydata-wrangler
Optional high-performance dependencies:
pip install pydata-wrangler[hf] # Adds transformers, sentence-transformers
Start wrangling:
import datawrangler as dw df = dw.wrangle(your_data)
Documentation Contentsο
Getting Started:
User Guide:
Development: