DataWrangler

Transform messy data into clean pandas/Polars DataFrames with intelligent automation

DataWrangler is a powerful Python package that automatically converts diverse data formats into clean, analysis-ready DataFrames. Whether you’re working with arrays, text, images, or mixed data types, DataWrangler intelligently detects formats and applies appropriate transformations – all with a simple, unified API.

🚀 New: High-performance Polars backend support for 2-100x faster processing!

Why DataWrangler?

🎯 Intelligent Automation: No more manual data preprocessing. DataWrangler automatically detects data types and applies appropriate transformations.
⚡ High Performance: Choose between pandas (familiar) and Polars (fast) backends. Get dramatic speedups with zero code changes.
🔧 Unified API: One simple function handles arrays, text, images, files, URLs, and mixed data types.
📊 Research-Ready: Built for data science workflows with advanced text processing, embeddings, and ML preprocessing.
🛡️ Production-Tested: Robust error handling, comprehensive testing, and battle-tested in real research environments.

Quick Start Examples

Basic Data Wrangling

import datawrangler as dw
import numpy as np

# Arrays become DataFrames automatically
array_data = np.random.rand(1000, 5)
df = dw.wrangle(array_data)
print(df.head())

High-Performance with Polars

# Same operation, 2-100x faster with Polars backend
fast_df = dw.wrangle(array_data, backend='polars')

# Set global backend preference
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars')  # All operations now use Polars

Advanced Text Processing

# Text documents become embedding vectors
documents = [
    "Machine learning transforms data into insights",
    "Data science combines statistics with programming",
    "AI enables automated decision-making systems"
]

# Automatic text embeddings with state-of-the-art models
text_df = dw.wrangle(documents)
print(f"Embedded {len(documents)} documents into {text_df.shape} DataFrame")

# Use modern transformer models for better quality
sentence_model = {'model': 'all-mpnet-base-v2'}
embeddings = dw.wrangle(documents, text_kwargs={'model': sentence_model})

Mixed Data Types in One Call

# Process multiple data types simultaneously
mixed_data = [
    np.random.rand(500, 10),           # NumPy array
    "path/to/image.jpg",               # Image file
    documents,                         # Text documents
    "https://api.example.com/data.csv" # Remote CSV
]

results = dw.wrangle(mixed_data, return_dtype=True)
dataframes, detected_types = results

for df, dtype in zip(dataframes, detected_types):
    print(f"{dtype}: {df.shape}")

Function Decoration for Seamless Integration

from datawrangler.decorate import funnel

@funnel  # Automatically converts inputs to DataFrames
def analyze_data(df):
    """Your function works with any data type now!"""
    return df.describe()

# Works with arrays, text, files - anything!
stats = analyze_data(array_data)      # NumPy array
text_stats = analyze_data(documents)  # Text documents

Common Use Cases

🔬 Research & Academia

Literature analysis and text mining
Experimental data processing
Multi-modal data integration
Reproducible research pipelines

💼 Business Intelligence

Customer feedback analysis
Sales data aggregation
Performance monitoring dashboards
Cross-platform data integration

🤖 Machine Learning

Feature engineering automation
Text preprocessing for NLP models
Multi-source data fusion
Model input preparation

📈 Data Engineering

ETL pipeline simplification
Real-time data processing
Data lake preprocessing
Format standardization

Performance Benefits

DataWrangler with Polars backend delivers significant performance improvements:

import time

# Large dataset example
large_array = np.random.rand(100000, 50)

# Pandas backend (traditional)
start = time.time()
pandas_df = dw.wrangle(large_array, backend='pandas')
pandas_time = time.time() - start

# Polars backend (high-performance)
start = time.time()
polars_df = dw.wrangle(large_array, backend='polars')
polars_time = time.time() - start

speedup = pandas_time / polars_time
print(f"Polars is {speedup:.1f}x faster!")
# Typical result: 50-100x speedup for large arrays

Real-world performance gains:

Array processing: 2-100x faster conversion
Text embeddings: 3-10x faster document processing
Aggregations: 5-50x faster group-by operations
Memory usage: 30-70% reduction for large datasets

Getting Started

Installation:
```
pip install pydata-wrangler
```

Optional high-performance dependencies:

pip install pydata-wrangler[hf]  # Adds transformers, sentence-transformers

Start wrangling:

import datawrangler as dw
df = dw.wrangle(your_data)

Documentation Contents

Getting Started:

User Guide:

Development: