datawrangler package

Subpackages

Module contents

Data Wrangler: Transform messy data into clean DataFrames (pandas or Polars)

Data Wrangler is a Python package that automatically transforms various data types (arrays, text, files, URLs, etc.) into clean, consistent DataFrame format using either pandas or Polars backends. It specializes in text data processing using modern NLP models and offers 2-100x performance improvements with Polars.

Key Features: - Dual backend support: pandas (default) or Polars for high performance - Automatic data type detection and conversion - Text embedding using sentence-transformers and sklearn models - Function decorators for seamless DataFrame integration - Support for files, URLs, and mixed data types - Configurable processing pipeline

Basic Usage:
>>> import datawrangler as dw
>>> df = dw.wrangle(your_data)  # pandas DataFrame (default)
>>> df_fast = dw.wrangle(your_data, backend='polars')  # Polars DataFrame

# With text data using sentence-transformers >>> text_df = dw.wrangle([“Hello world”, “Another text”], … text_kwargs={‘model’: ‘all-MiniLM-L6-v2’}, … backend=’polars’) # 2-100x faster with Polars

# Using the @funnel decorator (backend-agnostic) >>> @dw.funnel … def your_function(df): … return df.mean()

Backend Differences: - pandas:

  • Full feature compatibility with named indexes

  • Index names preserved during processing

  • Slower performance on large datasets

  • Extensive ecosystem support

  • Polars: * High performance (2-100x faster on large datasets) * Position-based indexing only (no named indexes) * Index names not preserved during backend conversion * Limited interpolation support in decorators * Growing ecosystem, may have fewer integrations

Choose pandas for: Small datasets, complex index operations, maximum compatibility Choose Polars for: Large datasets, performance-critical applications, simple workflows

Requirements: - Python 3.9+ - Optional: Install with [hf] extras for sentence-transformers support

pip install “pydata-wrangler[hf]”

Version: 0.3.0+ (NumPy 2.0+ and pandas 2.0+ compatible)