DataWrangler
======================================

**Transform messy data into clean pandas/Polars DataFrames with intelligent automation**

DataWrangler is a powerful Python package that automatically converts diverse data formats into clean, analysis-ready DataFrames. Whether you're working with arrays, text, images, or mixed data types, DataWrangler intelligently detects formats and applies appropriate transformations – all with a simple, unified API.

🚀 **New**: High-performance Polars backend support for 2-100x faster processing!

Why DataWrangler?
-----------------

**🎯 Intelligent Automation**
  No more manual data preprocessing. DataWrangler automatically detects data types and applies appropriate transformations.

**⚡ High Performance** 
  Choose between pandas (familiar) and Polars (fast) backends. Get dramatic speedups with zero code changes.

**🔧 Unified API**
  One simple function handles arrays, text, images, files, URLs, and mixed data types.

**📊 Research-Ready**
  Built for data science workflows with advanced text processing, embeddings, and ML preprocessing.

**🛡️ Production-Tested**
  Robust error handling, comprehensive testing, and battle-tested in real research environments.

Quick Start Examples
--------------------

**Basic Data Wrangling**

.. code-block:: python

   import datawrangler as dw
   import numpy as np

   # Arrays become DataFrames automatically
   array_data = np.random.rand(1000, 5)
   df = dw.wrangle(array_data)
   print(df.head())

**High-Performance with Polars**

.. code-block:: python

   # Same operation, 2-100x faster with Polars backend
   fast_df = dw.wrangle(array_data, backend='polars')
   
   # Set global backend preference
   from datawrangler.core.configurator import set_dataframe_backend
   set_dataframe_backend('polars')  # All operations now use Polars

**Advanced Text Processing**

.. code-block:: python

   # Text documents become embedding vectors
   documents = [
       "Machine learning transforms data into insights",
       "Data science combines statistics with programming",
       "AI enables automated decision-making systems"
   ]
   
   # Automatic text embeddings with state-of-the-art models
   text_df = dw.wrangle(documents)
   print(f"Embedded {len(documents)} documents into {text_df.shape} DataFrame")
   
   # Use modern transformer models for better quality
   sentence_model = {'model': 'all-mpnet-base-v2'}
   embeddings = dw.wrangle(documents, text_kwargs={'model': sentence_model})

**Mixed Data Types in One Call**

.. code-block:: python

   # Process multiple data types simultaneously
   mixed_data = [
       np.random.rand(500, 10),           # NumPy array
       "path/to/image.jpg",               # Image file
       documents,                         # Text documents
       "https://api.example.com/data.csv" # Remote CSV
   ]
   
   results = dw.wrangle(mixed_data, return_dtype=True)
   dataframes, detected_types = results
   
   for df, dtype in zip(dataframes, detected_types):
       print(f"{dtype}: {df.shape}")

**Function Decoration for Seamless Integration**

.. code-block:: python

   from datawrangler.decorate import funnel
   
   @funnel  # Automatically converts inputs to DataFrames
   def analyze_data(df):
       """Your function works with any data type now!"""
       return df.describe()
   
   # Works with arrays, text, files - anything!
   stats = analyze_data(array_data)      # NumPy array
   text_stats = analyze_data(documents)  # Text documents

Common Use Cases
----------------

**🔬 Research & Academia**
  * Literature analysis and text mining
  * Experimental data processing
  * Multi-modal data integration
  * Reproducible research pipelines

**💼 Business Intelligence**
  * Customer feedback analysis
  * Sales data aggregation
  * Performance monitoring dashboards
  * Cross-platform data integration

**🤖 Machine Learning**
  * Feature engineering automation
  * Text preprocessing for NLP models
  * Multi-source data fusion
  * Model input preparation

**📈 Data Engineering**
  * ETL pipeline simplification
  * Real-time data processing
  * Data lake preprocessing
  * Format standardization

Performance Benefits
--------------------

DataWrangler with Polars backend delivers significant performance improvements:

.. code-block:: python

   import time
   
   # Large dataset example
   large_array = np.random.rand(100000, 50)
   
   # Pandas backend (traditional)
   start = time.time()
   pandas_df = dw.wrangle(large_array, backend='pandas')
   pandas_time = time.time() - start
   
   # Polars backend (high-performance)
   start = time.time()
   polars_df = dw.wrangle(large_array, backend='polars')
   polars_time = time.time() - start
   
   speedup = pandas_time / polars_time
   print(f"Polars is {speedup:.1f}x faster!")
   # Typical result: 50-100x speedup for large arrays

**Real-world performance gains:**

* **Array processing**: 2-100x faster conversion
* **Text embeddings**: 3-10x faster document processing  
* **Aggregations**: 5-50x faster group-by operations
* **Memory usage**: 30-70% reduction for large datasets

Getting Started
---------------

1. **Installation**::

    pip install pydata-wrangler

2. **Optional high-performance dependencies**::

    pip install pydata-wrangler[hf]  # Adds transformers, sentence-transformers

3. **Start wrangling**::

    import datawrangler as dw
    df = dw.wrangle(your_data)

Documentation Contents
----------------------

.. toctree::
   :maxdepth: 1
   :caption: Getting Started:

   installation
   readme
   migration_guide

.. toctree::
   :maxdepth: 1
   :caption: User Guide:

   tutorials
   api

.. toctree::
   :maxdepth: 1
   :caption: Development:

   contributing
   authors
   history

Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`