History

0.4.0 (2025-06-14)

Major Release: High-Performance Polars Backend + Simplified Text API

This release introduces first-class Polars support for dramatic performance improvements and dramatically simplifies the text model API:

🚀 NEW: High-Performance Polars Backend (2-100x faster!): * Dual DataFrame Support: Choose between pandas (default) or Polars backends * Zero Code Changes: Add backend='polars' to any operation for instant speedups * Comprehensive Coverage: All data types (arrays, text, files) work with both backends * Smart Type Preservation: DataFrames maintain their type when no backend specified * Global Configuration: Set default backend preference with set_dataframe_backend('polars') * Cross-Backend Conversion: Seamlessly convert between pandas and Polars DataFrames

Performance Gains with Polars: * Array Processing: 2-100x faster conversion for large datasets * Text Embeddings: 3-10x faster document processing * Memory Efficiency: 30-70% reduction in memory usage * Parallel Processing: Built-in multi-core optimization

Text Model API Simplification (80% reduction in verbosity): * Simple String Format: {'model': 'all-MiniLM-L6-v2'} now works everywhere * Automatic Normalization: All model formats converted to unified dict internally * List Support: Lists of models work with simplified format (e.g., ['CountVectorizer', 'NMF']) * Full Backward Compatibility: All existing verbose syntax continues working

Google Colab Installation Fix: * Removed redundant configparser from requirements.txt (built-in to Python 3.x) * Eliminated installation warning popup in Google Colab environments * Cleaner dependency list and faster installation

Enhanced Documentation: * Updated all examples to use simplified text model API * Added comprehensive Polars backend examples and tutorials * Made all documentation backend-agnostic with performance guidance * Fixed all docstring examples to use public API correctly

Example of New Polars Backend:

import datawrangler as dw
import numpy as np

# Large dataset example
large_array = np.random.rand(50000, 20)

# Traditional pandas backend
pandas_df = dw.wrangle(large_array)  # Default

# High-performance Polars backend (2-100x faster!)
polars_df = dw.wrangle(large_array, backend='polars')

# Set global preference for all operations
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars')  # All operations now use Polars

Example of Simplified Text API:

Before (v0.3.0):

# Verbose dictionary format required
text_kwargs = {
    'model': {
        'model': 'all-MiniLM-L6-v2',
        'args': [],
        'kwargs': {}
    }
}

After (v0.4.0):

# Simplified - just pass the model name!
text_kwargs = {'model': 'all-MiniLM-L6-v2'}

# Works with Polars backend too for 3-10x faster text processing!
fast_embeddings = dw.wrangle(texts, text_kwargs=text_kwargs, backend='polars')

0.3.0 (2025-06-13)

Major Release: NumPy 2.0+ Compatibility & Modern ML Libraries

This release brings full compatibility with NumPy 2.0+ and pandas 2.0+ while modernizing the text embedding infrastructure:

Breaking Changes: * Replaced Flair with sentence-transformers for text embeddings * Removed gensim dependency (eliminates NumPy version conflicts) * Updated text embedding API to use sentence-transformers models

New Features: * Full NumPy 2.0+ and pandas 2.0+ compatibility * Modern sentence-transformers integration for text embeddings * Support for latest scikit-learn, matplotlib, and scipy versions * Enhanced error handling for missing dependencies

Bug Fixes: * Fixed numpy.str_ deprecation that broke in NumPy 2.0+ * Updated HuggingFace datasets import for API changes * Fixed sklearn IterativeImputer experimental import compatibility * Replaced deprecated matplotlib.pyplot.imread

Documentation: * Updated all examples to use sentence-transformers syntax * Modernized installation instructions and model references * Comprehensive tutorial updates with new embedding approaches

Migration Guide: Old Flair syntax: {‘model’: ‘TransformerDocumentEmbeddings’, ‘args’: [‘bert-base-uncased’]} New syntax: {‘model’: ‘all-mpnet-base-v2’, ‘args’: [], ‘kwargs’: {}}

0.2.2 (22-07-25)

  • Better error handling when hugging-face libraries aren’t installed and user asks to embed text using hugging-face models

0.2.1 (22-07-25)

  • Bug fixes when hugging-face libraries aren’t installed

0.2.0 (2022-07-25)

  • Adds CUDA (GPU) support for pytorch models

  • Streamline package by not installing hugging-face support by default

  • Adds Python 3.10 support (and associated tests)

  • Relaxes some tests to support a wider range of platforms (mostly this is relevant for GitHub CI)

  • Relaxes requirements.txt versioning to improve compatibility with other libraries when installing via pip

0.1.7 (2021-08-09)

  • Updates default behaviors for several models (via config.ini)

0.1.6 (2021-08-09)

  • Another bug fix release (more fixes to datawrangler.unstack)

0.1.5 (2021-08-09)

  • Corrected a bug in datawrangler.unstack

0.1.4 (2021-08-04)

  • Added an option to specify a customized dictionary of default options to the apply_default_options function

0.1.3 (2021-08-04)

  • Fixed some bugs related to stacking and unstacking DataFrames

0.1.2 (2021-08-04)

  • Minor update that corrects URLs of Khan Academy and NeurIPS corpora and corrects some issues with loading npy files

0.1.1 (2021-07-19)

  • Minor update in order to make the package available on pipy.

0.1.0 (2021-07-09)

  • First release on PyPI.