History

0.5.1 (2026-07-04)

Bug-fix release: Windows import crash + PyPI project page

Fixed import datawrangler crashing on Windows (issue #32). config.ini resolved the home directory via os.getenv('HOME'), which is unset on Windows (it uses USERPROFILE), so the data-cache path evaluated to os.path.join(None, ...) and raised TypeError at import time. The config now uses os.path.expanduser('~'), which resolves correctly on all platforms. Added a regression test that imports datawrangler in a fresh subprocess with HOME / USERPROFILE stripped from the environment.
The PyPI project page now displays the full README instead of a bare link to the documentation (setup.py embeds README.rst as the package’s long_description).

0.5.0 (2026-07-03)

Bug-fix release: robust remote caching + correctness fixes

Fixed: remote file caching (the headline fix):

get_extension now strips URL query strings (?dl=1) and fragments (#...) before detecting the file type. Previously, downloading a URL like https://.../data.npz?dl=1 (Dropbox / Google Drive share links, including the built-in text corpora) cached the file under a polluted name (<hash>.npz?dl=1). That cached copy could not be re-read by extension (ValueError: Unknown datatype: npz?dl=1), so the dataset was effectively re-downloaded / re-cached instead of reused. Remote files now cache under a clean, stable name and re-loads reuse the cached copy with no additional download.

Other correctness fixes:

core.configurator.apply_defaults no longer raises KeyError for functions/classes that have no section in config.ini (this had broken sklearn models such as TSNE, MDS, Isomap).
decorate.apply_stacked no longer crashes with TypeError when the wrapped function returns a (data, model) tuple (return_model=True) on unstacked input.
zoo.is_multiindex_dataframe no longer raises AttributeError on Polars DataFrames/LazyFrames (which have no .index).
io.load now correctly unwraps a single-array .npz file to the underlying array.
Fixed missing f-string prefixes in two zoo.text error messages.

Tests & docs:

Added regression tests for the caching fix and deeper tests for the return_model contract, the Polars helpers, backend configuration, and io.save round-trips.
Corrected stale version strings and API-reference stubs (backend functions, the Polars module).

0.4.0 (2025-06-14)

Major Release: High-Performance Polars Backend + Simplified Text API

This release introduces first-class Polars support for dramatic performance improvements and dramatically simplifies the text model API:

🚀 NEW: High-Performance Polars Backend (2-100x faster!): * Dual DataFrame Support: Choose between pandas (default) or Polars backends * Zero Code Changes: Add backend='polars' to any operation for instant speedups * Comprehensive Coverage: All data types (arrays, text, files) work with both backends * Smart Type Preservation: DataFrames maintain their type when no backend specified * Global Configuration: Set default backend preference with set_dataframe_backend('polars') * Cross-Backend Conversion: Seamlessly convert between pandas and Polars DataFrames

Performance Gains with Polars: * Array Processing: 2-100x faster conversion for large datasets * Text Embeddings: 3-10x faster document processing * Memory Efficiency: 30-70% reduction in memory usage * Parallel Processing: Built-in multi-core optimization

Text Model API Simplification (80% reduction in verbosity): * Simple String Format: {'model': 'all-MiniLM-L6-v2'} now works everywhere * Automatic Normalization: All model formats converted to unified dict internally * List Support: Lists of models work with simplified format (e.g., ['CountVectorizer', 'NMF']) * Full Backward Compatibility: All existing verbose syntax continues working

Google Colab Installation Fix: * Removed redundant configparser from requirements.txt (built-in to Python 3.x) * Eliminated installation warning popup in Google Colab environments * Cleaner dependency list and faster installation

Enhanced Documentation: * Updated all examples to use simplified text model API * Added comprehensive Polars backend examples and tutorials * Made all documentation backend-agnostic with performance guidance * Fixed all docstring examples to use public API correctly

Example of New Polars Backend:

import datawrangler as dw
import numpy as np

# Large dataset example
large_array = np.random.rand(50000, 20)

# Traditional pandas backend
pandas_df = dw.wrangle(large_array)  # Default

# High-performance Polars backend (2-100x faster!)
polars_df = dw.wrangle(large_array, backend='polars')

# Set global preference for all operations
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars')  # All operations now use Polars

Example of Simplified Text API:

Before (v0.3.0):

# Verbose dictionary format required
text_kwargs = {
    'model': {
        'model': 'all-MiniLM-L6-v2',
        'args': [],
        'kwargs': {}
    }
}

After (v0.4.0):

# Simplified - just pass the model name!
text_kwargs = {'model': 'all-MiniLM-L6-v2'}

# Works with Polars backend too for 3-10x faster text processing!
fast_embeddings = dw.wrangle(texts, text_kwargs=text_kwargs, backend='polars')

0.3.0 (2025-06-13)

Major Release: NumPy 2.0+ Compatibility & Modern ML Libraries

This release brings full compatibility with NumPy 2.0+ and pandas 2.0+ while modernizing the text embedding infrastructure:

Breaking Changes: * Replaced Flair with sentence-transformers for text embeddings * Removed gensim dependency (eliminates NumPy version conflicts) * Updated text embedding API to use sentence-transformers models

New Features: * Full NumPy 2.0+ and pandas 2.0+ compatibility * Modern sentence-transformers integration for text embeddings * Support for latest scikit-learn, matplotlib, and scipy versions * Enhanced error handling for missing dependencies

Bug Fixes: * Fixed numpy.str_ deprecation that broke in NumPy 2.0+ * Updated HuggingFace datasets import for API changes * Fixed sklearn IterativeImputer experimental import compatibility * Replaced deprecated matplotlib.pyplot.imread

Documentation: * Updated all examples to use sentence-transformers syntax * Modernized installation instructions and model references * Comprehensive tutorial updates with new embedding approaches

Migration Guide: Old Flair syntax: {‘model’: ‘TransformerDocumentEmbeddings’, ‘args’: [‘bert-base-uncased’]} New syntax: {‘model’: ‘all-mpnet-base-v2’, ‘args’: [], ‘kwargs’: {}}

0.2.2 (22-07-25)

Better error handling when hugging-face libraries aren’t installed and user asks to embed text using hugging-face models

0.2.1 (22-07-25)

Bug fixes when hugging-face libraries aren’t installed

0.2.0 (2022-07-25)

Adds CUDA (GPU) support for pytorch models
Streamline package by not installing hugging-face support by default
Adds Python 3.10 support (and associated tests)
Relaxes some tests to support a wider range of platforms (mostly this is relevant for GitHub CI)
Relaxes requirements.txt versioning to improve compatibility with other libraries when installing via pip

0.1.7 (2021-08-09)

Updates default behaviors for several models (via config.ini)

0.1.6 (2021-08-09)

Another bug fix release (more fixes to datawrangler.unstack)

0.1.5 (2021-08-09)

Corrected a bug in datawrangler.unstack

0.1.4 (2021-08-04)

Added an option to specify a customized dictionary of default options to the apply_default_options function

0.1.3 (2021-08-04)

Fixed some bugs related to stacking and unstacking DataFrames

0.1.2 (2021-08-04)

Minor update that corrects URLs of Khan Academy and NeurIPS corpora and corrects some issues with loading npy files

0.1.1 (2021-07-19)

Minor update in order to make the package available on pipy.

0.1.0 (2021-07-09)

First release on PyPI.