Migration Guide: v0.2 → v0.3

This guide helps you migrate from data-wrangler v0.2.x to v0.3.0, which includes significant modernization and breaking changes.

Note

This guide covers the v0.2.x → v0.3.0 migration. Later releases are backward compatible and require no code changes:

v0.4.0 added the optional high-performance Polars backend (dw.wrangle(..., backend='polars')) and a simplified text-model API (e.g. text_kwargs={'model': 'all-MiniLM-L6-v2'}).
v0.5.0 fixed remote-file caching for URLs with query strings (e.g. Dropbox ?dl=1 links); no API changes.

Overview of Changes 

Version 0.3.0 represents a major modernization of data-wrangler with focus on:

Modern Python Support: Requires Python 3.9+ (dropped support for 3.6-3.8)
NumPy 2.0+ Compatibility: Full support for latest NumPy versions
Pandas 2.0+ Compatibility: Updated for modern pandas API
Text Processing Overhaul: Migrated from Flair to sentence-transformers
Dependency Cleanup: Removed conflicting and deprecated libraries

Breaking Changes 

Python Version Requirements 

Before (v0.2.x):

# Supported Python 3.6, 3.7, 3.8, 3.9, 3.10
python_requires=">=3.6"

After (v0.3.0):

# Requires Python 3.9+
python_requires=">=3.9"

Migration: Upgrade to Python 3.9 or later before installing v0.3.0.

Text Embedding Models 

The most significant change is the migration from Flair to sentence-transformers for text embeddings.

Before (v0.2.x):

# Old Flair syntax
flair_model = {
    'model': 'TransformerDocumentEmbeddings',
    'args': ['bert-base-uncased']
}

embeddings = dw.wrangle(texts, text_kwargs={'model': flair_model})

After (v0.3.0):

# New sentence-transformers syntax
sentence_model = {
    'model': 'all-mpnet-base-v2',
    'args': [],
    'kwargs': {}
}

# Or simplified:
embeddings = dw.wrangle(texts, text_kwargs={'model': 'all-mpnet-base-v2'})

Model Name Mappings

Here are recommended migrations for common Flair models:

Flair → Sentence-Transformers Migration
Old Flair Model	New Sentence-Transformers Model
`TransformerDocumentEmbeddings('bert-base-uncased')`	`'all-MiniLM-L6-v2'` (fast) or `'all-mpnet-base-v2'` (quality)
`TransformerDocumentEmbeddings('roberta-base')`	`'all-distilroberta-v1'`
Custom transformer models	Use model name directly from HuggingFace Hub

Installation Changes 

Before (v0.2.x):

pip install data-wrangler

After (v0.3.0):

# Basic installation (sklearn text processing only)
pip install pydata-wrangler

# Full installation with sentence-transformers support
pip install "pydata-wrangler[hf]"

Note: The package name on PyPI is now pydata-wrangler to avoid conflicts.

Removed Dependencies 

The following dependencies were removed in v0.3.0:

flair - Replaced with sentence-transformers
gensim - Caused NumPy version conflicts
konoha - Unused Japanese tokenizer
pytorch-transformers - Renamed to transformers
pytorch-pretrained-bert - Replaced by transformers

If your code relied on these libraries directly, you’ll need to install them separately.

Step-by-Step Migration 

1. Check Python Version 

Ensure you’re using Python 3.9 or later:

python --version
# Should show 3.9.x or higher

2. Update Installation 

Uninstall old version and install new:

pip uninstall data-wrangler
pip install "pydata-wrangler[hf]"

3. Update Text Processing Code 

Replace Flair model specifications:

# OLD - Replace this
old_model = {'model': 'TransformerDocumentEmbeddings', 'args': ['bert-base-uncased']}

# NEW - With this
new_model = 'all-mpnet-base-v2'  # or {'model': 'all-mpnet-base-v2', 'args': [], 'kwargs': {}}

4. Test Your Code 

Run your existing code to identify any remaining issues:

python -m pytest tests/  # If you have tests
python your_script.py     # Test your main scripts

Common Migration Issues 

Issue: Import Errors 

Problem:

ImportError: No module named 'flair'

Solution: Remove any direct Flair imports and use data-wrangler’s text processing instead:

# Remove this
from flair.embeddings import TransformerDocumentEmbeddings

# Use this instead
import datawrangler as dw
embeddings = dw.wrangle(texts, text_kwargs={'model': 'all-mpnet-base-v2'})

Issue: Model Configuration Errors 

Problem:

ValueError: Model 'TransformerDocumentEmbeddings' not found

Solution: Update model specifications to use sentence-transformers model names.

Issue: Performance Differences 

Sentence-transformers models may have different performance characteristics than Flair models:

Speed: sentence-transformers is generally faster
Memory: Model memory usage may differ
Accuracy: Results may vary slightly due to different model architectures

Test your specific use case and adjust model choices if needed.

Recommended Model Choices 

For Different Use Cases 

Model Recommendations
Use Case	Fast Option	High Quality Option
General text similarity	`all-MiniLM-L6-v2`	`all-mpnet-base-v2`
Semantic search	`all-MiniLM-L6-v2`	`all-mpnet-base-v2`
Paraphrase detection	`paraphrase-MiniLM-L6-v2`	`paraphrase-mpnet-base-v2`
Multi-language	`paraphrase-multilingual-MiniLM-L12-v2`	`paraphrase-multilingual-mpnet-base-v2`

Performance Comparison 

Approximate performance characteristics:

all-MiniLM-L6-v2: 384 dimensions, ~120MB, fastest
all-mpnet-base-v2: 768 dimensions, ~420MB, highest quality
paraphrase-MiniLM-L6-v2: 384 dimensions, optimized for similarity

☐ Python version is 3.9+ ☐ Installation successful: pip show pydata-wrangler ☐ Basic functionality: import datawrangler as dw; dw.wrangle([1,2,3]) ☐ Text processing: dw.wrangle(["test"], text_kwargs={'model': 'all-MiniLM-L6-v2'}) ☐ Your specific use cases still work correctly ☐ Performance is acceptable for your needs ☐ Results are consistent with your expectations

Sample Test Script 

Use this script to validate your migration:

import datawrangler as dw
import numpy as np

# Test basic functionality
print("Testing basic array wrangling...")
result = dw.wrangle(np.random.randn(5, 3))
print(f"Array result shape: {result.shape}")

# Test text processing
print("\\nTesting text processing...")
texts = ["Hello world", "Data science is great"]
text_result = dw.wrangle(texts, text_kwargs={'model': 'all-MiniLM-L6-v2'})
print(f"Text result shape: {text_result.shape}")

# Test decorator functionality
print("\\nTesting @funnel decorator...")
@dw.funnel
def compute_mean(data):
    return data.mean().mean()

mean_result = compute_mean([1, 2, 3, 4, 5])
print(f"Mean result: {mean_result}")

print("\\nMigration validation complete!")

Getting Help 

If you encounter issues during migration:

Check the documentation: Updated examples in tutorials
Review error messages: Often contain specific guidance
Test with simple examples: Isolate the problem
Compare v0.2 vs v0.3 behavior: Use the examples above

For additional support:

GitHub Issues: https://github.com/ContextLab/data-wrangler/issues
Documentation: https://data-wrangler.readthedocs.io/
Examples: See the tutorials for v0.3.0 patterns

Benefits of v0.3.0 

While migration requires some work, v0.3.0 provides significant benefits:

✅ Better Performance: Modern dependencies and optimizations ✅ Future-Proof: Compatible with latest Python ecosystem ✅ Improved Models: Access to state-of-the-art sentence-transformers ✅ Cleaner Dependencies: Removed conflicts and deprecated packages ✅ Better Maintenance: Built on actively maintained libraries ✅ Enhanced Documentation: Comprehensive tutorials and examples

The migration effort pays off with a more robust, performant, and maintainable codebase.

Migration Guide: v0.2 → v0.3

Overview of Changes 

Breaking Changes 

Python Version Requirements 

Text Embedding Models 

Model Name Mappings

Installation Changes 

Removed Dependencies 

Step-by-Step Migration 

1. Check Python Version 

2. Update Installation 

3. Update Text Processing Code 

4. Test Your Code 

Common Migration Issues 

Issue: Import Errors 

Issue: Model Configuration Errors 

Issue: Performance Differences 

Recommended Model Choices 

For Different Use Cases 

Performance Comparison 

Testing Your Migration 

Validation Checklist 

Sample Test Script 

Getting Help 

Benefits of v0.3.0 