Migration Guide: v0.2 → v0.3
This guide helps you migrate from data-wrangler v0.2.x to v0.3.0, which includes significant modernization and breaking changes.
Overview of Changes
Version 0.3.0 represents a major modernization of data-wrangler with focus on:
Modern Python Support: Requires Python 3.9+ (dropped support for 3.6-3.8)
NumPy 2.0+ Compatibility: Full support for latest NumPy versions
Pandas 2.0+ Compatibility: Updated for modern pandas API
Text Processing Overhaul: Migrated from Flair to sentence-transformers
Dependency Cleanup: Removed conflicting and deprecated libraries
Breaking Changes
Python Version Requirements
Before (v0.2.x):
# Supported Python 3.6, 3.7, 3.8, 3.9, 3.10
python_requires=">=3.6"
After (v0.3.0):
# Requires Python 3.9+
python_requires=">=3.9"
Migration: Upgrade to Python 3.9 or later before installing v0.3.0.
Text Embedding Models
The most significant change is the migration from Flair to sentence-transformers for text embeddings.
Before (v0.2.x):
# Old Flair syntax
flair_model = {
'model': 'TransformerDocumentEmbeddings',
'args': ['bert-base-uncased']
}
embeddings = dw.wrangle(texts, text_kwargs={'model': flair_model})
After (v0.3.0):
# New sentence-transformers syntax
sentence_model = {
'model': 'all-mpnet-base-v2',
'args': [],
'kwargs': {}
}
# Or simplified:
embeddings = dw.wrangle(texts, text_kwargs={'model': 'all-mpnet-base-v2'})
Model Name Mappings
Here are recommended migrations for common Flair models:
Old Flair Model |
New Sentence-Transformers Model |
|---|---|
|
|
|
|
Custom transformer models |
Use model name directly from HuggingFace Hub |
Installation Changes
Before (v0.2.x):
pip install data-wrangler
After (v0.3.0):
# Basic installation (sklearn text processing only)
pip install pydata-wrangler
# Full installation with sentence-transformers support
pip install "pydata-wrangler[hf]"
Note: The package name on PyPI is now pydata-wrangler to avoid conflicts.
Removed Dependencies
The following dependencies were removed in v0.3.0:
flair- Replaced with sentence-transformersgensim- Caused NumPy version conflictskonoha- Unused Japanese tokenizerpytorch-transformers- Renamed totransformerspytorch-pretrained-bert- Replaced bytransformers
If your code relied on these libraries directly, you’ll need to install them separately.
Step-by-Step Migration
1. Check Python Version
Ensure you’re using Python 3.9 or later:
python --version
# Should show 3.9.x or higher
2. Update Installation
Uninstall old version and install new:
pip uninstall data-wrangler
pip install "pydata-wrangler[hf]"
3. Update Text Processing Code
Replace Flair model specifications:
# OLD - Replace this
old_model = {'model': 'TransformerDocumentEmbeddings', 'args': ['bert-base-uncased']}
# NEW - With this
new_model = 'all-mpnet-base-v2' # or {'model': 'all-mpnet-base-v2', 'args': [], 'kwargs': {}}
4. Test Your Code
Run your existing code to identify any remaining issues:
python -m pytest tests/ # If you have tests
python your_script.py # Test your main scripts
Common Migration Issues
Issue: Import Errors
Problem:
ImportError: No module named 'flair'
Solution: Remove any direct Flair imports and use data-wrangler’s text processing instead:
# Remove this
from flair.embeddings import TransformerDocumentEmbeddings
# Use this instead
import datawrangler as dw
embeddings = dw.wrangle(texts, text_kwargs={'model': 'all-mpnet-base-v2'})
Issue: Model Configuration Errors
Problem:
ValueError: Model 'TransformerDocumentEmbeddings' not found
Solution: Update model specifications to use sentence-transformers model names.
Issue: Performance Differences
Sentence-transformers models may have different performance characteristics than Flair models:
Speed: sentence-transformers is generally faster
Memory: Model memory usage may differ
Accuracy: Results may vary slightly due to different model architectures
Test your specific use case and adjust model choices if needed.
Recommended Model Choices
For Different Use Cases
Use Case |
Fast Option |
High Quality Option |
|---|---|---|
General text similarity |
|
|
Semantic search |
|
|
Paraphrase detection |
|
|
Multi-language |
|
|
Performance Comparison
Approximate performance characteristics:
all-MiniLM-L6-v2: 384 dimensions, ~120MB, fastest
all-mpnet-base-v2: 768 dimensions, ~420MB, highest quality
paraphrase-MiniLM-L6-v2: 384 dimensions, optimized for similarity
Testing Your Migration
Validation Checklist
After migration, verify:
☐ Python version is 3.9+
☐ Installation successful: pip show pydata-wrangler
☐ Basic functionality: import datawrangler as dw; dw.wrangle([1,2,3])
☐ Text processing: dw.wrangle(["test"], text_kwargs={'model': 'all-MiniLM-L6-v2'})
☐ Your specific use cases still work correctly
☐ Performance is acceptable for your needs
☐ Results are consistent with your expectations
Sample Test Script
Use this script to validate your migration:
import datawrangler as dw
import numpy as np
# Test basic functionality
print("Testing basic array wrangling...")
result = dw.wrangle(np.random.randn(5, 3))
print(f"Array result shape: {result.shape}")
# Test text processing
print("\\nTesting text processing...")
texts = ["Hello world", "Data science is great"]
text_result = dw.wrangle(texts, text_kwargs={'model': 'all-MiniLM-L6-v2'})
print(f"Text result shape: {text_result.shape}")
# Test decorator functionality
print("\\nTesting @funnel decorator...")
@dw.funnel
def compute_mean(data):
return data.mean().mean()
mean_result = compute_mean([1, 2, 3, 4, 5])
print(f"Mean result: {mean_result}")
print("\\nMigration validation complete!")
Getting Help
If you encounter issues during migration:
Check the documentation: Updated examples in tutorials
Review error messages: Often contain specific guidance
Test with simple examples: Isolate the problem
Compare v0.2 vs v0.3 behavior: Use the examples above
For additional support:
GitHub Issues: https://github.com/ContextLab/data-wrangler/issues
Documentation: https://data-wrangler.readthedocs.io/
Examples: See the tutorials for v0.3.0 patterns
Benefits of v0.3.0
While migration requires some work, v0.3.0 provides significant benefits:
✅ Better Performance: Modern dependencies and optimizations ✅ Future-Proof: Compatible with latest Python ecosystem ✅ Improved Models: Access to state-of-the-art sentence-transformers ✅ Cleaner Dependencies: Removed conflicts and deprecated packages ✅ Better Maintenance: Built on actively maintained libraries ✅ Enhanced Documentation: Comprehensive tutorials and examples
The migration effort pays off with a more robust, performant, and maintainable codebase.