Migration Guide: v0.2 → v0.3

This guide helps you migrate from data-wrangler v0.2.x to v0.3.0, which includes significant modernization and breaking changes.

Overview of Changes

Version 0.3.0 represents a major modernization of data-wrangler with focus on:

  • Modern Python Support: Requires Python 3.9+ (dropped support for 3.6-3.8)

  • NumPy 2.0+ Compatibility: Full support for latest NumPy versions

  • Pandas 2.0+ Compatibility: Updated for modern pandas API

  • Text Processing Overhaul: Migrated from Flair to sentence-transformers

  • Dependency Cleanup: Removed conflicting and deprecated libraries

Breaking Changes

Python Version Requirements

Before (v0.2.x):

# Supported Python 3.6, 3.7, 3.8, 3.9, 3.10
python_requires=">=3.6"

After (v0.3.0):

# Requires Python 3.9+
python_requires=">=3.9"

Migration: Upgrade to Python 3.9 or later before installing v0.3.0.

Text Embedding Models

The most significant change is the migration from Flair to sentence-transformers for text embeddings.

Before (v0.2.x):

# Old Flair syntax
flair_model = {
    'model': 'TransformerDocumentEmbeddings',
    'args': ['bert-base-uncased']
}

embeddings = dw.wrangle(texts, text_kwargs={'model': flair_model})

After (v0.3.0):

# New sentence-transformers syntax
sentence_model = {
    'model': 'all-mpnet-base-v2',
    'args': [],
    'kwargs': {}
}

# Or simplified:
embeddings = dw.wrangle(texts, text_kwargs={'model': 'all-mpnet-base-v2'})

Model Name Mappings

Here are recommended migrations for common Flair models:

Flair → Sentence-Transformers Migration

Old Flair Model

New Sentence-Transformers Model

TransformerDocumentEmbeddings('bert-base-uncased')

'all-MiniLM-L6-v2' (fast) or 'all-mpnet-base-v2' (quality)

TransformerDocumentEmbeddings('roberta-base')

'all-distilroberta-v1'

Custom transformer models

Use model name directly from HuggingFace Hub

Installation Changes

Before (v0.2.x):

pip install data-wrangler

After (v0.3.0):

# Basic installation (sklearn text processing only)
pip install pydata-wrangler

# Full installation with sentence-transformers support
pip install "pydata-wrangler[hf]"

Note: The package name on PyPI is now pydata-wrangler to avoid conflicts.

Removed Dependencies

The following dependencies were removed in v0.3.0:

  • flair - Replaced with sentence-transformers

  • gensim - Caused NumPy version conflicts

  • konoha - Unused Japanese tokenizer

  • pytorch-transformers - Renamed to transformers

  • pytorch-pretrained-bert - Replaced by transformers

If your code relied on these libraries directly, you’ll need to install them separately.

Step-by-Step Migration

1. Check Python Version

Ensure you’re using Python 3.9 or later:

python --version
# Should show 3.9.x or higher

2. Update Installation

Uninstall old version and install new:

pip uninstall data-wrangler
pip install "pydata-wrangler[hf]"

3. Update Text Processing Code

Replace Flair model specifications:

# OLD - Replace this
old_model = {'model': 'TransformerDocumentEmbeddings', 'args': ['bert-base-uncased']}

# NEW - With this
new_model = 'all-mpnet-base-v2'  # or {'model': 'all-mpnet-base-v2', 'args': [], 'kwargs': {}}

4. Test Your Code

Run your existing code to identify any remaining issues:

python -m pytest tests/  # If you have tests
python your_script.py     # Test your main scripts

Common Migration Issues

Issue: Import Errors

Problem:

ImportError: No module named 'flair'

Solution: Remove any direct Flair imports and use data-wrangler’s text processing instead:

# Remove this
from flair.embeddings import TransformerDocumentEmbeddings

# Use this instead
import datawrangler as dw
embeddings = dw.wrangle(texts, text_kwargs={'model': 'all-mpnet-base-v2'})

Issue: Model Configuration Errors

Problem:

ValueError: Model 'TransformerDocumentEmbeddings' not found

Solution: Update model specifications to use sentence-transformers model names.

Issue: Performance Differences

Sentence-transformers models may have different performance characteristics than Flair models:

  • Speed: sentence-transformers is generally faster

  • Memory: Model memory usage may differ

  • Accuracy: Results may vary slightly due to different model architectures

Test your specific use case and adjust model choices if needed.

Testing Your Migration

Validation Checklist

After migration, verify:

☐ Python version is 3.9+ ☐ Installation successful: pip show pydata-wrangler ☐ Basic functionality: import datawrangler as dw; dw.wrangle([1,2,3]) ☐ Text processing: dw.wrangle(["test"], text_kwargs={'model': 'all-MiniLM-L6-v2'}) ☐ Your specific use cases still work correctly ☐ Performance is acceptable for your needs ☐ Results are consistent with your expectations

Sample Test Script

Use this script to validate your migration:

import datawrangler as dw
import numpy as np

# Test basic functionality
print("Testing basic array wrangling...")
result = dw.wrangle(np.random.randn(5, 3))
print(f"Array result shape: {result.shape}")

# Test text processing
print("\\nTesting text processing...")
texts = ["Hello world", "Data science is great"]
text_result = dw.wrangle(texts, text_kwargs={'model': 'all-MiniLM-L6-v2'})
print(f"Text result shape: {text_result.shape}")

# Test decorator functionality
print("\\nTesting @funnel decorator...")
@dw.funnel
def compute_mean(data):
    return data.mean().mean()

mean_result = compute_mean([1, 2, 3, 4, 5])
print(f"Mean result: {mean_result}")

print("\\nMigration validation complete!")

Getting Help

If you encounter issues during migration:

  1. Check the documentation: Updated examples in tutorials

  2. Review error messages: Often contain specific guidance

  3. Test with simple examples: Isolate the problem

  4. Compare v0.2 vs v0.3 behavior: Use the examples above

For additional support:

Benefits of v0.3.0

While migration requires some work, v0.3.0 provides significant benefits:

Better Performance: Modern dependencies and optimizations ✅ Future-Proof: Compatible with latest Python ecosystem ✅ Improved Models: Access to state-of-the-art sentence-transformers ✅ Cleaner Dependencies: Removed conflicts and deprecated packages ✅ Better Maintenance: Built on actively maintained libraries ✅ Enhanced Documentation: Comprehensive tutorials and examples

The migration effort pays off with a more robust, performant, and maintainable codebase.