Data Wrangler Core Configuration
This tutorial covers the core configuration system in data-wrangler, including how to customize default settings, work with configuration files, and apply custom defaults to functions.
Overview
The datawrangler.core module provides a flexible configuration system that allows you to:
Set default parameters for text processing models
Customize data processing behavior
Apply consistent settings across your project
Override defaults on a per-function basis
Let’s explore how to use these features effectively.
[ ]:
import datawrangler as dw
from datawrangler.core import get_default_options, apply_defaults, update_dict
import pandas as pd
import numpy as np
Getting Default Configuration
The configuration system is built around a config.ini file that defines default parameters for all supported models and data types. Let’s examine the current defaults:
[ ]:
# Get all default configuration options
defaults = get_default_options()
# Display the main configuration sections
print("Available configuration sections:")
for section in defaults.keys():
print(f"- {section}")
print(f"\nSupported data types: {defaults['supported_formats']['types']}")
print(f"Default text model: {defaults['text']['model']}")
print(f"Default text corpus: {defaults['text']['corpus']}")
Model-Specific Configuration
Each model has its own section in the configuration with optimized default parameters. Let’s examine some key model configurations:
DataFrame Backend Configuration
With the introduction of Polars support, data-wrangler now supports configuring the DataFrame backend globally:
[ ]:
# Examine sklearn model defaults
print("CountVectorizer default settings:")
for key, value in defaults['CountVectorizer'].items():
print(f" {key}: {value}")
print("\nLatentDirichletAllocation default settings:")
for key, value in defaults['LatentDirichletAllocation'].items():
print(f" {key}: {value}")
print("\nSentenceTransformer default settings:")
for key, value in defaults['SentenceTransformer'].items():
print(f" {key}: {value}")
[ ]:
# Import backend configuration functions
from datawrangler.core.configurator import set_dataframe_backend, get_dataframe_backend
# Check current backend
print(f"Current DataFrame backend: {get_dataframe_backend()}")
# Create sample data
sample_data = np.random.rand(100, 5)
# Test with pandas backend (default)
pandas_result = dw.wrangle(sample_data, backend='pandas')
print(f"Pandas result type: {type(pandas_result)}")
# Test with Polars backend
polars_result = dw.wrangle(sample_data, backend='polars')
print(f"Polars result type: {type(polars_result)}")
# Set global backend preference
print("\n📝 Setting global backend to Polars...")
set_dataframe_backend('polars')
print(f"New global backend: {get_dataframe_backend()}")
# Now operations use Polars by default
global_result = dw.wrangle(sample_data) # No backend parameter needed
print(f"Global setting result type: {type(global_result)}")
# Reset to pandas for rest of tutorial
set_dataframe_backend('pandas')
print(f"\n🔄 Reset to pandas: {get_dataframe_backend()}")