Data Wrangler Core Configuration

This tutorial covers the core configuration system in data-wrangler, including how to customize default settings, work with configuration files, and apply custom defaults to functions.

Overview

The datawrangler.core module provides a flexible configuration system that allows you to:

  • Set default parameters for text processing models

  • Customize data processing behavior

  • Apply consistent settings across your project

  • Override defaults on a per-function basis

Let’s explore how to use these features effectively.

[ ]:
import datawrangler as dw
from datawrangler.core import get_default_options, apply_defaults, update_dict
import pandas as pd
import numpy as np

Getting Default Configuration

The configuration system is built around a config.ini file that defines default parameters for all supported models and data types. Let’s examine the current defaults:

[ ]:
# Get all default configuration options
defaults = get_default_options()

# Display the main configuration sections
print("Available configuration sections:")
for section in defaults.keys():
    print(f"- {section}")

print(f"\nSupported data types: {defaults['supported_formats']['types']}")
print(f"Default text model: {defaults['text']['model']}")
print(f"Default text corpus: {defaults['text']['corpus']}")

Model-Specific Configuration

Each model has its own section in the configuration with optimized default parameters. Let’s examine some key model configurations:

DataFrame Backend Configuration

With the introduction of Polars support, data-wrangler now supports configuring the DataFrame backend globally:

[ ]:
# Examine sklearn model defaults
print("CountVectorizer default settings:")
for key, value in defaults['CountVectorizer'].items():
    print(f"  {key}: {value}")

print("\nLatentDirichletAllocation default settings:")
for key, value in defaults['LatentDirichletAllocation'].items():
    print(f"  {key}: {value}")

print("\nSentenceTransformer default settings:")
for key, value in defaults['SentenceTransformer'].items():
    print(f"  {key}: {value}")
[ ]:
# Import backend configuration functions
from datawrangler.core.configurator import set_dataframe_backend, get_dataframe_backend

# Check current backend
print(f"Current DataFrame backend: {get_dataframe_backend()}")

# Create sample data
sample_data = np.random.rand(100, 5)

# Test with pandas backend (default)
pandas_result = dw.wrangle(sample_data, backend='pandas')
print(f"Pandas result type: {type(pandas_result)}")

# Test with Polars backend
polars_result = dw.wrangle(sample_data, backend='polars')
print(f"Polars result type: {type(polars_result)}")

# Set global backend preference
print("\n📝 Setting global backend to Polars...")
set_dataframe_backend('polars')
print(f"New global backend: {get_dataframe_backend()}")

# Now operations use Polars by default
global_result = dw.wrangle(sample_data)  # No backend parameter needed
print(f"Global setting result type: {type(global_result)}")

# Reset to pandas for rest of tutorial
set_dataframe_backend('pandas')
print(f"\n🔄 Reset to pandas: {get_dataframe_backend()}")