Data Wrangler Core Configuration

This tutorial covers the core configuration system in data-wrangler, including how to customize default settings, work with configuration files, and apply custom defaults to functions.

Overview

The datawrangler.core module provides a flexible configuration system that allows you to:

Set default parameters for text processing models
Customize data processing behavior
Apply consistent settings across your project
Override defaults on a per-function basis

Let’s explore how to use these features effectively.

[1]:

import datawrangler as dw
from datawrangler.core import get_default_options, apply_defaults, update_dict
import pandas as pd
import numpy as np

Getting Default Configuration

The configuration system is built around a config.ini file that defines default parameters for all supported models and data types. Let’s examine the current defaults:

[2]:

# Get all default configuration options
defaults = get_default_options()

# Display the main configuration sections
print("Available configuration sections:")
for section in defaults.keys():
    print(f"- {section}")

print(f"\nSupported data types: {defaults['supported_formats']['types']}")
print(f"Default text model: {defaults['text']['model']}")
print(f"Default text corpus: {defaults['text']['corpus']}")

Available configuration sections:
- DEFAULT
- supported_formats
- backend
- text
- CountVectorizer
- HashingVectorizer
- TfidfTransformer
- TfidfVectorizer
- DictionaryLearning
- FactorAnalysis
- FastICA
- IncrementalPCA
- KernelPCA
- LatentDirichletAllocation
- MiniBatchDictionaryLearning
- MiniBatchSparsePCA
- NMF
- PCA
- SparsePCA
- TruncatedSVD
- SentenceTransformer
- all-MiniLM-L6-v2
- all-mpnet-base-v2
- paraphrase-MiniLM-L6-v2
- all-distilroberta-v1
- impute
- SimpleImputer
- IterativeImputer
- KNNImputer
- interpolate
- data

Supported data types: ['dataframe', 'text', 'array', 'null']
Default text model: ['CountVectorizer', 'LatentDirichletAllocation']
Default text corpus: 'minipedia'

Model-Specific Configuration

Each model has its own section in the configuration with optimized default parameters. Let’s examine some key model configurations:

DataFrame Backend Configuration

With the introduction of Polars support, data-wrangler now supports configuring the DataFrame backend globally:

[3]:

# Examine sklearn model defaults
print("CountVectorizer default settings:")
for key, value in defaults['CountVectorizer'].items():
    print(f"  {key}: {value}")

print("\nLatentDirichletAllocation default settings:")
for key, value in defaults['LatentDirichletAllocation'].items():
    print(f"  {key}: {value}")

print("\nSentenceTransformer default settings:")
for key, value in defaults['SentenceTransformer'].items():
    print(f"  {key}: {value}")

CountVectorizer default settings:
  stop_words: 'english'
  lowercase: True
  max_df: 0.25
  min_df: 0.1
  strip_accents: 'unicode'

LatentDirichletAllocation default settings:
  n_components: 50
  learning_method: 'online'

SentenceTransformer default settings:
  __model: 'all-MiniLM-L6-v2'

[4]:

# Import backend configuration functions
from datawrangler.core.configurator import set_dataframe_backend, get_dataframe_backend

# Check current backend
print(f"Current DataFrame backend: {get_dataframe_backend()}")

# Create sample data
sample_data = np.random.rand(100, 5)

# Test with pandas backend (default)
pandas_result = dw.wrangle(sample_data, backend='pandas')
print(f"Pandas result type: {type(pandas_result)}")

# Test with Polars backend
polars_result = dw.wrangle(sample_data, backend='polars')
print(f"Polars result type: {type(polars_result)}")

# Set global backend preference
print("\n📝 Setting global backend to Polars...")
set_dataframe_backend('polars')
print(f"New global backend: {get_dataframe_backend()}")

# Now operations use Polars by default
global_result = dw.wrangle(sample_data)  # No backend parameter needed
print(f"Global setting result type: {type(global_result)}")

# Reset to pandas for rest of tutorial
set_dataframe_backend('pandas')
print(f"\n🔄 Reset to pandas: {get_dataframe_backend()}")

Current DataFrame backend: pandas
Pandas result type: <class 'pandas.core.frame.DataFrame'>
Polars result type: <class 'polars.dataframe.frame.DataFrame'>

📝 Setting global backend to Polars...
New global backend: polars
Global setting result type: <class 'pandas.core.frame.DataFrame'>

🔄 Reset to pandas: pandas

How `config.ini` is structured

get_default_options() parses datawrangler/core/config.ini, which is organized into named sections. Some sections configure data types and backends ([supported_formats], [backend], [text]); the rest give the default keyword arguments for a specific model or function, one section per name ([CountVectorizer], [LatentDirichletAllocation], [interpolate], and so on). Values are stored as strings and evaluated when used, so you can write things like stop_words = 'english' or n_components = 50.

[5]:

# The [text] section defines the default model, corpus, and per-model settings
print('Default text pipeline:', defaults['text']['model'])
print('Default corpus:       ', defaults['text']['corpus'])
print('Default interpolate:  ', dict(defaults['interpolate']))
print('Default impute model: ', defaults['impute']['model'])

Default text pipeline: ['CountVectorizer', 'LatentDirichletAllocation']
Default corpus:        'minipedia'
Default interpolate:   {'method': "'linear'", 'limit_direction': "'both'"}
Default impute model:  'IterativeImputer'

Customizing the defaults

get_default_options() returns an ordinary (mutable) config object, so you can adjust settings at runtime without editing the file on disk. Changing a value here changes the defaults that downstream functions pick up via apply_defaults.

[6]:

# Make a customized copy of the defaults
custom = get_default_options()
custom['LatentDirichletAllocation']['n_components'] = '10'   # 10 topics instead of 50
print(f"Customized LDA topics: {custom['LatentDirichletAllocation']['n_components']}")

# Point the default text corpus somewhere else
custom['text']['corpus'] = "'sotus'"
print(f"Customized default corpus: {custom['text']['corpus']}")

Customized LDA topics: 10
Customized default corpus: 'sotus'

Applying defaults with `apply_defaults`

apply_defaults wraps a function so that any argument the caller omits is filled in from the configuration. It matches the function’s name against a section in the defaults, so a function called scale picks up the values from a [scale] section. This is how data-wrangler injects per-model settings automatically.

[7]:

def scale(data, factor=1.0):
    """Multiply every value by `factor`."""
    return [d * factor for d in data]

# Supply defaults for `scale` via a matching section (values are strings, as in config.ini)
scale_with_defaults = apply_defaults(scale, defaults={'scale': {'factor': '3.0'}})

print('Without defaults (factor=1.0):', scale([1, 2, 3]))
print('With apply_defaults (factor=3.0):', scale_with_defaults([1, 2, 3]))
print('Explicit argument still wins:', scale_with_defaults([1, 2, 3], factor=10.0))

Without defaults (factor=1.0): [1.0, 2.0, 3.0]
With apply_defaults (factor=3.0): [3.0, 6.0, 9.0]
Explicit argument still wins: [10.0, 20.0, 30.0]