{ "cells": [ { "cell_type": "markdown", "source": "# Data Wrangler Core Configuration\n\nThis tutorial covers the core configuration system in data-wrangler, including how to customize default settings, work with configuration files, and apply custom defaults to functions.\n\n## Overview\n\nThe `datawrangler.core` module provides a flexible configuration system that allows you to:\n\n- Set default parameters for text processing models\n- Customize data processing behavior\n- Apply consistent settings across your project\n- Override defaults on a per-function basis\n\nLet's explore how to use these features effectively.", "metadata": {} }, { "cell_type": "code", "source": "import datawrangler as dw\nfrom datawrangler.core import get_default_options, apply_defaults, update_dict\nimport pandas as pd\nimport numpy as np", "metadata": {}, "outputs": [], "execution_count": null }, { "cell_type": "markdown", "source": "## Getting Default Configuration\n\nThe configuration system is built around a `config.ini` file that defines default parameters for all supported models and data types. Let's examine the current defaults:", "metadata": {} }, { "cell_type": "code", "source": "# Get all default configuration options\ndefaults = get_default_options()\n\n# Display the main configuration sections\nprint(\"Available configuration sections:\")\nfor section in defaults.keys():\n print(f\"- {section}\")\n\nprint(f\"\\nSupported data types: {defaults['supported_formats']['types']}\")\nprint(f\"Default text model: {defaults['text']['model']}\")\nprint(f\"Default text corpus: {defaults['text']['corpus']}\")", "metadata": {}, "outputs": [], "execution_count": null }, { "cell_type": "markdown", "source": "## Model-Specific Configuration\n\nEach model has its own section in the configuration with optimized default parameters. Let's examine some key model configurations:", "metadata": {} }, { "cell_type": "markdown", "source": "## DataFrame Backend Configuration\n\nWith the introduction of Polars support, data-wrangler now supports configuring the DataFrame backend globally:", "metadata": {} }, { "cell_type": "code", "source": "# Examine sklearn model defaults\nprint(\"CountVectorizer default settings:\")\nfor key, value in defaults['CountVectorizer'].items():\n print(f\" {key}: {value}\")\n\nprint(\"\\nLatentDirichletAllocation default settings:\")\nfor key, value in defaults['LatentDirichletAllocation'].items():\n print(f\" {key}: {value}\")\n\nprint(\"\\nSentenceTransformer default settings:\")\nfor key, value in defaults['SentenceTransformer'].items():\n print(f\" {key}: {value}\")", "metadata": {}, "outputs": [], "execution_count": null }, { "cell_type": "code", "source": "# Import backend configuration functions\nfrom datawrangler.core.configurator import set_dataframe_backend, get_dataframe_backend\n\n# Check current backend\nprint(f\"Current DataFrame backend: {get_dataframe_backend()}\")\n\n# Create sample data\nsample_data = np.random.rand(100, 5)\n\n# Test with pandas backend (default)\npandas_result = dw.wrangle(sample_data, backend='pandas')\nprint(f\"Pandas result type: {type(pandas_result)}\")\n\n# Test with Polars backend\npolars_result = dw.wrangle(sample_data, backend='polars')\nprint(f\"Polars result type: {type(polars_result)}\")\n\n# Set global backend preference\nprint(\"\\n\ud83d\udcdd Setting global backend to Polars...\")\nset_dataframe_backend('polars')\nprint(f\"New global backend: {get_dataframe_backend()}\")\n\n# Now operations use Polars by default\nglobal_result = dw.wrangle(sample_data) # No backend parameter needed\nprint(f\"Global setting result type: {type(global_result)}\")\n\n# Reset to pandas for rest of tutorial\nset_dataframe_backend('pandas')\nprint(f\"\\n\ud83d\udd04 Reset to pandas: {get_dataframe_backend()}\")", "metadata": {}, "outputs": [], "execution_count": null } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" }, "pycharm": { "stem_cell": { "cell_type": "raw", "source": [ "\n", "\n", "\n" ], "metadata": { "collapsed": false } } } }, "nbformat": 4, "nbformat_minor": 0 }