Data wrangling basics: supported filetypes

[1]:
import datawrangler as dw
import numpy as np
import os
import pandas as pd
from matplotlib import pyplot as plt
from tutorial_helpers import data_file, image_file, text_file

Let’s load in some sample data:

[2]:
dataframe = dw.io.load(data_file, index_col=0)
array = dataframe.values
image = dw.io.load(image_file)
text = dw.io.load(text_file)

Sample DataFrame:

[3]:
dataframe
[3]:
FirstDim SecondDim ThirdDim FourthDim FifthDim
ByTwos
0 1 2 3 4 5
2 2 4 6 8 10
4 3 6 9 12 15
5 4 8 12 16 20
6 5 10 15 20 25
8 6 12 18 24 30
10 7 14 21 28 35

Sample Array

[4]:
array
[4]:
array([[ 1,  2,  3,  4,  5],
       [ 2,  4,  6,  8, 10],
       [ 3,  6,  9, 12, 15],
       [ 4,  8, 12, 16, 20],
       [ 5, 10, 15, 20, 25],
       [ 6, 12, 18, 24, 30],
       [ 7, 14, 21, 28, 35]])

Sample image:

[5]:
plt.imshow(image)
[5]:
<matplotlib.image.AxesImage at 0x7fb72ab06cd0>
../_images/tutorials_wrangling_basics_9_1.png

Sample text:

[6]:
print(text)
O give me a home where the buffaloes roam
Where the deer and the antelope play
Where seldom is heard a discouraging word
And the skies are not cloudy all day
Home, home on the range
Where the deer and the antelope play
Where seldom is heard a discouraging word
And the skies are not cloudy all day
Where the air is so pure and the zephyrs so free
And the breezes so balmy and light
That I would not exchange my home on the range
For all of the cities so bright
Home, home on the range
Where the deer and the antelope play
Where seldom is heard a discouraging word
And the skies are not cloudy all day
How often at night when the heavens are bright
With the light of the glittering stars
I stand there amazed and I ask as I gaze
Does their glory exceed that of ours?
Home, home on the range
Where the deer and the antelope play
Where seldom is heard a discouraging word
And the skies are not cloudy all day

Wrangling DataFrames

Wrangling a well-formed DataFrame just returns itself:

High-Performance DataFrames with Polars

data-wrangler now supports Polars, a lightning-fast DataFrame library that can provide 2-100x performance improvements over pandas for many operations. You can choose your DataFrame backend on a per-operation basis or globally.

Installation

Polars is now included as a core dependency of data-wrangler, so no additional installation is required!

Basic Polars Usage

You can specify the backend for any wrangling operation:

[7]:
wrangled_df = dw.wrangle(dataframe)
assert np.allclose(dataframe, wrangled_df)
wrangled_df
[7]:
FirstDim SecondDim ThirdDim FourthDim FifthDim
ByTwos
0 1 2 3 4 5
2 2 4 6 8 10
4 3 6 9 12 15
5 4 8 12 16 20
6 5 10 15 20 25
8 6 12 18 24 30
10 7 14 21 28 35

Wrangling Arrays

Wrangling an Array turns it into a DataFrame. If the Array is 2D, the resulting DataFrame will have the same shape:

[8]:
wrangled_array = dw.wrangle(array)
assert np.allclose(dataframe, wrangled_array)
wrangled_array
[8]:
0 1 2 3 4
0 1 2 3 4 5
1 2 4 6 8 10
2 3 6 9 12 15
3 4 8 12 16 20
4 5 10 15 20 25
5 6 12 18 24 30
6 7 14 21 28 35

Note that we’ve recovered the original DataFrame, but the index and column labels have been reset. We can provide these labels to the wrangle function. The array_kwargs keyword argument specifies how array (or array-like) data objects should be turned into DataFrames:

[9]:
array_kwargs = {'index': dataframe.index, 'columns': dataframe.columns}
wrangled_array2 = dw.wrangle(array, array_kwargs=array_kwargs)
wrangled_array2
[9]:
FirstDim SecondDim ThirdDim FourthDim FifthDim
ByTwos
0 1 2 3 4 5
2 2 4 6 8 10
4 3 6 9 12 15
5 4 8 12 16 20
6 5 10 15 20 25
8 6 12 18 24 30
10 7 14 21 28 35

Wrangling text data using natural language processing models

Next, let’s play with some text data. By default, data-wrangler embeds text using a Latent Dirichlet Allocation model trained on a curated version of Wikipedia, called the “minipedia” corpus. First we’ll split the text into its component lines, and then we’ll wrangle the result:

[10]:
lines = text.split('\n')  # creates a list of strings (one string per line)
wrangled_text = dw.wrangle(lines)
wrangled_text
loading corpus: minipedia...done!
[10]:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
1 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
2 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
3 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
4 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
5 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
6 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
7 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
8 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
9 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
10 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
11 0.01 0.01 0.01 0.51 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
12 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
13 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
14 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
15 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
16 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
17 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
18 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
19 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
20 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
21 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
22 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
23 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02

24 rows × 50 columns

In the resulting DataFrame, each row corresponds to a line of text, and each column corresponds to an embedding dimension. To get a better feel for what these dimensions mean, we can use the return_model flag to get back the fitted model, and then we can examine the top-weighted words from each topic:

[11]:
wrangled_text2, text_model = dw.wrangle(lines, text_kwargs={'return_model': True})
[12]:
# display top words from the model
def get_top_words(model, n_words=10):
  vectorizer = model[0]['model']
  embedder = model[1]['model']

  vocab = {v: k for k, v in vectorizer.vocabulary_.items()}
  top_words = []
  for k in range(embedder.components_.shape[0]):
      top_words.append([vocab[i] for i in np.argsort(embedder.components_[k, :])[::-1][:n_words]])
  return top_words

def display_top_words(model, n_words=10):
  for k, w in enumerate(get_top_words(model, n_words=n_words)):
      print(f'Topic {k}: {", ".join(w)}')

print(f'Top words from each of the {wrangled_text2.shape[1]} discovered topics:\n')
display_top_words(text_model)
Top words from each of the 50 discovered topics:

Topic 0: angle, points, normal, units, equal, measure, distribution, distance, unit, constant
Topic 1: foot, feet, double, speed, running, course, action, round, distance, motion
Topic 2: game, play, team, played, rules, sports, field, events, competition, women
Topic 3: mm, metal, plastic, glass, steel, diameter, sizes, machine, strength, cm
Topic 4: 2008, million, 2007, 2009, 2011, 2010, march, december, 2006, january
Topic 5: color, green, blue, yellow, colors, brown, tree, dark, plant, plants
Topic 6: music, sound, film, rock, played, television, play, record, records, classical
Topic 7: hours, night, hour, days, minutes, 24, week, daily, sun, working
Topic 8: theory, self, behavior, science, cultural, concept, studies, individuals, model, relationship
Topic 9: earth, sun, god, million, appears, visible, bodies, believed, billion, ago
Topic 10: computer, data, management, project, key, electronic, access, online, devices, technology
Topic 11: church, god, christian, religious, roman, religion, tradition, eastern, traditions, st
Topic 12: military, forces, force, ii, arms, russian, royal, service, units, operations
Topic 13: objects, mass, object, field, core, space, matter, fields, visible, energy
Topic 14: building, built, buildings, construction, house, floor, room, space, houses, walls
Topic 15: class, classes, anti, active, business, fall, military, 1950s, divided, feature
Topic 16: cross, symbol, sign, al, et, shaped, version, appears, represents, link
Topic 17: wind, ice, scale, energy, temperature, weather, speed, pressure, flow, sea
Topic 18: gas, energy, temperature, heat, chemical, carbon, liquid, compounds, acid, reaction
Topic 19: gain, southern, species, models, california, northern, frequency, active, components, spring
Topic 20: pressure, flow, fluid, test, volume, supply, liquid, inner, internal, outer
Topic 21: gold, iron, silver, metal, steel, value, carbon, bc, pure, ii
Topic 22: death, stage, dead, die, remains, man, performed, stages, carried, bodies
Topic 23: species, humans, male, female, million, ago, live, population, living, animals
Topic 24: language, languages, books, writing, written, words, text, published, formal, literature
Topic 25: health, medical, care, treatment, poor, population, million, studies, report, risk
Topic 26: animals, animal, skin, humans, eye, eyes, wild, kept, domestic, raised
Topic 27: cells, cell, growth, plants, structures, layer, plant, acid, functions, biological
Topic 28: women, sexual, children, men, female, male, child, woman, man, mother
Topic 29: oil, fruit, varieties, hot, served, consumption, grown, fresh, content, sold
Topic 30: service, court, legal, services, civil, department, federal, government, laws, issued
Topic 31: blood, disease, heart, risk, diseases, treatment, health, causes, medical, loss
Topic 32: vehicles, electric, speed, built, drive, safety, equipment, transport, electrical, technology
Topic 33: art, style, london, works, 18th, museum, tradition, saw, famous, william
Topic 34: political, party, government, rights, legal, organization, exchange, status, organizations, economic
Topic 35: fish, sea, ft, river, deep, land, fresh, 200, island, 500
Topic 36: city, road, street, cities, town, river, urban, population, island, travel
Topic 37: god, fruit, trees, disease, risk, treatment, head, million, cultural, medical
Topic 38: soil, land, plant, bc, plants, rock, regions, region, india, stone
Topic 39: story, damage, article, loss, ring, exposure, protection, journal, published, severe
Topic 40: section, big, principal, differences, ii, fully, model, wind, 32, split
Topic 41: base, lines, wall, pieces, piece, opening, figure, upper, vertical, branch
Topic 42: property, elements, numbers, element, table, properties, real, value, classical, theory
Topic 43: worn, paper, wear, clothing, fashion, women, men, style, styles, cover
Topic 44: chinese, king, india, east, african, africa, japanese, asia, indian, spanish
Topic 45: species, trees, plants, leaves, winter, wild, tree, season, summer, northern
Topic 46: foot, et, al, feet, science, political, height, 18th, court, religious
Topic 47: market, price, product, products, goods, store, chain, supply, industry, company
Topic 48: school, education, training, schools, degree, college, children, professional, programs, degrees
Topic 49: wood, head, image, cut, wooden, tools, edge, tool, images, face

Then we can ask which topics had the most weight in each line:

[13]:
i = 1
line_embedding = wrangled_text2.loc[i].values
line_top_topic = np.where(line_embedding == np.max(line_embedding))[0][0]

print(f'Line {i} put the most weight on topic {line_top_topic}: {lines[i]}')
Line 1 put the most weight on topic 2: Where the deer and the antelope play

Note that each time the model is re-trained, the topic weights will change. If all text data are wrangled in a single pass, data-wrangler will automatically apply the same model to all text data. However, if the data are wrangled in multiple calls to dw.wrangle, the model fit during the first pass should be re-used in subsequent analyses:

[14]:
def match(a, b, type):
    if np.allclose(a, b):
        print(f'{type.capitalize()}s match!')
    else:
        print(f'{type.capitalize()}s do NOT match!')

match(wrangled_text, wrangled_text2, 'topic')
Topics do NOT match!

We can re-apply the already-fitted model to “new” text:

[15]:
wrangled_text3 = dw.wrangle(lines, text_kwargs={'model': text_model})
match(wrangled_text2, wrangled_text3, 'topic')
Topics match!

Modern Text Processing with Sentence-Transformers

In addition to scikit-learn text embedding models, data-wrangler provides comprehensive support for state-of-the-art sentence-transformers models via HuggingFace.

Installation Requirements

Sentence-transformers support requires additional ML libraries. To keep data-wrangler lightweight, these are optional dependencies:

pip install --upgrade "pydata-wrangler[hf]"

This installs sentence-transformers, transformers, and related HuggingFace libraries.

Basic Usage

Here’s how to use sentence-transformers with your text data:

[ ]:
# High-quality model - better for production applications
# Using simplified API - just pass the model name as a string!
quality_embeddings = dw.wrangle(lines, text_kwargs={'model': 'all-mpnet-base-v2'})

print(f"Quality model embeddings shape: {quality_embeddings.shape}")
print(f"Quality model (first few dimensions):")
quality_embeddings.iloc[:3, :5]

Model Comparison

Notice the difference in embedding dimensions:

  • Fast model (all-MiniLM-L6-v2): 384 dimensions - good for speed and memory efficiency

  • Quality model (all-mpnet-base-v2): 768 dimensions - better semantic understanding

Practical Applications

Different models work better for different tasks:

  1. Similarity Search: Use all-MiniLM-L6-v2 for fast similarity search

  2. Semantic Analysis: Use all-mpnet-base-v2 for deeper semantic understanding

  3. Paraphrase Detection: Use paraphrase-MiniLM-L6-v2 for finding similar content

Let’s see how these embeddings can be used for similarity analysis:

[ ]:
# Example: Find similar lines using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Calculate similarity matrix
similarity_matrix = cosine_similarity(fast_embeddings)

# Find the most similar lines to the first line
line_0_similarities = similarity_matrix[0]
most_similar_indices = np.argsort(line_0_similarities)[-3:]  # Top 3 similar lines

print("Original line 0:", lines[0])
print("\nMost similar lines:")
for idx in reversed(most_similar_indices):
    if idx != 0:  # Don't include the line itself
        print(f"Line {idx} (similarity: {line_0_similarities[idx]:.3f}): {lines[idx]}")
[ ]:
# Fast model - good for quick prototyping
# Using simplified API - just pass the model name as a string!
fast_embeddings = dw.wrangle(lines, text_kwargs={'model': 'all-MiniLM-L6-v2'})

print(f"Fast model embeddings shape: {fast_embeddings.shape}")
print(f"Fast model (first few dimensions):")
fast_embeddings.iloc[:3, :5]

Wrangling images

Images (stored in any format supported by matplotlib) treated as Arrays. Images are wrangled into DataFrames by slicing the image along axis 2 (i.e., the color dimension), horizontally concatenating the slices, and then turning the result into a DataFrame. In general, this approach is taken for all high-dimensional (> 2D) Arrays:

[17]:
wrangled_image = dw.wrangle(image)
plt.imshow(wrangled_image);
../_images/tutorials_wrangling_basics_36_0.png

Objects, file paths, and URLs

Data supplied to data-wrangler may be passed in directly as a Python object that is already loaded into memory (as in the above examples). However, data may also be supplied as a (string) file path or URL. For example, wrangling the already loaded-in image versus wrangling the image’s file path will yield the same result:

[18]:
wrangled_image_from_path = dw.wrangle(image_file)
match(wrangled_image, wrangled_image_from_path, 'image')

Images match!

Handling multiple data types

Multiple objects, file paths, or URLs may be wrangled in a single function call. If desired, type-specific wrangling preferences may be provided. Specifying return_dtype=True also returns a list of the automatically detected data types for each object:

[37]:
# Using simplified API - just pass the model name as a string\!
text_kwargs = {'model': 'all-MiniLM-L6-v2'}

i = 10
first_lines = lines[:i]
last_lines = lines[i:]
[20]:
wrangled_data, dtypes = dw.wrangle([dataframe, array, image_file, first_lines, last_lines],
                                   text_kwargs=text_kwargs,
                                   return_dtype=True)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

We can check the inferred datatypes:

[21]:
dtypes
[21]:
['dataframe', 'array', 'array', 'text', 'text']

We can also verify that when the data are wrangled simultaneously, we get the same results as when each object is wrangled separately. For example, here’s the newly wrangled image:

[22]:
# visualize the wrangled image
plt.imshow(wrangled_data[2])
[22]:
<matplotlib.image.AxesImage at 0x7fb72b2ea890>
../_images/tutorials_wrangling_basics_45_1.png

And here’s how the text embeddings compare to our previous results:

[44]:
# compare the first lines' embeddings:
match(sentence_embeddings.iloc[:i], wrangled_data[3], 'first lines\'s sentence-transformers embedding')

# compare the last lines' embeddings
match(sentence_embeddings.iloc[i:], wrangled_data[4], 'last lines\'s sentence-transformers embedding')

Check out the other tutorials for more advanced data wrangling functions!

[ ]:
# Wrangle data using Polars backend for high performance
import polars as pl

# Convert array to Polars DataFrame
polars_df = dw.wrangle(array, backend='polars')
print(f"Polars DataFrame type: {type(polars_df)}")
print(f"Shape: {polars_df.shape}")
polars_df.head()

You can convert between pandas and Polars DataFrames seamlessly:

[ ]:
# Convert pandas DataFrame to Polars
pandas_df = wrangled_df  # Our original pandas DataFrame
polars_from_pandas = dw.wrangle(pandas_df, backend='polars')

print(f"Original: {type(pandas_df)}")
print(f"Converted to Polars: {type(polars_from_pandas)}")

# Convert Polars DataFrame back to pandas
pandas_from_polars = dw.wrangle(polars_from_pandas, backend='pandas')
print(f"Converted back to pandas: {type(pandas_from_polars)}")

# Verify data is preserved
import numpy as np
print(f"Data preserved: {np.allclose(pandas_df.values, pandas_from_polars.values)}")

You can set a global preference for DataFrame backend:

[ ]:
# Configure global backend preference
from datawrangler.core.configurator import set_dataframe_backend, get_dataframe_backend

# Check current default
print(f"Current default backend: {get_dataframe_backend()}")

# Set global preference to Polars
set_dataframe_backend('polars')
print(f"New default backend: {get_dataframe_backend()}")

# Now all operations use Polars by default
global_polars_df = dw.wrangle(array)  # No need to specify backend='polars'
print(f"Global setting result: {type(global_polars_df)}")

# Reset to pandas for rest of tutorial
set_dataframe_backend('pandas')
print(f"Reset to: {get_dataframe_backend()}")

Text processing also supports the Polars backend for high-performance embeddings:

[ ]:
# Text embeddings with Polars backend for better performance
sample_texts = ["Machine learning is fascinating",
                "Data science combines statistics and programming",
                "Polars is a fast DataFrame library"]

# Process text with Polars backend
polars_text_df = dw.wrangle(sample_texts, backend='polars')
print(f"Text embeddings with Polars: {type(polars_text_df)}")
print(f"Shape: {polars_text_df.shape}")

# Compare with pandas backend
pandas_text_df = dw.wrangle(sample_texts, backend='pandas')
print(f"Text embeddings with pandas: {type(pandas_text_df)}")
print(f"Shape: {pandas_text_df.shape}")

# Both should have the same shape
print(f"Same embedding dimensions: {polars_text_df.shape == pandas_text_df.shape}")

Polars offers significant performance advantages, especially for:

  • Large datasets: 2-10x faster operations on datasets with millions of rows

  • Aggregations: Group-by operations and statistical computations

  • Memory efficiency: Lower memory usage with columnar data format

  • Parallel processing: Built-in parallelization for multi-core systems

The choice between pandas and Polars depends on your specific needs:

  • Use pandas for: Familiarity, ecosystem compatibility, complex transformations

  • Use Polars for: Performance, large datasets, memory efficiency, parallel processing

data-wrangler automatically preserves your DataFrame type when no backend is specified:

Here’s a practical example showing the performance difference between pandas and Polars for array conversion:

[ ]:
# Demonstrate automatic type preservation
import pandas as pd
import polars as pl

# Create DataFrames of each type
pandas_input = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
polars_input = pl.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Wrangle without specifying backend - type is preserved
pandas_output = dw.wrangle(pandas_input)
polars_output = dw.wrangle(polars_input)

print(f"Pandas input: {type(pandas_input)} -> Output: {type(pandas_output)}")
print(f"Polars input: {type(polars_input)} -> Output: {type(polars_output)}")
print("✅ Types automatically preserved!")
[ ]:
import time
import numpy as np

# Create a moderately sized array for benchmarking
large_array = np.random.rand(10000, 50)
print(f"Array shape: {large_array.shape}")

# Benchmark pandas backend
start_time = time.time()
pandas_result = dw.wrangle(large_array, backend='pandas')
pandas_time = time.time() - start_time

# Benchmark Polars backend
start_time = time.time()
polars_result = dw.wrangle(large_array, backend='polars')
polars_time = time.time() - start_time

print(f"\n📊 Performance Comparison:")
print(f"Pandas backend: {pandas_time:.4f} seconds")
print(f"Polars backend: {polars_time:.4f} seconds")
print(f"Speedup: {pandas_time/polars_time:.1f}x faster with Polars")

# Verify results are equivalent
print(f"\n✅ Results equivalent: {np.allclose(pandas_result.values, polars_result.to_pandas().values)}")

Summary: Choosing the Right Backend

data-wrangler now provides flexible DataFrame backend support:

Feature

pandas

Polars

Performance

Standard

2-100x faster

Memory Usage

Higher

Lower (columnar)

Ecosystem

Mature, extensive

Growing rapidly

Learning Curve

Familiar to most

Similar API

Best For

General use, prototyping

Large data, production

Quick Start Guide

# Basic usage - specify backend per operation
import datawrangler as dw

# Use pandas (default)
df_pandas = dw.wrangle(data)

# Use Polars for performance
df_polars = dw.wrangle(data, backend='polars')

# Set global preference
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars')  # All operations use Polars

Both backends support all data-wrangler functionality including text processing, array conversion, and complex data wrangling pipelines. Choose the backend that best fits your performance requirements and workflow preferences!