Data wrangling basics: supported filetypes#

[1]:
import datawrangler as dw
import numpy as np
import os
import pandas as pd
from matplotlib import pyplot as plt
from tutorial_helpers import data_file, image_file, text_file

Let’s load in some sample data:

[2]:
dataframe = dw.io.load(data_file, index_col=0)
array = dataframe.values
image = dw.io.load(image_file)
text = dw.io.load(text_file)

Sample DataFrame:

[3]:
dataframe
[3]:
FirstDim SecondDim ThirdDim FourthDim FifthDim
ByTwos
0 1 2 3 4 5
2 2 4 6 8 10
4 3 6 9 12 15
5 4 8 12 16 20
6 5 10 15 20 25
8 6 12 18 24 30
10 7 14 21 28 35

Sample Array

[4]:
array
[4]:
array([[ 1,  2,  3,  4,  5],
       [ 2,  4,  6,  8, 10],
       [ 3,  6,  9, 12, 15],
       [ 4,  8, 12, 16, 20],
       [ 5, 10, 15, 20, 25],
       [ 6, 12, 18, 24, 30],
       [ 7, 14, 21, 28, 35]])

Sample image:

[5]:
plt.imshow(image)
[5]:
<matplotlib.image.AxesImage at 0x7fb72ab06cd0>
../_images/tutorials_wrangling_basics_9_1.png

Sample text:

[6]:
print(text)
O give me a home where the buffaloes roam
Where the deer and the antelope play
Where seldom is heard a discouraging word
And the skies are not cloudy all day
Home, home on the range
Where the deer and the antelope play
Where seldom is heard a discouraging word
And the skies are not cloudy all day
Where the air is so pure and the zephyrs so free
And the breezes so balmy and light
That I would not exchange my home on the range
For all of the cities so bright
Home, home on the range
Where the deer and the antelope play
Where seldom is heard a discouraging word
And the skies are not cloudy all day
How often at night when the heavens are bright
With the light of the glittering stars
I stand there amazed and I ask as I gaze
Does their glory exceed that of ours?
Home, home on the range
Where the deer and the antelope play
Where seldom is heard a discouraging word
And the skies are not cloudy all day

Wrangling DataFrames#

Wrangling a well-formed DataFrame just returns itself:

[7]:
wrangled_df = dw.wrangle(dataframe)
assert np.allclose(dataframe, wrangled_df)
wrangled_df
[7]:
FirstDim SecondDim ThirdDim FourthDim FifthDim
ByTwos
0 1 2 3 4 5
2 2 4 6 8 10
4 3 6 9 12 15
5 4 8 12 16 20
6 5 10 15 20 25
8 6 12 18 24 30
10 7 14 21 28 35

Wrangling Arrays#

Wrangling an Array turns it into a DataFrame. If the Array is 2D, the resulting DataFrame will have the same shape:

[8]:
wrangled_array = dw.wrangle(array)
assert np.allclose(dataframe, wrangled_array)
wrangled_array
[8]:
0 1 2 3 4
0 1 2 3 4 5
1 2 4 6 8 10
2 3 6 9 12 15
3 4 8 12 16 20
4 5 10 15 20 25
5 6 12 18 24 30
6 7 14 21 28 35

Note that we’ve recovered the original DataFrame, but the index and column labels have been reset. We can provide these labels to the wrangle function. The array_kwargs keyword argument specifies how array (or array-like) data objects should be turned into DataFrames:

[9]:
array_kwargs = {'index': dataframe.index, 'columns': dataframe.columns}
wrangled_array2 = dw.wrangle(array, array_kwargs=array_kwargs)
wrangled_array2
[9]:
FirstDim SecondDim ThirdDim FourthDim FifthDim
ByTwos
0 1 2 3 4 5
2 2 4 6 8 10
4 3 6 9 12 15
5 4 8 12 16 20
6 5 10 15 20 25
8 6 12 18 24 30
10 7 14 21 28 35

Wrangling text data using natural language processing models#

Next, let’s play with some text data. By default, data-wrangler embeds text using a Latent Dirichlet Allocation model trained on a curated version of Wikipedia, called the “minipedia” corpus. First we’ll split the text into its component lines, and then we’ll wrangle the result:

[10]:
lines = text.split('\n')  # creates a list of strings (one string per line)
wrangled_text = dw.wrangle(lines)
wrangled_text
loading corpus: minipedia...done!
[10]:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
1 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
2 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
3 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
4 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
5 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
6 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
7 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
8 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
9 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
10 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
11 0.01 0.01 0.01 0.51 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
12 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
13 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
14 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
15 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
16 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
17 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
18 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
19 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
20 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
21 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
22 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
23 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ... 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02

24 rows × 50 columns

In the resulting DataFrame, each row corresponds to a line of text, and each column corresponds to an embedding dimension. To get a better feel for what these dimensions mean, we can use the return_model flag to get back the fitted model, and then we can examine the top-weighted words from each topic:

[11]:
wrangled_text2, text_model = dw.wrangle(lines, text_kwargs={'return_model': True})
[12]:
# display top words from the model
def get_top_words(model, n_words=10):
  vectorizer = model[0]['model']
  embedder = model[1]['model']

  vocab = {v: k for k, v in vectorizer.vocabulary_.items()}
  top_words = []
  for k in range(embedder.components_.shape[0]):
      top_words.append([vocab[i] for i in np.argsort(embedder.components_[k, :])[::-1][:n_words]])
  return top_words

def display_top_words(model, n_words=10):
  for k, w in enumerate(get_top_words(model, n_words=n_words)):
      print(f'Topic {k}: {", ".join(w)}')

print(f'Top words from each of the {wrangled_text2.shape[1]} discovered topics:\n')
display_top_words(text_model)
Top words from each of the 50 discovered topics:

Topic 0: angle, points, normal, units, equal, measure, distribution, distance, unit, constant
Topic 1: foot, feet, double, speed, running, course, action, round, distance, motion
Topic 2: game, play, team, played, rules, sports, field, events, competition, women
Topic 3: mm, metal, plastic, glass, steel, diameter, sizes, machine, strength, cm
Topic 4: 2008, million, 2007, 2009, 2011, 2010, march, december, 2006, january
Topic 5: color, green, blue, yellow, colors, brown, tree, dark, plant, plants
Topic 6: music, sound, film, rock, played, television, play, record, records, classical
Topic 7: hours, night, hour, days, minutes, 24, week, daily, sun, working
Topic 8: theory, self, behavior, science, cultural, concept, studies, individuals, model, relationship
Topic 9: earth, sun, god, million, appears, visible, bodies, believed, billion, ago
Topic 10: computer, data, management, project, key, electronic, access, online, devices, technology
Topic 11: church, god, christian, religious, roman, religion, tradition, eastern, traditions, st
Topic 12: military, forces, force, ii, arms, russian, royal, service, units, operations
Topic 13: objects, mass, object, field, core, space, matter, fields, visible, energy
Topic 14: building, built, buildings, construction, house, floor, room, space, houses, walls
Topic 15: class, classes, anti, active, business, fall, military, 1950s, divided, feature
Topic 16: cross, symbol, sign, al, et, shaped, version, appears, represents, link
Topic 17: wind, ice, scale, energy, temperature, weather, speed, pressure, flow, sea
Topic 18: gas, energy, temperature, heat, chemical, carbon, liquid, compounds, acid, reaction
Topic 19: gain, southern, species, models, california, northern, frequency, active, components, spring
Topic 20: pressure, flow, fluid, test, volume, supply, liquid, inner, internal, outer
Topic 21: gold, iron, silver, metal, steel, value, carbon, bc, pure, ii
Topic 22: death, stage, dead, die, remains, man, performed, stages, carried, bodies
Topic 23: species, humans, male, female, million, ago, live, population, living, animals
Topic 24: language, languages, books, writing, written, words, text, published, formal, literature
Topic 25: health, medical, care, treatment, poor, population, million, studies, report, risk
Topic 26: animals, animal, skin, humans, eye, eyes, wild, kept, domestic, raised
Topic 27: cells, cell, growth, plants, structures, layer, plant, acid, functions, biological
Topic 28: women, sexual, children, men, female, male, child, woman, man, mother
Topic 29: oil, fruit, varieties, hot, served, consumption, grown, fresh, content, sold
Topic 30: service, court, legal, services, civil, department, federal, government, laws, issued
Topic 31: blood, disease, heart, risk, diseases, treatment, health, causes, medical, loss
Topic 32: vehicles, electric, speed, built, drive, safety, equipment, transport, electrical, technology
Topic 33: art, style, london, works, 18th, museum, tradition, saw, famous, william
Topic 34: political, party, government, rights, legal, organization, exchange, status, organizations, economic
Topic 35: fish, sea, ft, river, deep, land, fresh, 200, island, 500
Topic 36: city, road, street, cities, town, river, urban, population, island, travel
Topic 37: god, fruit, trees, disease, risk, treatment, head, million, cultural, medical
Topic 38: soil, land, plant, bc, plants, rock, regions, region, india, stone
Topic 39: story, damage, article, loss, ring, exposure, protection, journal, published, severe
Topic 40: section, big, principal, differences, ii, fully, model, wind, 32, split
Topic 41: base, lines, wall, pieces, piece, opening, figure, upper, vertical, branch
Topic 42: property, elements, numbers, element, table, properties, real, value, classical, theory
Topic 43: worn, paper, wear, clothing, fashion, women, men, style, styles, cover
Topic 44: chinese, king, india, east, african, africa, japanese, asia, indian, spanish
Topic 45: species, trees, plants, leaves, winter, wild, tree, season, summer, northern
Topic 46: foot, et, al, feet, science, political, height, 18th, court, religious
Topic 47: market, price, product, products, goods, store, chain, supply, industry, company
Topic 48: school, education, training, schools, degree, college, children, professional, programs, degrees
Topic 49: wood, head, image, cut, wooden, tools, edge, tool, images, face

Then we can ask which topics had the most weight in each line:

[13]:
i = 1
line_embedding = wrangled_text2.loc[i].values
line_top_topic = np.where(line_embedding == np.max(line_embedding))[0][0]

print(f'Line {i} put the most weight on topic {line_top_topic}: {lines[i]}')
Line 1 put the most weight on topic 2: Where the deer and the antelope play

Note that each time the model is re-trained, the topic weights will change. If all text data are wrangled in a single pass, data-wrangler will automatically apply the same model to all text data. However, if the data are wrangled in multiple calls to dw.wrangle, the model fit during the first pass should be re-used in subsequent analyses:

[14]:
def match(a, b, type):
    if np.allclose(a, b):
        print(f'{type.capitalize()}s match!')
    else:
        print(f'{type.capitalize()}s do NOT match!')

match(wrangled_text, wrangled_text2, 'topic')
Topics do NOT match!

We can re-apply the already-fitted model to “new” text:

[15]:
wrangled_text3 = dw.wrangle(lines, text_kwargs={'model': text_model})
match(wrangled_text2, wrangled_text3, 'topic')
Topics match!

In addition to training scikit-learn text embedding models and applying them to new text, data-wrangler also provides wrappers for all models on hugging-face.

Support for hugging-face models requires installing pytorch along with several other “heavy” libraries. To keep data-wrangler lightweight, these dependencies are not installed by default. To use hugging-face models, install data-wrangler with the hf flag to install the additional required dependencies:

pip install --upgrade pydata-wrangler[hf]

Once the relevant requirements have been installed, the example text may be embedded using BERT as follows:

[16]:
bert = {'model': 'TransformerDocumentEmbeddings', 'args': ['bert-base-uncased'], 'kwargs': {}}
bert_embeddings = dw.wrangle(lines, text_kwargs={'model': bert})
bert_embeddings
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[16]:
0 1 2 3 4 5 6 7 8 9 ... 758 759 760 761 762 763 764 765 766 767
0 -0.179094 0.284875 -0.299491 -0.177137 -0.141404 -0.067665 0.565553 0.467750 -0.202324 -0.212224 ... -0.084236 -0.418129 0.018537 0.011057 -0.008122 0.272119 -0.206632 -0.628336 0.101313 0.264802
1 -0.355487 -0.017091 -0.654681 -0.259151 0.354129 -0.081604 0.058410 0.774030 -0.508033 -0.182541 ... 0.604850 -0.409842 -0.351007 0.006899 0.283090 0.510657 -0.163572 -0.187182 0.706947 0.126187
2 -0.434022 0.232728 -0.357610 -0.538824 -0.114096 0.091183 0.355937 0.533486 -0.258791 -0.517260 ... 0.326534 -0.007036 0.122069 0.427969 0.178177 0.006136 0.400360 -0.195199 -0.078082 0.425256
3 0.116562 0.352505 -0.202589 -0.516457 -0.240493 -0.251500 0.434644 1.104639 -0.548261 -0.573233 ... -0.207017 -0.543512 0.078864 0.427558 0.229642 0.177264 0.150222 -0.463897 0.718633 0.301247
4 -0.303647 0.064255 0.083019 -0.227822 -0.155865 -0.340454 0.406733 0.363462 -0.119832 0.008430 ... 0.045680 -0.289353 0.040407 0.220519 0.385223 -0.200164 -0.278663 -0.711024 0.326283 0.156556
5 -0.355487 -0.017091 -0.654681 -0.259151 0.354129 -0.081604 0.058410 0.774030 -0.508033 -0.182541 ... 0.604850 -0.409842 -0.351007 0.006899 0.283090 0.510657 -0.163572 -0.187182 0.706947 0.126187
6 -0.434022 0.232728 -0.357610 -0.538824 -0.114096 0.091183 0.355937 0.533486 -0.258791 -0.517260 ... 0.326534 -0.007036 0.122069 0.427969 0.178177 0.006136 0.400360 -0.195199 -0.078082 0.425256
7 0.116562 0.352505 -0.202589 -0.516457 -0.240493 -0.251500 0.434644 1.104639 -0.548261 -0.573233 ... -0.207017 -0.543512 0.078864 0.427558 0.229642 0.177264 0.150222 -0.463897 0.718633 0.301247
8 -0.154609 -0.117675 -0.193155 -0.120318 0.144618 -0.040556 0.055185 0.593648 -0.052717 -0.609513 ... 0.127333 -0.753203 -0.320297 0.160509 0.180886 0.219988 -0.039618 -0.503588 0.528944 0.037808
9 -0.055069 0.233549 -0.373195 -0.063233 -0.368840 -0.094472 -0.001174 0.553668 -0.308054 -0.308339 ... -0.078380 -0.351585 0.063690 0.097025 0.297684 0.268404 -0.114566 -0.443027 0.438811 0.249571
10 -0.267577 0.273886 -0.290938 -0.154544 -0.294967 -0.245634 0.110082 0.199497 0.071730 0.048779 ... 0.186747 -0.157770 0.178595 -0.231232 0.034351 -0.263620 -0.297643 -0.015021 0.064520 0.413469
11 -0.197301 0.388715 -0.570143 0.060791 -0.689461 -0.086844 0.264934 0.730778 -0.349251 -0.632186 ... 0.062608 -0.156804 0.015915 0.356509 0.126876 0.070972 -0.266137 -0.248545 0.569642 0.025831
12 -0.303647 0.064255 0.083019 -0.227822 -0.155865 -0.340454 0.406733 0.363462 -0.119832 0.008430 ... 0.045680 -0.289353 0.040407 0.220519 0.385223 -0.200164 -0.278663 -0.711024 0.326283 0.156556
13 -0.355487 -0.017091 -0.654681 -0.259151 0.354129 -0.081604 0.058410 0.774030 -0.508033 -0.182541 ... 0.604850 -0.409842 -0.351007 0.006899 0.283090 0.510657 -0.163572 -0.187182 0.706947 0.126187
14 -0.434022 0.232728 -0.357610 -0.538824 -0.114096 0.091183 0.355937 0.533486 -0.258791 -0.517260 ... 0.326534 -0.007036 0.122069 0.427969 0.178177 0.006136 0.400360 -0.195199 -0.078082 0.425256
15 0.116562 0.352505 -0.202589 -0.516457 -0.240493 -0.251500 0.434644 1.104639 -0.548261 -0.573233 ... -0.207017 -0.543512 0.078864 0.427558 0.229642 0.177264 0.150222 -0.463897 0.718633 0.301247
16 -0.042624 0.237888 -0.339024 -0.296698 -0.030120 0.325440 0.256775 0.668790 -0.397956 -0.412569 ... -0.078481 -0.429169 0.062366 0.155723 0.464718 0.164203 -0.323323 -0.087794 0.519133 -0.012814
17 -0.128437 0.139192 -0.030835 -0.021700 -0.219863 -0.022784 0.231964 0.285454 -0.195330 -0.291857 ... 0.021852 -0.090953 -0.045454 0.106142 0.084842 -0.083967 0.176070 -0.354902 0.105398 0.048799
18 -0.058371 0.577921 -0.205151 -0.409644 -0.182388 -0.365214 0.237212 0.274319 -0.231599 -0.452986 ... 0.180460 -0.142040 0.198233 -0.205660 -0.146888 0.240958 -0.033177 -0.298921 0.645759 0.109323
19 -0.095738 0.258930 -0.246360 -0.317963 -0.303286 0.059010 0.884104 0.638759 -0.023153 -0.363116 ... -0.093948 -0.062690 -0.032260 -0.303910 0.204218 -0.038480 0.264909 -0.098913 0.112675 0.251036
20 -0.303647 0.064255 0.083019 -0.227822 -0.155865 -0.340454 0.406733 0.363462 -0.119832 0.008430 ... 0.045680 -0.289353 0.040407 0.220519 0.385223 -0.200164 -0.278663 -0.711024 0.326283 0.156556
21 -0.355487 -0.017091 -0.654681 -0.259151 0.354129 -0.081604 0.058410 0.774030 -0.508033 -0.182541 ... 0.604850 -0.409842 -0.351007 0.006899 0.283090 0.510657 -0.163572 -0.187182 0.706947 0.126187
22 -0.434022 0.232728 -0.357610 -0.538824 -0.114096 0.091183 0.355937 0.533486 -0.258791 -0.517260 ... 0.326534 -0.007036 0.122069 0.427969 0.178177 0.006136 0.400360 -0.195199 -0.078082 0.425256
23 0.116562 0.352505 -0.202589 -0.516457 -0.240493 -0.251500 0.434644 1.104639 -0.548261 -0.573233 ... -0.207017 -0.543512 0.078864 0.427558 0.229642 0.177264 0.150222 -0.463897 0.718633 0.301247

24 rows × 768 columns

Wrangling images#

Images (stored in any format supported by matplotlib) treated as Arrays. Images are wrangled into DataFrames by slicing the image along axis 2 (i.e., the color dimension), horizontally concatenating the slices, and then turning the result into a DataFrame. In general, this approach is taken for all high-dimensional (> 2D) Arrays:

[17]:
wrangled_image = dw.wrangle(image)
plt.imshow(wrangled_image);
../_images/tutorials_wrangling_basics_32_0.png

Objects, file paths, and URLs#

Data supplied to data-wrangler may be passed in directly as a Python object that is already loaded into memory (as in the above examples). However, data may also be supplied as a (string) file path or URL. For example, wrangling the already loaded-in image versus wrangling the image’s file path will yield the same result:

[18]:
wrangled_image_from_path = dw.wrangle(image_file)
match(wrangled_image, wrangled_image_from_path, 'image')

Images match!

Handling multiple data types#

Multiple objects, file paths, or URLs may be wrangled in a single function call. If desired, type-specific wrangling preferences may be provided. Specifying return_dtype=True also returns a list of the automatically detected data types for each object:

[19]:
text_kwargs = {'model': bert}

i = 10
first_lines = lines[:i]
last_lines = lines[i:]
[20]:
wrangled_data, dtypes = dw.wrangle([dataframe, array, image_file, first_lines, last_lines],
                                   text_kwargs=text_kwargs,
                                   return_dtype=True)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

We can check the inferred datatypes:

[21]:
dtypes
[21]:
['dataframe', 'array', 'array', 'text', 'text']

We can also verify that when the data are wrangled simultaneously, we get the same results as when each object is wrangled separately. For example, here’s the newly wrangled image:

[22]:
# visualize the wrangled image
plt.imshow(wrangled_data[2])
[22]:
<matplotlib.image.AxesImage at 0x7fb72b2ea890>
../_images/tutorials_wrangling_basics_41_1.png

And here’s how the text embeddings compare to our previous results:

[23]:
# compare the first lines' embeddings:
match(bert_embeddings.iloc[:i], wrangled_data[3], 'first lines\'s BERT embedding')

# compare the last lines' embeddings
match(bert_embeddings.iloc[i:], wrangled_data[4], 'last lines\'s BERT embedding')
First lines's bert embeddings match!
Last lines's bert embeddings match!

Check out the other tutorials for more advanced data wrangling functions!