datawrangler.zoo.text

datawrangler.zoo.text.is_text(x)[source]

Test whether an object contains (or points to) text.

Parameters

param x:: the object to test

Returns

return:: True if the object is (or points to) text and False otherwise.

datawrangler.zoo.text.wrangle_text(text, return_model=False, backend=None, **kwargs)[source]

Turn text into DataFrames (pandas or Polars)

Parameters

param text:

A string or (nested) list of strings. Each string can contain either the to-be-wrangled text, a file path, or a URL.

param return_model:

if True, return a fitted model that may be applied to new text data, along with the wrangled text. Default: False.

param backend:

str, optional The DataFrame backend to use (‘pandas’ or ‘polars’). If None, uses the default backend (pandas)

param kwargs:

Other (optional) keyword arguments may be passed into the function to control the wrangling process:

‘corpus’: any built-in or hugging-face corpus (see get_corpus for more details); this argument is passed to the get_corpus function as the “dataset_name” keyword argument
- the ‘config’ argument may be used to select a specific variant of the corpus (passed to get_corpus as the “config_name” keyword argument).
‘model’: any scikit-learn-compatible or hugging-face-compatible model (see apply_text_model for more details) Simplified API examples:
- ‘all-MiniLM-L6-v2’ (string format for sentence-transformers)
- ‘CountVectorizer’ (string format for sklearn model)
- [‘CountVectorizer’, ‘LatentDirichletAllocation’] (list of strings for sklearn pipeline)
- {‘model’: ‘all-MiniLM-L6-v2’} (partial dict format)
Full dict format (backward compatible):
- {‘model’: ‘all-MiniLM-L6-v2’, ‘args’: [], ‘kwargs’: {}}
‘array_kwargs’: a dictionary of keyword arguments that may be passed to wrangle_array to control how the final DataFrame is structured (see wrangle_array for details).

Returns

return:: a DataFrame (pandas or Polars based on backend) or list of DataFrames containing the embedded text. If return_model is True a tuple, whose first element contains the embedded text and second element contains the fitted models, is returned instead.

Examples

>>> import datawrangler as dw
>>> # Create pandas DataFrame with sentence embeddings
>>> df_pandas = dw.wrangle(["Hello world", "How are you?"],
...                        text_kwargs={'model': 'all-MiniLM-L6-v2'})
>>> # Create Polars DataFrame with sentence embeddings
>>> df_polars = dw.wrangle(["Hello world", "How are you?"],
...                        text_kwargs={'model': 'all-MiniLM-L6-v2'},
...                        backend='polars')
>>> # Use sklearn pipeline with pandas backend (default)
>>> df_sklearn = dw.wrangle(["This is text", "More text here"],
...                         text_kwargs={'model': ['CountVectorizer', 'LatentDirichletAllocation']})

datawrangler.zoo.text.get_corpus(dataset_name='wikimedia/wikipedia', config_name='20231101.en')[source]

Download (and return) a text corpus. By default, a 2023 snapshot of all English Wikipedia articles is returned.

Hugging-Face corpora must be referenced by their full namespace/name id (e.g. wikimedia/wikipedia, cam-cst/cbt); bare legacy names (e.g. wikipedia, cbt) are no longer accepted by datasets >= 4.

[Parameters]

param dataset_name:

a string containing the corpus name. Can be one of the following:

Corpora built into data-wrangler:
- ‘minipedia’: a curated and cleaned up subset of Wikipedia containing articles on a wide variety of topics
- ‘neurips’: a collection of NeurIPS articles
- ‘sotus’: transcripts of state of the union addresses from US Presidents from 1989 – 2018
- ‘khan’: transcripts of (most) Khan Academy YouTube videos
Any hugging-face corpus; for a full list see https://huggingface.co/datasets Note that downloading hugging-face corpora also requires specifying a config_name

param config_name:

configuration name or description for hugging-face corpora. This argument is ignored if dataset name is set to one of the data-wrangler corpora described above.

Returns

return:: A list of number-of-documents strings, where each string contains the text of one document in the corpus.

datawrangler.zoo.text.get_text_model(x)[source]

Given a valid scikit-learn or sentence-transformers model, or a string matching the name of a valid model, return a callable function or class constructor for the given model.

Parameters

param x:: an object to turn into a valid scikit-learn or sentence-transformers model. Can be: - An already-valid model instance - A string matching sklearn model names (e.g., ‘LatentDirichletAllocation’, ‘CountVectorizer’) - A string matching sentence-transformers model names (e.g., ‘all-MiniLM-L6-v2’, ‘all-mpnet-base-v2’) - A normalized dict with ‘model’ key (e.g., {‘model’: ‘CountVectorizer’, ‘args’: [], ‘kwargs’: {}})

Returns

return:: A valid scikit-learn or sentence-transformers model (or None if no model matching the given description can be found)

Examples

>>> from datawrangler.zoo.text import get_text_model
>>> get_text_model('LatentDirichletAllocation')  # sklearn model
>>> get_text_model('all-MiniLM-L6-v2')  # sentence-transformers model
>>> get_text_model({'model': 'CountVectorizer'})  # dict format

datawrangler.zoo.text.apply_text_model(x, text, *args, mode='fit_transform', return_model=False, **kwargs)[source]

Apply a scikit-learn or hugging-face text embedding model to one or more text datasets. Scikit-learn models are trained on the specified corpus and then applied to all datasets. All Hugging-Face models are pre-trained.

Parameters

param x:

the model to apply. Supported models include:

Scikit-learn models. The recommended pipeline is to specify a feature extraction model (for turning text into a number-of-documents by number-of-features matrix), and then to apply a matrix decomposition or embedding model (for turning the features matrix into text embeddings). When models are passed as a list, each model is applied in succession to the output of the previous model. The pipeline is first fit to the provided corpus, and then applied to the given text. Default: [‘CountVectorizer’, ‘LatentDirichletAllocation’]
- All scikit-learn text feature extraction models are supported; for a full list see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text These may be passed either as callable modules (e.g., sklearn.feature_extraction.text.CountVectorizer) or as strings (e.g., ‘CountVectorizer’). Default options for each model are defined in config.ini.
- All scikit-learn matrix decomposition models are supported; for a full list see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition These may be passed either as callable modules (e.g., sklearn.decomposition.NMF) or as strings (e.g., ‘NMF’). Default options for each model are defined in config.ini.
Hugging-face models. These take raw text as input and produce text embeddings as output. Models can be specified using the simplified API (recommended) or full dict format:

Simplified API (NEW):
- As a string: ‘all-MiniLM-L6-v2’
- As a partial dict: {‘model’: ‘all-MiniLM-L6-v2’}
Popular models include:
- ‘all-MiniLM-L6-v2’: Fast, good for general sentence similarity
- ‘all-mpnet-base-v2’: High quality sentence embeddings
- ‘paraphrase-MiniLM-L6-v2’: Good for paraphrase detection
For a full list see: https://www.sbert.net/docs/pretrained_models.html

Full dict format (backward compatible): {‘model’: ‘all-mpnet-base-v2’, ‘args’: [], ‘kwargs’: {}} or using the SentenceTransformer class: {‘model’: ‘SentenceTransformer’, ‘args’: [‘all-MiniLM-L6-v2’], ‘kwargs’: {}} The ‘kwargs’ dictionary may be further subdivided; if an ‘embedding_kwargs’ key is included in ‘kwargs’, its values will be treated as keyword arguments to be applied to the embedding model when it is initialized.

param text:

a string (a single word, sentence, or document), list of strings (a list of words, sentences, or documents), or a nested list of strings (a list of listed words, sentences, or documents). Strings and (shallow) lists of strings result in a single embedding matrix; nested lists produce a list of embedding matrices (one per lowest-level list)

param args:

a list of unnamed arguments to pass to every text embedding model or pipeline step. Default: [].

param mode:

one of: ‘fit’ (fit the model), ‘transform’ (apply an already-fitted model), or ‘fit_transform’ (fit a model and then apply it to the same text). The ‘fit’ mode is only supported for scikit-learn (and scikit-learn- compatible) models.

param return_model:

if True, return both the embedded text and a trained model that may be applied to new text. If False, return only the text embeddings. Default: False.

param kwargs:

keyword arguments are passed to the embedding model; these are equivalent to specifying the embedding model as a dictionary. When a keyword argument appears in both model[‘kwargs’] and kwargs, the kwargs value is used preferentially.

Returns

return:: The text embeddings (if return_model is False) or a tuple whose first element is the text embeddings and whose second element is a fitted model that may be applied to new text (if return_model is True).

datawrangler.zoo.text.to_str_list(x, encoding='utf-8')[source]

Internal helper function used to wrangle text data. Handles binary strings, nested lists of strings, and arrays: or DataFrames containing text.

Parameters

param x:: the text-containing object to be wrangled.
param encoding:: for objects of type bytes, specify the encoding. Default: ‘utf-8’.

Returns

return:: a string or (possibly nested) list of strings

datawrangler.zoo.text.get_text(x, force_literal=False)[source]

Parse, load, or download one or more documents.

Parameters

param x:: A string or list of strings. Each string can be either the text of a document, a file path, or a URL. If a file path or URL is provided, the contents are loaded in, treated as text, and returned. If a list of strings is provided, the get_text function is applied to each element of the list.
param force_literal:: If True, interpret strings literally (rather than checking to see if the strings point to a local or remote file). Default: False.

Returns

return:: The text as a string or (potentially nested) list of strings