datawrangler.zoo.text

datawrangler.zoo.text.is_text(x)[source]

Test whether an object contains (or points to) text.

Parameters

param x:

the object to test

Returns

return:

True if the object is (or points to) text and False otherwise.

datawrangler.zoo.text.wrangle_text(text, return_model=False, backend=None, **kwargs)[source]

Turn text into DataFrames (pandas or Polars)

Parameters

param text:

A string or (nested) list of strings. Each string can contain either the to-be-wrangled text, a file path, or a URL.

param return_model:

if True, return a fitted model that may be applied to new text data, along with the wrangled text. Default: False.

param backend:

str, optional The DataFrame backend to use (‘pandas’ or ‘polars’). If None, uses the default backend (pandas)

param kwargs:

Other (optional) keyword arguments may be passed into the function to control the wrangling process: - ‘corpus’: any built-in or hugging-face corpus (see get_corpus for more details); this argument is passed to the

get_corpus function as the “dataset_name” keyword argument - the ‘config’ argument may be used to select a specific variant of the corpus (passed to get_corpus as the

“config_name” keyword argument).

  • ‘model’: any scikit-learn-compatible or hugging-face-compatible model (see apply_text_model for more details) Simplified API examples:

    • ‘all-MiniLM-L6-v2’ (string format for sentence-transformers)

    • ‘CountVectorizer’ (string format for sklearn model)

    • [‘CountVectorizer’, ‘LatentDirichletAllocation’] (list of strings for sklearn pipeline)

    • {‘model’: ‘all-MiniLM-L6-v2’} (partial dict format)

    Full dict format (backward compatible):
    • {‘model’: ‘all-MiniLM-L6-v2’, ‘args’: [], ‘kwargs’: {}}

  • ‘array_kwargs’: a dictionary of keyword arguments that may be passed to wrangle_array to control how the final DataFrame is structured (see wrangle_array for details).

Returns

return:

a DataFrame (pandas or Polars based on backend) or list of DataFrames containing the embedded text. If return_model is True a tuple, whose first element contains the embedded text and second element contains the fitted models, is returned instead.

Examples

>>> import datawrangler as dw
>>> # Create pandas DataFrame with sentence embeddings
>>> df_pandas = dw.wrangle(["Hello world", "How are you?"], 
...                        text_kwargs={'model': 'all-MiniLM-L6-v2'})
>>> # Create Polars DataFrame with sentence embeddings  
>>> df_polars = dw.wrangle(["Hello world", "How are you?"],
...                        text_kwargs={'model': 'all-MiniLM-L6-v2'},
...                        backend='polars')
>>> # Use sklearn pipeline with pandas backend (default)
>>> df_sklearn = dw.wrangle(["This is text", "More text here"],
...                         text_kwargs={'model': ['CountVectorizer', 'LatentDirichletAllocation']})
datawrangler.zoo.text.get_corpus(dataset_name='wikipedia', config_name='20200501.en')[source]

Download (and return) a text corpus. By default, a 2020 snapshot of all English Wikipedia articles is returned.

[Parameters]

param dataset_name:

a string containing the corpus name. Can be one of the following: - Corpora built into data-wrangler:

  • ‘minipedia’: a curated and cleaned up subset of Wikipedia containing articles on a wide variety of topics

  • ‘neurips’: a collection of NeurIPS articles

  • ‘sotus’: transcripts of state of the union addresses from US Presidents from 1989 – 2018

  • ‘khan’: transcripts of (most) Khan Academy YouTube videos

  • Any hugging-face corpus; for a full list see https://huggingface.co/datasets Note that downloading hugging-face corpora also requires specifying a config_name

param config_name:

configuration name or description for hugging-face corpora. This argument is ignored if dataset name is set to one of the data-wrangler corpora described above.

Returns

return:

A list of number-of-documents strings, where each string contains the text of one document in the corpus.

datawrangler.zoo.text.get_text_model(x)[source]

Given a valid scikit-learn or sentence-transformers model, or a string matching the name of a valid model, return a callable function or class constructor for the given model.

Parameters

param x:

an object to turn into a valid scikit-learn or sentence-transformers model. Can be: - An already-valid model instance - A string matching sklearn model names (e.g., ‘LatentDirichletAllocation’, ‘CountVectorizer’) - A string matching sentence-transformers model names (e.g., ‘all-MiniLM-L6-v2’, ‘all-mpnet-base-v2’) - A normalized dict with ‘model’ key (e.g., {‘model’: ‘CountVectorizer’, ‘args’: [], ‘kwargs’: {}})

Returns

return:

A valid scikit-learn or sentence-transformers model (or None if no model matching the given description can be found)

Examples

>>> from datawrangler.zoo.text import get_text_model
>>> get_text_model('LatentDirichletAllocation')  # sklearn model
>>> get_text_model('all-MiniLM-L6-v2')  # sentence-transformers model
>>> get_text_model({'model': 'CountVectorizer'})  # dict format
datawrangler.zoo.text.apply_text_model(x, text, *args, mode='fit_transform', return_model=False, **kwargs)[source]

Apply a scikit-learn or hugging-face text embedding model to one or more text datasets. Scikit-learn models are trained on the specified corpus and then applied to all datasets. All Hugging-Face models are pre-trained.

Parameters

param x:

the model to apply. Supported models include: - Scikit-learn models. The recommended pipeline is to specify a feature extraction model (for turning text into

a number-of-documents by number-of-features matrix), and then to apply a matrix decomposition or embedding model (for turning the features matrix into text embeddings). When models are passed as a list, each model is applied in succession to the output of the previous model. The pipeline is first fit to the provided corpus, and then applied to the given text. Default: [‘CountVectorizer’, ‘LatentDirichletAllocation’] - All scikit-learn text feature extraction models are supported; for a full list see

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text These may be passed either as callable modules (e.g., sklearn.feature_extraction.text.CountVectorizer) or as strings (e.g., ‘CountVectorizer’). Default options for each model are defined in config.ini.

  • Hugging-face models. These take raw text as input and produce text embeddings as output. Models can be

    specified using the simplified API (recommended) or full dict format:

    Simplified API (NEW):
    • As a string: ‘all-MiniLM-L6-v2’

    • As a partial dict: {‘model’: ‘all-MiniLM-L6-v2’}

    Popular models include:
    • ‘all-MiniLM-L6-v2’: Fast, good for general sentence similarity

    • ‘all-mpnet-base-v2’: High quality sentence embeddings

    • ‘paraphrase-MiniLM-L6-v2’: Good for paraphrase detection

    For a full list see: https://www.sbert.net/docs/pretrained_models.html

    Full dict format (backward compatible):

    {‘model’: ‘all-mpnet-base-v2’, ‘args’: [], ‘kwargs’: {}}

    or using the SentenceTransformer class:

    {‘model’: ‘SentenceTransformer’, ‘args’: [‘all-MiniLM-L6-v2’], ‘kwargs’: {}}

    The ‘kwargs’ dictionary may be further subdivided; if an ‘embedding_kwargs’ key is included in ‘kwargs’, its values will be treated as keyword arguments to be applied to the embedding model when it is initialized.

param text:

a string (a single word, sentence, or document), list of strings (a list of words, sentences, or documents), or a nested list of strings (a list of listed words, sentences, or documents). Strings and (shallow) lists of strings result in a single embedding matrix; nested lists produce a list of embedding matrices (one per lowest-level list)

param args:

a list of unnamed arguments to pass to every text embedding model or pipeline step. Default: [].

param mode:

one of: ‘fit’ (fit the model), ‘transform’ (apply an already-fitted model), or ‘fit_transform’ (fit a model and then apply it to the same text). The ‘fit’ mode is only supported for scikit-learn (and scikit-learn- compatible) models.

param return_model:

if True, return both the embedded text and a trained model that may be applied to new text. If False, return only the text embeddings. Default: False.

param kwargs:

keyword arguments are passed to the embedding model; these are equivalent to specifying the embedding model as a dictionary. When a keyword argument appears in both model[‘kwargs’] and kwargs, the kwargs value is used preferentially.

Returns

return:

The text embeddings (if return_model is False) or a tuple whose first element is the text embeddings and whose second element is a fitted model that may be applied to new text (if return_model is True).