datawrangler.zoo.text#

datawrangler.zoo.text.is_text(x)[source]#

Test whether an object contains (or points to) text.

Parameters

x – the object to test

Returns

True if the object is (or points to) text and False otherwise.

datawrangler.zoo.text.wrangle_text(text, return_model=False, **kwargs)[source]#

Turn text into DataFrames

Parameters
  • text – A string or (nested) list of strings. Each string can contain either the to-be-wrangled text, a file path, or a URL.

  • return_model – if True, return a fitted model that may be applied to new text data, along with the wrangled text. Default: False.

  • kwargs

    Other (optional) keyword arguments may be passed into the function to control the wrangling process: - ‘corpus’: any built-in or hugging-face corpus (see get_corpus for more details); this argument is passed to the

    get_corpus function as the “dataset_name” keyword argument - the ‘config’ argument may be used to select a specific variant of the corpus (passed to get_corpus as the

    ”config_name” keyword argument).

    • ’model’: any scikit-learn-compatible or hugging-face-compatible model (see apply_text_model for more details)

    • ’array_kwargs’: a dictionary of keyword arguments that may be passed to wrangle_array to control how the final DataFrame is structured (see wrangle_array for details).

Returns

a DataFrame (or list of DataFrames) containing the embedded text. If return_model is True a tuple, whose first element contains the embedded text and second element contains the fitted models, is returned instead.

datawrangler.zoo.text.get_corpus(dataset_name='wikipedia', config_name='20200501.en')[source]#

Download (and return) a text corpus. By default, a 2020 snapshot of all English Wikipedia articles is returned.

Parameters
  • dataset_name

    a string containing the corpus name. Can be one of the following: - Corpora built into data-wrangler:

    • ’minipedia’: a curated and cleaned up subset of Wikipedia containing articles on a wide variety of topics

    • ’neurips’: a collection of NeurIPS articles

    • ’sotus’: transcripts of state of the union addresses from US Presidents from 1989 – 2018

    • ’khan’: transcripts of (most) Khan Academy YouTube videos

    • Any hugging-face corpus; for a full list see https://huggingface.co/datasets Note that downloading hugging-face corpora also requires specifying a config_name

  • config_name – configuration name or description for hugging-face corpora. This argument is ignored if dataset name is set to one of the data-wrangler corpora described above.

Returns

A list of number-of-documents strings, where each string contains the text of one document in the corpus.

datawrangler.zoo.text.get_text_model(x)[source]#

Given an valid scikit-learn or hugging-face model, or a string (e.g., ‘LatentDirichletAllocation’ or ‘TransformerDocumentEmbeddings’) matching the name of a valid scikit-learn or hugging-face model, return a callable function or class constructor for the given model.

Parameters

x – an object to turn into a valid scikit-learn or hugging-face model (e.g., an already-valid model or a string)

Returns

A valid scikit-learn or hugging-face model (or None if no model matching the given description can be found)

datawrangler.zoo.text.apply_text_model(x, text, *args, mode='fit_transform', return_model=False, **kwargs)[source]#

Apply a scikit-learn or hugging-face text embedding model to one or more text datasets. Scikit-learn models are trained on the specified corpus and then applied to all datasets. All Hugging-Face models are pre-trained.

Parameters
  • x

    the model to apply. Supported models include: - Scikit-learn models. The recommended pipeline is to specify a feature extraction model (for turning text into

    a number-of-documents by number-of-features matrix), and then to apply a matrix decomposition or embedding model (for turning the features matrix into text embeddings). When models are passed as a list, each model is applied in succession to the output of the previous model. The pipeline is first fit to the provided corpus, and then applied to the given text. Default: [‘CountVectorizer’, ‘LatentDirichletAllocation’] - All scikit-learn text feature extraction models are supported; for a full list see

    https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text These may be passed either as callable modules (e.g., sklearn.feature_extraction.text.CountVectorizer) or as strings (e.g., ‘CountVectorizer’). Default options for each model are defined in config.ini.

    • Hugging-face models. These take raw text as input and produce text embeddings as output. Hugging-face models
      are specified using dictionaries containing the following keys:
      for example, to embed a document using GPT-2, use

      {model: ‘TransformerDocumentEmbeddings’, args: [‘gpt2’], ‘kwargs’: {}}

      The ‘kwargs’ dictionary may be further subdivided; if an ‘embedding_kwargs’ key is included in ‘kwargs’, its values will be treated as keyword arguments to be applied to the embedding model when it is initialized. All other keyword arguments are passed on to flair.data.Sentence in order to tokenize the given text.

  • text – a string (a single word, sentence, or document), list of strings (a list of words, sentences, or documents), or a nested list of strings (a list of listed words, sentences, or documents). Strings and (shallow) lists of strings result in a single embedding matrix; nested lists produce a list of embedding matrices (one per lowest-level list)

  • args – a list of unnamed arguments to pass to every text embedding model or pipeline step. Default: [].

  • mode – one of: ‘fit’ (fit the model), ‘transform’ (apply an already-fitted model), or ‘fit_transform’ (fit a model and then apply it to the same text). The ‘fit’ mode is only supported for scikit-learn (and scikit-learn- compatible) models.

  • return_model – if True, return both the embedded text and a trained model that may be applied to new text. If False, return only the text embeddings. Default: False.

  • kwargs – keyword arguments are passed to the embedding model; these are equivalent to specifying the embedding model as a dictionary. When a keyword argument appears in both model[‘kwargs’] and kwargs, the kwargs value is used preferentially.

Returns

The text embeddings (if return_model is False) or a tuple whose first element is the text embeddings and whose second element is a fitted model that may be applied to new text (if return_model is True).