datawrangler.zoo.text
- datawrangler.zoo.text.is_text(x)[source]
Test whether an object contains (or points to) text.
Parameters
- param x:
the object to test
Returns
- return:
True if the object is (or points to) text and False otherwise.
- datawrangler.zoo.text.wrangle_text(text, return_model=False, backend=None, **kwargs)[source]
Turn text into DataFrames (pandas or Polars)
Parameters
- param text:
A string or (nested) list of strings. Each string can contain either the to-be-wrangled text, a file path, or a URL.
- param return_model:
if True, return a fitted model that may be applied to new text data, along with the wrangled text. Default: False.
- param backend:
str, optional The DataFrame backend to use (‘pandas’ or ‘polars’). If None, uses the default backend (pandas)
- param kwargs:
Other (optional) keyword arguments may be passed into the function to control the wrangling process: - ‘corpus’: any built-in or hugging-face corpus (see get_corpus for more details); this argument is passed to the
get_corpus function as the “dataset_name” keyword argument - the ‘config’ argument may be used to select a specific variant of the corpus (passed to get_corpus as the
“config_name” keyword argument).
‘model’: any scikit-learn-compatible or hugging-face-compatible model (see apply_text_model for more details) Simplified API examples:
‘all-MiniLM-L6-v2’ (string format for sentence-transformers)
‘CountVectorizer’ (string format for sklearn model)
[‘CountVectorizer’, ‘LatentDirichletAllocation’] (list of strings for sklearn pipeline)
{‘model’: ‘all-MiniLM-L6-v2’} (partial dict format)
- Full dict format (backward compatible):
{‘model’: ‘all-MiniLM-L6-v2’, ‘args’: [], ‘kwargs’: {}}
‘array_kwargs’: a dictionary of keyword arguments that may be passed to wrangle_array to control how the final DataFrame is structured (see wrangle_array for details).
Returns
- return:
a DataFrame (pandas or Polars based on backend) or list of DataFrames containing the embedded text. If return_model is True a tuple, whose first element contains the embedded text and second element contains the fitted models, is returned instead.
Examples
>>> import datawrangler as dw >>> # Create pandas DataFrame with sentence embeddings >>> df_pandas = dw.wrangle(["Hello world", "How are you?"], ... text_kwargs={'model': 'all-MiniLM-L6-v2'}) >>> # Create Polars DataFrame with sentence embeddings >>> df_polars = dw.wrangle(["Hello world", "How are you?"], ... text_kwargs={'model': 'all-MiniLM-L6-v2'}, ... backend='polars') >>> # Use sklearn pipeline with pandas backend (default) >>> df_sklearn = dw.wrangle(["This is text", "More text here"], ... text_kwargs={'model': ['CountVectorizer', 'LatentDirichletAllocation']})
- datawrangler.zoo.text.get_corpus(dataset_name='wikipedia', config_name='20200501.en')[source]
Download (and return) a text corpus. By default, a 2020 snapshot of all English Wikipedia articles is returned.
[Parameters]
- param dataset_name:
a string containing the corpus name. Can be one of the following: - Corpora built into data-wrangler:
‘minipedia’: a curated and cleaned up subset of Wikipedia containing articles on a wide variety of topics
‘neurips’: a collection of NeurIPS articles
‘sotus’: transcripts of state of the union addresses from US Presidents from 1989 – 2018
‘khan’: transcripts of (most) Khan Academy YouTube videos
Any hugging-face corpus; for a full list see https://huggingface.co/datasets Note that downloading hugging-face corpora also requires specifying a config_name
- param config_name:
configuration name or description for hugging-face corpora. This argument is ignored if dataset name is set to one of the data-wrangler corpora described above.
Returns
- return:
A list of number-of-documents strings, where each string contains the text of one document in the corpus.
- datawrangler.zoo.text.get_text_model(x)[source]
Given a valid scikit-learn or sentence-transformers model, or a string matching the name of a valid model, return a callable function or class constructor for the given model.
Parameters
- param x:
an object to turn into a valid scikit-learn or sentence-transformers model. Can be: - An already-valid model instance - A string matching sklearn model names (e.g., ‘LatentDirichletAllocation’, ‘CountVectorizer’) - A string matching sentence-transformers model names (e.g., ‘all-MiniLM-L6-v2’, ‘all-mpnet-base-v2’) - A normalized dict with ‘model’ key (e.g., {‘model’: ‘CountVectorizer’, ‘args’: [], ‘kwargs’: {}})
Returns
- return:
A valid scikit-learn or sentence-transformers model (or None if no model matching the given description can be found)
Examples
>>> from datawrangler.zoo.text import get_text_model >>> get_text_model('LatentDirichletAllocation') # sklearn model >>> get_text_model('all-MiniLM-L6-v2') # sentence-transformers model >>> get_text_model({'model': 'CountVectorizer'}) # dict format
- datawrangler.zoo.text.apply_text_model(x, text, *args, mode='fit_transform', return_model=False, **kwargs)[source]
Apply a scikit-learn or hugging-face text embedding model to one or more text datasets. Scikit-learn models are trained on the specified corpus and then applied to all datasets. All Hugging-Face models are pre-trained.
Parameters
- param x:
the model to apply. Supported models include: - Scikit-learn models. The recommended pipeline is to specify a feature extraction model (for turning text into
a number-of-documents by number-of-features matrix), and then to apply a matrix decomposition or embedding model (for turning the features matrix into text embeddings). When models are passed as a list, each model is applied in succession to the output of the previous model. The pipeline is first fit to the provided corpus, and then applied to the given text. Default: [‘CountVectorizer’, ‘LatentDirichletAllocation’] - All scikit-learn text feature extraction models are supported; for a full list see
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text These may be passed either as callable modules (e.g., sklearn.feature_extraction.text.CountVectorizer) or as strings (e.g., ‘CountVectorizer’). Default options for each model are defined in config.ini.
- All scikit-learn matrix decomposition models are supported; for a full list see
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition These may be passed either as callable modules (e.g., sklearn.decomposition.NMF) or as strings (e.g., ‘NMF’). Default options for each model are defined in config.ini.
- Hugging-face models. These take raw text as input and produce text embeddings as output. Models can be
specified using the simplified API (recommended) or full dict format:
- Simplified API (NEW):
As a string: ‘all-MiniLM-L6-v2’
As a partial dict: {‘model’: ‘all-MiniLM-L6-v2’}
- Popular models include:
‘all-MiniLM-L6-v2’: Fast, good for general sentence similarity
‘all-mpnet-base-v2’: High quality sentence embeddings
‘paraphrase-MiniLM-L6-v2’: Good for paraphrase detection
For a full list see: https://www.sbert.net/docs/pretrained_models.html
- Full dict format (backward compatible):
{‘model’: ‘all-mpnet-base-v2’, ‘args’: [], ‘kwargs’: {}}
- or using the SentenceTransformer class:
{‘model’: ‘SentenceTransformer’, ‘args’: [‘all-MiniLM-L6-v2’], ‘kwargs’: {}}
The ‘kwargs’ dictionary may be further subdivided; if an ‘embedding_kwargs’ key is included in ‘kwargs’, its values will be treated as keyword arguments to be applied to the embedding model when it is initialized.
- param text:
a string (a single word, sentence, or document), list of strings (a list of words, sentences, or documents), or a nested list of strings (a list of listed words, sentences, or documents). Strings and (shallow) lists of strings result in a single embedding matrix; nested lists produce a list of embedding matrices (one per lowest-level list)
- param args:
a list of unnamed arguments to pass to every text embedding model or pipeline step. Default: [].
- param mode:
one of: ‘fit’ (fit the model), ‘transform’ (apply an already-fitted model), or ‘fit_transform’ (fit a model and then apply it to the same text). The ‘fit’ mode is only supported for scikit-learn (and scikit-learn- compatible) models.
- param return_model:
if True, return both the embedded text and a trained model that may be applied to new text. If False, return only the text embeddings. Default: False.
- param kwargs:
keyword arguments are passed to the embedding model; these are equivalent to specifying the embedding model as a dictionary. When a keyword argument appears in both model[‘kwargs’] and kwargs, the kwargs value is used preferentially.
Returns
- return:
The text embeddings (if return_model is False) or a tuple whose first element is the text embeddings and whose second element is a fitted model that may be applied to new text (if return_model is True).