EntityExtractor Parameters

Check the init and runtime parameters you can pass for EntityExtractor.

YAML Init Parameters

You can specify the following parameters for EntityExtractor in the pipeline YAML:

ParameterTypePossible ValuesDescription
model_name_or_pathStringDefault: elastic/distilbert-base-cased-finetuned-conll03-englishThe name of the model to use for entity extraction.
Required.
model_versionStringDefault: noneThe version of the model.
Optional.
use_gpuBooleanTrue
False
Default: True
Specifies if GPU should be used.
Required.
progress_barBooleanTrue
False
Default: True
Shows the progress bar when processing.
Required.
batch_sizeIntegerDefault: 16The number of documents to extract entities from.
Required.
use_auth_tokenUnion of string and BooleanDefault: noneSpecifies the API token used to download private models from Hugging Face. If you set it to True, it uses the token generated when running transformers-cli login.
Optional
devicesList of union of string and torch deviceDefault: noneA list of torch devices such as cuda, cpu, mps, to limit inference to specific devices.
Example: [torch.device(cuda:0), "mps, "cuda:1"
If you set use_gpu to False, this parameter is not used and a single cpu device is used for inference.
Optional.
aggregation_strategyLiteralNone
simple
first
average
max
Default: first
The strategy to fuse tokens based on the model prediction.
- none: Doesn't aggregate and returns raw results from the model.
- simple: Attempts to group entities following the default schema. This means that (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}]
Two consecutive B tags end up as different entities. In word-based languages, this may result in splitting words undesirably. For example, "Microsoft" would be tagged as: [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Check the FIRST, MAX, and AVERAGE options for ways to mitigate this example and disambiguate words. These mitigations will only work on real words; "New York" might still be tagged with two different entities.
- first: Uses the simplestrategy, except that words cannot end up with different tags. If there's ambiguity, words use the tag of the first token of the word.
- average: Uses the simple strategy, except that words cannot end up with different tags. The scores are averaged across tokens and the label with the maximum score is chosen.
- max: Uses the simple strategy, except that words cannot end up with different tags. The word entity is the token with the maximum score.
Required.
add_prefix_spaceBooleanTrue
False
Default: None
Set to true if you don't want the first word to be treated differently. This is relevant for model types such as "bloom", "gpt2", and "roberta". For more information, see Hugging Face documentation.
Optional.
num_workersIntegerDefault: 0The number of workers to be used in the Pytorch Dataloader.
Required
flatten_entities_in_meta_dataBooleanTrue
False
Default: False
Converts all entities predicted for a document from a list of dictionaries into a single list for each key in the dictionary.
Required.
max_seq_lenIntegerDefault: NoneThe maximum length of one input text for the model. If not provided, the max length is automatically determined by the model_max_length variable of the tokenizer.
Optional.
pre_split_textBooleanTrue
False
Default: False
Splits the text of a document into words before being passed into the model. This is common practice for models trained for named entity recognition and is recommended when using architectures that don't use word-level tokenizers.
Required
ignore_labelsList of stringsDefault: NoneA list of labels to ignore. If None, it defaults to ["0"].
Optional.

REST API Runtime Parameters

There are no runtime parameters you can pass to this node when making a request to the Search REST API endpoint.