YAML Init Parameters

You can specify the following parameters for EntityExtractor in the pipeline YAML:

Parameter	Type	Possible Values	Description
`model_name_or_path`	String	Default: `elastic/distilbert-base-cased-finetuned-conll03-english`	The name of the model to use for entity extraction. Required.
`model_version`	String	Default: `none`	The version of the model. Optional.
`use_gpu`	Boolean	`True` `False` Default: `True`	Specifies if GPU should be used. Required.
`progress_bar`	Boolean	`True` `False` Default: `True`	Shows the progress bar when processing. Required.
`batch_size`	Integer	Default: `16`	The number of documents to extract entities from. Required.
`use_auth_token`	Union of string and Boolean	Default: `none`	Specifies the API token used to download private models from Hugging Face. If you set it to True, it uses the token generated when running `transformers-cli login`. Optional
`devices`	List of union of string and torch device	Default: `none`	A list of torch devices such as cuda, cpu, mps, to limit inference to specific devices. Example: `[torch.device(cuda:0), "mps, "cuda:1"` If you set `use_gpu` to `False`, this parameter is not used and a single cpu device is used for inference. Optional.
`aggregation_strategy`	Literal	`None` `simple` `first` `average` `max` Default: `first`	The strategy to fuse tokens based on the model prediction. - `none`: Doesn't aggregate and returns raw results from the model. - `simple`: Attempts to group entities following the default schema. This means that (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Two consecutive B tags end up as different entities. In word-based languages, this may result in splitting words undesirably. For example, "Microsoft" would be tagged as: [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Check the FIRST, MAX, and AVERAGE options for ways to mitigate this example and disambiguate words. These mitigations will only work on real words; "New York" might still be tagged with two different entities. - `first`: Uses the `simple`strategy, except that words cannot end up with different tags. If there's ambiguity, words use the tag of the first token of the word. - `average`: Uses the `simple` strategy, except that words cannot end up with different tags. The scores are averaged across tokens and the label with the maximum score is chosen. - `max`: Uses the `simple` strategy, except that words cannot end up with different tags. The word entity is the token with the maximum score. Required.
`add_prefix_space`	Boolean	`True` `False` Default: `None`	Set to `true` if you don't want the first word to be treated differently. This is relevant for model types such as "bloom", "gpt2", and "roberta". For more information, see Hugging Face documentation. Optional.
`num_workers`	Integer	Default: `0`	The number of workers to be used in the Pytorch Dataloader. Required
`flatten_entities_in_meta_data`	Boolean	`True` `False` Default: `False`	Converts all entities predicted for a document from a list of dictionaries into a single list for each key in the dictionary. Required.
`max_seq_len`	Integer	Default: `None`	The maximum length of one input text for the model. If not provided, the max length is automatically determined by the `model_max_length` variable of the tokenizer. Optional.
`pre_split_text`	Boolean	`True` `False` Default: `False`	Splits the text of a document into words before being passed into the model. This is common practice for models trained for named entity recognition and is recommended when using architectures that don't use word-level tokenizers. Required
`ignore_labels`	List of strings	Default: `None`	A list of labels to ignore. If `None`, it defaults to `["0"]`. Optional.

REST API Runtime Parameters

There are no runtime parameters you can pass to this node when making a request to the Search REST API endpoint.