Check the init and runtime parameters you can pass for EntityExtractor.
YAML Init Parameters
You can specify the following parameters for EntityExtractor in the pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
model_name_or_path | String | Default: elastic/distilbert-base-cased-finetuned-conll03-english | The name of the model to use for entity extraction. Required. |
model_version | String | Default: none | The version of the model. Optional. |
use_gpu | Boolean | True False Default: True | Specifies if GPU should be used. Required. |
progress_bar | Boolean | True False Default: True | Shows the progress bar when processing. Required. |
batch_size | Integer | Default: 16 | The number of documents to extract entities from. Required. |
use_auth_token | Union of string and Boolean | Default: none | Specifies the API token used to download private models from Hugging Face. If you set it to True, it uses the token generated when running transformers-cli login .Optional |
devices | List of union of string and torch device | Default: none | A list of torch devices such as cuda, cpu, mps, to limit inference to specific devices. Example: [torch.device(cuda:0), "mps, "cuda:1" If you set use_gpu to False , this parameter is not used and a single cpu device is used for inference.Optional. |
aggregation_strategy | Literal | None simple first average max Default: first | The strategy to fuse tokens based on the model prediction. - none : Doesn't aggregate and returns raw results from the model.- simple : Attempts to group entities following the default schema. This means that (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}]Two consecutive B tags end up as different entities. In word-based languages, this may result in splitting words undesirably. For example, "Microsoft" would be tagged as: [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Check the FIRST, MAX, and AVERAGE options for ways to mitigate this example and disambiguate words. These mitigations will only work on real words; "New York" might still be tagged with two different entities. - first : Uses the simple strategy, except that words cannot end up with different tags. If there's ambiguity, words use the tag of the first token of the word.- average : Uses the simple strategy, except that words cannot end up with different tags. The scores are averaged across tokens and the label with the maximum score is chosen.- max : Uses the simple strategy, except that words cannot end up with different tags. The word entity is the token with the maximum score.Required. |
add_prefix_space | Boolean | True False Default: None | Set to true if you don't want the first word to be treated differently. This is relevant for model types such as "bloom", "gpt2", and "roberta". For more information, see Hugging Face documentation.Optional. |
num_workers | Integer | Default: 0 | The number of workers to be used in the Pytorch Dataloader. Required |
flatten_entities_in_meta_data | Boolean | True False Default: False | Converts all entities predicted for a document from a list of dictionaries into a single list for each key in the dictionary. Required. |
max_seq_len | Integer | Default: None | The maximum length of one input text for the model. If not provided, the max length is automatically determined by the model_max_length variable of the tokenizer.Optional. |
pre_split_text | Boolean | True False Default: False | Splits the text of a document into words before being passed into the model. This is common practice for models trained for named entity recognition and is recommended when using architectures that don't use word-level tokenizers. Required |
ignore_labels | List of strings | Default: None | A list of labels to ignore. If None , it defaults to ["0"] .Optional. |
REST API Runtime Parameters
There are no runtime parameters you can pass to this node when making a request to the Search REST API endpoint.