EntityExtractor

Use EntityExtractor in your pipelines to extract predefined entities out of text. It's most often used to perform named entity extraction (NER).

Apart from NER, you can use it, for example, to label each word by part of speech. Entity extraction can also be a powerful way to generate metadata that you can then use as search filters.

EntityExtractor uses the elastic/distilbert-base-cased-finetuned-conll03-english model by default. All entities the model extracts populate documents' metadata.

Basic Information

Usage Example

Here's an example of an EntityExtractor used in a query pipeline. It extracts entities from the retrieved documents only and the extracted entities populate the documents' metadata.

components:
  - name: EntityExtractor
    type: EntityExtractor
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
    params: 
      embedding_dim: 768
      similarity: cosine 
  - name: Retriever
    type: EmbeddingRetriever
    params:
       document_store: DocumentStore
       embedding_model: intfloat/e5-base-v2 
       model_format: sentence_transformers
       top_k: 20      
    ...
 pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: EntityExtractor
        inputs: [Retriever] 
        ...

This is an indexing pipeline with EntityExtractor. As a result, all documents in the Document Store have extracted entities in their metadata.

Tip

Make sure EntityExtractor comes after PreProcessor. That's because PreProcessor splits documents into smaller chunks, which breaks the alignment with the start and end character indices that EntityExtractor returns.

components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
       embedding_dim: 768
       similarity: cosine
  - name: EntityExtractor
    type: EntityExtractor
    params:
       flatten_entities_in_meta_data: true
  - name: Converter
    type: TextConverter
  - name: Classifier
    type: FileTypeClassifier
  - name: Processor
    type: PreProcessor
    params:
       split_by: word
       split_length: 250
       split_overlap: 30
       split_respect_sentence_boundary: True
       language: en
    ...
 pipelines:
  - name: indexing
    nodes:
      - name: Classifier
        inputs: [File]
      - name: Converter
        inputs: [Classifier.output_1] 
      - name: Processor
        inputs: [Converter]
      - name: EntityExtractor
        inputs: [Processor]
      - name: DocumentStore
        inputs: [EntityExtractor]
        ...

Parameters

You can specify the following parameters for FileTypeClassifier in the pipeline YAML:

ParameterTypePossible ValuesDescription
model_name_or_pathStringDefault: elastic/distilbert-base-cased-finetuned-conll03-englishThe name of the model to use for entity extraction.
Required.
model_versionStringDefault: noneThe version of the model.
Optional.
use_gpuBooleanTrue
False
Default: True
Specifies if GPU should be used.
Required.
progress_barBooleanTrue
False
Default: True
Shows the progress bar when processing.
Required.
batch_sizeIntegerDefault: 16The number of documents to extract entities from.
Required.
use_auth_tokenUnion of string and BooleanDefault: noneSpecifies the API token used to download private models from Hugging Face. If you set it to True, it uses the token generated when running transformers-cli login.
Optional
devicesList of union of string and torch deviceDefault: noneA list of torch devices such as cuda, cpu, mps, to limit inference to specific devices.
Example: [torch.device(cuda:0), "mps, "cuda:1"
If you set use_gpu to False, this parameter is not used and a single cpu device is used for inference.
Optional.
aggregation_strategyLiteralNone
simple
first
average
max
Default: first
The strategy to fuse tokens based on the model prediction.
- none: Doesn't aggregate and returns raw results from the model.
- simple: Attempts to group entities following the default schema. This means that (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}]
Two consecutive B tags end up as different entities. In word-based languages, this may result in splitting words undesirably. For example, "Microsoft" would be tagged as: [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Check the FIRST, MAX, and AVERAGE options for ways to mitigate this example and disambiguate words. These mitigations will only work on real words; "New York" might still be tagged with two different entities.
- first: Uses the simplestrategy, except that words cannot end up with different tags. If there's ambiguity, words use the tag of the first token of the word.
- average: Uses the simple strategy, except that words cannot end up with different tags. The scores are averaged across tokens and the label with the maximum score is chosen.
- max: Uses the simple strategy, except that words cannot end up with different tags. The word entity is the token with the maximum score.
Required.
add_prefix_spaceBooleanTrue
False
Default: None
Set to true if you don't want the first word to be treated differently. This is relevant for model types such as "bloom", "gpt2", and "roberta". For more information, see Hugging Face documentation.
Optional.
num_workersIntegerDefault: 0The number of workers to be used in the Pytorch Dataloader.
Required
flatten_entities_in_meta_dataBooleanTrue
False
Default: False
Converts all entities predicted for a document from a list of dictionaries into a single list for each key in the dictionary.
Required.
max_seq_lenIntegerDefault: NoneThe maximum length of one input text for the model. If not provided, the max length is automatically determined by the model_max_length variable of the tokenizer.
Optional.
pre_split_textBooleanTrue
False
Default: False
Splits the text of a document into words before being passed into the model. This is common practice for models trained for named entity recognition and is recommended when using architectures that don't use word-level tokenizers.
Required
ignore_labelsList of stringsDefault: NoneA list of labels to ignore. If None, it defaults to ["0"].
Optional.