Apart from NER, you can use it, for example, to label each word by part of speech. Entity extraction can also be a powerful way to generate metadata that you can then use as search filters.

EntityExtractor uses the elastic/distilbert-base-cased-finetuned-conll03-english model by default. All entities the model extracts populate documents' metadata.

Basic Information

Pipeline type: Used in indexing and query pipelines. In indexing pipelines, it extracts entities from all documents in the Document Store. In query pipelines, it extracts entities from the retriever documents only.
Nodes that can precede it in a pipeline:
- In indexing pipelines: PreProcessor, CNAzureConverter, PDFToTextConverter, TextConverter
- In query pipelines: Retriever, InterleaveDocuments, JoinDocuments, Ranker, RetrievalScoreAdjuster
Nodes that can follow it in a pipeline:
- In indexing pipelines: None (it's recommended to be placed after PreProcessor)
- In query pipelines: Reader, ReferencePredictor, Ranker, RetrievalScoreAdjuster, PromptNode, JoinDocuments, InterleaveDocuments
Node input: Documents
Node output: Documents
Available node classes: EntityExtractor

Usage Example

Here's an example of an EntityExtractor used in a query pipeline. It extracts entities from the retrieved documents only and the extracted entities populate the documents' metadata.

components:
  - name: EntityExtractor
    type: EntityExtractor
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
    params: 
      embedding_dim: 768
      similarity: cosine 
  - name: Retriever
    type: EmbeddingRetriever
    params:
       document_store: DocumentStore
       embedding_model: intfloat/e5-base-v2 
       model_format: sentence_transformers
       top_k: 20      
    ...
 pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: EntityExtractor
        inputs: [Retriever] 
        ...

This is an indexing pipeline with EntityExtractor. As a result, all documents in the Document Store have extracted entities in their metadata.

✅
Tip
Make sure EntityExtractor comes after PreProcessor. That's because PreProcessor splits documents into smaller chunks, which breaks the alignment with the start and end character indices that EntityExtractor returns.

components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
       embedding_dim: 768
       similarity: cosine
  - name: EntityExtractor
    type: EntityExtractor
    params:
       flatten_entities_in_meta_data: true
  - name: Converter
    type: TextConverter
  - name: Classifier
    type: FileTypeClassifier
  - name: Processor
    type: PreProcessor
    params:
       split_by: word
       split_length: 250
       split_overlap: 30
       split_respect_sentence_boundary: True
       language: en
    ...
 pipelines:
  - name: indexing
    nodes:
      - name: Classifier
        inputs: [File]
      - name: Converter
        inputs: [Classifier.output_1] 
      - name: Processor
        inputs: [Converter]
      - name: EntityExtractor
        inputs: [Processor]
      - name: DocumentStore
        inputs: [EntityExtractor]
        ...

Parameters

You can specify the following parameters for FileTypeClassifier in the pipeline YAML:

Parameter	Type	Possible Values	Description
`model_name_or_path`	String	Default: `elastic/distilbert-base-cased-finetuned-conll03-english`	The name of the model to use for entity extraction. Required.
`model_version`	String	Default: `none`	The version of the model. Optional.
`use_gpu`	Boolean	`True` `False` Default: `True`	Specifies if GPU should be used. Required.
`progress_bar`	Boolean	`True` `False` Default: `True`	Shows the progress bar when processing. Required.
`batch_size`	Integer	Default: `16`	The number of documents to extract entities from. Required.
`use_auth_token`	Union of string and Boolean	Default: `none`	Specifies the API token used to download private models from Hugging Face. If you set it to True, it uses the token generated when running `transformers-cli login`. Optional
`devices`	List of union of string and torch device	Default: `none`	A list of torch devices such as cuda, cpu, mps, to limit inference to specific devices. Example: `[torch.device(cuda:0), "mps, "cuda:1"` If you set `use_gpu` to `False`, this parameter is not used and a single cpu device is used for inference. Optional.
`aggregation_strategy`	Literal	`None` `simple` `first` `average` `max` Default: `first`	The strategy to fuse tokens based on the model prediction. - `none`: Doesn't aggregate and returns raw results from the model. - `simple`: Attempts to group entities following the default schema. This means that (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Two consecutive B tags end up as different entities. In word-based languages, this may result in splitting words undesirably. For example, "Microsoft" would be tagged as: [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Check the FIRST, MAX, and AVERAGE options for ways to mitigate this example and disambiguate words. These mitigations will only work on real words; "New York" might still be tagged with two different entities. - `first`: Uses the `simple`strategy, except that words cannot end up with different tags. If there's ambiguity, words use the tag of the first token of the word. - `average`: Uses the `simple` strategy, except that words cannot end up with different tags. The scores are averaged across tokens and the label with the maximum score is chosen. - `max`: Uses the `simple` strategy, except that words cannot end up with different tags. The word entity is the token with the maximum score. Required.
`add_prefix_space`	Boolean	`True` `False` Default: `None`	Set to `true` if you don't want the first word to be treated differently. This is relevant for model types such as "bloom", "gpt2", and "roberta". For more information, see Hugging Face documentation. Optional.
`num_workers`	Integer	Default: `0`	The number of workers to be used in the Pytorch Dataloader. Required
`flatten_entities_in_meta_data`	Boolean	`True` `False` Default: `False`	Converts all entities predicted for a document from a list of dictionaries into a single list for each key in the dictionary. Required.
`max_seq_len`	Integer	Default: `None`	The maximum length of one input text for the model. If not provided, the max length is automatically determined by the `model_max_length` variable of the tokenizer. Optional.
`pre_split_text`	Boolean	`True` `False` Default: `False`	Splits the text of a document into words before being passed into the model. This is common practice for models trained for named entity recognition and is recommended when using architectures that don't use word-level tokenizers. Required
`ignore_labels`	List of strings	Default: `None`	A list of labels to ignore. If `None`, it defaults to `["0"]`. Optional.

Basic Information

Usage Example

✅Tip

Parameters

✅
Tip