EntityExtractor
Use EntityExtractor in your pipelines to extract predefined entities out of text. It's most often used to perform named entity extraction (NER).
Apart from NER, you can use it, for example, to label each word by part of speech. Entity extraction can also be a powerful way to generate metadata that you can then use as search filters.
EntityExtractor uses the elastic/distilbert-base-cased-finetuned-conll03-english
model by default. All entities the model extracts populate documents' metadata.
Basic Information
- Pipeline type: Used in indexing and query pipelines. In indexing pipelines, it extracts entities from all documents in the Document Store. In query pipelines, it extracts entities from the retriever documents only.
- Nodes that can precede it in a pipeline:
- In indexing pipelines: PreProcessor, CNAzureConverter, PDFToTextConverter, TextConverter
- In query pipelines: Retriever, InterleaveDocuments, JoinDocuments, Ranker, RetrievalScoreAdjuster
- Nodes that can follow it in a pipeline:
- In indexing pipelines: None (it's recommended to be placed after PreProcessor)
- In query pipelines: Reader, ReferencePredictor, Ranker, RetrievalScoreAdjuster, PromptNode, JoinDocuments, InterleaveDocuments
- Node input: Documents
- Node output: Documents
- Available node classes: EntityExtractor
Usage Example
Here's an example of an EntityExtractor used in a query pipeline. It extracts entities from the retrieved documents only and the extracted entities populate the documents' metadata.
components:
- name: EntityExtractor
type: EntityExtractor
- name: DocumentStore
type: DeepsetCloudDocumentStore
params:
embedding_dim: 768
similarity: cosine
- name: Retriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: intfloat/e5-base-v2
model_format: sentence_transformers
top_k: 20
...
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: EntityExtractor
inputs: [Retriever]
...
This is an indexing pipeline with EntityExtractor. As a result, all documents in the Document Store have extracted entities in their metadata.
Tip
Make sure EntityExtractor comes after PreProcessor. That's because PreProcessor splits documents into smaller chunks, which breaks the alignment with the start and end character indices that EntityExtractor returns.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
params:
embedding_dim: 768
similarity: cosine
- name: EntityExtractor
type: EntityExtractor
params:
flatten_entities_in_meta_data: true
- name: Converter
type: TextConverter
- name: Classifier
type: FileTypeClassifier
- name: Processor
type: PreProcessor
params:
split_by: word
split_length: 250
split_overlap: 30
split_respect_sentence_boundary: True
language: en
...
pipelines:
- name: indexing
nodes:
- name: Classifier
inputs: [File]
- name: Converter
inputs: [Classifier.output_1]
- name: Processor
inputs: [Converter]
- name: EntityExtractor
inputs: [Processor]
- name: DocumentStore
inputs: [EntityExtractor]
...
Parameters
You can specify the following parameters for FileTypeClassifier
in the pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
model_name_or_path | String | Default: elastic/distilbert-base-cased-finetuned-conll03-english | The name of the model to use for entity extraction. Required. |
model_version | String | Default: none | The version of the model. Optional. |
use_gpu | Boolean | True False Default: True | Specifies if GPU should be used. Required. |
progress_bar | Boolean | True False Default: True | Shows the progress bar when processing. Required. |
batch_size | Integer | Default: 16 | The number of documents to extract entities from. Required. |
use_auth_token | Union of string and Boolean | Default: none | Specifies the API token used to download private models from Hugging Face. If you set it to True, it uses the token generated when running transformers-cli login .Optional |
devices | List of union of string and torch device | Default: none | A list of torch devices such as cuda, cpu, mps, to limit inference to specific devices. Example: [torch.device(cuda:0), "mps, "cuda:1" If you set use_gpu to False , this parameter is not used and a single cpu device is used for inference.Optional. |
aggregation_strategy | Literal | None simple first average max Default: first | The strategy to fuse tokens based on the model prediction. - none : Doesn't aggregate and returns raw results from the model.- simple : Attempts to group entities following the default schema. This means that (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}]Two consecutive B tags end up as different entities. In word-based languages, this may result in splitting words undesirably. For example, "Microsoft" would be tagged as: [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Check the FIRST, MAX, and AVERAGE options for ways to mitigate this example and disambiguate words. These mitigations will only work on real words; "New York" might still be tagged with two different entities. - first : Uses the simple strategy, except that words cannot end up with different tags. If there's ambiguity, words use the tag of the first token of the word.- average : Uses the simple strategy, except that words cannot end up with different tags. The scores are averaged across tokens and the label with the maximum score is chosen.- max : Uses the simple strategy, except that words cannot end up with different tags. The word entity is the token with the maximum score.Required. |
add_prefix_space | Boolean | True False Default: None | Set to true if you don't want the first word to be treated differently. This is relevant for model types such as "bloom", "gpt2", and "roberta". For more information, see Hugging Face documentation.Optional. |
num_workers | Integer | Default: 0 | The number of workers to be used in the Pytorch Dataloader. Required |
flatten_entities_in_meta_data | Boolean | True False Default: False | Converts all entities predicted for a document from a list of dictionaries into a single list for each key in the dictionary. Required. |
max_seq_len | Integer | Default: None | The maximum length of one input text for the model. If not provided, the max length is automatically determined by the model_max_length variable of the tokenizer.Optional. |
pre_split_text | Boolean | True False Default: False | Splits the text of a document into words before being passed into the model. This is common practice for models trained for named entity recognition and is recommended when using architectures that don't use word-level tokenizers. Required |
ignore_labels | List of strings | Default: None | A list of labels to ignore. If None , it defaults to ["0"] .Optional. |
Updated 7 months ago