NamedEntityExtractor
Annotate named entities in documents and store them as the document's metadata.
Key Features
- Identifies named entities (people, organizations, locations, and other named items) in document text.
- Stores entity annotations as metadata in the processed documents.
- Supports two backends: Hugging Face and spaCy.
- Works with any sequence classification model from the Hugging Face model hub or any spaCy model with an NER component.
- Automatically groups recognized entities by class based on the model used.
Configuration
- Drag the
NamedEntityExtractorcomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Choose the backend:
hugging_faceorspacy. - Enter the model name or path. For Hugging Face, use model IDs like
dslim/bert-base-NER. For spaCy, use model names likeen_core_web_sm.
- Choose the backend:
- Go to the Advanced tab to configure additional settings:
- Optionally set
pipeline_kwargsto pass additional arguments to the model pipeline. - Configure
deviceto specify where to load the model. - Set the
tokenfor downloading private Hugging Face models.
- Optionally set
Connections
NamedEntityExtractor receives a list of documents as input, typically from a converter such as TextFileToDocument or from DocumentSplitter. It outputs a list of documents with named entity annotations stored in their metadata.
Connect its output to a document embedder or directly to DocumentWriter for storage.
Source Code
To check this component's source code, open named_entity_extractor.py in the Haystack repository.
Usage Examples
Basic Configuration
NamedEntityExtractor:
type: haystack.components.extractors.named_entity_extractor.NamedEntityExtractor
init_parameters:
backend: hugging_face
model: dslim/bert-base-NER
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
Using the Component in an Index
This index uses NamedEntityExtractor to annotate named entities in documents before storing them:
# haystack-pipeline
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
NamedEntityExtractor:
type: haystack.components.extractors.named_entity_extractor.NamedEntityExtractor
init_parameters:
backend: hugging_face
model: dslim/bert-base-NER
pipeline_kwargs:
device:
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: OVERWRITE
connections:
- sender: TextFileToDocument.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: NamedEntityExtractor.documents
- sender: NamedEntityExtractor.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents
inputs:
files:
- TextFileToDocument.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | Documents to process. | |
batch_size | int | 1 | Batch size used for processing the documents. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | Processed documents with named entity annotations stored in metadata. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
backend | Union[str, NamedEntityExtractorBackend] | Backend to use for NER. Options: hugging_face or spacy. | |
model | str | Name of the model or a path to the model on the local disk. For Hugging Face, use model IDs like dslim/bert-base-NER. For spaCy, use model names like en_core_web_sm. | |
pipeline_kwargs | Optional[Dict[str, Any]] | None | Keyword arguments passed to the pipeline. The pipeline can override these arguments. Dependent on the backend. |
device | Optional[ComponentDevice] | None | The device on which the model is loaded. If None, the default device is automatically selected. If a device or device map is specified in pipeline_kwargs, it overrides this parameter (only applicable to the Hugging Face backend). |
token | Optional[Secret] | Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False) | The API token to download private models from Hugging Face. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | Documents to process. | |
batch_size | int | 1 | Batch size used for processing the documents. |
Was this page helpful?