Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

NamedEntityExtractor

Annotate named entities in documents and store them as the document's metadata.

Key Features

  • Identifies entities such as people, organizations, and locations in document text
  • Supports two backends: Hugging Face and spaCy
  • Stores entity annotations in each document's metadata field
  • Compatible with any Hugging Face NER model (for example, dslim/bert-base-NER) or spaCy NER model
  • Supports GPU acceleration and custom device configuration (Hugging Face backend)
  • Configurable batch size for efficient processing of large document sets

Configuration

  1. Drag the NamedEntityExtractor component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. On the General tab:
    1. Select the backend type: hugging_face or spacy.
    2. Enter the model name or path. For Hugging Face, use a model ID such as dslim/bert-base-NER. For spaCy, use a model name such as en_core_web_sm.
  4. Go to the Advanced tab to configure pipeline_kwargs, device, and token.

Connections

NamedEntityExtractor accepts a list of documents and a batch size as input. It outputs the same documents with named entity annotations stored in their metadata. It typically receives documents from converters such as TextFileToDocument or from DocumentSplitter, and sends annotated documents to embedders or DocumentWriter.

Usage Example

Using the component in a pipeline

This index uses NamedEntityExtractor to annotate named entities in documents before storing them:

components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1

NamedEntityExtractor:
type: haystack.components.extractors.named_entity_extractor.NamedEntityExtractor
init_parameters:
backend: hugging_face
model: dslim/bert-base-NER
pipeline_kwargs:
device:
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false

document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2

DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: OVERWRITE

connections:
- sender: TextFileToDocument.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: NamedEntityExtractor.documents
- sender: NamedEntityExtractor.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents

inputs:
files:
- TextFileToDocument.sources

Parameters

Inputs

ParameterTypeDefaultDescription
documentsList[Document]Documents to process.
batch_sizeint1Batch size used for processing the documents.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Processed documents with named entity annotations stored in metadata.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
backendUnion[str, NamedEntityExtractorBackend]Backend to use for NER. Options: hugging_face or spacy.
modelstrName of the model or a path to the model on the local disk. For Hugging Face, use model IDs like dslim/bert-base-NER. For spaCy, use model names like en_core_web_sm.
pipeline_kwargsOptional[Dict[str, Any]]NoneKeyword arguments passed to the pipeline. The pipeline can override these arguments. Dependent on the backend.
deviceOptional[ComponentDevice]NoneThe device on which the model is loaded. If None, the default device is automatically selected. If a device or device map is specified in pipeline_kwargs, it overrides this parameter (only applicable to the Hugging Face backend).
tokenOptional[Secret]Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False)The API token to download private models from Hugging Face.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]Documents to process.
batch_sizeint1Batch size used for processing the documents.