NamedEntityExtractor
Annotate named entities in documents and store them as the document's metadata.
Basic Information
- Type:
haystack.components.extractors.named_entity_extractor.NamedEntityExtractor - Components it can connect with:
Converters:NamedEntityExtractorcan receive documents from converters in an index.DocumentSplitter:NamedEntityExtractorcan receive split documents fromDocumentSplitteror send annotated documents to it.DocumentWriter:NamedEntityExtractorcan send annotated documents toDocumentWriter.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents to process. | |
| batch_size | int | 1 | Batch size used for processing the documents. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Processed documents with named entity annotations stored in metadata. |
Overview
NamedEntityExtractor extracts predefined named entities out of a piece of text. . It identifies entities such as people, organizations, locations, and other named items in the document text.
The component automatically recognizes and groups them depending on their class, such as people's names, organizations, locations, and other types. The exact classes are determined by the model that you use the component with.
The component supports two backends:
- Hugging Face: Use any sequence classification model from the Hugging Face model hub. For example,
dslim/bert-base-NERis a popular choice for general NER tasks. - spaCy: Use any spaCy model that contains an NER component.
Annotations are stored as metadata in the documents.
Usage Example
Using the component in a pipeline
This index uses NamedEntityExtractor to annotate named entities in documents before storing them:
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
NamedEntityExtractor:
type: haystack.components.extractors.named_entity_extractor.NamedEntityExtractor
init_parameters:
backend: hugging_face
model: dslim/bert-base-NER
pipeline_kwargs:
device:
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: OVERWRITE
connections:
- sender: TextFileToDocument.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: NamedEntityExtractor.documents
- sender: NamedEntityExtractor.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents
inputs:
files:
- TextFileToDocument.sources
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| backend | Union[str, NamedEntityExtractorBackend] | Backend to use for NER. Options: hugging_face or spacy. | |
| model | str | Name of the model or a path to the model on the local disk. For Hugging Face, use model IDs like dslim/bert-base-NER. For spaCy, use model names like en_core_web_sm. | |
| pipeline_kwargs | Optional[Dict[str, Any]] | None | Keyword arguments passed to the pipeline. The pipeline can override these arguments. Dependent on the backend. |
| device | Optional[ComponentDevice] | None | The device on which the model is loaded. If None, the default device is automatically selected. If a device or device map is specified in pipeline_kwargs, it overrides this parameter (only applicable to the Hugging Face backend). |
| token | Optional[Secret] | Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False) | The API token to download private models from Hugging Face. |
Run Method Parameters
These are the parameters you can configure for the run() method. You can pass these parameters at query time through the API, in Playground, or when running a job.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents to process. | |
| batch_size | int | 1 | Batch size used for processing the documents. |
Was this page helpful?