LLMMetadataExtractor
Extract structured metadata from documents using a Large Language Model (LLM) and add it to the document's metadata field.
Key Features
- Uses a ChatGenerator and a prompt to extract metadata from each document
- Adds extracted metadata directly to each document's metadata field
- Returns failed documents separately with error information for debugging or reprocessing
- Supports processing a configurable range of pages rather than the full document
- Works with any ChatGenerator configured to return JSON output
- Compatible with OpenAI, Anthropic, and other LLM providers
Configuration
- Drag the
LLMMetadataExtractorcomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- On the General tab:
- Enter the prompt that instructs the LLM how to extract metadata. Use
{{ document.content }}in the prompt to reference the document content. - Select a ChatGenerator and configure it to return JSON output. For example, with
OpenAIChatGenerator, pass{"response_format": {"type": "json_object"}}ingeneration_kwargs.
- Enter the prompt that instructs the LLM how to extract metadata. Use
- Go to the Advanced tab to configure
expected_keys,page_range,raise_on_failure, andmax_workers.
Connections
LLMMetadataExtractor accepts a list of documents and an optional page range as input. It outputs documents updated with extracted metadata and a separate list of failed_documents for any that could not be processed. Failed documents include metadata_extraction_error and metadata_extraction_response in their metadata. It typically receives documents from converters such as TextFileToDocument and sends documents with enriched metadata to DocumentSplitter or DocumentWriter.
Usage Example
Using the component in a pipeline
This index uses LLMMetadataExtractor to extract named entities from documents and store them as metadata:
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
LLMMetadataExtractor:
type: haystack.components.extractors.llm_metadata_extractor.LLMMetadataExtractor
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o-mini
generation_kwargs:
max_tokens: 500
temperature: 0.0
response_format:
type: json_object
prompt: |
Extract the following metadata from the document and return it as JSON:
- "title": The title or main topic of the document
- "entities": A list of named entities (people, organizations, locations) mentioned
- "summary": A brief one-sentence summary of the content
Document content:
{{ document.content }}
Return only valid JSON with the keys: title, entities, summary
expected_keys:
- title
- entities
- summary
raise_on_failure: false
max_workers: 3
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: OVERWRITE
connections:
- sender: TextFileToDocument.documents
receiver: LLMMetadataExtractor.documents
- sender: LLMMetadataExtractor.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents
inputs:
files:
- TextFileToDocument.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of documents to extract metadata from. | |
| page_range | Optional[List[Union[str, int]]] | None | A range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents that were successfully updated with the extracted metadata. | |
| failed_documents | List[Document] | Documents that failed to extract metadata. These documents have metadata_extraction_error and metadata_extraction_response in their metadata. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| prompt | str | The prompt to be used for the LLM. Use {{ document.content }} to reference the document content. | |
| chat_generator | ChatGenerator | A ChatGenerator instance representing the LLM. The LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, pass {"response_format": {"type": "json_object"}} in the generation_kwargs. | |
| expected_keys | Optional[List[str]] | None | The keys expected in the JSON output from the LLM. |
| page_range | Optional[List[Union[str, int]]] | None | A range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document. This parameter can be overridden in the run() method. |
| raise_on_failure | bool | False | Whether to raise an error on failure during the execution of the generator or validation of the JSON output. |
| max_workers | int | 3 | The maximum number of workers to use in the thread pool executor. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of documents to extract metadata from. | |
| page_range | Optional[List[Union[str, int]]] | None | A range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document. |
Was this page helpful?