Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

LLMMetadataExtractor

Extract structured metadata from documents using a Large Language Model (LLM) and add it to the document's metadata field.

Key Features

  • Uses a ChatGenerator and a prompt to extract metadata from each document
  • Adds extracted metadata directly to each document's metadata field
  • Returns failed documents separately with error information for debugging or reprocessing
  • Supports processing a configurable range of pages rather than the full document
  • Works with any ChatGenerator configured to return JSON output
  • Compatible with OpenAI, Anthropic, and other LLM providers

Configuration

  1. Drag the LLMMetadataExtractor component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. On the General tab:
    1. Enter the prompt that instructs the LLM how to extract metadata. Use {{ document.content }} in the prompt to reference the document content.
    2. Select a ChatGenerator and configure it to return JSON output. For example, with OpenAIChatGenerator, pass {"response_format": {"type": "json_object"}} in generation_kwargs.
  4. Go to the Advanced tab to configure expected_keys, page_range, raise_on_failure, and max_workers.

Connections

LLMMetadataExtractor accepts a list of documents and an optional page range as input. It outputs documents updated with extracted metadata and a separate list of failed_documents for any that could not be processed. Failed documents include metadata_extraction_error and metadata_extraction_response in their metadata. It typically receives documents from converters such as TextFileToDocument and sends documents with enriched metadata to DocumentSplitter or DocumentWriter.

Usage Example

Using the component in a pipeline

This index uses LLMMetadataExtractor to extract named entities from documents and store them as metadata:

components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

LLMMetadataExtractor:
type: haystack.components.extractors.llm_metadata_extractor.LLMMetadataExtractor
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o-mini
generation_kwargs:
max_tokens: 500
temperature: 0.0
response_format:
type: json_object
prompt: |
Extract the following metadata from the document and return it as JSON:
- "title": The title or main topic of the document
- "entities": A list of named entities (people, organizations, locations) mentioned
- "summary": A brief one-sentence summary of the content

Document content:
{{ document.content }}

Return only valid JSON with the keys: title, entities, summary
expected_keys:
- title
- entities
- summary
raise_on_failure: false
max_workers: 3

DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1

document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2

DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: OVERWRITE

connections:
- sender: TextFileToDocument.documents
receiver: LLMMetadataExtractor.documents
- sender: LLMMetadataExtractor.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents

inputs:
files:
- TextFileToDocument.sources

Parameters

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents to extract metadata from.
page_rangeOptional[List[Union[str, int]]]NoneA range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Documents that were successfully updated with the extracted metadata.
failed_documentsList[Document]Documents that failed to extract metadata. These documents have metadata_extraction_error and metadata_extraction_response in their metadata.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
promptstrThe prompt to be used for the LLM. Use {{ document.content }} to reference the document content.
chat_generatorChatGeneratorA ChatGenerator instance representing the LLM. The LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, pass {"response_format": {"type": "json_object"}} in the generation_kwargs.
expected_keysOptional[List[str]]NoneThe keys expected in the JSON output from the LLM.
page_rangeOptional[List[Union[str, int]]]NoneA range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document. This parameter can be overridden in the run() method.
raise_on_failureboolFalseWhether to raise an error on failure during the execution of the generator or validation of the JSON output.
max_workersint3The maximum number of workers to use in the thread pool executor.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of documents to extract metadata from.
page_rangeOptional[List[Union[str, int]]]NoneA range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document.