Skip to main content

LLMMetadataExtractor

Extract metadata from documents using a Large Language Model (LLM).

Basic Information

  • Type: haystack.components.extractors.llm_metadata_extractor.LLMMetadataExtractor
  • Components it can connect with:
    • Converters: LLMMetadataExtractor can receive documents from converters in an index.
    • DocumentSplitter: LLMMetadataExtractor can send extracted documents to DocumentSplitter for further processing.
    • DocumentWriter: LLMMetadataExtractor can send documents with extracted metadata to DocumentWriter.

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents to extract metadata from.
page_rangeOptional[List[Union[str, int]]]NoneA range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Documents that were successfully updated with the extracted metadata.
failed_documentsList[Document]Documents that failed to extract metadata. These documents have metadata_extraction_error and metadata_extraction_response in their metadata.

Overview

LLMMetadataExtractor extracts metadata from documents using a Large Language Model (LLM). The component expects a large language model, a Generator, and a prompt instructing the LLM how to extract metadata from the document.

The prompt should have a variable called document that points to a single document in the list of documents. To access the content of the document, use {{ document.content }} in the prompt.

The component runs the LLM on each document in the list and extracts metadata from the document. The metadata is added to the document's metadata field. If the LLM fails to extract metadata from a document, the document is added to the failed_documents list. The failed documents have the keys metadata_extraction_error and metadata_extraction_response in their metadata. You can re-run these documents with another extractor to extract metadata by using the metadata_extraction_response and metadata_extraction_error in the prompt.

The LLM must be configured to return a JSON object. For example, when using the OpenAIChatGenerator, pass {"response_format": {"type": "json_object"}} in the generation_kwargs.

Usage Example

Using the component in a pipeline

This index uses LLMMetadataExtractor to extract named entities from documents and store them as metadata:

components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

LLMMetadataExtractor:
type: haystack.components.extractors.llm_metadata_extractor.LLMMetadataExtractor
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o-mini
generation_kwargs:
max_tokens: 500
temperature: 0.0
response_format:
type: json_object
prompt: |
Extract the following metadata from the document and return it as JSON:
- "title": The title or main topic of the document
- "entities": A list of named entities (people, organizations, locations) mentioned
- "summary": A brief one-sentence summary of the content

Document content:
{{ document.content }}

Return only valid JSON with the keys: title, entities, summary
expected_keys:
- title
- entities
- summary
raise_on_failure: false
max_workers: 3

DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1

document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2

DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: OVERWRITE

connections:
- sender: TextFileToDocument.documents
receiver: LLMMetadataExtractor.documents
- sender: LLMMetadataExtractor.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents

inputs:
files:
- TextFileToDocument.sources

Parameters

Init parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
promptstrThe prompt to be used for the LLM. Use {{ document.content }} to reference the document content.
chat_generatorChatGeneratorA ChatGenerator instance representing the LLM. The LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, pass {"response_format": {"type": "json_object"}} in the generation_kwargs.
expected_keysOptional[List[str]]NoneThe keys expected in the JSON output from the LLM.
page_rangeOptional[List[Union[str, int]]]NoneA range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document. This parameter can be overridden in the run() method.
raise_on_failureboolFalseWhether to raise an error on failure during the execution of the generator or validation of the JSON output.
max_workersint3The maximum number of workers to use in the thread pool executor.

Run Method Parameters

These are the parameters you can configure for the run() method. You can pass these parameters at query time through the API, in Playground, or when running a job.

ParameterTypeDefaultDescription
documentsList[Document]List of documents to extract metadata from.
page_rangeOptional[List[Union[str, int]]]NoneA range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document.