LLMMetadataExtractor
Extract metadata from documents using a Large Language Model (LLM).
Basic Information
- Type:
haystack.components.extractors.llm_metadata_extractor.LLMMetadataExtractor - Components it can connect with:
Converters:LLMMetadataExtractorcan receive documents from converters in an index.DocumentSplitter:LLMMetadataExtractorcan send extracted documents toDocumentSplitterfor further processing.DocumentWriter:LLMMetadataExtractorcan send documents with extracted metadata toDocumentWriter.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of documents to extract metadata from. | |
| page_range | Optional[List[Union[str, int]]] | None | A range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents that were successfully updated with the extracted metadata. | |
| failed_documents | List[Document] | Documents that failed to extract metadata. These documents have metadata_extraction_error and metadata_extraction_response in their metadata. |
Overview
LLMMetadataExtractor extracts metadata from documents using a Large Language Model (LLM). The component expects a large language model, a Generator, and a prompt instructing the LLM how to extract metadata from the document.
The prompt should have a variable called document that points to a single document in the list of documents. To access the content of the document, use {{ document.content }} in the prompt.
The component runs the LLM on each document in the list and extracts metadata from the document. The metadata is added to the document's metadata field. If the LLM fails to extract metadata from a document, the document is added to the failed_documents list. The failed documents have the keys metadata_extraction_error and metadata_extraction_response in their metadata. You can re-run these documents with another extractor to extract metadata by using the metadata_extraction_response and metadata_extraction_error in the prompt.
The LLM must be configured to return a JSON object. For example, when using the OpenAIChatGenerator, pass {"response_format": {"type": "json_object"}} in the generation_kwargs.
Usage Example
Using the component in a pipeline
This index uses LLMMetadataExtractor to extract named entities from documents and store them as metadata:
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
LLMMetadataExtractor:
type: haystack.components.extractors.llm_metadata_extractor.LLMMetadataExtractor
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o-mini
generation_kwargs:
max_tokens: 500
temperature: 0.0
response_format:
type: json_object
prompt: |
Extract the following metadata from the document and return it as JSON:
- "title": The title or main topic of the document
- "entities": A list of named entities (people, organizations, locations) mentioned
- "summary": A brief one-sentence summary of the content
Document content:
{{ document.content }}
Return only valid JSON with the keys: title, entities, summary
expected_keys:
- title
- entities
- summary
raise_on_failure: false
max_workers: 3
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: OVERWRITE
connections:
- sender: TextFileToDocument.documents
receiver: LLMMetadataExtractor.documents
- sender: LLMMetadataExtractor.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents
inputs:
files:
- TextFileToDocument.sources
Parameters
Init parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| prompt | str | The prompt to be used for the LLM. Use {{ document.content }} to reference the document content. | |
| chat_generator | ChatGenerator | A ChatGenerator instance representing the LLM. The LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, pass {"response_format": {"type": "json_object"}} in the generation_kwargs. | |
| expected_keys | Optional[List[str]] | None | The keys expected in the JSON output from the LLM. |
| page_range | Optional[List[Union[str, int]]] | None | A range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document. This parameter can be overridden in the run() method. |
| raise_on_failure | bool | False | Whether to raise an error on failure during the execution of the generator or validation of the JSON output. |
| max_workers | int | 3 | The maximum number of workers to use in the thread pool executor. |
Run Method Parameters
These are the parameters you can configure for the run() method. You can pass these parameters at query time through the API, in Playground, or when running a job.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of documents to extract metadata from. | |
| page_range | Optional[List[Union[str, int]]] | None | A range of pages to extract metadata from. For example, page_range=['1', '3'] extracts metadata from the first and third pages of each document. It also accepts printable range strings, such as ['1-3', '5', '8', '10-12']. If None, metadata is extracted from the entire document. |
Was this page helpful?