Skip to main content

LLMDocumentContentExtractor

Extract textual content from image-based documents using a vision-enabled LLM (Large Language Model).

Basic Information

  • Type: haystack.components.extractors.image.LLMDocumentContentExtractor
  • Components it can connect with:
    • Converters: LLMDocumentContentExtractor can receive documents from Converters in an index.
    • DocumentSplitter: LLMDocumentContentExtractor can send extracted documents to DocumentSplitter for further processing.

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of image-based documents to process. Each document must have a valid file path in its metadata.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Successfully processed documents, updated with extracted content.
failed_documentsList[Document]Documents that failed processing, annotated with failure metadata.

Overview

LLMDocumentContentExtractor converts each input document into an image using the DocumentToImageContent component. It uses a prompt to instruct the LLM on how to extract content and then processes the image through a vision-capable ChatGenerator to extract structured textual content.

The prompt must only include instructions for the LLM, without any Jinja variables. Image data and the prompt are passed together to the LLM as a chat message.

Documents for which the LLM fails to extract content are returned in a separate failed_documents list. These failed documents have a content_extraction_error entry in their metadata. You can use this metadata for debugging or for reprocessing the documents later.

Usage Example

Using the component in a pipeline

This index uses LLMDocumentContentExtractor to extract text from image-based documents (such as scanned PDFs or images) using a vision-enabled LLM:

components:
PDFMinerToDocument:
type: haystack.components.converters.pdf.PDFMinerToDocument
init_parameters:

LLMDocumentContentExtractor:
type: haystack.components.extractors.image.LLMDocumentContentExtractor
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o
generation_kwargs:
prompt: "Extract all text content from this document image. Preserve the original structure including headings, paragraphs, lists, and tables. Return only the extracted text without any additional commentary."
file_path_meta_field: file_path
detail: auto
size:
raise_on_failure: false
max_workers: 3

DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1

document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2

DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: WRITE

connections:
- sender: PDFMinerToDocument.documents
receiver: LLMDocumentContentExtractor.documents
- sender: LLMDocumentContentExtractor.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents

inputs:
files:
- PDFMinerToDocument.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
chat_generatorChatGeneratorA ChatGenerator instance representing the LLM used to extract text. This generator must support vision-based input and return a plain text response.
promptstrDEFAULT_PROMPT_TEMPLATEInstructional text provided to the LLM. It must not contain Jinja variables. The prompt should only contain instructions on how to extract the content of the image-based document.
file_path_meta_fieldstrfile_pathThe metadata field in the Document that contains the file path to the image or PDF.
root_pathOptional[str]NoneThe root directory path where document files are located. If provided, file paths in document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
detailOptional[Literal]NoneOptional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". This will be passed to chat_generator when processing the images.
sizeOptional[Tuple[int, int]]NoneIf provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time.
raise_on_failureboolFalseIf True, exceptions from the LLM are raised. If False, failed documents are logged and returned.
max_workersint3Maximum number of threads used to parallelize LLM calls across documents using a ThreadPoolExecutor.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of image-based documents to process. Each must have a valid file path in its metadata.