LLMDocumentContentExtractor

Extract textual content from image-based documents using a vision-enabled LLM (Large Language Model).

Basic Information

Type: haystack.components.extractors.image.LLMDocumentContentExtractor
Components it can connect with:
- Converters: LLMDocumentContentExtractor can receive documents from Converters in an index.
- DocumentSplitter: LLMDocumentContentExtractor can send extracted documents to DocumentSplitter for further processing.

Inputs

Parameter	Type	Default	Description
documents	List[Document]		List of image-based documents to process. Each document must have a valid file path in its metadata.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		Successfully processed documents, updated with extracted content.
failed_documents	List[Document]		Documents that failed processing, annotated with failure metadata.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

LLMDocumentContentExtractor converts each input document into an image using the DocumentToImageContent component. It uses a prompt to instruct the LLM on how to extract content and the processes the image through a vision-capable ChatGenerator to extract structured textual content.

The prompt must not contain Jinja variables. It should only include instructions for the LLM. Image data and the prompt are passed together to the LLM as a chat message.

Documents for which the LLM fails to extract content are returned in a separate failed_documents list. These failed documents have a content_extraction_error entry in their metadata. You can use this metadata for debugging or for reprocessing the documents later.

Usage Example

Initializing the Component

components:
  LLMDocumentContentExtractor:
    type: haystack.components.extractors.image.llm_document_content_extractor.LLMDocumentContentExtractor
    init_parameters:
      chat_generator:
        type: haystack.components.generators.chat.openai.OpenAIChatGenerator
        init_parameters:
          model: gpt-4-vision-preview
      prompt: |
        You are part of an information extraction pipeline that extracts the content of image-based documents.
        
        Extract the content from the provided image.
        You need to extract the content exactly.
        Format everything as markdown.
        Make sure to retain the reading order of the document.
        
        **Visual Elements**
        Do not extract figures, drawings, maps, graphs or any other visual elements.
        Instead, add a caption that describes briefly what you see in the visual element.
        You must describe each visual element.
        If you only see a visual element without other content, you must describe this visual element.
        Enclose each image caption with [img-caption][/img-caption]
        
        **Tables**
        Make sure to format the table in markdown.
        Add a short caption below the table that describes the table's content.
        Enclose each table caption with [table-caption][/table-caption].
        The caption must be placed below the extracted table.
        
        **Forms**
        Reproduce checkbox selections with markdown.
        
        Go ahead and extract!
        
        Document:
      file_path_meta_field: file_path
      detail: high
      size: [1024, 768]
      raise_on_failure: false
      max_workers: 3

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
chat_generator	ChatGenerator		A ChatGenerator instance representing the LLM used to extract text. This generator must support vision-based input and return a plain text response.
prompt	str	DEFAULT_PROMPT_TEMPLATE	Instructional text provided to the LLM. It must not contain Jinja variables. The prompt should only contain instructions on how to extract the content of the image-based document.
file_path_meta_field	str	file_path	The metadata field in the Document that contains the file path to the image or PDF.
root_path	Optional[str]	None	The root directory path where document files are located. If provided, file paths in document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
detail	Optional[Literal]	None	Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". This will be passed to chat_generator when processing the images.
size	Optional[Tuple[int, int]]	None	If provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time.
raise_on_failure	bool	False	If True, exceptions from the LLM are raised. If False, failed documents are logged and returned.
max_workers	int	3	Maximum number of threads used to parallelize LLM calls across documents using a ThreadPoolExecutor.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]		List of image-based documents to process. Each must have a valid file path in its metadata.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Initializing the Component​

Parameters​

Init Parameters​

Run Method Parameters​