LLMDocumentContentExtractor
Extract textual content from image-based documents using a vision-enabled LLM.
Key Features
- Converts image-based documents (such as scanned PDFs or images) to text using a vision LLM.
- Works with any vision-capable
ChatGenerator. - Processes documents in parallel using a configurable thread pool.
- Returns failed documents separately with
content_extraction_errormetadata for debugging or reprocessing. - Supports optional image resizing to reduce memory usage and processing time.
- Configurable detail level for image processing (for OpenAI models).
Configuration
- Drag the
LLMDocumentContentExtractorcomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Select or configure the vision-capable chat generator (LLM) to use for extraction.
- Go to the Advanced tab to configure additional settings:
- Enter the prompt with instructions for the LLM on how to extract content from the document image. The prompt must not contain Jinja variables.
- Set the
file_path_meta_fieldto specify which metadata field contains the file path. - Optionally set the image
detaillevel (auto,high, orlow) if using an OpenAI model. - Optionally configure
sizeto resize images before processing. - Set
raise_on_failureandmax_workersas needed.
Connections
LLMDocumentContentExtractor receives a list of image-based documents as input, typically from a converter such as PDFMinerToDocument. Each document must have a valid file path in its metadata.
It outputs two lists: documents contains successfully processed documents updated with extracted text content, and failed_documents contains documents that could not be processed. Connect documents to DocumentSplitter or DocumentWriter for further processing.
Source Code
To check this component's source code, open llm_document_content_extractor.py in the Haystack repository.
Usage Examples
Basic Configuration
LLMDocumentContentExtractor:
type: haystack.components.extractors.image.LLMDocumentContentExtractor
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o
prompt: Extract all text content from this document image. Preserve the original structure including headings,
paragraphs, lists, and tables. Return only the extracted text without any additional commentary.
file_path_meta_field: file_path
detail: auto
raise_on_failure: false
max_workers: 3
LLMDocumentContentExtractor accepts a list of image-based documents as input. Each document must have a valid file path in its metadata. It outputs successfully processed documents (with extracted text content) and a separate list of failed_documents for documents the LLM could not process. It typically receives documents from converters such as PDFMinerToDocument and sends processed documents to DocumentSplitter for further processing.
Using the component in a pipeline
This index uses LLMDocumentContentExtractor to extract text from image-based documents (such as scanned PDFs or images) using a vision-enabled LLM:
# haystack-pipeline
components:
PDFMinerToDocument:
type: haystack.components.converters.pdfminer.PDFMinerToDocument
init_parameters:
LLMDocumentContentExtractor:
type: haystack.components.extractors.image.LLMDocumentContentExtractor
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o
generation_kwargs:
prompt: "Extract all text content from this document image. Preserve the original structure including headings, paragraphs, lists, and tables. Return only the extracted text without any additional commentary."
file_path_meta_field: file_path
detail: auto
size:
raise_on_failure: false
max_workers: 3
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: WRITE
connections:
- sender: PDFMinerToDocument.documents
receiver: LLMDocumentContentExtractor.documents
- sender: LLMDocumentContentExtractor.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents
inputs:
files:
- PDFMinerToDocument.sources
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of image-based documents to process. Each document must have a valid file path in its metadata. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | Successfully processed documents, updated with extracted content. |
failed_documents | List[Document] | Documents that failed processing, annotated with failure metadata. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
chat_generator | ChatGenerator | A ChatGenerator instance representing the LLM used to extract text. This generator must support vision-based input and return a plain text response. | |
prompt | str | DEFAULT_PROMPT_TEMPLATE | Instructional text provided to the LLM. It must not contain Jinja variables. The prompt should only contain instructions on how to extract the content of the image-based document. |
file_path_meta_field | str | file_path | The metadata field in the Document that contains the file path to the image or PDF. |
root_path | Optional[str] | None | The root directory path where document files are located. If provided, file paths in document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. |
detail | Optional[Literal] | None | Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". This will be passed to chat_generator when processing the images. |
size | Optional[Tuple[int, int]] | None | If provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time. |
raise_on_failure | bool | False | If True, exceptions from the LLM are raised. If False, failed documents are logged and returned. |
max_workers | int | 3 | Maximum number of threads used to parallelize LLM calls across documents using a ThreadPoolExecutor. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of image-based documents to process. Each must have a valid file path in its metadata. |
Was this page helpful?