Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DeepsetVLMPDFToDocumentConverter

Convert PDF documents to text using a Vision Language Model (VLM).

Deprecation Notice

This component is deprecated. It will continue to work in your existing pipelines. You can replace it with the LLMDocumentContentExtractor component.

Key Features

  • Uses a vision language model (VLM) to convert PDF page screenshots into text.
  • Works well with PDFs that have complex layouts, tables, handwritten text, figures, or mixed image and text content.
  • Supports OpenAI models through the OpenAI API and Anthropic models through Amazon Bedrock.
  • Processes files and pages in parallel for faster conversion.
  • Configures image detail level and conversion prompts to control output quality and format.
  • Detects truncated pages and automatically continues extraction where it left off (OpenAI provider only).
Costs

This component can cause high costs with OpenAI or Amazon Bedrock if you use it to convert thousands of PDF pages. For OpenAI, one PDF page equals roughly 1,500 input tokens and a page equals roughly between 800 and 3,000 output tokens.

Configuration

  1. Drag the DeepsetVLMPDFToDocumentConverter component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. On the General tab:
    1. Enter the model name, for example gpt-4o.
  4. Go to the Advanced tab to configure generation parameters, the API key, and device settings.

Connections

DeepsetVLMPDFToDocumentConverter accepts a list of PDF sources (sources) and optional metadata (meta) as input. It outputs a list of Document objects (documents).

Connect FileTypeRouter to the sources input to route PDF files to this converter. Connect the documents output to DocumentJoiner to combine converted documents with those from other converters.

Usage Example

This is an example index, where DeepsetVLMPDFToDocumentConverter receives PDFs from FileTypeRouter and then sends the converted files to DocumentJoiner:

components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- application/pdf
- text/markdown
- text/html
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
text_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
markdown_converter:
type: haystack.components.converters.markdown.MarkdownToDocument
init_parameters: {}
html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
extraction_kwargs:
output_format: txt
target_language: null
include_tables: true
include_links: false
docx_converter:
type: haystack.components.converters.docx.DOCXToDocument
init_parameters: {}
pptx_converter:
type: haystack.components.converters.pptx.PPTXToDocument
init_parameters: {}
xlsx_converter:
type: haystack.components.converters.XLSXToDocument
init_parameters: {}
joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 1024
similarity: cosine
policy: OVERWRITE
DeepsetVLMPDFToDocumentConverter:
type: deepset_cloud_custom_nodes.converters.vlm_pdf_to_document.DeepsetVLMPDFToDocumentConverter
init_parameters:
vlm_provider: openai
max_workers_files: 3
max_workers_pages: 5
max_retries: 3
backoff_factor: 2
initial_backoff_time: 30
prompt: |-
Extract the content from the document below.
You need to extract the content exactly.
Format everything as markdown.
Make sure to retain the reading order of the document.

**Headers- and Footers**
Remove repeating page headers or footers that disrupt the reading order.
Place letter heads that appear at the side of a document at the top of the page.


**Images**
Do not extract images, drawings or maps.
Instead, add a caption that describes briefly what you see on the image.
Enclose each image caption with [img-caption][/img-caption]

**Tables**
Make sure to format the table in markdown.
Add a short caption below the table that describes the table's content.
Enclose each table caption with [table-caption][/table-caption].
The caption must be placed below the extracted table.

**Forms**
Reproduce checkbox selections with markdown.

Go ahead and extract!

Document:
model: gpt-4o
max_splits_per_page: 3
detail: auto
generator_kwargs:
generation_kwargs:
temperature: 0
seed: 0
max_tokens: 4000
timeout: 120
response_extraction_pattern: null
progress_bar: true
page_separator: "\f"
connections:
- sender: file_classifier.text/plain
receiver: text_converter.sources
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: text_converter.documents
receiver: joiner.documents
- sender: markdown_converter.documents
receiver: joiner.documents
- sender: html_converter.documents
receiver: joiner.documents
- sender: docx_converter.documents
receiver: joiner.documents
- sender: pptx_converter.documents
receiver: joiner.documents
- sender: xlsx_converter.documents
receiver: joiner.documents
- sender: joiner.documents
receiver: writer.documents
- sender: file_classifier.application/pdf
receiver: DeepsetVLMPDFToDocumentConverter.sources
- sender: DeepsetVLMPDFToDocumentConverter.documents
receiver: joiner.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- file_classifier.sources

Parameters

Inputs

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]List of PDF sources to convert.
metaOptional[Union[Dict[str, Any], List[Dict[str, Any]]]]NoneOptional metadata or list of metadata dictionaries.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Dictionary with a list of converted documents.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
vlm_providerLiteral['openai', 'bedrock']openaiType of VLM provider to use ('openai' or 'bedrock').
max_workers_filesint3Maximum number of threads for processing files.
max_workers_pagesint5Maximum number of threads for processing pages.
max_retriesint3Maximum number of retries for page-level extraction.
backoff_factorfloat2.0Factor for exponential backoff between retries.
initial_backoff_timefloat30.0Initial backoff time in seconds.
promptstrExtract the content from this document page. Format everything as markdown to recreate the layout as best as possible. Retain the natural reading order.Prompt to use for the VLM.
openai_api_keySecretSecret.from_env_var('OPENAI_API_KEY')OpenAI API key.
modelstrgpt-4oModel name to use with the generator.
max_splits_per_pageint3Maximum number of splits per page. This parameter only applies when using the 'openai' as llm_provider. It detects when the conversion of a page was truncated because of the maximum number of output tokens and prompt the model to continue the extraction where it left off. Check the maximum number of output tokens for your model on the OpenAI-documentation. If you select 'bedrock' as llm_provider, the output of a page is truncated if it exceeds the maximum number of output tokens.
detailLiteral['auto', 'high', 'low']autoDetail level for image processing ('auto', 'high', 'low'). Pick 'high' for the best results and 'low' for the lowest inference costs. If you choose auto, the API automatically adjusts the resolution based on the size of the image input.
generator_kwargsOptional[Dict[str, Any]]NoneAdditional keyword arguments for the generator. Check the Generator's documentation to learn about the parameters that you can pass.
response_extraction_patternOptional[str]NoneRegex pattern to extract text from the generator's response.
progress_barboolTrueWhether to display a progress bar.
page_separatorstr\x0cWhat string to use to separate pages.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]List of PDF sources to convert.
metaOptional[Union[Dict[str, Any], List[Dict[str, Any]]]]NoneOptional metadata or list of metadata dictionaries.