DeepsetVLMPDFToDocumentConverter
Convert PDF documents to text using a Vision Language Model (VLM).
This component is deprecated. It will continue to work in your existing pipelines. You can replace it with the LLMDocumentContentExtractor component.
DeepsetVLMPDFToDocumentConverter uses a vision language model (VLM) to convert a screenshot of each PDF page into text based on your prompt. Use this converter with PDF files that have:
- complex layouts
- a mix of images and text
- tables
- handwritten text
- figures
Through prompting, you can convert tables, images, or figures into a textual representation which can be useful for retrieval or for passing the resulting text to an LLM.
It helps to extract text in a natural reading order from PDF documents with complex layouts without having to implement custom post-processing code to keep a natural reading order.
This component can cause high costs with OpenAI or Amazon Bedrock if you use it to convert thousands of PDF pages. For OpenAI, one PDF page equals roughly 1,500 input tokens and a page equals roughly between 800 and 3,000 output tokens.
Configuration
Key Features
- Uses a VLM to accurately convert complex PDF layouts, including tables, images, figures, and handwritten text.
- Supports OpenAI models via the OpenAI API and Anthropic models via Amazon Bedrock.
- Processes PDFs in parallel across files and pages for efficiency.
- Fully customizable through prompt configuration.
- Configurable detail level for image processing.
- Supports retry logic with exponential backoff.
Configuration
- Drag the
DeepsetVLMPDFToDocumentConvertercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Select the VLM Provider:
openaiorbedrock. - Enter the Model name (for example,
gpt-4o). - Edit the Prompt to control how the VLM extracts content.
- Select the VLM Provider:
- Go to the Advanced tab to configure
max_workers_files,max_workers_pages,max_retries,backoff_factor,initial_backoff_time,detail,generator_kwargs,response_extraction_pattern,max_splits_per_page,progress_bar, andpage_separator.
Connections
DeepsetVLMPDFToDocumentConverter receives PDF sources from FileTypeRouter through its sources input. It outputs converted documents through its documents output, which you typically connect to DocumentJoiner or a preprocessor for further processing.
Usage Examples
Basic Configuration
DeepsetVLMPDFToDocumentConverter:
type: deepset_cloud_custom_nodes.converters.vlm_pdf_to_document.DeepsetVLMPDFToDocumentConverter
init_parameters:
vlm_provider: openai
max_workers_files: 3
max_workers_pages: 5
max_retries: 3
backoff_factor: 2
initial_backoff_time: 30
prompt: |-
Extract the content from the document below.
You need to extract the content exactly.
Format everything as markdown.
Make sure to retain the reading order of the document.
**Headers- and Footers**
Remove repeating page headers or footers that disrupt the reading order.
Place letter heads that appear at the side of a document at the top of the page.
**Images**
Do not extract images, drawings or maps.
Instead, add a caption that describes briefly what you see on the image.
Enclose each image caption with [img-caption][/img-caption]
**Tables**
Make sure to format the table in markdown.
Add a short caption below the table that describes the table's content.
Enclose each table caption with [table-caption][/table-caption].
The caption must be placed below the extracted table.
**Forms**
Reproduce checkbox selections with markdown.
Go ahead and extract!
Document:
model: gpt-4o
max_splits_per_page: 3
detail: auto
generator_kwargs:
generation_kwargs:
temperature: 0
seed: 0
max_tokens: 4000
timeout: 120
progress_bar: true
page_separator: "\f"
This is an example index, where DeepsetVLMPDFToDocumentConverter receives PDFs from FileTypeRouter and then sends the converted files to DocumentJoiner:
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- application/pdf
- text/markdown
- text/html
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
text_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
markdown_converter:
type: haystack.components.converters.markdown.MarkdownToDocument
init_parameters: {}
html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
extraction_kwargs:
output_format: txt
target_language: null
include_tables: true
include_links: false
docx_converter:
type: haystack.components.converters.docx.DOCXToDocument
init_parameters: {}
pptx_converter:
type: haystack.components.converters.pptx.PPTXToDocument
init_parameters: {}
xlsx_converter:
type: haystack.components.converters.XLSXToDocument
init_parameters: {}
joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 1024
similarity: cosine
policy: OVERWRITE
DeepsetVLMPDFToDocumentConverter:
type: deepset_cloud_custom_nodes.converters.vlm_pdf_to_document.DeepsetVLMPDFToDocumentConverter
init_parameters:
vlm_provider: openai
max_workers_files: 3
max_workers_pages: 5
max_retries: 3
backoff_factor: 2
initial_backoff_time: 30
prompt: |-
Extract the content from the document below.
You need to extract the content exactly.
Format everything as markdown.
Make sure to retain the reading order of the document.
**Headers- and Footers**
Remove repeating page headers or footers that disrupt the reading order.
Place letter heads that appear at the side of a document at the top of the page.
**Images**
Do not extract images, drawings or maps.
Instead, add a caption that describes briefly what you see on the image.
Enclose each image caption with [img-caption][/img-caption]
**Tables**
Make sure to format the table in markdown.
Add a short caption below the table that describes the table's content.
Enclose each table caption with [table-caption][/table-caption].
The caption must be placed below the extracted table.
**Forms**
Reproduce checkbox selections with markdown.
Go ahead and extract!
Document:
model: gpt-4o
max_splits_per_page: 3
detail: auto
generator_kwargs:
generation_kwargs:
temperature: 0
seed: 0
max_tokens: 4000
timeout: 120
response_extraction_pattern: null
progress_bar: true
page_separator: "\f"
connections:
- sender: file_classifier.text/plain
receiver: text_converter.sources
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: text_converter.documents
receiver: joiner.documents
- sender: markdown_converter.documents
receiver: joiner.documents
- sender: html_converter.documents
receiver: joiner.documents
- sender: docx_converter.documents
receiver: joiner.documents
- sender: pptx_converter.documents
receiver: joiner.documents
- sender: xlsx_converter.documents
receiver: joiner.documents
- sender: joiner.documents
receiver: writer.documents
- sender: file_classifier.application/pdf
receiver: DeepsetVLMPDFToDocumentConverter.sources
- sender: DeepsetVLMPDFToDocumentConverter.documents
receiver: joiner.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- file_classifier.sources
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
sources | List[Union[str, Path, ByteStream]] | List of PDF sources to convert. |
meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | Optional metadata or list of metadata dictionaries. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of converted documents. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
vlm_provider | Literal['openai', 'bedrock'] | openai | Type of VLM provider to use (openai or bedrock). |
max_workers_files | int | 3 | Maximum number of threads for processing files. |
max_workers_pages | int | 5 | Maximum number of threads for processing pages. |
max_retries | int | 3 | Maximum number of retries for page-level extraction. |
backoff_factor | float | 2.0 | Factor for exponential backoff between retries. |
initial_backoff_time | float | 30.0 | Initial backoff time in seconds. |
prompt | str | Extract the content from this document page. Format everything as markdown to recreate the layout as best as possible. Retain the natural reading order. | Prompt to use for the VLM. |
openai_api_key | Secret | Secret.from_env_var('OPENAI_API_KEY') | OpenAI API key. |
model | str | gpt-4o | Model name to use with the generator. |
max_splits_per_page | int | 3 | Maximum number of splits per page. Only applies when using openai as llm_provider. |
detail | Literal['auto', 'high', 'low'] | auto | Detail level for image processing. Choose high for the best results and low for the lowest inference costs. |
generator_kwargs | Optional[Dict[str, Any]] | None | Additional keyword arguments for the generator. Check the Generator's documentation to learn about the parameters that you can pass. |
response_extraction_pattern | Optional[str] | None | Regex pattern to extract text from the generator's response. |
progress_bar | bool | True | Whether to display a progress bar. |
page_separator | str | \x0c | What string to use to separate pages. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
sources | List[Union[str, Path, ByteStream]] | List of PDF sources to convert. | |
meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata or list of metadata dictionaries. |
Was this page helpful?