Convert PDF documents to text using a Vision Language Model (VLM).
Basic Information
- Pipeline type: Indexing or Query
- Type:
- Components it often connects to:
- FileTypeRouter:
and converts them into documents. - DocumentJoiner:
can send the converted documents to aDocumentJoiner
that joins documents from all Converters in the pipeline. - PreProcessors:
can send the converted documents to a Preprocessor for further processing.
- FileTypeRouter:
Required Inputs
Name | Type | Description |
sources | List of Path and ByteStream objects | The lisf of PDF sources to convert. |
Optional Inputs
Name | Type | Default | Description |
meta | Dictionary | None | Metadata or a list of metadata dictionaries. |
Name | Type | Description |
documents | Dictionary with a list of Document objects | The converted documents. |
uses a vision language model (VLM) to convert a screenshot of each PDF page into text based on your prompt. Use this converter with PDF files that have:
- complex layouts
- a mix of images and text
- tables
- handwritten text
- figures
Through prompting, you can convert tables, images, or figures into a textual representation which can be useful for retrieval or for passing the resulting text to an LLM.
It helps to extract text in a natural reading order from PDF documents with complex layouts without having to implement custom post-processing code to keep a natural reading order.
This component can cause high costs with OpenAI or Amazon Bedrock if you use it to convert thousands of PDf pages. For OpenAI, one PDF page equals roughly 1,500 input tokens and a page equals roughly between 800 and 3,000 output tokens.
supports OpenAI models through the OpenAI API and Anthropic models through Amazon Bedrock. It processes PDFs in parallel for both files and pages.
You can adjust the conversion process by passing a custom prompt or adjusting any of the other parameters.
Use the generator_kwargs
argument to pass additional parameters to the underlying VLM generator.
Check the DeepsetOpenAIVisionGenerator
or the DeepsetAmazonBedrockVisionGenerator
to learn about
the parameters that they accept.
Usage Example
This is an example indexing pipeline, where DeepsetVLMPDFToDocumentConverter
receives PDFs from FileTypeRouter
and then sends the converted files to DocumentJoiner
type: haystack.components.routers.file_type_router.FileTypeRouter
- text/plain
- application/pdf
- text/markdown
- text/html
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
type: haystack.components.converters.txt.TextFileToDocument
encoding: utf-8
type: haystack.components.converters.markdown.MarkdownToDocument
init_parameters: {}
type: haystack.components.converters.html.HTMLToDocument
output_format: txt
target_language: null
include_tables: true
include_links: false
type: haystack.components.converters.docx.DOCXToDocument
init_parameters: {}
type: haystack.components.converters.pptx.PPTXToDocument
init_parameters: {}
type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
init_parameters: {}
type: haystack.components.joiners.document_joiner.DocumentJoiner
join_mode: concatenate
type: haystack.components.writers.document_writer.DocumentWriter
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
embedding_dim: 1024
similarity: cosine
type: deepset_cloud_custom_nodes.converters.vlm_pdf_to_document.DeepsetVLMPDFToDocumentConverter
vlm_provider: openai
max_workers_files: 3
max_workers_pages: 5
max_retries: 3
backoff_factor: 2
initial_backoff_time: 30
prompt: |-
Extract the content from the document below.
You need to extract the content exactly.
Format everything as markdown.
Make sure to retain the reading order of the document.
**Headers- and Footers**
Remove repeating page headers or footers that disrupt the reading order.
Place letter heads that appear at the side of a document at the top of the page.
Do not extract images, drawings or maps.
Instead, add a caption that describes briefly what you see on the image.
Enclose each image caption with [img-caption][/img-caption]
Make sure to format the table in markdown.
Add a short caption below the table that describes the table's content.
Enclose each table caption with [table-caption][/table-caption].
The caption must be placed below the extracted table.
Reproduce checkbox selections with markdown.
Go ahead and extract!
model: gpt-4o
max_splits_per_page: 3
detail: auto
temperature: 0
seed: 0
max_tokens: 4000
timeout: 120
response_extraction_pattern: null
progress_bar: true
page_separator: "\f"
- sender: file_classifier.text/plain
receiver: text_converter.sources
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: text_converter.documents
receiver: joiner.documents
- sender: markdown_converter.documents
receiver: joiner.documents
- sender: html_converter.documents
receiver: joiner.documents
- sender: docx_converter.documents
receiver: joiner.documents
- sender: pptx_converter.documents
receiver: joiner.documents
- sender: xlsx_converter.documents
receiver: joiner.documents
- sender: joiner.documents
receiver: writer.documents
- sender: file_classifier.application/pdf
receiver: DeepsetVLMPDFToDocumentConverter.sources
- sender: DeepsetVLMPDFToDocumentConverter.documents
receiver: joiner.documents
max_runs_per_component: 100
metadata: {}
- file_classifier.sources
Init Parameters
These are the parameters you can configure in Pipeline Builder:
Parameter | Type | Possible values | Description |
vlm_provider | Literal | openai bedrock Default: openai | The type of VLM to use. You can choose OpenAI or Bedrock. Required. |
max_workers_files | Integer | Default: 3 | The maximum number of threads for processing files. Required. |
max_workers_pages | Integer | Default: 5 | The maximum number of threads for processing pages. Required. |
max_retries | Integer | Default: 3 | The maximum number of retries for page-level extraction. Required. |
backoff_factor | Float | Default: 2.0 | The factor for exponentia backoff between retries. Required. |
initial_backoff_time | Float | Default: 30.0 | The initial backoff time in seconds. Required. |
prompt | String` | Default: Extract the content from this document page. Format everything as markdown to recreate the layout as best as possible. Retain the natural reading order. | The prompt for the VLM. Required. |
openai_api_key | Secret | Default: Secret.from_env_var("OPENAI_API_KEY") | The API key for OpenAI. Required. |
model | String | Default: gpt-4o | The name of the model you want to use. Required. |
max_splits_per_page | Integer | Default: 3 | The maximum number of splits per page. This parameter only applies when using openai as llm_provider . It detects when the conversion of a page was truncated because of the maximum number of output tokens and prompts the model to continue the extraction where it left off.Check the maximum number of output tokens for your model in OpenAI-documentation. If you select bedrock as llm_provider , the output of a page is truncated if it exceeds the maximum number of output tokens.Required. |
detail | Literal | auto low high Default: auto | The level of detail for image processing. Choose high for best results and low for lowest inference costs. If you choose auto , the API automatically adjusts the resolution based on the size of the image input.Required. |
generation_kwargs | Dictionary | Default None | Additional keyword arguments for the generator. Check DeepsetOpenAIVisionGenerator or DeepsetAmazonBedrockVisionGenerator to learn about the parameters that you can pass.Optional. |
response_extraction_pattern | String | Default: None | A regex pattern to extract text from the Generator's response. Optional. |
progress_bar | Boolean | True False Default: True | Shows a progress bar. Required. |
page_separator | String | Default: \f | The string to use for separating pages. Required. |
Run Method Parameters
These are the parameters you can configure for the component's run()
method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
Parameter | Type | Description |
sources | List of string, Path, or ByteStream objects | List of PDF sources to convert. Required. |
meta | List of dictionaries or a dictionary | Metadata for the request. Optional. |
Updated about 1 month ago