DeepsetVLMPDFToDocumentConverter

Convert PDF documents to text using a Vision Language Model (VLM).

Basic Information

  • Pipeline type: Indexing or Query
  • Type: deepset_cloud_custom_nodes.converters.vlm_pdf_to_document.DeepsetVLMPDFToDocumentConverter
  • Components it often connects to:
    • FileTypeRouter: DeepsetVLMPDFToDocumentConverter receives sources from FileTypeRouter and converts them into documents.
    • DocumentJoiner: DeepsetVLMPDFToDocumentConverter can send the converted documents to a DocumentJoiner that joins documents from all Converters in the pipeline.
    • PreProcessors: DeepsetVLMPDFToDocumentConverter can send the converted documents to a Preprocessor for further processing.

Inputs

Required Inputs

NameTypeDescription
sourcesList of Path and ByteStream objectsThe lisf of PDF sources to convert.

Optional Inputs

NameTypeDefaultDescription
metaDictionaryNoneMetadata or a list of metadata dictionaries.

Outputs

NameTypeDescription
documentsDictionary with a list of Document objectsThe converted documents.

Overview

DeepsetVLMPDFToDocumentConverter uses a vision language model (VLM) to convert a screenshot of each PDF page into text based on your prompt. Use this converter with PDF files that have:

  • complex layouts
  • a mix of images and text
  • tables
  • handwritten text
  • figures

Through prompting, you can convert tables, images, or figures into a textual representation which can be useful for retrieval or for passing the resulting text to an LLM.

It helps to extract text in a natural reading order from PDF documents with complex layouts without having to implement custom post-processing code to keep a natural reading order.

🚧

This component can cause high costs with OpenAI or Amazon Bedrock if you use it to convert thousands of PDf pages. For OpenAI, one PDF page equals roughly 1,500 input tokens and a page equals roughly between 800 and 3,000 output tokens.

DeepsetVLMPDFToDocumentConverter supports OpenAI models through the OpenAI API and Anthropic models through Amazon Bedrock. It processes PDFs in parallel for both files and pages.

You can adjust the conversion process by passing a custom prompt or adjusting any of the other parameters.
Use the generator_kwargs argument to pass additional parameters to the underlying VLM generator.
Check the DeepsetOpenAIVisionGenerator or the DeepsetAmazonBedrockVisionGenerator to learn about
the parameters that they accept.

Usage Example

This is an example indexing pipeline, where DeepsetVLMPDFToDocumentConverter receives PDFs from FileTypeRouter and then sends the converted files to DocumentJoiner:

components:
  file_classifier:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
        - text/plain
        - application/pdf
        - text/markdown
        - text/html
        - application/vnd.openxmlformats-officedocument.wordprocessingml.document
        - application/vnd.openxmlformats-officedocument.presentationml.presentation
        - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  text_converter:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8
  markdown_converter:
    type: haystack.components.converters.markdown.MarkdownToDocument
    init_parameters: {}
  html_converter:
    type: haystack.components.converters.html.HTMLToDocument
    init_parameters:
      extraction_kwargs:
        output_format: txt
        target_language: null
        include_tables: true
        include_links: false
  docx_converter:
    type: haystack.components.converters.docx.DOCXToDocument
    init_parameters: {}
  pptx_converter:
    type: haystack.components.converters.pptx.PPTXToDocument
    init_parameters: {}
  xlsx_converter:
    type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
    init_parameters: {}
  joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          embedding_dim: 1024
          similarity: cosine
      policy: OVERWRITE
  DeepsetVLMPDFToDocumentConverter:
    type: deepset_cloud_custom_nodes.converters.vlm_pdf_to_document.DeepsetVLMPDFToDocumentConverter
    init_parameters:
      vlm_provider: openai
      max_workers_files: 3
      max_workers_pages: 5
      max_retries: 3
      backoff_factor: 2
      initial_backoff_time: 30
      prompt: |-
        Extract the content from the document below.
        You need to extract the content exactly.
        Format everything as markdown.
        Make sure to retain the reading order of the document.

        **Headers- and Footers**
        Remove repeating page headers or footers that disrupt the reading order.
        Place letter heads that appear at the side of a document at the top of the page.


        **Images**
        Do not extract images, drawings or maps.
        Instead, add a caption that describes briefly what you see on the image.
        Enclose each image caption with [img-caption][/img-caption]

        **Tables**
        Make sure to format the table in markdown.
        Add a short caption below the table that describes the table's content.
        Enclose each table caption with [table-caption][/table-caption].
        The caption must be placed below the extracted table.

        **Forms**
        Reproduce checkbox selections with markdown.

        Go ahead and extract!

        Document:
      model: gpt-4o
      max_splits_per_page: 3
      detail: auto
      generator_kwargs:
        generation_kwargs:
          temperature: 0
          seed: 0
          max_tokens: 4000
        timeout: 120
      response_extraction_pattern: null
      progress_bar: true
      page_separator: "\f"
connections:
  - sender: file_classifier.text/plain
    receiver: text_converter.sources
  - sender: file_classifier.text/markdown
    receiver: markdown_converter.sources
  - sender: file_classifier.text/html
    receiver: html_converter.sources
  - sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
    receiver: docx_converter.sources
  - sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
    receiver: pptx_converter.sources
  - sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    receiver: xlsx_converter.sources
  - sender: text_converter.documents
    receiver: joiner.documents
  - sender: markdown_converter.documents
    receiver: joiner.documents
  - sender: html_converter.documents
    receiver: joiner.documents
  - sender: docx_converter.documents
    receiver: joiner.documents
  - sender: pptx_converter.documents
    receiver: joiner.documents
  - sender: xlsx_converter.documents
    receiver: joiner.documents
  - sender: joiner.documents
    receiver: writer.documents
  - sender: file_classifier.application/pdf
    receiver: DeepsetVLMPDFToDocumentConverter.sources
  - sender: DeepsetVLMPDFToDocumentConverter.documents
    receiver: joiner.documents
max_runs_per_component: 100
metadata: {}
inputs:
  files:
    - file_classifier.sources


Init Parameters

ParameterTypePossible valuesDescription
vlm_providerLiteralopenai
bedrock

Default: openai
The type of VLM to use. You can choose OpenAI or Bedrock.
Required.
max_workers_filesIntegerDefault: 3The maximum number of threads for processing files.
Required.
max_workers_pagesIntegerDefault: 5The maximum number of threads for processing pages.
Required.
max_retriesIntegerDefault: 3The maximum number of retries for page-level extraction.
Required.
backoff_factorFloatDefault: 2.0The factor for exponentia backoff between retries.
Required.
initial_backoff_timeFloatDefault: 30.0The initial backoff time in seconds.
Required.
promptString`Default: Extract the content from this document page. Format everything as markdown to recreate the layout as best as possible. Retain the natural reading order.The prompt for the VLM.
Required.
openai_api_keySecretDefault: Secret.from_env_var("OPENAI_API_KEY")The API key for OpenAI.
Required.
modelStringDefault: gpt-4oThe name of the model you want to use.
Required.
max_splits_per_pageIntegerDefault: 3The maximum number of splits per page. This parameter only applies when using openai as llm_provider. It detects when the conversion of a page was truncated because of the maximum number of output tokens and prompts the model to continue the extraction where it left off.
Check the maximum number of output tokens for your model in OpenAI-documentation.
If you select bedrock as llm_provider, the output of a page is truncated if it exceeds the maximum number of output tokens.
Required.
detailLiteralauto
low
high
Default: auto
The level of detail for image processing. Choose high for best results and low for lowest inference costs. If you choose auto, the API automatically adjusts the resolution based on the size of the image input.
Required.
generation_kwargsDictionaryDefault NoneAdditional keyword arguments for the generator. Check DeepsetOpenAIVisionGenerator or DeepsetAmazonBedrockVisionGenerator to learn about the parameters that you can pass.
Optional.
response_extraction_patternStringDefault: NoneA regex pattern to extract text from the Generator's response.
Optional.
progress_barBooleanTrue
False
Default: True
Shows a progress bar.
Required.
page_separatorStringDefault: \fThe string to use for separating pages.
Required.