DoclingConverter

Convert PDF and other supported files into Markdown documents with Docling. Docling is an open source document processing library.

Basic Information

Type: deepset_cloud_custom_nodes.converters.docling_converter.DoclingConverter
Components it can connect with:
- Input: Send file references or ByteStreams to DoclingConverter.
- FileTypeRouter: Route supported file types (such as PDF) to DoclingConverter.
- DocumentJoiner or DocumentSplitter: Consume the generated documents for further processing before writing to a document store.

Inputs

Parameter	Type	Default	Description
sources	List[Union[str, Path, ByteStream]]		File paths, Path objects, or ByteStreams to convert.
meta	Optional[Union[Dict[str, Any], List[Dict[str, Any]]]]	None	Metadata applied to every generated document or aligned per source.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		Documents created from the Docling Markdown export.

Overview

DoclingConverter wraps the Docling pipeline to turn PDFs and other supported inputs into Markdown content. It can run OCR, preserve table structure, and keep page breaks by inserting a placeholder token. The component extracts Docling metadata, flattens it, and merges it with the incoming metadata so downstream components can filter or route by Docling specific fields. When you convert DOCX files, Docling cannot reliably recover page numbers, so run DOCX through a PDF conversion if the pipeline depends on page numbers.

For more information about Docling, see the Docling webiste.

GPU Acceleration

DoclingConverter runs faster with GPU acceleration. Enable GPU acceleration in the pipeline settings to improve performance:

Go to Pipelines and click the pipeline that contains the DoclingConverter component. You're redirected to the Pipeline Details page.
Go to Settings and click the GPU Acceleration toggle to turn it on.

For details, see GPU Acceleration.

Usage Example

Using the Component in an Index

In this example, DoclingConverter receives files from the Input component, then sends the processed documents to DocumentSplitter.

components:
  docling_converter:
    type: deepset_cloud_custom_nodes.converters.docling_converter.DoclingConverter
    init_parameters:
      do_picture_classification: false
  splitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 250
      split_overlap: 30
  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          - ${OPENSEARCH_HOST}
          index: docling-demo
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      policy: SKIP

connections:
- sender: docling_converter.documents
  receiver: splitter.documents
- sender: splitter.documents
  receiver: writer.documents

inputs:
  files:
  - docling_converter.sources

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Builder:

Parameter	Type	Default	Description
do_ocr	bool	True	Enable OCR when Docling processes PDF pages.
do_table_structure	bool	True	Preserve table structure information during conversion.
do_code_enrichment	bool	False	Enable Docling code OCR to capture code blocks.
do_formula_enrichment	bool	False	Enable formula OCR to keep math expressions readable.
do_picture_classification	bool	False	Attach Docling picture classification metadata.
do_picture_description	bool	False	Add Docling generated picture descriptions to metadata.
page_break_placeholder	str	"\f"	Token inserted whenever Docling encounters a page break.
filter_binary_hash	bool	True	Remove Docling `binary_hash` fields from metadata.
pipeline_options	PdfPipelineOptions \| None	None	Advanced Docling pipeline options for Haystack users. Overrides the individual boolean flags.
convert_kwargs	Dict[str, Any] \| None	None	Additional arguments forwarded to `DocumentConverter.convert()`, such as `headers` or `max_num_pages`.
md_export_kwargs	Dict[str, Any] \| None	None	Overrides for Docling Markdown export, for example `image_placeholder` or custom formatting flags.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
sources	List[Union[str, Path, ByteStream]]		Paths or ByteStreams for the files you want to convert.
meta	Optional[Union[Dict[str, Any], List[Dict[str, Any]]]]	None	Metadata applied to every generated document or aligned per source list.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Using the Component in an Index​

Parameters​

Init Parameters​

Run Method Parameters​