Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DoclingConverter

Converts PDFs and other supported files into Markdown documents using Docling, an open source document processing library.

GPU Acceleration

DoclingConverter runs faster with GPU acceleration. Enable GPU acceleration in the pipeline settings to improve performance:

  1. Go to Pipelines and click the pipeline that contains the DoclingConverter component. You're redirected to the Pipeline Details page.
  2. Go to Settings and click the GPU Acceleration toggle to turn it on.

For details, see GPU Acceleration.

Key Features

  • Converts PDFs and other supported file types to Markdown content.
  • Runs OCR and preserves table structure during conversion.
  • Inserts page break placeholders to maintain document structure.
  • Extracts and flattens Docling metadata so downstream components can filter by Docling-specific fields.
  • Accepts file paths, Path objects, or ByteStream inputs.
  • For more information about Docling, see the Docling website.

Configuration

  1. Drag the DoclingConverter component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the parameters as needed.

Connections

DoclingConverter accepts file paths, Path objects, or ByteStream inputs as sources, and optional metadata as meta. It outputs a list of Document objects (documents).

Connect the pipeline's file input to the sources input. Connect the documents output to DocumentSplitter, DocumentJoiner, or DocumentWriter for further processing.

Usage Example

Using the Component in an Index

In this example, DoclingConverter receives files from the Input component, then sends the processed documents to DocumentSplitter.

components:
docling_converter:
type: deepset_cloud_custom_nodes.converters.docling_converter.DoclingConverter
init_parameters:
do_picture_classification: false
splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
index: docling-demo
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: SKIP

connections:
- sender: docling_converter.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: writer.documents

inputs:
files:
- docling_converter.sources

max_runs_per_component: 100

metadata: {}


Parameters

Inputs

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]File paths, Path objects, or ByteStreams to convert.
metaOptional[Union[Dict[str, Any], List[Dict[str, Any]]]]NoneMetadata applied to every generated document or aligned per source.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Documents created from the Docling Markdown export.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
do_ocrboolTrueEnable OCR when Docling processes PDF pages.
do_table_structureboolTruePreserve table structure information during conversion.
do_code_enrichmentboolFalseEnable Docling code OCR to capture code blocks.
do_formula_enrichmentboolFalseEnable formula OCR to keep math expressions readable.
do_picture_classificationboolFalseAttach Docling picture classification metadata.
do_picture_descriptionboolFalseAdd Docling generated picture descriptions to metadata.
page_break_placeholderstr"\f"Token inserted whenever Docling encounters a page break.
filter_binary_hashboolTrueRemove Docling binary_hash fields from metadata.
pipeline_optionsPdfPipelineOptions | NoneNoneAdvanced Docling pipeline options for Haystack users. Overrides the individual boolean flags.
convert_kwargsDict[str, Any] | NoneNoneAdditional arguments forwarded to DocumentConverter.convert(), such as headers or max_num_pages.
md_export_kwargsDict[str, Any] | NoneNoneOverrides for Docling Markdown export, for example image_placeholder or custom formatting flags.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]Paths or ByteStreams for the files you want to convert.
metaOptional[Union[Dict[str, Any], List[Dict[str, Any]]]]NoneMetadata applied to every generated document or aligned per source list.