DoclingConverter
Convert PDF and other supported files into Markdown documents with Docling. Docling is an open source document processing library.
Key Features
- Wraps the Docling pipeline to convert PDFs and other supported inputs into Markdown content.
- Supports OCR, table structure preservation, code block capture, formula recognition, and picture classification.
- Extracts Docling metadata, flattens it, and merges it with incoming metadata for downstream filtering and routing.
- Inserts a configurable placeholder token at page breaks.
- Runs faster with GPU acceleration. For details, see GPU Acceleration.
- For more information about Docling, see the Docling website.
DoclingConverter runs faster with GPU acceleration. Enable GPU acceleration in the pipeline settings to improve performance:
- Go to Pipelines and click the pipeline that contains the
DoclingConvertercomponent. You're redirected to the Pipeline Details page. - Go to Settings and click the GPU Acceleration toggle to turn it on.
For details, see GPU Acceleration.
Configuration
- Drag the
DoclingConvertercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Set
do_ocrto enable or disable OCR when processing PDF pages (enabled by default). - Set
do_table_structureto preserve table structure during conversion (enabled by default).
- Set
- Go to the Advanced tab to configure additional enrichment options such as
do_code_enrichment,do_formula_enrichment,do_picture_classification,do_picture_description, and advanced overrides likepipeline_options,convert_kwargs, andmd_export_kwargs.
When you convert DOCX files, Docling cannot reliably recover page numbers. Run DOCX through a PDF conversion first if your pipeline depends on page numbers.
Connections
DoclingConverter receives file paths, Path objects, or ByteStreams through its sources input, typically from the Input component or FileTypeRouter. It outputs a list of Markdown documents through its documents output, which you connect to DocumentSplitter or DocumentJoiner for further processing before writing to a document store.
Usage Examples
Basic Configuration
docling_converter:
type: deepset_cloud_custom_nodes.converters.docling_converter.DoclingConverter
init_parameters:
do_picture_classification: false
- Converts PDFs and other supported file types to Markdown content.
- Runs OCR and preserves table structure during conversion.
- Inserts page break placeholders to maintain document structure.
- Extracts and flattens Docling metadata so downstream components can filter by Docling-specific fields.
- Accepts file paths,
Pathobjects, orByteStreaminputs. - For more information about Docling, see the Docling website.
Configuration
- Drag the
DoclingConvertercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the parameters as needed.
Connections
DoclingConverter accepts file paths, Path objects, or ByteStream inputs as sources, and optional metadata as meta. It outputs a list of Document objects (documents).
Connect the pipeline's file input to the sources input. Connect the documents output to DocumentSplitter, DocumentJoiner, or DocumentWriter for further processing.
Usage Example
Using the Component in an Index
In this example, DoclingConverter receives files from the Input component, then sends the processed documents to DocumentSplitter.
components:
docling_converter:
type: deepset_cloud_custom_nodes.converters.docling_converter.DoclingConverter
init_parameters:
do_picture_classification: false
splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
index: docling-demo
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: SKIP
connections:
- sender: docling_converter.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: writer.documents
inputs:
files:
- docling_converter.sources
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | File paths, Path objects, or ByteStreams to convert. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Metadata applied to every generated document or aligned per source. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents created from the Docling Markdown export. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| do_ocr | bool | True | Enable OCR when Docling processes PDF pages. |
| do_table_structure | bool | True | Preserve table structure information during conversion. |
| do_code_enrichment | bool | False | Enable Docling code OCR to capture code blocks. |
| do_formula_enrichment | bool | False | Enable formula OCR to keep math expressions readable. |
| do_picture_classification | bool | False | Attach Docling picture classification metadata. |
| do_picture_description | bool | False | Add Docling generated picture descriptions to metadata. |
| page_break_placeholder | str | "\f" | Token inserted whenever Docling encounters a page break. |
| filter_binary_hash | bool | True | Remove Docling binary_hash fields from metadata. |
| pipeline_options | PdfPipelineOptions | None | None | Advanced Docling pipeline options for Haystack users. Overrides the individual boolean flags. |
| convert_kwargs | Dict[str, Any] | None | None | Additional arguments forwarded to DocumentConverter.convert(), such as headers or max_num_pages. |
| md_export_kwargs | Dict[str, Any] | None | None | Overrides for Docling Markdown export, for example image_placeholder or custom formatting flags. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | Paths or ByteStreams for the files you want to convert. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Metadata applied to every generated document or aligned per source list. |
Related Information
Was this page helpful?