Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DocumentCleaner

Clean the text in documents by removing whitespaces, empty lines, headers, footers, and more. Use this component in indexing pipelines to prepare documents for further processing by LLMs or embedders.

Key Features

  • Removes empty lines and extra whitespaces.
  • Removes repeated substrings such as headers and footers across pages.
  • Removes specific substrings or text matching a regular expression.
  • Normalizes Unicode characters (NFC, NFKC, NFD, or NFKD).
  • Converts text to ASCII only, removing accents and non-ASCII characters.
  • Optionally retains the original document ID.

Configuration

  1. Drag the DocumentCleaner component onto the canvas from the Component Library.
  2. Click on the component to open the configuration panel.
  3. Configure the component settings:
    • Toggle Remove Empty Lines to remove empty lines from the document.
    • Toggle Remove Extra Whitespaces to remove extra whitespaces from the document.
    • Toggle Remove Repeated Substrings to remove repeated headers and footers from pages. Pages must be separated by a form feed character \f, which is supported by TextFileToDocument and AzureOCRDocumentConverter.
    • Set Remove Substrings to specify a list of strings to remove from the document.
    • Set Remove Regex to specify a regular expression pattern whose matches are removed.
    • Set Unicode Normalization to apply Unicode normalization to the text.
    • Toggle ASCII Only to convert text to ASCII, removing accents and other non-ASCII characters.
    • Toggle Keep ID to retain the original document ID.

Connections

DocumentCleaner accepts a list of Document objects and outputs cleaned Document objects.

It typically receives documents from converters like TextFileToDocument, PDFMinerToDocument, or AzureOCRDocumentConverter, and sends cleaned documents to DocumentSplitter for chunking or directly to document embedders.

Source Code

To check this component's source code, open document_cleaner.py in the Haystack repository.

Usage Examples

Basic Configuration

  DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
keep_id: false
ascii_only: false

Using the Component in an Index

This example shows a typical indexing pipeline where DocumentCleaner cleans documents after conversion and before splitting.

# haystack-pipeline
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
remove_substrings:
remove_regex:
keep_id: false
unicode_normalization:
ascii_only: false
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 0
split_threshold: 0
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
policy: NONE

connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- TextFileToDocument.sources

Parameters

Inputs

ParameterTypeDescription
documentsList[Document]List of Documents to clean.

Outputs

ParameterTypeDescription
documentsList[Document]List of cleaned documents.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
remove_empty_linesboolTrueIf True, removes empty lines.
remove_extra_whitespacesboolTrueIf True, removes extra whitespaces.
remove_repeated_substringsboolFalseIf True, removes repeated substrings (headers and footers) from pages. Pages must be separated by a form feed character "\f", which is supported by TextFileToDocument and AzureOCRDocumentConverter.
remove_substringsOptional[List[str]]NoneList of substrings to remove from the text.
remove_regexOptional[str]NoneRegex to match and replace substrings by "".
keep_idboolFalseIf True, keeps the IDs of the original documents.
unicode_normalizationOptional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']]NoneUnicode normalization form to apply to the text. Note: This will run before any other steps.
ascii_onlyboolFalseWhether to convert the text to ASCII only. Will remove accents from characters and replace them with ASCII characters. Other non-ASCII characters will be removed. Note: This will run before any pattern matching or removal.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDescription
documentsList[Document]List of Documents to clean.