DocumentCleaner
Clean text documents by removing whitespaces, empty lines, headers, footers, and other unwanted content to prepare documents for LLM processing.
Key Features
- Removes empty lines and extra whitespaces from document text.
- Removes repeated substrings such as headers and footers across pages.
- Removes specific substrings or patterns matching a regular expression.
- Applies Unicode normalization (NFC, NFKC, NFD, or NFKD).
- Converts text to ASCII only, removing accented and non-ASCII characters.
- Optionally retains the original document ID in the output.
Configuration
- Drag the
DocumentCleanercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the parameters as needed.
Connections
DocumentCleaner accepts a list of documents (documents) as input and outputs a list of cleaned documents (documents).
Connect converters such as TextFileToDocument, PDFMinerToDocument, or AzureOCRDocumentConverter to the input. Connect the output to DocumentSplitter for chunking, or directly to document embedders if no splitting is needed.
Usage Example
Using the Component in an Index
This example shows a typical indexing pipeline where DocumentCleaner cleans documents after conversion and before splitting.
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
remove_substrings:
remove_regex:
keep_id: false
unicode_normalization:
ascii_only: false
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 0
split_threshold: 0
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
policy: NONE
connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- TextFileToDocument.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to clean. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of cleaned documents. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| remove_empty_lines | bool | True | If True, removes empty lines. |
| remove_extra_whitespaces | bool | True | If True, removes extra whitespaces. |
| remove_repeated_substrings | bool | False | If True, removes repeated substrings (headers and footers) from pages. Pages must be separated by a form feed character "\f", which is supported by TextFileToDocument and AzureOCRDocumentConverter. |
| remove_substrings | Optional[List[str]] | None | List of substrings to remove from the text. |
| remove_regex | Optional[str] | None | Regex to match and replace substrings by "". |
| keep_id | bool | False | If True, keeps the IDs of the original documents. |
| unicode_normalization | Optional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']] | None | Unicode normalization form to apply to the text. Note: This will run before any other steps. |
| ascii_only | bool | False | Whether to convert the text to ASCII only. Will remove accents from characters and replace them with ASCII characters. Other non-ASCII characters will be removed. Note: This will run before any pattern matching or removal. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to clean. |
Was this page helpful?