DocumentCleaner

Cleans the text in the documents.

Basic Information

Type: haystack_integrations.preprocessors.document_cleaner.DocumentCleaner

Inputs

Parameter	Type	Default	Description
documents	List[Document]		List of Documents to clean.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		A dictionary with the following key: - `documents`: List of cleaned Documents.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Cleans the text in the documents.

It removes extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).

Usage Example

components:
  DocumentCleaner:
    type: components.preprocessors.document_cleaner.DocumentCleaner
    init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
remove_empty_lines	bool	True	If `True`, removes empty lines.
remove_extra_whitespaces	bool	True	If `True`, removes extra whitespaces.
remove_repeated_substrings	bool	False	If `True`, removes repeated substrings (headers and footers) from pages. Pages must be separated by a form feed character "\f", which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`.
remove_substrings	Optional[List[str]]	None	List of substrings to remove from the text.
remove_regex	Optional[str]	None	Regex to match and replace substrings by "".
keep_id	bool	False	If `True`, keeps the IDs of the original documents.
unicode_normalization	Optional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']]	None	Unicode normalization form to apply to the text. Note: This will run before any other steps.
ascii_only	bool	False	Whether to convert the text to ASCII only. Will remove accents from characters and replace them with ASCII characters. Other non-ASCII characters will be removed. Note: This will run before any pattern matching or removal.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]		List of Documents to clean.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​