Skip to main content

DocumentCleaner

Cleans the text in the documents.

Basic Information

  • Type: haystack_integrations.preprocessors.document_cleaner.DocumentCleaner

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to clean.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]A dictionary with the following key: - documents: List of cleaned Documents.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Cleans the text in the documents.

It removes extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).

Usage Example

components:
DocumentCleaner:
type: components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
remove_empty_linesboolTrueIf True, removes empty lines.
remove_extra_whitespacesboolTrueIf True, removes extra whitespaces.
remove_repeated_substringsboolFalseIf True, removes repeated substrings (headers and footers) from pages. Pages must be separated by a form feed character "\f", which is supported by TextFileToDocument and AzureOCRDocumentConverter.
remove_substringsOptional[List[str]]NoneList of substrings to remove from the text.
remove_regexOptional[str]NoneRegex to match and replace substrings by "".
keep_idboolFalseIf True, keeps the IDs of the original documents.
unicode_normalizationOptional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']]NoneUnicode normalization form to apply to the text. Note: This will run before any other steps.
ascii_onlyboolFalseWhether to convert the text to ASCII only. Will remove accents from characters and replace them with ASCII characters. Other non-ASCII characters will be removed. Note: This will run before any pattern matching or removal.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to clean.