DocumentCleaner Parameters

Customize DocumentCleaner to preprocess your text documents.

YAML Init Parameters

These are the parameters you can pass to this component in the pipeline YAML configuration:

Parameter

Type

Possible values

Description

remove_empty_lines

Boolean

True, False
Default: True

Removes empty lines.
Required.

remove_extra_whitespaces

Boolean

True, False
Default: True

Removes extra whitespaces.
Required.

remove_repeated_substrings

Boolean

True, False
Default: False

Removes repeated substrings (headers and footers) from pages. Pages in the text must be separated by form feed character \\f, which is supported by TextFileToDocument and AzureOCRDocumentConverter.
Required.

keep_id

Boolean

True, False
Default: False

Keep the IDs of the original documents.
Required.

remove_substrings

List of strings

Default: None

List of substrings to remove from the text.
Optional.

remove_regex

String

Default: None

Regex to match and replace substrings by "".
Optional.


REST API Runtime Parameters

There are no runtime parameters you can pass to this component when making a request to the Search REST API endpoint.