DocumentCleaner Parameters

Customize DocumentCleaner to preprocess your text documents.

YAML Init Parameters

These are the parameters you can pass to this component in the pipeline YAML configuration:

ParameterTypePossible valuesDescription
remove_empty_linesBooleanTrue, False
Default: True
Removes empty lines.
Required.
remove_extra_whitespacesBooleanTrue, False
Default: True
Removes extra whitespaces.
Required.
remove_repeated_substringsBooleanTrue, False
Default: False
Removes repeated substrings (headers and footers) from pages. Pages in the text must be separated by form feed character \\f, which is supported by TextFileToDocument and AzureOCRDocumentConverter.
Required.
keep_idBooleanTrue, False
Default: False
Keep the IDs of the original documents.
Required.
remove_substringsList of stringsDefault: NoneList of substrings to remove from the text.
Optional.
remove_regexStringDefault: NoneRegex to match and replace substrings by "".
Optional.

REST API Runtime Parameters

There are no runtime parameters you can pass to this component when making a request to the Search REST API endpoint.