Customize DocumentCleaner to preprocess your text documents.
YAML Init Parameters
These are the parameters you can pass to this component in the pipeline YAML configuration:
Parameter | Type | Possible values | Description |
---|---|---|---|
remove_empty_lines | Boolean | True , False Default: True | Removes empty lines. Required. |
remove_extra_whitespaces | Boolean | True , False Default: True | Removes extra whitespaces. Required. |
remove_repeated_substrings | Boolean | True , False Default: False | Removes repeated substrings (headers and footers) from pages. Pages in the text must be separated by form feed character \\f , which is supported by TextFileToDocument and AzureOCRDocumentConverter .Required. |
keep_id | Boolean | True , False Default: False | Keep the IDs of the original documents. Required. |
remove_substrings | List of strings | Default: None | List of substrings to remove from the text. Optional. |
remove_regex | String | Default: None | Regex to match and replace substrings by "". Optional. |
REST API Runtime Parameters
There are no runtime parameters you can pass to this component when making a request to the Search REST API endpoint.