DocumentCleaner
Cleans the text in the documents.
Basic Information
- Type:
haystack_integrations.preprocessors.document_cleaner.DocumentCleaner
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to clean. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A dictionary with the following key: - documents: List of cleaned Documents. |
Overview
Work in Progress
Bear with us while we're working on adding pipeline examples and most common components connections.
Cleans the text in the documents.
It removes extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).
Usage Example
components:
DocumentCleaner:
type: components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| remove_empty_lines | bool | True | If True, removes empty lines. |
| remove_extra_whitespaces | bool | True | If True, removes extra whitespaces. |
| remove_repeated_substrings | bool | False | If True, removes repeated substrings (headers and footers) from pages. Pages must be separated by a form feed character "\f", which is supported by TextFileToDocument and AzureOCRDocumentConverter. |
| remove_substrings | Optional[List[str]] | None | List of substrings to remove from the text. |
| remove_regex | Optional[str] | None | Regex to match and replace substrings by "". |
| keep_id | bool | False | If True, keeps the IDs of the original documents. |
| unicode_normalization | Optional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']] | None | Unicode normalization form to apply to the text. Note: This will run before any other steps. |
| ascii_only | bool | False | Whether to convert the text to ASCII only. Will remove accents from characters and replace them with ASCII characters. Other non-ASCII characters will be removed. Note: This will run before any pattern matching or removal. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to clean. |
Was this page helpful?