CSVDocumentCleaner
A component for cleaning CSV documents by removing empty rows and columns.
Basic Information
- Type:
haystack_integrations.preprocessors.csv_document_cleaner.CSVDocumentCleaner
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents containing CSV-formatted content. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A dictionary with a list of cleaned Documents under the key "documents". Processing steps: 1. Reads each document's content as a CSV table. 2. Retains the specified number of ignore_rows from the top and ignore_columns from the left. 3. Drops any rows and columns that are entirely empty (if enabled by remove_empty_rows and remove_empty_columns). 4. Reattaches the ignored rows and columns to maintain their original positions. 5. Returns the cleaned CSV content as a new Document object, with an option to retain the original document ID. |
Overview
Bear with us while we're working on adding pipeline examples and most common components connections.
A component for cleaning CSV documents by removing empty rows and columns.
This component processes CSV content stored in Documents, allowing for the optional ignoring of a specified number of rows and columns before performing the cleaning operation. Additionally, it provides options to keep document IDs and control whether empty rows and columns should be removed.
Usage Example
components:
CSVDocumentCleaner:
type: components.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| ignore_rows | int | 0 | Number of rows to ignore from the top of the CSV table before processing. |
| ignore_columns | int | 0 | Number of columns to ignore from the left of the CSV table before processing. |
| remove_empty_rows | bool | True | Whether to remove rows that are entirely empty. |
| remove_empty_columns | bool | True | Whether to remove columns that are entirely empty. |
| keep_id | bool | False | Whether to retain the original document ID in the output document. Rows and columns ignored using these parameters are preserved in the final output, meaning they are not considered when removing empty rows and columns. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents containing CSV-formatted content. |
Was this page helpful?