CSVDocumentCleaner
Clean CSV documents by removing empty rows and columns.
Basic Information
- Type:
haystack.components.preprocessors.csv_document_cleaner.CSVDocumentCleaner - Components it can connect with:
- Converters:
CSVDocumentCleanerreceives CSV documents from converters likeCSVToDocument. CSVDocumentSplitter:CSVDocumentCleanercan send cleaned CSV documents toCSVDocumentSplitterfor splitting.DocumentWriter:CSVDocumentCleanercan send cleaned documents toDocumentWriterfor storage.
- Converters:
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents containing CSV-formatted content. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of cleaned documents with empty rows and columns removed. |
Overview
CSVDocumentCleaner processes CSV content stored in documents, removing empty rows and columns to create cleaner data for downstream processing.
The component performs the following steps:
- Reads each document's content as a CSV table.
- Retains the specified number of rows from the top (specified in
ignore_rows) and columns from the left ( specified inignore_columns). - Drops any rows and columns that are entirely empty (if enabled).
- Reattaches the ignored rows and columns to maintain their original positions.
- Returns the cleaned CSV content as a new document.
Use this component in indexes to prepare CSV data before splitting or embedding.
Usage Example
Using the Component in an Index
This example shows an index that cleans CSV documents and then writes them to a document store.
components:
CSVToDocument:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
store_full_path: false
CSVDocumentCleaner:
type: deepset_cloud_custom_nodes.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: csv-documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
policy: NONE
connections:
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- CSVToDocument.sources
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| ignore_rows | int | 0 | Number of rows to ignore from the top of the CSV table before processing. |
| ignore_columns | int | 0 | Number of columns to ignore from the left of the CSV table before processing. |
| remove_empty_rows | bool | True | Whether to remove rows that are entirely empty. |
| remove_empty_columns | bool | True | Whether to remove columns that are entirely empty. |
| keep_id | bool | False | Whether to retain the original document ID in the output document. Rows and columns ignored using these parameters are preserved in the final output, meaning they are not considered when removing empty rows and columns. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents containing CSV-formatted content. |
Was this page helpful?