CSVDocumentCleaner
Clean CSV documents by removing empty rows and columns. Use this component in indexing pipelines to prepare CSV data before splitting or embedding.
Key Features
- Removes entirely empty rows and columns from CSV content.
- Preserves a specified number of rows from the top and columns from the left before processing.
- Reattaches the preserved rows and columns to maintain their original positions in the output.
- Optionally retains the original document ID.
Configuration
- Drag the
CSVDocumentCleanercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the component behavior:
- Set Ignore Rows to specify how many rows from the top of the CSV table to preserve before processing.
- Set Ignore Columns to specify how many columns from the left to preserve before processing.
- Toggle Remove Empty Rows to remove rows that are entirely empty.
- Toggle Remove Empty Columns to remove columns that are entirely empty.
- Toggle Keep ID to retain the original document ID in the output document.
Connections
CSVDocumentCleaner accepts a list of Document objects containing CSV-formatted content. It outputs cleaned Document objects with empty rows and columns removed.
It typically receives documents from converters like CSVToDocument and sends cleaned documents to CSVDocumentSplitter or DocumentWriter.
Source Code
To check this component's source code, open csv_document_cleaner.py in the Haystack repository.
Usage Examples
Basic Configuration
CSVDocumentCleaner:
type: haystack.components.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false
Using the Component in an Index
This example shows an index that cleans CSV documents and then writes them to a document store.
# haystack-pipeline
components:
CSVToDocument:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: csv-documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
method:
mappings:
settings:
index.knn: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: NONE
CSVDocumentCleaner:
type: haystack.components.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false
connections:
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: DocumentWriter.documents
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- CSVToDocument.sources
documents:
- CSVToDocument.documents
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of Documents containing CSV-formatted content. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of cleaned documents with empty rows and columns removed. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
ignore_rows | int | 0 | Number of rows to ignore from the top of the CSV table before processing. |
ignore_columns | int | 0 | Number of columns to ignore from the left of the CSV table before processing. |
remove_empty_rows | bool | True | Whether to remove rows that are entirely empty. |
remove_empty_columns | bool | True | Whether to remove columns that are entirely empty. |
keep_id | bool | False | Whether to retain the original document ID in the output document. Rows and columns ignored using these parameters are preserved in the final output, meaning they are not considered when removing empty rows and columns. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of Documents containing CSV-formatted content. |
Related Information
Was this page helpful?