Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

CSVDocumentCleaner

Clean CSV documents by removing empty rows and columns. Use this component in indexing pipelines to prepare CSV data before splitting or embedding.

Key Features

  • Removes entirely empty rows and columns from CSV content.
  • Preserves a specified number of rows from the top and columns from the left before processing.
  • Reattaches the preserved rows and columns to maintain their original positions in the output.
  • Optionally retains the original document ID.

Configuration

  1. Drag the CSVDocumentCleaner component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the component behavior:
    • Set Ignore Rows to specify how many rows from the top of the CSV table to preserve before processing.
    • Set Ignore Columns to specify how many columns from the left to preserve before processing.
    • Toggle Remove Empty Rows to remove rows that are entirely empty.
    • Toggle Remove Empty Columns to remove columns that are entirely empty.
    • Toggle Keep ID to retain the original document ID in the output document.

Connections

CSVDocumentCleaner accepts a list of Document objects containing CSV-formatted content. It outputs cleaned Document objects with empty rows and columns removed.

It typically receives documents from converters like CSVToDocument and sends cleaned documents to CSVDocumentSplitter or DocumentWriter.

Source Code

To check this component's source code, open csv_document_cleaner.py in the Haystack repository.

Usage Examples

Basic Configuration

  CSVDocumentCleaner:
type: haystack.components.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false

Using the Component in an Index

This example shows an index that cleans CSV documents and then writes them to a document store.

# haystack-pipeline
components:
CSVToDocument:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: csv-documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
method:
mappings:
settings:
index.knn: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: NONE
CSVDocumentCleaner:
type: haystack.components.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false

connections:
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: DocumentWriter.documents
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- CSVToDocument.sources
documents:
- CSVToDocument.documents



Parameters

Inputs

ParameterTypeDescription
documentsList[Document]List of Documents containing CSV-formatted content.

Outputs

ParameterTypeDescription
documentsList[Document]List of cleaned documents with empty rows and columns removed.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
ignore_rowsint0Number of rows to ignore from the top of the CSV table before processing.
ignore_columnsint0Number of columns to ignore from the left of the CSV table before processing.
remove_empty_rowsboolTrueWhether to remove rows that are entirely empty.
remove_empty_columnsboolTrueWhether to remove columns that are entirely empty.
keep_idboolFalseWhether to retain the original document ID in the output document. Rows and columns ignored using these parameters are preserved in the final output, meaning they are not considered when removing empty rows and columns.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDescription
documentsList[Document]List of Documents containing CSV-formatted content.