Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

CSVDocumentCleaner

Clean CSV documents by removing empty rows and columns to prepare data for downstream processing.

Key Features

  • Removes entirely empty rows and columns from CSV documents.
  • Preserves a configurable number of header rows and left-most columns before cleaning.
  • Reattaches ignored rows and columns after cleaning to maintain their original positions.
  • Optionally retains the original document ID in the output.
  • Works with any converter that outputs CSV-formatted documents, such as CSVToDocument.

Configuration

  1. Drag the CSVDocumentCleaner component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the parameters as needed.

Connections

CSVDocumentCleaner accepts a list of documents (documents) containing CSV-formatted content as input and outputs a list of cleaned documents (documents).

Connect a CSV converter such as CSVToDocument to the input. Connect the output to CSVDocumentSplitter for further splitting, or directly to DocumentWriter for storage.

Usage Example

Using the Component in an Index

This example shows an index that cleans CSV documents and then writes them to a document store.

components:
CSVToDocument:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
store_full_path: false
CSVDocumentCleaner:
type: deepset_cloud_custom_nodes.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: csv-documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
policy: NONE

connections:
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- CSVToDocument.sources

Parameters

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of Documents containing CSV-formatted content.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of cleaned documents with empty rows and columns removed.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
ignore_rowsint0Number of rows to ignore from the top of the CSV table before processing.
ignore_columnsint0Number of columns to ignore from the left of the CSV table before processing.
remove_empty_rowsboolTrueWhether to remove rows that are entirely empty.
remove_empty_columnsboolTrueWhether to remove columns that are entirely empty.
keep_idboolFalseWhether to retain the original document ID in the output document. Rows and columns ignored using these parameters are preserved in the final output, meaning they are not considered when removing empty rows and columns.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of Documents containing CSV-formatted content.