Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

CSVDocumentSplitter

Split CSV documents into sub-tables based on empty rows or columns.

Key Features

  • Splits CSV documents into smaller sub-tables for easier processing and embedding.
  • Supports threshold mode: splits where consecutive empty rows or columns exceed a configurable threshold.
  • Supports row-wise mode: splits each row into a separate sub-table.
  • Tracks split metadata including source_id, row_idx_start, col_idx_start, and split_id.
  • Accepts custom pandas.read_csv keyword arguments for flexible CSV parsing.

Configuration

  1. Drag the CSVDocumentSplitter component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the parameters as needed.

Connections

CSVDocumentSplitter accepts a list of documents (documents) containing CSV-formatted content as input and outputs a list of split documents (documents), each representing an extracted sub-table.

Connect CSVDocumentCleaner or a CSV converter such as CSVToDocument to the input. Connect the output to document embedders or DocumentWriter for storage.

Usage Example

Using the Component in an Index

This example shows an index that cleans and splits CSV documents before embedding and storing them.

components:
CSVToDocument:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
store_full_path: false
CSVDocumentCleaner:
type: deepset_cloud_custom_nodes.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false
CSVDocumentSplitter:
type: deepset_cloud_custom_nodes.preprocessors.csv_document_splitter.CSVDocumentSplitter
init_parameters:
row_split_threshold: 2
column_split_threshold: 2
read_csv_kwargs:
split_mode: threshold
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: csv-documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
policy: NONE

connections:
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: CSVDocumentSplitter.documents
- sender: CSVDocumentSplitter.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- CSVToDocument.sources

Parameters

Inputs

ParameterTypeDefaultDescription
documentsList[Document]A list of Documents containing CSV-formatted content. Each document is assumed to contain one or more tables separated by empty rows or columns.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents, each representing an extracted sub-table from the original CSV. Document metadata includes source_id, row_idx_start, col_idx_start, and split_id.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
row_split_thresholdOptional[int]2The minimum number of consecutive empty rows required to trigger a split.
column_split_thresholdOptional[int]2The minimum number of consecutive empty columns required to trigger a split.
read_csv_kwargsOptional[Dict[str, Any]]NoneAdditional keyword arguments to pass to pandas.read_csv. By default, the component with options: - header=None - skip_blank_lines=False to preserve blank lines - dtype=object to prevent type inference (e.g., converting numbers to floats). See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
split_modeSplitModethresholdIf threshold, the component will split the document based on the number of consecutive empty rows or columns that exceed the row_split_threshold or column_split_threshold. If row-wise, the component will split each row into a separate sub-table.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]A list of Documents containing CSV-formatted content. Each document is assumed to contain one or more tables separated by empty rows or columns.