CSVDocumentSplitter
Split CSV documents into sub-tables based on empty rows or columns. Use this component in indexing pipelines to break large CSV files into smaller, more manageable chunks before embedding.
Key Features
- Splits CSV documents into smaller sub-tables using empty row or column delimiters.
- Supports threshold mode: splits on consecutive empty rows or columns that exceed a configurable threshold.
- Supports row-wise mode: splits each row into a separate sub-table document.
- Tracks metadata for each split, including source document ID, starting row and column index, and split order.
Configuration
- Drag the
CSVDocumentSplittercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Choose a Split Mode:
thresholdto split on consecutive empty rows or columns, orrow-wiseto split each row into a separate document. - Set Row Split Threshold to the minimum number of consecutive empty rows required to trigger a split (used in threshold mode).
- Set Column Split Threshold to the minimum number of consecutive empty columns required to trigger a split (used in threshold mode).
- Set Read CSV Kwargs to pass additional options to
pandas.read_csv, such as custom delimiters or encodings.
- Choose a Split Mode:
Connections
CSVDocumentSplitter accepts a list of Document objects containing CSV-formatted content. It outputs a list of Document objects, each representing an extracted sub-table from the original CSV. Each output document includes source_id, row_idx_start, col_idx_start, and split_id metadata fields.
It typically receives documents from CSVDocumentCleaner or directly from converters like CSVToDocument, and sends split documents to embedders or DocumentWriter.
Source Code
To check this component's source code, open csv_document_splitter.py in the Haystack repository.
Usage Examples
Basic Configuration
CSVDocumentSplitter:
type: haystack.components.preprocessors.csv_document_splitter.CSVDocumentSplitter
init_parameters:
row_split_threshold: 2
column_split_threshold: 2
split_mode: threshold
Using the Component in an Index
This example shows an index that cleans and splits CSV documents before embedding and storing them.
# haystack-pipeline
components:
CSVToDocument:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: csv-documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
method:
mappings:
settings:
index.knn: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: NONE
CSVDocumentSplitter:
type: haystack.components.preprocessors.csv_document_splitter.CSVDocumentSplitter
init_parameters:
row_split_threshold: 2
column_split_threshold: 2
read_csv_kwargs:
split_mode: threshold
CSVDocumentCleaner:
type: haystack.components.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false
connections:
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: CSVDocumentSplitter.documents
- sender: CSVDocumentSplitter.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- CSVToDocument.sources
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of Documents containing CSV-formatted content. Each document is assumed to contain one or more tables separated by empty rows or columns. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of documents, each representing an extracted sub-table from the original CSV. Document metadata includes source_id, row_idx_start, col_idx_start, and split_id. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
row_split_threshold | Optional[int] | 2 | The minimum number of consecutive empty rows required to trigger a split. |
column_split_threshold | Optional[int] | 2 | The minimum number of consecutive empty columns required to trigger a split. |
read_csv_kwargs | Optional[Dict[str, Any]] | None | Additional keyword arguments to pass to pandas.read_csv. By default, the component uses header=None, skip_blank_lines=False to preserve blank lines, and dtype=object to prevent type inference. See the pandas documentation for more information. |
split_mode | SplitMode | threshold | If threshold, splits the document based on consecutive empty rows or columns exceeding the threshold. If row-wise, splits each row into a separate sub-table. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of Documents containing CSV-formatted content. Each document is assumed to contain one or more tables separated by empty rows or columns. |
Related Information
Was this page helpful?