CSVDocumentSplitter
Split CSV documents into sub-tables based on empty rows or columns.
Basic Information
- Type:
haystack.components.preprocessors.csv_document_splitter.CSVDocumentSplitter - Components it can connect with:
CSVDocumentCleaner:CSVDocumentSplittercan receive cleaned CSV documents fromCSVDocumentCleaner.- Converters:
CSVDocumentSplittercan receive CSV documents from converters likeCSVToDocument. - Embedders:
CSVDocumentSplittercan send split documents to document embedders. DocumentWriter:CSVDocumentSplittercan send split documents toDocumentWriterfor storage.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of Documents containing CSV-formatted content. Each document is assumed to contain one or more tables separated by empty rows or columns. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of documents, each representing an extracted sub-table from the original CSV. Document metadata includes source_id, row_idx_start, col_idx_start, and split_id. |
Overview
CSVDocumentSplitter is used in indexes to split CSV documents into smaller sub-tables, making them easier to process and embed. It supports two modes of operation:
- Threshold mode: Identifies consecutive empty rows or columns that exceed a given threshold and uses them as delimiters to segment the document into smaller tables.
- Row-wise mode: Splits each row into a separate sub-table, represented as a document.
Each split document includes metadata to track:
source_id: The original document IDrow_idx_start: Starting row index of the sub-tablecol_idx_start: Starting column index of the sub-tablesplit_id: Order of the split in the original document
Usage Example
Using the Component in an Index
This example shows an index that cleans and splits CSV documents before embedding and storing them.
components:
CSVToDocument:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
store_full_path: false
CSVDocumentCleaner:
type: deepset_cloud_custom_nodes.preprocessors.csv_document_cleaner.CSVDocumentCleaner
init_parameters:
ignore_rows: 0
ignore_columns: 0
remove_empty_rows: true
remove_empty_columns: true
keep_id: false
CSVDocumentSplitter:
type: deepset_cloud_custom_nodes.preprocessors.csv_document_splitter.CSVDocumentSplitter
init_parameters:
row_split_threshold: 2
column_split_threshold: 2
read_csv_kwargs:
split_mode: threshold
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: csv-documents-index
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
similarity: cosine
policy: NONE
connections:
- sender: CSVToDocument.documents
receiver: CSVDocumentCleaner.documents
- sender: CSVDocumentCleaner.documents
receiver: CSVDocumentSplitter.documents
- sender: CSVDocumentSplitter.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- CSVToDocument.sources
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| row_split_threshold | Optional[int] | 2 | The minimum number of consecutive empty rows required to trigger a split. |
| column_split_threshold | Optional[int] | 2 | The minimum number of consecutive empty columns required to trigger a split. |
| read_csv_kwargs | Optional[Dict[str, Any]] | None | Additional keyword arguments to pass to pandas.read_csv. By default, the component with options: - header=None - skip_blank_lines=False to preserve blank lines - dtype=object to prevent type inference (e.g., converting numbers to floats). See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information. |
| split_mode | SplitMode | threshold | If threshold, the component will split the document based on the number of consecutive empty rows or columns that exceed the row_split_threshold or column_split_threshold. If row-wise, the component will split each row into a separate sub-table. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of Documents containing CSV-formatted content. Each document is assumed to contain one or more tables separated by empty rows or columns. |
Was this page helpful?