CSVDocumentSplitter

A component for splitting CSV documents into sub-tables based on split arguments.

Basic Information

Type: haystack_integrations.preprocessors.csv_document_splitter.CSVDocumentSplitter

Inputs

Parameter	Type	Default	Description
documents	List[Document]		A list of Documents containing CSV-formatted content. Each document is assumed to contain one or more tables separated by empty rows or columns.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		A dictionary with a key `"documents"`, mapping to a list of new `Document` objects, each representing an extracted sub-table from the original CSV. The metadata of each document includes: - A field `source_id` to track the original document. - A field `row_idx_start` to indicate the starting row index of the sub-table in the original table. - A field `col_idx_start` to indicate the starting column index of the sub-table in the original table. - A field `split_id` to indicate the order of the split in the original document. - All other metadata copied from the original document. - If a document cannot be processed, it is returned unchanged. - The `meta` field from the original document is preserved in the split documents.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

A component for splitting CSV documents into sub-tables based on split arguments.

The splitter supports two modes of operation:

identify consecutive empty rows or columns that exceed a given threshold and uses them as delimiters to segment the document into smaller tables.
split each row into a separate sub-table, represented as a Document.

Usage Example

components:
  CSVDocumentSplitter:
    type: components.preprocessors.csv_document_splitter.CSVDocumentSplitter
    init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
row_split_threshold	Optional[int]	2	The minimum number of consecutive empty rows required to trigger a split.
column_split_threshold	Optional[int]	2	The minimum number of consecutive empty columns required to trigger a split.
read_csv_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments to pass to `pandas.read_csv`. By default, the component with options: - `header=None` - `skip_blank_lines=False` to preserve blank lines - `dtype=object` to prevent type inference (e.g., converting numbers to floats). See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
split_mode	SplitMode	threshold	If `threshold`, the component will split the document based on the number of consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`. If `row-wise`, the component will split each row into a separate sub-table.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]		A list of Documents containing CSV-formatted content. Each document is assumed to contain one or more tables separated by empty rows or columns.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​