DeepsetCSVRowsToDocumentsConverter
Read CSV files from various sources and convert each row into Haystack documents.
Key Features
- Converts each CSV row into a separate Haystack
Documentobject - Uses a configurable column as the document content
- Adds all other columns to the document's metadata automatically
- Produces pre-chunked documents that can bypass the splitter in your pipeline
- Supports UTF-8 encoding by default, with support for custom encodings
- Accepts file paths and ByteStream objects as input
Configuration
- Drag the
DeepsetCSVRowsToDocumentsConvertercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the parameters as needed.
Connections
DeepsetCSVRowsToDocumentsConverter receives CSV files as input — typically from FileTypeRouter, which routes files by MIME type. Because it produces pre-chunked documents (one per row), you can connect its output directly to an embedder like SentenceTransformersDocumentEmbedder, skipping DocumentSplitter.
Usage Example
Using the Component in a Pipeline
This is an example of an index that processes multiple file types. It starts with FilesInput followed by file_classifier (FileTypeRouter) which classifies files by type and sends them to an appropriate converter.
DeepsetCSVRowsToDocumentsConverter receives CSV files from file_classifier (FileTypeRouter) and outputs a list of pre-chunked documents. Since these documents are already chunked, they bypass the splitter (DeepsetDocumentSplitter) and go directly to the document_embedder (SentenceTransformersDocumentEmbedder) and finally to the writer (DocumentWriter), which writes them into the document store.
YAML configuration:
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/markdown
- text/html
- text/csv
markdown_converter:
type: haystack.components.converters.markdown.MarkdownToDocument
init_parameters: {}
html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
extraction_kwargs:
output_format: txt
target_language: null
include_tables: true
include_links: false
splitter:
type: deepset_cloud_custom_nodes.preprocessors.document_splitter.DeepsetDocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: true
language: en
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
similarity: cosine
policy: OVERWRITE
DeepsetCSVRowsToDocumentsConverter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters:
content_column: content
encoding: utf-8
connections:
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: markdown_converter.documents
receiver: splitter.documents
- sender: html_converter.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: document_embedder.documents
- sender: DeepsetCSVRowsToDocumentsConverter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
- sender: file_classifier.text/csv
receiver: DeepsetCSVRowsToDocumentsConverter.sources
max_loops_allowed: 100
metadata: {}
inputs:
files:
- file_classifier.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | List of CSV file paths (str or Path) or ByteStream objects. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the documents. Can be a single dict or a list of dicts. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A dictionary containing a list of Haystack Documents. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| content_column | str | content | Name of the column to use as content when processing the CSV file. |
| encoding | str | utf-8 | Encoding type to use when reading the files. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | List of CSV file paths (str or Path) or ByteStream objects. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the documents. Can be a single dict or a list of dicts. |
Was this page helpful?