DeepsetCSVRowsToDocumentsConverter
Read CSV files from various sources and convert each row into Haystack documents.
DeepsetCSVRowsToDocumentsConverter reads a CSV file and converts each row into a Document object, using one column as the document's main content. All other columns are added to the document's metadata.
Key Features
- Converts each CSV row into a separate document — useful for pre-chunked datasets.
- Configurable content column to select which column becomes the document content.
- All remaining columns are automatically stored in document metadata.
- Supports UTF-8 encoding by default with configurable encoding.
Configuration
- Drag the
DeepsetCSVRowsToDocumentsConvertercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Set the Content Column to specify which CSV column to use as the document content. The default is
content. - Set the Encoding for the CSV files. The default is UTF-8.
- Set the Content Column to specify which CSV column to use as the document content. The default is
Connections
DeepsetCSVRowsToDocumentsConverter accepts a list of file paths or ByteStream objects through its sources input. It outputs a list of Document objects, one per CSV row.
Because the output documents are already one-row chunks, you can bypass a document splitter and connect directly to an embedder. It typically connects with:
FileTypeRouter: receives CSV files routed by MIME type.SentenceTransformersDocumentEmbedderor other embedders: sends documents directly for embedding.
Usage Examples
Basic Configuration
DeepsetCSVRowsToDocumentsConverter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters:
content_column: content
encoding: utf-8
Using the Component in a Pipeline
This is an example of an index that processes multiple file types. It starts with FilesInput followed by file_classifier (FileTypeRouter) which classifies files by type and sends them to an appropriate converter.
DeepsetCSVRowsToDocumentsConverter receives CSV files from file_classifier (FileTypeRouter) and outputs a list of pre-chunked documents. Since these documents are already chunked, they bypass the splitter (DeepsetDocumentSplitter) and go directly to the document_embedder (SentenceTransformersDocumentEmbedder) and finally to the writer (DocumentWriter), which writes them into the document store.
YAML configuration:
# haystack-pipeline
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/markdown
- text/html
- text/csv
markdown_converter:
type: haystack.components.converters.markdown.MarkdownToDocument
init_parameters: {}
html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
extraction_kwargs:
output_format: txt
target_language:
include_tables: true
include_links: false
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
similarity: cosine
hosts:
index: ""
max_chunk_bytes: 104857600
return_embedding: false
method:
mappings:
settings:
index.knn: true
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE
DeepsetCSVRowsToDocumentsConverter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters:
content_column: content
encoding: utf-8
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 0
split_threshold: 0
splitting_function:
respect_sentence_boundary: false
language: en
use_split_rules: true
extend_abbreviations: true
skip_empty_documents: true
connections:
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: DeepsetCSVRowsToDocumentsConverter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
- sender: file_classifier.text/csv
receiver: DeepsetCSVRowsToDocumentsConverter.sources
- sender: markdown_converter.documents
receiver: DocumentSplitter.documents
- sender: html_converter.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: document_embedder.documents
metadata: {}
inputs:
files:
- file_classifier.sources
max_runs_per_component: 100
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
sources | List[Union[str, Path, ByteStream]] | List of CSV file paths (str or Path) or ByteStream objects. | |
meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the documents. Can be a single dict or a list of dicts. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of Haystack Documents, one per CSV row. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
content_column | str | content | Name of the column to use as content when processing the CSV file. |
encoding | str | utf-8 | Encoding type to use when reading the files. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
sources | List[Union[str, Path, ByteStream]] | List of CSV file paths (str or Path) or ByteStream objects. | |
meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the documents. Can be a single dict or a list of dicts. |
Was this page helpful?