DeepsetCSVRowsToDocumentsConverter
Convert each row of a CSV file into a document object.
Basic Information
- Pipeline type: Indexing
- Type:
deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
- Components it can connect with:
- FileTypeRouter: DeepsetCSVRowsToDocumentsConverter can receive CSV files from FileTypeRouter.
- DocumentJoiner: DeepsetCSVRowsToDocumentsConverter can send converted documents to DocumentJoiner. This is useful if you have other converters in your pipeline and want to join their output with DeepsetCSVRowsToDocumentsConverter's output before sending it further down the pipeline.
Inputs
Required Inputs
Name | Type | Description |
---|---|---|
sources | List of strings, Paths, and BysteStream objects | A list of CSV file paths (strings or Paths) or ByteStream objects. |
Optional Inputs
Name | Type | Description |
---|---|---|
meta | Dictionary or a list of dictionaries | Metadata to attach to the documents. |
Outputs
Name | Type | Description |
---|---|---|
documents | List of Document objects | A list of converted documents. |
Overview
DeepsetCSVRowsToDocumentsConverter reads a CSV file and converts each row into a Document object, using the content column as the document's main content. You can specify a different column for content using the content_column
parameter.
All other columns are added to the document’s metadata.
Usage Example
This is an example of an indexing pipeline that processes multiple file types. It starts with FilesInput followed by file_classifier
(FileTypeRouter) which classifies files by type and sends them to an appropriate converter.
The DeepsetCSVRowsToDocumentsConverter
receives CSV files from the file_classifier
(FileTypeRouter) and outputs a list of pre-chunked documents. Since these documents are already chunked, they bypass the splitter
(DeepsetDocumentSplitter) and go directly to joiner_csv
(DocumentJoiner). The DocumentJoiner
combines documents from both the DeepsetCSVRowsToDocumentsConverter
and the splitter
(DeepsetDocumentSplitter) into a single list. This joined list is then sent to the document_embedder
(SentenceTransformersDocumentEmbedder) and finally to the writer
(DocumentWriter), which writes them into the document store.
YAML configuration:
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/markdown
- text/html
- text/csv
markdown_converter:
type: haystack.components.converters.markdown.MarkdownToDocument
init_parameters: {}
html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
extraction_kwargs:
output_format: txt
target_language: null
include_tables: true
include_links: false
joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
joiner_csv:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
splitter:
type: deepset_cloud_custom_nodes.preprocessors.document_splitter.DeepsetDocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: true
language: en
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
similarity: cosine
policy: OVERWRITE
DeepsetCSVRowsToDocumentsConverter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters:
content_column: content
encoding: utf-8
connections:
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: markdown_converter.documents
receiver: joiner.documents
- sender: html_converter.documents
receiver: joiner.documents
- sender: joiner.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: joiner_csv.documents
- sender: joiner_csv.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
- sender: file_classifier.text/csv
receiver: DeepsetCSVRowsToDocumentsConverter.sources
- sender: DeepsetCSVRowsToDocumentsConverter.documents
receiver: joiner_csv.documents
max_loops_allowed: 100
metadata: {}
inputs:
files:
- file_classifier.sources
Init Parameters
Parameter | Type | Possible values | Description |
---|---|---|---|
content_column | String | Default: content | Specifies the column name to use as content when processing the CSV file. Optional. |
encoding | String | Default: utf-8 | Specifies the encoding type to use when reading the files. Optional. |
Updated about 1 month ago