DeepsetCSVRowsToDocumentsConverter

Convert each row of a CSV file into a document object.

Basic Information

  • Pipeline type: Indexing
  • Type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
  • Components it can connect with:
    • FileTypeRouter: DeepsetCSVRowsToDocumentsConverter can receive CSV files from FileTypeRouter.
    • DocumentJoiner: DeepsetCSVRowsToDocumentsConverter can send converted documents to DocumentJoiner. This is useful if you have other converters in your pipeline and want to join their output with DeepsetCSVRowsToDocumentsConverter's output before sending it further down the pipeline.

Inputs

Required Inputs

NameTypeDescription
sourcesList of strings, Paths, and BysteStream objectsA list of CSV file paths (strings or Paths) or ByteStream objects.

Optional Inputs

NameTypeDescription
metaDictionary or a list of dictionariesMetadata to attach to the documents.

Outputs

NameTypeDescription
documentsList of Document objectsA list of converted documents.

Overview

DeepsetCSVRowsToDocumentsConverter reads a CSV file and converts each row into a Document object, using the content column as the document's main content. You can specify a different column for content using the content_column parameter.

All other columns are added to the document’s metadata.

Usage Example

This is an example of an indexing pipeline that processes multiple file types. It starts with FilesInput followed by file_classifier (FileTypeRouter) which classifies files by type and sends them to an appropriate converter.

The DeepsetCSVRowsToDocumentsConverter receives CSV files from the file_classifier (FileTypeRouter) and outputs a list of pre-chunked documents. Since these documents are already chunked, they bypass the splitter (DeepsetDocumentSplitter) and go directly to joiner_csv (DocumentJoiner). The DocumentJoiner combines documents from both the DeepsetCSVRowsToDocumentsConverter and the splitter (DeepsetDocumentSplitter) into a single list. This joined list is then sent to the document_embedder (SentenceTransformersDocumentEmbedder) and finally to the writer (DocumentWriter), which writes them into the document store.

This is an example of an indexing pipeline that processes multiple file types. It starts with FilesInput followed by `file_classifier` (FileTypeRouter) which classifies files by type and sends them to an appropriate converter.  DeepsetCSVRowsToDocumentsConverter receives CSV files and outputs a list of documents which are sent to `joiner` (DocumentJoiner). Joiner joins documents from multiple converters into a single list and sends the joined documents to preprocessors (`splitter` and `document_embedder`) which prepare them for being written into the document store.

YAML configuration:

components:
  file_classifier:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
        - text/markdown
        - text/html
        - text/csv
  markdown_converter:
    type: haystack.components.converters.markdown.MarkdownToDocument
    init_parameters: {}
  html_converter:
    type: haystack.components.converters.html.HTMLToDocument
    init_parameters:
      extraction_kwargs:
        output_format: txt
        target_language: null
        include_tables: true
        include_links: false
  joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
  joiner_csv:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
  splitter:
    type: deepset_cloud_custom_nodes.preprocessors.document_splitter.DeepsetDocumentSplitter
    init_parameters:
      split_by: word
      split_length: 250
      split_overlap: 30
      respect_sentence_boundary: true
      language: en
  document_embedder:
    type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
    init_parameters:
      model: intfloat/e5-base-v2
  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          embedding_dim: 768
          similarity: cosine
      policy: OVERWRITE
  DeepsetCSVRowsToDocumentsConverter:
    type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
    init_parameters:
      content_column: content
      encoding: utf-8
connections:
  - sender: file_classifier.text/markdown
    receiver: markdown_converter.sources
  - sender: file_classifier.text/html
    receiver: html_converter.sources
  - sender: markdown_converter.documents
    receiver: joiner.documents
  - sender: html_converter.documents
    receiver: joiner.documents
  - sender: joiner.documents
    receiver: splitter.documents
  - sender: splitter.documents
    receiver: joiner_csv.documents
  - sender: joiner_csv.documents
    receiver: document_embedder.documents
  - sender: document_embedder.documents
    receiver: writer.documents
  - sender: file_classifier.text/csv
    receiver: DeepsetCSVRowsToDocumentsConverter.sources
  - sender: DeepsetCSVRowsToDocumentsConverter.documents
    receiver: joiner_csv.documents
max_loops_allowed: 100
metadata: {}
inputs:
  files:
    - file_classifier.sources


Init Parameters

ParameterTypePossible valuesDescription
content_columnStringDefault: contentSpecifies the column name to use as content when processing the CSV file.
Optional.
encodingStringDefault: utf-8Specifies the encoding type to use when reading the files.
Optional.