XLSXToDocument

Convert XLSX files to documents your pipeline can search.

Basic Information

  • Pipeline type: Indexing
  • Type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
  • Components it can connect with:
    • FileTypeRouter: XLSXToDocument can receive XLSX files from FileTypeRouter to convert them to documents.
    • DocumentJoiner: DocumentJoiner can receive the converted documents and send them on to DocumentSplitter or Cleaner.

Inputs

Required Inputs

NameTypeDescription
sourcesList of file paths or ByteStream objects.The files to be converted.

Optional Inputs

NameTypeDescription
metaDictionary or a list of dictionaries of string and anyThe metadata to attach to the converted documents.
If it's a single dictionary, its content is added to the metadata of all produced documents.
If it's a list, the length of the list must match the number of sources because the two lists will be zipped.
If sources contain ByteStream objects, their meta is added to the output documents.

Outputs

NameTypeDescription
documentsList of Document objectsA list of documents.

Overview

Use the XLSXToDocument converter to turn XLSX files into a searchable format. It uses the pandas.read_excel function under the hood.

Usage Example

This example shows an indexing pipeline using XLSXToDocument with default settings. It generates one document for each sheet in the XLSX file. To retain the column names from the first row, we skip sending the output documents to DocumentSplitter. Instead, we use DocumentJoiner to combine the split documents from other converters with the unsplit documents from XLSXToDocument, then send them all to DocumentEmbedder.

components:
    file_classifier:
      type: haystack.components.routers.file_type_router.FileTypeRouter
      init_parameters:
        mime_types:
        - text/plain
        - application/vnd.openxmlformats-officedocument.presentationml.presentation
        - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    text_converter:
      type: haystack.components.converters.txt.TextFileToDocument
      init_parameters:
        encoding: utf-8
    pptx_converter:
      type: haystack.components.converters.pptx.PPTXToDocument
      init_parameters: {}
    xlsx_converter:
      type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
      init_parameters: {}
    joiner:
      type: haystack.components.joiners.document_joiner.DocumentJoiner
      init_parameters:
        join_mode: concatenate
        sort_by_score: false
    joiner_xlsx:  # merge split documents with non-split xlsx documents
      type: haystack.components.joiners.document_joiner.DocumentJoiner
      init_parameters:
        join_mode: concatenate
        sort_by_score: false
    splitter:
      type: deepset_cloud_custom_nodes.preprocessors.document_splitter.DeepsetDocumentSplitter
      init_parameters:
        split_by: word
        split_length: 250
        split_overlap: 30
        respect_sentence_boundary: True
        language: en
    document_embedder:
      type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
      init_parameters:
        model: "intfloat/e5-base-v2"
    writer:
      type: haystack.components.writers.document_writer.DocumentWriter
      init_parameters:
        document_store:
          type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
          init_parameters:
            embedding_dim: 768
            similarity: cosine
        policy: OVERWRITE
  connections:  # Defines how the components are connected
  - sender: file_classifier.text/plain
    receiver: text_converter.sources
  - sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
    receiver: pptx_converter.sources
  - sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    receiver: xlsx_converter.sources
  - sender: text_converter.documents
    receiver: joiner.documents
  - sender: pptx_converter.documents
    receiver: joiner.documents
  - sender: joiner.documents
    receiver: splitter.documents
  - sender: splitter.documents
    receiver: joiner_xlsx.documents # DocumentJoiner receives the split documents from other converters
  - sender: xlsx_converter.documents
    receiver: joiner_xlsx.documents # DocumentJoiner receives unsplit documents from XLSXToDocument
  - sender: joiner_xlsx.documents # DocumentJoiner sends all documents to DocumentEmbedder
    receiver: document_embedder.documents
  - sender: document_embedder.documents
    receiver: writer.documents

Init Parameters

ParameterTypePossible valuesDescription
document_perLiteralsheet
row
Default: sheet
Specifies how to create documents. Possible options:

- sheet: Creates a separate document for each sheet.
- row: Creates a separate document for each row.
Required
content_columnStringDefault: contentThe name of the column to use as document content if document_per is set to row.
Required.
sheet_nameList of strings and integersDefault: NoneSpecifies the sheet to convert. It can be the sheet name, number, or both. If None, converts all sheets.
**kwargsAnyAdditional arguments to pass to pandas.read_excel. For a list of possible options, see pandas.read_excel documentation.