Basic Information

Pipeline type: Indexing
Type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
Components it can connect with:
- FileTypeRouter: XLSXToDocument can receive XLSX files from FileTypeRouter to convert them to documents.
- DocumentJoiner: DocumentJoiner can receive the converted documents and send them on to DocumentSplitter or Cleaner.

Inputs

Required Inputs

Name	Type	Description
`sources`	List of file paths or ByteStream objects.	The files to be converted.

Optional Inputs

Name	Type	Description
`meta`	Dictionary or a list of dictionaries of string and any	The metadata to attach to the converted documents. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, the length of the list must match the number of sources because the two lists will be zipped. If `sources` contain ByteStream objects, their `meta` is added to the output documents.

Outputs

Name	Type	Description
`documents`	List of Document objects	A list of documents.

Overview

Use the XLSXToDocument converter to turn XLSX files into a searchable format. It uses the pandas.read_excel function under the hood.

Usage Example

This example shows an indexing pipeline using XLSXToDocument with default settings. It generates one document for each sheet in the XLSX file. To retain the column names from the first row, we skip sending the output documents to DocumentSplitter. Instead, we use DocumentJoiner to combine the split documents from other converters with the unsplit documents from XLSXToDocument, then send them all to DocumentEmbedder.

components:
    file_classifier:
      type: haystack.components.routers.file_type_router.FileTypeRouter
      init_parameters:
        mime_types:
        - text/plain
        - application/vnd.openxmlformats-officedocument.presentationml.presentation
        - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    text_converter:
      type: haystack.components.converters.txt.TextFileToDocument
      init_parameters:
        encoding: utf-8
    pptx_converter:
      type: haystack.components.converters.pptx.PPTXToDocument
      init_parameters: {}
    xlsx_converter:
      type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
      init_parameters: {}
    joiner:
      type: haystack.components.joiners.document_joiner.DocumentJoiner
      init_parameters:
        join_mode: concatenate
        sort_by_score: false
    joiner_xlsx:  # merge split documents with non-split xlsx documents
      type: haystack.components.joiners.document_joiner.DocumentJoiner
      init_parameters:
        join_mode: concatenate
        sort_by_score: false
    splitter:
      type: deepset_cloud_custom_nodes.preprocessors.document_splitter.DeepsetDocumentSplitter
      init_parameters:
        split_by: word
        split_length: 250
        split_overlap: 30
        respect_sentence_boundary: True
        language: en
    document_embedder:
      type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
      init_parameters:
        model: "intfloat/e5-base-v2"
    writer:
      type: haystack.components.writers.document_writer.DocumentWriter
      init_parameters:
        document_store:
          type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
          init_parameters:
            embedding_dim: 768
            similarity: cosine
        policy: OVERWRITE
  connections:  # Defines how the components are connected
  - sender: file_classifier.text/plain
    receiver: text_converter.sources
  - sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
    receiver: pptx_converter.sources
  - sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    receiver: xlsx_converter.sources
  - sender: text_converter.documents
    receiver: joiner.documents
  - sender: pptx_converter.documents
    receiver: joiner.documents
  - sender: joiner.documents
    receiver: splitter.documents
  - sender: splitter.documents
    receiver: joiner_xlsx.documents # DocumentJoiner receives the split documents from other converters
  - sender: xlsx_converter.documents
    receiver: joiner_xlsx.documents # DocumentJoiner receives unsplit documents from XLSXToDocument
  - sender: joiner_xlsx.documents # DocumentJoiner sends all documents to DocumentEmbedder
    receiver: document_embedder.documents
  - sender: document_embedder.documents
    receiver: writer.documents

Init Parameters

Parameter	Type	Possible values	Description
`document_per`	Literal	`sheet` `row` Default: `sheet`	Specifies how to create documents. Possible options: - `sheet`: Creates a separate document for each sheet. - `row`: Creates a separate document for each row. Required
`content_column`	String	Default: `content`	The name of the column to use as document content if `document_per` is set to `row`. Required.
`sheet_name`	List of strings and integers	Default: `None`	Specifies the sheet to convert. It can be the sheet name, number, or both. If `None`, converts all sheets.
`**kwargs`	Any		Additional arguments to pass to `pandas.read_excel`. For a list of possible options, see pandas.read_excel documentation.