Skip to main content

XLSXToDocument

Turn XLSX worksheets or rows into documents. This component uses pandas and openpyxl to read spreadsheets.

Deprecation Notice

This component is deprecated. It will continue to work in your existing pipelines for now. You can replace it with the XLSXToDocument component.

Basic Information

  • Type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
  • Components it can connect with:
    • FileTypeRouter: Route XLSX files to XLSXToDocument.
    • DocumentJoiner: Combine spreadsheet output with documents produced by other converters.
    • DocumentSplitter or downstream embedding components that accept document lists.

Inputs

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]Paths or ByteStreams that point to XLSX files.
metaOptional[Union[Dict[str, Any], List[Dict[str, Any]]]]NoneMetadata forwarded to every document or aligned per source.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Documents that contain CSV content or the selected row content plus merged metadata.

Overview

XLSXToDocument uses pandas and openpyxl to read spreadsheets. You can create one document per sheet or per row. When you choose row mode, the component picks the column you set in content_column for the document content and moves the remaining columns into metadata. The component preserves metadata from the ByteStream input and notes the sheet name and row index for traceability.

Usage Example

Using the Component in an Index

components:
file_router:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- text/csv
xlsx_converter:
type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
init_parameters:
document_per: row
content_column: summary
csv_converter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters: {}
joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate

DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:

connections:
- sender: file_router.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: file_router.text/csv
receiver: csv_converter.sources
- sender: xlsx_converter.documents
receiver: joiner.documents
- sender: csv_converter.documents
receiver: joiner.documents

- sender: joiner.documents
receiver: DocumentWriter.documents

inputs:
files:
- file_router.sources

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Builder:

ParameterTypeDefaultDescription
document_perLiteral["sheet", "row"]"sheet"Create a document per worksheet or per row.
content_columnstr"content"Column that holds the content when document_per is set to row.
sheet_nameUnion[str, int, List[Union[str, int]], None]NoneLimit conversion to one sheet, several sheets, or leave None to read all sheets.
kwargsDict[str, Any]Arguments forwarded to pandas.read_excel, such as engine, skiprows, or nrows.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]XLSX file paths or ByteStreams.
metaOptional[Union[Dict[str, Any], List[Dict[str, Any]]]]NoneMetadata applied to every generated document or aligned per source entry.