XLSXToDocument
Turn XLSX worksheets or rows into documents. This component uses pandas and openpyxl to read spreadsheets.
This component is deprecated. It will continue to work in your existing pipelines for now. You can replace it with the XLSXToDocument component.
Basic Information
- Type:
deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument - Components it can connect with:
FileTypeRouter: Route XLSX files toXLSXToDocument.DocumentJoiner: Combine spreadsheet output with documents produced by other converters.DocumentSplitteror downstream embedding components that accept document lists.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | Paths or ByteStreams that point to XLSX files. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Metadata forwarded to every document or aligned per source. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents that contain CSV content or the selected row content plus merged metadata. |
Overview
XLSXToDocument uses pandas and openpyxl to read spreadsheets. You can create one document per sheet or per row. When you choose row mode, the component picks the column you set in content_column for the document content and moves the remaining columns into metadata. The component preserves metadata from the ByteStream input and notes the sheet name and row index for traceability.
Usage Example
Using the Component in an Index
components:
file_router:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- text/csv
xlsx_converter:
type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
init_parameters:
document_per: row
content_column: summary
csv_converter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters: {}
joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
connections:
- sender: file_router.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: file_router.text/csv
receiver: csv_converter.sources
- sender: xlsx_converter.documents
receiver: joiner.documents
- sender: csv_converter.documents
receiver: joiner.documents
- sender: joiner.documents
receiver: DocumentWriter.documents
inputs:
files:
- file_router.sources
max_runs_per_component: 100
metadata: {}
Parameters
Init Parameters
These are the parameters you can configure in Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| document_per | Literal["sheet", "row"] | "sheet" | Create a document per worksheet or per row. |
| content_column | str | "content" | Column that holds the content when document_per is set to row. |
| sheet_name | Union[str, int, List[Union[str, int]], None] | None | Limit conversion to one sheet, several sheets, or leave None to read all sheets. |
| kwargs | Dict[str, Any] | Arguments forwarded to pandas.read_excel, such as engine, skiprows, or nrows. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | XLSX file paths or ByteStreams. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Metadata applied to every generated document or aligned per source entry. |
Was this page helpful?