XLSXToDocument
Turn XLSX worksheets or rows into documents. This component uses pandas and openpyxl to read spreadsheets.
This component is deprecated. It will continue to work in your existing pipelines for now. You can replace it with the XLSXToDocument component.
Key Features
- Creates one document per worksheet or one document per row, controlled by the
document_perparameter. - In row mode, uses the column specified by
content_columnas the document content and moves other columns into metadata. - Preserves metadata from ByteStream inputs and records the sheet name and row index for traceability.
- Limits conversion to specific sheets using the
sheet_nameparameter. - Forwards additional arguments to
pandas.read_excelfor fine-grained control. - Integrates with
FileTypeRouterandDocumentJoinerin multi-format indexing pipelines.
Configuration
- Drag the
XLSXToDocumentcomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the parameters as needed.
Connections
XLSXToDocument accepts a list of file paths or ByteStream objects (sources) as input, along with optional metadata (meta). It outputs a list of converted documents (documents).
Typically, XLSXToDocument receives XLSX files routed from FileTypeRouter and sends its output to DocumentJoiner, which combines documents from multiple converters before passing them downstream for indexing.
Usage Example
Using the Component in an Index
components:
file_router:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- text/csv
xlsx_converter:
type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
init_parameters:
document_per: row
content_column: summary
csv_converter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters: {}
joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
connections:
- sender: file_router.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: file_router.text/csv
receiver: csv_converter.sources
- sender: xlsx_converter.documents
receiver: joiner.documents
- sender: csv_converter.documents
receiver: joiner.documents
- sender: joiner.documents
receiver: DocumentWriter.documents
inputs:
files:
- file_router.sources
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | Paths or ByteStreams that point to XLSX files. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Metadata forwarded to every document or aligned per source. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents that contain CSV content or the selected row content plus merged metadata. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| document_per | Literal["sheet", "row"] | "sheet" | Create a document per worksheet or per row. |
| content_column | str | "content" | Column that holds the content when document_per is set to row. |
| sheet_name | Union[str, int, List[Union[str, int]], None] | None | Limit conversion to one sheet, several sheets, or leave None to read all sheets. |
| kwargs | Dict[str, Any] | Arguments forwarded to pandas.read_excel, such as engine, skiprows, or nrows. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | XLSX file paths or ByteStreams. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Metadata applied to every generated document or aligned per source entry. |
Was this page helpful?