Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DeepsetFileDownloader

Download files referenced by documents to the local filesystem. Use this component in visual question answering pipelines to fetch PDFs and images before conversion.

Deprecation Notice

This component is deprecated. Use S3Downloader from the Haystack AWS integration instead. Existing pipelines that use this component continue to work for now.

Key Features

  • Downloads files associated with documents from platform storage to local disk.
  • Filters downloads by file extension.
  • Caches downloaded files locally with a configurable cache size.
  • Returns updated documents with file_path set in metadata.
  • Supports concurrent downloads for multiple files.

Configuration

  1. Drag the DeepsetFileDownloader component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Set file_extensions to limit which file types are downloaded.
  4. Configure sources_target_type and max_cache_size as needed.
Warm-up required

This component requires a warm-up step before it can download files. The platform handles this automatically when the pipeline runs.

Connections

DeepsetFileDownloader accepts documents or sources as input. It outputs documents with file_path metadata and sources in the configured target type.

Connect a Ranker or retriever that returns documents with file_id metadata to the input. Connect the documents or sources output to DeepsetPDFDocumentToBase64Image, DeepsetFileToBase64Image, or a visual Generator.

Usage Example

This example downloads PDF files before converting them to images:

components:
image_downloader:
type: deepset_cloud_custom_nodes.augmenters.deepset_file_downloader.DeepsetFileDownloader
init_parameters:
file_extensions:
- .pdf
sources_target_type: str
max_cache_size: 100

pdf_to_image:
type: deepset_cloud_custom_nodes.converters.pdf_to_image.DeepsetPDFDocumentToBase64Image
init_parameters:
detail: auto

connections:
- sender: image_downloader.documents
receiver: pdf_to_image.documents

Parameters

Inputs

ParameterTypeDefaultDescription
documentsOptional[List[Document]]NoneDocuments with file_id in metadata to download.
sourcesOptional[List[Union[ByteStream, UUID, str]]]NoneFile IDs or byte streams to download.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Documents with file_path set in metadata.
sourcesList[Union[str, Path, ByteStream]]Downloaded file paths or byte streams in the configured target type.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
file_extensionsOptional[List[str]]NoneFile extensions to download, such as [".pdf", ".png"]. If None, all file types are downloaded.
sources_target_typeLiteral["str", "pathlib.Path", "haystack.dataclasses.ByteStream"]strType of the sources returned in the output.
max_cache_sizeint100Maximum number of files to keep in the local cache.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsOptional[List[Document]]NoneDocuments with file_id in metadata to download.
sourcesOptional[List[Union[ByteStream, UUID, str]]]NoneFile IDs or byte streams to download.