Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

S3Downloader

Download files from AWS S3 buckets to the local filesystem. Use this component in indexes when your documents reference files stored in S3 that need to be processed locally.

Key Features

  • Downloads files from Amazon S3 buckets for use in pipeline processing.
  • Supports concurrent downloads for improved performance.
  • Filters downloads by file extension to reduce unnecessary processing.
  • Caches downloaded files to avoid redundant downloads.
  • Supports custom S3 key generation functions if your S3 file structure doesn't match your document metadata.

Configuration

To use this component, connect Haystack Platform with your AWS account. Create secrets with the following keys:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_DEFAULT_REGION

For details on how to create secrets, see Add Secrets.

  1. Drag the S3Downloader component onto the canvas from the Component Library.
  2. Click on the component to open the configuration panel.
  3. On the General tab:
    • Set file_extensions to specify which file types to download. For example, [".pdf", ".txt"].
    • Set file_name_meta_key to the metadata field in your documents that contains the S3 file name.
    • Optionally, set file_root_path to the local directory where files will be downloaded.
  4. Go to the Advanced tab to configure additional settings, such as max_workers, max_cache_size, s3_key_generation_function, and boto3_config.

File Extension Filtering

You can use the file_extensions parameter to download only specific file types, reducing unnecessary downloads and processing time. For example, file_extensions=[".pdf", ".txt"] downloads only PDF and TXT files while skipping others.

Custom S3 Key Generation

By default, the component uses the file_name from Document metadata as the S3 key. If your S3 file structure doesn't match the file names in metadata, you can provide an optional s3_key_generation_function to customize how S3 keys are generated from Document metadata.

Connections

S3Downloader receives documents that contain S3 file references in their metadata. Connect a converter's documents output to its documents input.

It outputs the same documents with the file_path metadata updated to point to the downloaded local files. Connect its documents output to a document cleaner, splitter, or other preprocessor.

Source Code

To check this component's source code, open s3_downloader.py in the Haystack Core Integrations repository.

Usage Examples

Basic Configuration

  s3_downloader:
type: haystack_integrations.components.downloaders.s3.s3_downloader.S3Downloader
init_parameters:
aws_access_key_id:
type: env_var
env_vars:
- AWS_ACCESS_KEY_ID
strict: false
aws_secret_access_key:
type: env_var
env_vars:
- AWS_SECRET_ACCESS_KEY
strict: false
aws_session_token:
type: env_var
env_vars:
- AWS_SESSION_TOKEN
strict: false
aws_region_name:
type: env_var
env_vars:
- AWS_DEFAULT_REGION
strict: false
aws_profile_name:
type: env_var
env_vars:
- AWS_PROFILE
strict: false
file_extensions:
- .pdf
- .txt
- .docx
- .html
file_name_meta_key: file_name
max_workers: 32
max_cache_size: 100
s3_key_generation_function: deepset_cloud_custom_nodes.utils.storage.get_s3_key

This is an example indexing pipeline with S3Downloader to download and process files from S3:

components:
s3_downloader:
type: haystack_integrations.components.downloaders.s3.s3_downloader.S3Downloader
init_parameters:
aws_access_key_id:
type: env_var
env_vars:
- AWS_ACCESS_KEY_ID
strict: false
aws_secret_access_key:
type: env_var
env_vars:
- AWS_SECRET_ACCESS_KEY
strict: false
aws_session_token:
type: env_var
env_vars:
- AWS_SESSION_TOKEN
strict: false
aws_region_name:
type: env_var
env_vars:
- AWS_DEFAULT_REGION
strict: false
aws_profile_name:
type: env_var
env_vars:
- AWS_PROFILE
strict: false
boto3_config:
file_root_path:
file_extensions:
- .pdf
- .txt
- .docx
- .html
file_name_meta_key: file_name
max_workers: 32
max_cache_size: 100
s3_key_generation_function: deepset_cloud_custom_nodes.utils.storage.get_s3_key

converter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8

cleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
keep_id: false

splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
split_threshold: 0

document_embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
normalize_embeddings: true

writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'default'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE

connections:
- sender: cleaner.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
- sender: converter.documents
receiver: s3_downloader.documents
- sender: s3_downloader.documents
receiver: cleaner.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- converter.sources

Parameters

Inputs

ParameterTypeDescription
documentsList[Document]A list of documents with S3 file references in their metadata.

Outputs

ParameterTypeDescription
documentsList[Document]Documents with the file_path metadata updated to point to the downloaded local files.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
aws_access_key_idOptional[Secret]Secret.from_env_var('AWS_ACCESS_KEY_ID')AWS access key ID.
aws_secret_access_keyOptional[Secret]Secret.from_env_var('AWS_SECRET_ACCESS_KEY')AWS secret access key.
aws_session_tokenOptional[Secret]Secret.from_env_var('AWS_SESSION_TOKEN')AWS session token for temporary credentials.
aws_region_nameOptional[Secret]Secret.from_env_var('AWS_DEFAULT_REGION')AWS region name.
aws_profile_nameOptional[Secret]Secret.from_env_var('AWS_PROFILE')AWS profile name.
boto3_configOptional[Dict[str, Any]]NoneConfiguration for the boto3 client.
file_root_pathOptional[str]NoneThe path where files are downloaded. Can be set through this parameter or the FILE_ROOT_PATH environment variable. If the specified directory doesn't exist, it's created.
file_extensionsOptional[List[str]]NoneFile extensions permitted for download. By default, all file extensions are allowed.
file_name_meta_keystr"file_name"The metadata key that contains the file name to download.
max_workersint32Maximum number of workers for concurrent downloads.
max_cache_sizeint100Maximum number of files to cache.
s3_key_generation_functionOptional[Callable]NoneA function to generate S3 keys from documents.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDescription
documentsList[Document]A list of documents with S3 file references in their metadata.