Upload Files with Python

Use this method if you have many files to upload or want to upload files with metadata.

About This Task

Uploading files using the Python methods included in the SDK is asynchronous and uses sessions under the hood. It's best for uploading large numbers of files with metadata. To upload files using this method, create and run a Python script. You can find example scripts at the bottom of this page.

To learn more, see also Synchronous and asynchronous upload and Working with Metadata.

Sessions

Asynchronous upload uses the mechanism of sessions to upload your files to deepset Cloud. A session stores the ingestion status of the files: the number of failed and finished files. Each session has an ID so you can check its details anytime.

A session starts when you initiate the upload. For SDK, it opens when you call the upload method or command and closes when the upload is finished. A session expires after 24 hours. You can have a maximum of 10 open sessions.

When using the SDK, you don't have to worry about the sessions as the SDK takes care of opening and closing them for you. They're just there if you want to check the status of your past and current uploads.

Folder Structure

You don't need to follow any specific folder structure. If your folder contains files with the same name, all these files are uploaded by default. You can set the write mode to overwrite the files, keep them all, or fail the upload.

File Extensions

Make sure your files have lowercase extensions, for example, my_file.pdf, instead of my_file.PDF. The SDK doesn't upload files with uppercase extensions.


Prerequisites

  1. Install the SDK
  2. Generate an API Key to connect to a deepset Cloud workspace.

Upload Scripts Examples

Upload From a Folder

Here is an example of a synchronous and asynchronous way to upload files from a folder.

Note: When using Jupyter Notebooks, use this import before loading the SDK:

import nest_asyncio
nest_asyncio.apply()

from deepset_cloud_sdk.workflows.sync_client.files import list_files

# you can install it with 
pip install nest-asyncio
from pathlib import Path

from deepset_cloud_sdk.workflows.sync_client.files import upload

# Uploads all files from a given path
upload(
    paths=[Path("<your_path_to_the_upload_folder>")],
    api_key="<deepsetCloud_API_key>",
    workspace_name="<default_workspace>",
    blocking=True,  # waits until the files are displayed in deepset Cloud,
    # this may take a couple of minutes
    timeout_s=300,  # the timeout for the `blocking` parameter in number of seconds
    show_progress=True,  # shows the progress bar
    recursive=True,  # uploads files from all subfolders as well
)
from pathlib import Path

from deepset_cloud_sdk.workflows.async_client.files import upload

# Uploads all files from a given path.
async def my_async_context() -> None:
    await upload(
        paths=[Path("<your_path_to_the_upload_folder>")],
        api_key="<deepsetCloud_API_key>",
        workspace_name="<default_workspace>",
        blocking=True,  # waits until the files are displayed in deepset Cloud,
        # this may take a couple of minutes
        timeout_s=300,  # the timeout for the `blocking` parameter in number of seconds
        show_progress=True,  # shows the progress bar
        recursive=True,  # uploads files from all subfolders as well
    )

# Run the async function
if __name__ == "__main__":
    asyncio.run(my_async_context())

Upload Bytes

You can files as bytes to a deepset Cloud workspace. This method is suitable for all file types. Here are examples of a synchronous and an asynchronous way to do this:

from deepset_cloud_sdk.workflows.sync_client.files import upload_bytes, DeepsetCloudFileBytes

upload_bytes(
    api_key="<deepsetCloud_API_key>",
    workspace_name="<default_workspace>", # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
    files=[
        DeepsetCloudFileBytes(
            name="example.txt",
            file_bytes=b"this is text",
            meta={"key": "value"},  # optional
        )
    ],
    blocking=True,  # optional, by default True
    timeout_s=300,  # optional, by default 300
)
from deepset_cloud_sdk.workflows.async_client.files import upload_bytes, DeepsetCloudFileBytes

async def my_async_context() -> None:
    await upload_bytes(
        api_key="<deepsetCloud_API_key>",
        workspace_name="<default_workspace>",  # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
        files=[
          DeepsetCloudFileBytes(
              name="example.txt",
              file_bytes=b"this is some byte text",
              meta={"key": "value"},  # optional
          )
        ],
        blocking=True,  # optional, by default True
        timeout_s=300,  # optional, by default 300
    )

# Run the async function
if __name__ == "__main__":
    asyncio.run(my_async_context())

Synchronize GitHub Files with deepset Cloud

Here's an example script to load TXT and MD files from GitHub and send them to deepset Cloud. It fetches the content as texts from GitHub and forwards them to deepset Cloud.

from typing import List
import httpx
from urllib.parse import urlparse

from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile

# Place your API key here
API_KEY: str = "<YOUR-API-KEY>"


def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename


def fetch_and_prepare_files(urls: List[str]) -> List[DeepsetCloudFile]:
    """Fetches files from URLs and converts them to DeepsetCloudFile objects.

    These Objects can be uploaded to the Deepset Cloud directly without
    having to first copy them to disk.

    :param urls: List of URLs to fetch files from
    :return: List of DeepsetCloudFile objects
    """
    files_to_upload: List[DeepsetCloudFile] = []

    for url in urls:
        response = httpx.get(url)
        response.raise_for_status()

        file = DeepsetCloudFile(
            text=response.text,
            name=_parse_filename(url),
            meta={"url": url},
        )
        files_to_upload.append(file)

    return files_to_upload


# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]

files = fetch_and_prepare_files(DOWNLOAD_URLS)

# Upload .txt and .md files to deepset Cloud
upload_texts(
    workspace_name="upload-test-123",  # optional, uses "DEFAULT_WORKSPACE_NAME" by default
    files=files,
    blocking=False,  # Set to False for non-blocking uploads
    timeout_s=300,  # Optional, default is 300 seconds
    show_progress=True,  # Optional, default is True
    api_key=API_KEY,
    write_mode=WriteMode.OVERWRITE,
)

Download Files from a URL and Upload to deepset Cloud

Using Threading

This script downloads TXT and MD files from the URL you specify and then uploads them to deepset Cloud using threading. Note that the maximum concurrency (processes in mulitprocessing.pool(processes=3) ) is limited by the amount of cores in your system. For maximum utilization, you can use multiprocessing.cpu_count() to set the number of processes.

import multiprocessing
from typing import List
from urllib.parse import urlparse

import httpx

from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile

# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"


def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename


def fetch_and_upload_file(url: str) -> None:
    """Fetches a file from the given URL and converts it to a DeepsetCloudFile object.

    That Object can be uploaded to the Deepset Cloud directly without
    having to first copy them to disk.

    :param url: URL to fetch files from
    """
    response = httpx.get(url)
    response.raise_for_status()

    file = DeepsetCloudFile(
        text=response.text,
        name=_parse_filename(url),
        meta={"url": url},
    )

    upload_texts(
        workspace_name=WORKSPACE,  # optional, uses "DEFAULT_WORKSPACE_NAME" by default
        files=[file],
        blocking=False,  # Set to False for non-blocking uploads
        timeout_s=300,  # Optional, default is 300 seconds
        show_progress=True,  # Optional, default is True
        api_key=API_KEY,
        write_mode=WriteMode.OVERWRITE,
    )


# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]

if __name__ == '__main__':
    # Upload .txt and .md files to deepset Cloud
    # Start one thread per URL to download and upload the files
    with multiprocessing.Pool(processes=3) as pool:
        results = pool.map(fetch_and_upload_file, DOWNLOAD_URLS)

Async

This example downloads files from a URL you specify and then uploads them asynchronously to deepset Cloud:

import asyncio
import httpx
from typing import List
from urllib.parse import urlparse

from deepset_cloud_sdk.workflows.sync_client.files import WriteMode, DeepsetCloudFile
from deepset_cloud_sdk.workflows.async_client.files import upload_texts

# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"


def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename


async def fetch_and_upload_file(url: str) -> None:
    """Fetches a file from the given URL and converts it to a DeepsetCloudFile object.

    That Object can be uploaded to the Deepset Cloud directly without
    having to first copy them to disk.

    :param url: URL to fetch file from
    """
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        response.raise_for_status()

        file = DeepsetCloudFile(
            text=response.text,
            name=_parse_filename(url),
            meta={"url": url},
        )

        await upload_texts(
            workspace_name=WORKSPACE,  # optional, uses "DEFAULT_WORKSPACE_NAME" environment variable by default
            files=[file],
            blocking=False,  # Set to False for non-blocking uploads
            timeout_s=300,  # Optional, default is 300 seconds
            show_progress=True,  # Optional, default is True
            api_key=API_KEY,
            write_mode=WriteMode.OVERWRITE,
        )


async def main(urls: List[str]) -> None:
    """Main function to run the asynchronous fetching and uploading of files."""
    tasks = [fetch_and_upload_file(url) for url in urls]
    await asyncio.gather(*tasks)


# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]

# Run the main function
if __name__ == "__main__":
    asyncio.run(main(DOWNLOAD_URLS))

From Memory in Byte Format

Here's an example of how to fetch a PDF file from a given URL, convert it to a byte format, and then upload it to deepset Cloud:

import multiprocessing
from typing import List
from urllib.parse import urlparse

import httpx

from deepset_cloud_sdk.workflows.sync_client.files import (
    DeepsetCloudFileBytes,
    WriteMode,
    upload_bytes,
)

# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"


def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename


def fetch_and_upload_file(url: str) -> None:
    """Fetches a file from the given URL and converts it to a DeepsetCloudFile object.

    That Object can be uploaded to the Deepset Cloud directly without
    having to first copy them to disk.

    :param url: URL to fetch files from
    """
    response = httpx.get(url)
    response.raise_for_status()

    file = DeepsetCloudFileBytes(
        file_bytes=response.content,
        name=_parse_filename(url),
        meta={"url": url},
    )

    upload_bytes(
        workspace_name=WORKSPACE,  # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
        files=[file],
        # by default blocking=True - by setting to False it will mean that you can immediately
        # continue uploading another batch of files
        blocking=False,
        timeout_s=300,  # optional, by default 300
        show_progress=True,  # optional, by default True
        api_key=API_KEY,
        write_mode=WriteMode.OVERWRITE,
    )


# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
    "https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]

if __name__ == '__main__':
    # Upload .txt and .pdf files to deepset Cloud
    # Start one thread per URL to download and upload the files
    with multiprocessing.Pool(processes=4) as pool:
        results = pool.map(fetch_and_upload_file, DOWNLOAD_URLS)

Google Colab Notebook

Here's a Colab notebook with different upload scenarios you can test: Upload files with SDK in Google Colab.