Upload Files with Python

Use this method if you have many files to upload or want to upload files with metadata.

About This Task

Uploading files using the Python methods included in the SDK is asynchronous and uses sessions under the hood. It's best for uploading large numbers of files with metadata. To upload files using this method, create and run a Python script. You can find example scripts at the bottom of this page.

To learn more, see also Synchronous and asynchronous upload and Working with Metadata.

Sessions

Asynchronous upload uses the mechanism of sessions to upload your files to deepset Cloud. A session stores the ingestion status of the files: the number of failed and finished files. Each session has an ID so you can check its details anytime.

A session starts when you initiate the upload. For SDK, it opens when you call the upload method or command and closes when the upload is finished. A session expires after 24 hours. You can have a maximum of 10 open sessions.

When using the SDK, you don't have to worry about the sessions as the SDK takes care of opening and closing them for you. They're just there if you want to check the status of your past and current uploads.

Folder Structure

You don't need to follow any specific folder structure. If your folder contains files with the same name, all these files are uploaded by default. You can set the write mode to overwrite the files, keep them all, or fail the upload.

Prerequisites

  1. Install the SDK
  2. Generate an API Key to connect to a deepset Cloud workspace.

Upload Scripts Examples

Upload Files From a Folder

Here's an example script you can use:

from pathlib import Path

from deepset_cloud_sdk.workflows.sync_client.files import upload

## Uploads all files from a given path.
upload(
    paths=[Path("<your_path_to_the_upload_folder>")],
    api_key="<deepsetCloud_API_key>",
    workspace_name="<default_workspace>",
    blocking=True,  # waits until the files are displayed in deepset Cloud,
    # this may take a couple of minutes
    timeout_s=300,  # the timeout for the `blocking` parameter in number of seconds
    show_progress=True,  # shows the progress bar
    recursive=True,  # uploads files from all subfolders as well
)
from pathlib import Path

from deepset_cloud_sdk.workflows.async_client.files import upload

async def my_async_context() -> None:
    await upload(
        paths=[Path("<your_path_to_the_upload_folder>")],
        api_key="<deepsetCloud_API_key>",
        workspace_name="<default_workspace>",
        blocking=True,  # waits until the files are displayed in deepset Cloud,
        # this may take a couple of minutes
        timeout_s=300,  # the timeout for the `blocking` parameter in number of seconds
        show_progress=True,  # shows the progress bar
        recursive=True,  # uploads files from all subfolders as well
    )

Upload Texts

You can upload raw text to a deepset Cloud workspace, like this:

from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, DeepsetCloudFile

upload_texts(
    workspace_name="<default_workspace>", # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
    files=[
        DeepsetCloudFile(
            name="example.txt",
            text="this is text",
            meta={"key": "value"},  # optional
        )
    ],
    blocking=True,  # optional, by default True
    timeout_s=300,  # optional, by default 300
)
from deepset_cloud_sdk.workflows.async_client.files import upload_texts, DeepsetCloudFile

async def my_async_context() -> None:
  await upload_texts(
      workspace_name="<default_workspace>", # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
      files=[
          DeepsetCloudFile(
              name="example.txt",
              text="this is text",
              meta={"key": "value"},  # optional
          )
      ],
      blocking=True,  # optional, by default True
      timeout_s=300,  # optional, by default 300
  )

Synchronize GitHub Files with deepset Cloud

Here's an example script to load TXT and PDF files from Github and send them to deepset Cloud. It fetches the content as texts from GitHub and forwards them to deepset Cloud.

from typing import List
import httpx
from urllib.parse import urlparse

from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile

# Place your API key here
API_KEY: str = "<YOUR-API-KEY>"


def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename


def fetch_and_prepare_files(urls: List[str]) -> List[DeepsetCloudFile]:
    """Fetches files from URLs and converts them to DeepsetCloudFile objects.

    These Objects can be uploaded to the Deepset Cloud directly without
    having to first copy them to disk.

    :param urls: List of URLs to fetch files from
    :return: List of DeepsetCloudFile objects
    """
    files_to_upload: List[DeepsetCloudFile] = []

    for url in urls:
        response = httpx.get(url)
        response.raise_for_status()

        file = DeepsetCloudFile(
            text=response.text,
            name=_parse_filename(url),
            meta={"url": url},
        )
        files_to_upload.append(file)

    return files_to_upload


# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
    "https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]

files = fetch_and_prepare_files(DOWNLOAD_URLS)

# Upload .txt and .pdf files to the Deepset Cloud
upload_texts(
    workspace_name="upload-test-123",  # optional, uses "DEFAULT_WORKSPACE_NAME" by default
    files=files,
    blocking=False,  # Set to False for non-blocking uploads
    timeout_s=300,  # Optional, default is 300 seconds
    show_progress=True,  # Optional, default is True
    api_key=API_KEY,
    write_mode=WriteMode.OVERWRITE,
)

Download Files from a URL and Upload Them to deepset Cloud with Threading

This script downloads TXT files from the URL you specify and then uploads them to deepset Cloud using threading. Note that the maximum concurrency (processes in mulitprocessing.pool(processes=4) ) is 10.

import multiprocessing
from typing import List
from urllib.parse import urlparse

import httpx

from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile

# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"


def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename


def fetch_and_upload_file(url: str) -> None:
    """Fetches a file from the fiven URL and converts it to a DeepsetCloudFile object.

    That Object can be uploaded to the Deepset Cloud directly without
    having to first copy them to disk.

    :param urls: List of URLs to fetch files from
    """
    response = httpx.get(url)
    response.raise_for_status()

    file = DeepsetCloudFile(
        text=response.text,
        name=_parse_filename(url),
        meta={"url": url},
    )

    upload_texts(
        workspace_name=WORKSPACE,  # optional, uses "DEFAULT_WORKSPACE_NAME" by default
        files=[file],
        blocking=False,  # Set to False for non-blocking uploads
        timeout_s=300,  # Optional, default is 300 seconds
        show_progress=True,  # Optional, default is True
        api_key=API_KEY,
        write_mode=WriteMode.OVERWRITE,
    )


# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
    "https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]

# Upload .txt and .pdf files to the Deepset Cloud
# we start a thread per url to download and upload the files
with multiprocessing.Pool(processes=4) as pool: #the maximum number of processes is 10
    results = pool.map(fetch_and_upload_file, DOWNLOAD_URLS)

Download Files from a URL and Upload Them Async

This example downloads files from a URL you specify and then uploads them asynchronously to deepset Cloud:

import asyncio
import httpx
from typing import List
from urllib.parse import urlparse

from deepset_cloud_sdk.workflows.sync_client.files import WriteMode, DeepsetCloudFile
from deepset_cloud_sdk.workflows.async_client.files import upload_texts

# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"


def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename


async def fetch_and_upload_file(url: str) -> None:
    """Fetches a file from the given URL and converts it to a DeepsetCloudFile object.

    That Object can be uploaded to the Deepset Cloud directly without
    having to first copy them to disk.

    :param url: URL to fetch file from
    """
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        response.raise_for_status()

        file = DeepsetCloudFile(
            text=response.text,
            name=_parse_filename(url),
            meta={"url": url},
        )

        await upload_texts(
            workspace_name=WORKSPACE,  # optional, uses "DEFAULT_WORKSPACE_NAME" by default
            files=[file],
            blocking=False,  # Set to False for non-blocking uploads
            timeout_s=300,  # Optional, default is 300 seconds
            show_progress=True,  # Optional, default is True
            api_key=API_KEY,
            write_mode=WriteMode.OVERWRITE,
        )


async def main(urls: List[str]) -> None:
    """Main function to run the asynchronous fetching and uploading of files."""
    tasks = [fetch_and_upload_file(url) for url in urls]
    await asyncio.gather(*tasks)


# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
    "https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]

# Run the main function
if __name__ == "__main__":
    asyncio.run(main(DOWNLOAD_URLS))

Google Colab Notebook

Here's a Colab notebook with different upload scenarios you can test: Upload files with SDK in Google Colab.