Upload Files with Python
Use this method if you have many files to upload or want to upload files with metadata.
About This Task
Uploading files using the Python methods included in the SDK is asynchronous and uses sessions under the hood. It's best for uploading large numbers of files with metadata. To upload files using this method, create and run a Python script. You can find example scripts at the bottom of this page.
To learn more, see also Synchronous and asynchronous upload and Working with Metadata.
Sessions
Asynchronous upload uses the mechanism of sessions to upload your files to deepset Cloud. A session stores the ingestion status of the files: the number of failed and finished files. Each session has an ID so you can check its details anytime.
A session starts when you initiate the upload. For SDK, it opens when you call the upload method or command and closes when the upload is finished. A session expires after 24 hours. You can have a maximum of 10 open sessions.
When using the SDK, you don't have to worry about the sessions as the SDK takes care of opening and closing them for you. They're just there if you want to check the status of your past and current uploads.
Folder Structure
You don't need to follow any specific folder structure. If your folder contains files with the same name, all these files are uploaded by default. You can set the write mode to overwrite the files, keep them all, or fail the upload.
File Extensions
Make sure your files have lowercase extensions, for example, my_file.pdf, instead of my_file.PDF. The SDK doesn't upload files with uppercase extensions.
Prerequisites
- Install the SDK
- Generate an API Key to connect to a deepset Cloud workspace.
Upload Scripts Examples
Upload From a Folder
Here is an example of a synchronous and asynchronous way to upload files from a folder.
Note: When using Jupyter Notebooks, use this import before loading the SDK:
import nest_asyncio
nest_asyncio.apply()
from deepset_cloud_sdk.workflows.sync_client.files import list_files
# you can install it with
pip install nest-asyncio
from pathlib import Path
from deepset_cloud_sdk.workflows.sync_client.files import upload
# Uploads all files from a given path
upload(
paths=[Path("<your_path_to_the_upload_folder>")],
api_key="<deepsetCloud_API_key>",
workspace_name="<default_workspace>",
blocking=True, # waits until the files are displayed in deepset Cloud,
# this may take a couple of minutes
timeout_s=300, # the timeout for the `blocking` parameter in number of seconds
show_progress=True, # shows the progress bar
recursive=True, # uploads files from all subfolders as well
)
from pathlib import Path
from deepset_cloud_sdk.workflows.async_client.files import upload
# Uploads all files from a given path.
async def my_async_context() -> None:
await upload(
paths=[Path("<your_path_to_the_upload_folder>")],
api_key="<deepsetCloud_API_key>",
workspace_name="<default_workspace>",
blocking=True, # waits until the files are displayed in deepset Cloud,
# this may take a couple of minutes
timeout_s=300, # the timeout for the `blocking` parameter in number of seconds
show_progress=True, # shows the progress bar
recursive=True, # uploads files from all subfolders as well
)
# Run the async function
if __name__ == "__main__":
asyncio.run(my_async_context())
Upload Bytes
You can files as bytes to a deepset Cloud workspace. This method is suitable for all file types. Here are examples of a synchronous and an asynchronous way to do this:
from deepset_cloud_sdk.workflows.sync_client.files import upload_bytes, DeepsetCloudFileBytes
upload_bytes(
api_key="<deepsetCloud_API_key>",
workspace_name="<default_workspace>", # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
files=[
DeepsetCloudFileBytes(
name="example.txt",
file_bytes=b"this is text",
meta={"key": "value"}, # optional
)
],
blocking=True, # optional, by default True
timeout_s=300, # optional, by default 300
)
from deepset_cloud_sdk.workflows.async_client.files import upload_bytes, DeepsetCloudFileBytes
async def my_async_context() -> None:
await upload_bytes(
api_key="<deepsetCloud_API_key>",
workspace_name="<default_workspace>", # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
files=[
DeepsetCloudFileBytes(
name="example.txt",
file_bytes=b"this is some byte text",
meta={"key": "value"}, # optional
)
],
blocking=True, # optional, by default True
timeout_s=300, # optional, by default 300
)
# Run the async function
if __name__ == "__main__":
asyncio.run(my_async_context())
Synchronize GitHub Files with deepset Cloud
Here's an example script to load TXT and MD files from GitHub and send them to deepset Cloud. It fetches the content as texts from GitHub and forwards them to deepset Cloud.
from typing import List
import httpx
from urllib.parse import urlparse
from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile
# Place your API key here
API_KEY: str = "<YOUR-API-KEY>"
def _parse_filename(url: str) -> str:
"""Parses the filename from a URL.
:param url: URL to parse the filename from
:return: Filename
"""
path = urlparse(url).path
filename = path.split("/")[-1]
return filename
def fetch_and_prepare_files(urls: List[str]) -> List[DeepsetCloudFile]:
"""Fetches files from URLs and converts them to DeepsetCloudFile objects.
These Objects can be uploaded to the Deepset Cloud directly without
having to first copy them to disk.
:param urls: List of URLs to fetch files from
:return: List of DeepsetCloudFile objects
"""
files_to_upload: List[DeepsetCloudFile] = []
for url in urls:
response = httpx.get(url)
response.raise_for_status()
file = DeepsetCloudFile(
text=response.text,
name=_parse_filename(url),
meta={"url": url},
)
files_to_upload.append(file)
return files_to_upload
# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
"https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]
files = fetch_and_prepare_files(DOWNLOAD_URLS)
# Upload .txt and .md files to deepset Cloud
upload_texts(
workspace_name="upload-test-123", # optional, uses "DEFAULT_WORKSPACE_NAME" by default
files=files,
blocking=False, # Set to False for non-blocking uploads
timeout_s=300, # Optional, default is 300 seconds
show_progress=True, # Optional, default is True
api_key=API_KEY,
write_mode=WriteMode.OVERWRITE,
)
Download Files from a URL and Upload to deepset Cloud
Using Threading
This script downloads TXT and MD files from the URL you specify and then uploads them to deepset Cloud using threading. Note that the maximum concurrency (processes in mulitprocessing.pool(processes=3)
) is limited by the amount of cores in your system. For maximum utilization, you can use multiprocessing.cpu_count()
to set the number of processes.
import multiprocessing
from typing import List
from urllib.parse import urlparse
import httpx
from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile
# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"
def _parse_filename(url: str) -> str:
"""Parses the filename from a URL.
:param url: URL to parse the filename from
:return: Filename
"""
path = urlparse(url).path
filename = path.split("/")[-1]
return filename
def fetch_and_upload_file(url: str) -> None:
"""Fetches a file from the given URL and converts it to a DeepsetCloudFile object.
That Object can be uploaded to the Deepset Cloud directly without
having to first copy them to disk.
:param url: URL to fetch files from
"""
response = httpx.get(url)
response.raise_for_status()
file = DeepsetCloudFile(
text=response.text,
name=_parse_filename(url),
meta={"url": url},
)
upload_texts(
workspace_name=WORKSPACE, # optional, uses "DEFAULT_WORKSPACE_NAME" by default
files=[file],
blocking=False, # Set to False for non-blocking uploads
timeout_s=300, # Optional, default is 300 seconds
show_progress=True, # Optional, default is True
api_key=API_KEY,
write_mode=WriteMode.OVERWRITE,
)
# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
"https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]
if __name__ == '__main__':
# Upload .txt and .md files to deepset Cloud
# Start one thread per URL to download and upload the files
with multiprocessing.Pool(processes=3) as pool:
results = pool.map(fetch_and_upload_file, DOWNLOAD_URLS)
Async
This example downloads files from a URL you specify and then uploads them asynchronously to deepset Cloud:
import asyncio
import httpx
from typing import List
from urllib.parse import urlparse
from deepset_cloud_sdk.workflows.sync_client.files import WriteMode, DeepsetCloudFile
from deepset_cloud_sdk.workflows.async_client.files import upload_texts
# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"
def _parse_filename(url: str) -> str:
"""Parses the filename from a URL.
:param url: URL to parse the filename from
:return: Filename
"""
path = urlparse(url).path
filename = path.split("/")[-1]
return filename
async def fetch_and_upload_file(url: str) -> None:
"""Fetches a file from the given URL and converts it to a DeepsetCloudFile object.
That Object can be uploaded to the Deepset Cloud directly without
having to first copy them to disk.
:param url: URL to fetch file from
"""
async with httpx.AsyncClient() as client:
response = await client.get(url)
response.raise_for_status()
file = DeepsetCloudFile(
text=response.text,
name=_parse_filename(url),
meta={"url": url},
)
await upload_texts(
workspace_name=WORKSPACE, # optional, uses "DEFAULT_WORKSPACE_NAME" environment variable by default
files=[file],
blocking=False, # Set to False for non-blocking uploads
timeout_s=300, # Optional, default is 300 seconds
show_progress=True, # Optional, default is True
api_key=API_KEY,
write_mode=WriteMode.OVERWRITE,
)
async def main(urls: List[str]) -> None:
"""Main function to run the asynchronous fetching and uploading of files."""
tasks = [fetch_and_upload_file(url) for url in urls]
await asyncio.gather(*tasks)
# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
"https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]
# Run the main function
if __name__ == "__main__":
asyncio.run(main(DOWNLOAD_URLS))
From Memory in Byte Format
Here's an example of how to fetch a PDF file from a given URL, convert it to a byte format, and then upload it to deepset Cloud:
import multiprocessing
from typing import List
from urllib.parse import urlparse
import httpx
from deepset_cloud_sdk.workflows.sync_client.files import (
DeepsetCloudFileBytes,
WriteMode,
upload_bytes,
)
# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"
def _parse_filename(url: str) -> str:
"""Parses the filename from a URL.
:param url: URL to parse the filename from
:return: Filename
"""
path = urlparse(url).path
filename = path.split("/")[-1]
return filename
def fetch_and_upload_file(url: str) -> None:
"""Fetches a file from the given URL and converts it to a DeepsetCloudFile object.
That Object can be uploaded to the Deepset Cloud directly without
having to first copy them to disk.
:param url: URL to fetch files from
"""
response = httpx.get(url)
response.raise_for_status()
file = DeepsetCloudFileBytes(
file_bytes=response.content,
name=_parse_filename(url),
meta={"url": url},
)
upload_bytes(
workspace_name=WORKSPACE, # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
files=[file],
# by default blocking=True - by setting to False it will mean that you can immediately
# continue uploading another batch of files
blocking=False,
timeout_s=300, # optional, by default 300
show_progress=True, # optional, by default True
api_key=API_KEY,
write_mode=WriteMode.OVERWRITE,
)
# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
"https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
"https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]
if __name__ == '__main__':
# Upload .txt and .pdf files to deepset Cloud
# Start one thread per URL to download and upload the files
with multiprocessing.Pool(processes=4) as pool:
results = pool.map(fetch_and_upload_file, DOWNLOAD_URLS)
Google Colab Notebook
Here's a Colab notebook with different upload scenarios you can test: Upload files with SDK in Google Colab.
Updated about 2 months ago