Upload Files with Python
Use this method if you have many files to upload or want to upload files with metadata.
About This Task
Uploading files using the Python methods included in the SDK is asynchronous and uses sessions under the hood. It's best for uploading large numbers of files with metadata. To upload files using this method, create and run a Python script. You can find example scripts at the bottom of this page.
To learn more, see also Synchronous and asynchronous upload and Working with Metadata.
Sessions
Asynchronous upload uses the mechanism of sessions to upload your files to deepset Cloud. A session stores the ingestion status of the files: the number of failed and finished files. Each session has an ID so you can check its details anytime.
A session starts when you initiate the upload. For SDK, it opens when you call the upload method or command and closes when the upload is finished. A session expires after 24 hours. You can have a maximum of 10 open sessions.
When using the SDK, you don't have to worry about the sessions as the SDK takes care of opening and closing them for you. They're just there if you want to check the status of your past and current uploads.
Folder Structure
You don't need to follow any specific folder structure. If your folder contains files with the same name, all these files are uploaded by default. You can set the write mode to overwrite the files, keep them all, or fail the upload.
Prerequisites
- Install the SDK
- Generate an API Key to connect to a deepset Cloud workspace.
Upload Scripts Examples
Upload Files From a Folder
Here's an example script you can use:
from pathlib import Path
from deepset_cloud_sdk.workflows.sync_client.files import upload
## Uploads all files from a given path.
upload(
paths=[Path("<your_path_to_the_upload_folder>")],
api_key="<deepsetCloud_API_key>",
workspace_name="<default_workspace>",
blocking=True, # waits until the files are displayed in deepset Cloud,
# this may take a couple of minutes
timeout_s=300, # the timeout for the `blocking` parameter in number of seconds
show_progress=True, # shows the progress bar
recursive=True, # uploads files from all subfolders as well
)
from pathlib import Path
from deepset_cloud_sdk.workflows.async_client.files import upload
async def my_async_context() -> None:
await upload(
paths=[Path("<your_path_to_the_upload_folder>")],
api_key="<deepsetCloud_API_key>",
workspace_name="<default_workspace>",
blocking=True, # waits until the files are displayed in deepset Cloud,
# this may take a couple of minutes
timeout_s=300, # the timeout for the `blocking` parameter in number of seconds
show_progress=True, # shows the progress bar
recursive=True, # uploads files from all subfolders as well
)
Upload Texts
You can upload raw text to a deepset Cloud workspace, like this:
from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, DeepsetCloudFile
upload_texts(
workspace_name="<default_workspace>", # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
files=[
DeepsetCloudFile(
name="example.txt",
text="this is text",
meta={"key": "value"}, # optional
)
],
blocking=True, # optional, by default True
timeout_s=300, # optional, by default 300
)
from deepset_cloud_sdk.workflows.async_client.files import upload_texts, DeepsetCloudFile
async def my_async_context() -> None:
await upload_texts(
workspace_name="<default_workspace>", # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
files=[
DeepsetCloudFile(
name="example.txt",
text="this is text",
meta={"key": "value"}, # optional
)
],
blocking=True, # optional, by default True
timeout_s=300, # optional, by default 300
)
Synchronize GitHub Files with deepset Cloud
Here's an example script to load TXT and PDF files from Github and send them to deepset Cloud. It fetches the content as texts from GitHub and forwards them to deepset Cloud.
from typing import List
import httpx
from urllib.parse import urlparse
from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile
# Place your API key here
API_KEY: str = "<YOUR-API-KEY>"
def _parse_filename(url: str) -> str:
"""Parses the filename from a URL.
:param url: URL to parse the filename from
:return: Filename
"""
path = urlparse(url).path
filename = path.split("/")[-1]
return filename
def fetch_and_prepare_files(urls: List[str]) -> List[DeepsetCloudFile]:
"""Fetches files from URLs and converts them to DeepsetCloudFile objects.
These Objects can be uploaded to the Deepset Cloud directly without
having to first copy them to disk.
:param urls: List of URLs to fetch files from
:return: List of DeepsetCloudFile objects
"""
files_to_upload: List[DeepsetCloudFile] = []
for url in urls:
response = httpx.get(url)
response.raise_for_status()
file = DeepsetCloudFile(
text=response.text,
name=_parse_filename(url),
meta={"url": url},
)
files_to_upload.append(file)
return files_to_upload
# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
"https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
"https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]
files = fetch_and_prepare_files(DOWNLOAD_URLS)
# Upload .txt and .pdf files to the Deepset Cloud
upload_texts(
workspace_name="upload-test-123", # optional, uses "DEFAULT_WORKSPACE_NAME" by default
files=files,
blocking=False, # Set to False for non-blocking uploads
timeout_s=300, # Optional, default is 300 seconds
show_progress=True, # Optional, default is True
api_key=API_KEY,
write_mode=WriteMode.OVERWRITE,
)
Download Files from a URL and Upload Them to deepset Cloud with Threading
This script downloads TXT files from the URL you specify and then uploads them to deepset Cloud using threading. Note that the maximum concurrency (processes in mulitprocessing.pool(processes=4)
) is 10.
import multiprocessing
from typing import List
from urllib.parse import urlparse
import httpx
from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile
# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"
def _parse_filename(url: str) -> str:
"""Parses the filename from a URL.
:param url: URL to parse the filename from
:return: Filename
"""
path = urlparse(url).path
filename = path.split("/")[-1]
return filename
def fetch_and_upload_file(url: str) -> None:
"""Fetches a file from the fiven URL and converts it to a DeepsetCloudFile object.
That Object can be uploaded to the Deepset Cloud directly without
having to first copy them to disk.
:param urls: List of URLs to fetch files from
"""
response = httpx.get(url)
response.raise_for_status()
file = DeepsetCloudFile(
text=response.text,
name=_parse_filename(url),
meta={"url": url},
)
upload_texts(
workspace_name=WORKSPACE, # optional, uses "DEFAULT_WORKSPACE_NAME" by default
files=[file],
blocking=False, # Set to False for non-blocking uploads
timeout_s=300, # Optional, default is 300 seconds
show_progress=True, # Optional, default is True
api_key=API_KEY,
write_mode=WriteMode.OVERWRITE,
)
# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
"https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
"https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]
# Upload .txt and .pdf files to the Deepset Cloud
# we start a thread per url to download and upload the files
with multiprocessing.Pool(processes=4) as pool: #the maximum number of processes is 10
results = pool.map(fetch_and_upload_file, DOWNLOAD_URLS)
Download Files from a URL and Upload Them Async
This example downloads files from a URL you specify and then uploads them asynchronously to deepset Cloud:
import asyncio
import httpx
from typing import List
from urllib.parse import urlparse
from deepset_cloud_sdk.workflows.sync_client.files import WriteMode, DeepsetCloudFile
from deepset_cloud_sdk.workflows.async_client.files import upload_texts
# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"
def _parse_filename(url: str) -> str:
"""Parses the filename from a URL.
:param url: URL to parse the filename from
:return: Filename
"""
path = urlparse(url).path
filename = path.split("/")[-1]
return filename
async def fetch_and_upload_file(url: str) -> None:
"""Fetches a file from the given URL and converts it to a DeepsetCloudFile object.
That Object can be uploaded to the Deepset Cloud directly without
having to first copy them to disk.
:param url: URL to fetch file from
"""
async with httpx.AsyncClient() as client:
response = await client.get(url)
response.raise_for_status()
file = DeepsetCloudFile(
text=response.text,
name=_parse_filename(url),
meta={"url": url},
)
await upload_texts(
workspace_name=WORKSPACE, # optional, uses "DEFAULT_WORKSPACE_NAME" by default
files=[file],
blocking=False, # Set to False for non-blocking uploads
timeout_s=300, # Optional, default is 300 seconds
show_progress=True, # Optional, default is True
api_key=API_KEY,
write_mode=WriteMode.OVERWRITE,
)
async def main(urls: List[str]) -> None:
"""Main function to run the asynchronous fetching and uploading of files."""
tasks = [fetch_and_upload_file(url) for url in urls]
await asyncio.gather(*tasks)
# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
"https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
"https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
"https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]
# Run the main function
if __name__ == "__main__":
asyncio.run(main(DOWNLOAD_URLS))
Google Colab Notebook
Here's a Colab notebook with different upload scenarios you can test: Upload files with SDK in Google Colab.
Updated 4 days ago