Data Synchronization
Data keeps changing and you want your search to always run on the latest data. Here are some tips how to do it.
To find an answer to a query, your pipeline searches the files you uploaded to deepset Cloud. When you deploy the pipeline, the files are indexed and ready to be searched. However, data is rarely static, and it's only natural that your files should change: get deleted, added, or updated. Your app should always be in sync with your file storage.
Syncing Your Data
The easiest way to ensure your search runs on the latest version of your files is to create a script that periodically syncs your data with deepset Cloud. deepset Cloud provides endpoints for uploading and deleting files so you can easily remove old data and add new ones.
Coming Soon: Updating Files
Currently, there is no endpoint for updating files, so if a file that's already in deepset Cloud is modified, you must delete it and upload the latest version.
An example script for synchronizing data
from pathlib import Path
import datetime
from haystack.utils import DeepsetCloud
# Specify where your files are locally
dir = Path("../../test/samples/docs")
files_offline = {f.name: f for f in dir.iterdir()}
# Initiate the deepset Cloud file client
file_client = DeepsetCloud.get_file_client()
# List all your files in deepset Cloud
files_online = {f["name"]: f for f in file_client.list_files()}
files_to_upload = []
files_to_delete = []
# Check if there are any files in your local database that are not in deepset Cloud
for file_offline in files_offline.values():
if file_offline.name not in files_online:
# Add new files to the upload list
files_to_upload.append(file_offline)
else:
# Check if all files are up to date
datetime_online = datetime.datetime.fromisoformat(files_online[file_offline.name]["created_at"])
datetime_offline = datetime.datetime.fromtimestamp(file_offline.stat().st_mtime).astimezone() # consider using st_ctime for windows
# Add the local files that are newer than the files in deepset Cloud to the upload list
# Add the files in deepset Cloud that are older than your local versions to the delete list
if datetime_offline > datetime_online:
files_to_upload.append(file_offline)
files_to_delete.append(files_online[file_offline.name]["file_id"])
# If you want to delete files that do not exist any more, uncomment the following lines:
# for file_online_name, file_online in files_online.items():
# if file_online_name not in files_offline:
# files_to_delete.append(file_online["file_id"])
# Upload new files to deepset Cloud
# and delete old files from deepset Cloud
file_client.upload_files(files_to_upload)
for file_to_delete in files_to_delete:
file_client.delete_file(file_to_delete)
Indexing
Whenever you add a file, it's automatically re-indexed. You don't have to redeploy the pipeline to trigger indexing.
Updated about 2 months ago