Data Synchronization

Data keeps changing and you want your search to always run on the latest data. Here are some tips how to do it.

To find an answer to a query, your pipeline searches the files that you uploaded to deepset Cloud. When you deploy the pipeline, the files are indexed and ready to be searched on. However, data is rarely static and it's only natural that your files should change: get deleted, added, or updated. Your app should always be in sync with your file storage.

Syncing Your Data

The easiest way to ensure your search runs on the latest version of your files is to create a script that periodically syncs your data with deepset Cloud. deepset Cloud provides endpoints for uploading and deleting files so you can easily remove old data and add new ones.

🚧

Coming Soon: Updating Files

Currently, there is no endpoint for updating files so if a file that's already in deepset Cloud is modified, you must delete it and upload the latest version.

An example script for synchronizing data
from pathlib import Path
import datetime

from haystack.utils import DeepsetCloud

# Specify where your files are locally
dir = Path("../../test/samples/docs")
files_offline = {f.name: f for f in dir.iterdir()}

# Initiate the deepset Cloud file client
file_client = DeepsetCloud.get_file_client()

# List all your files in deepset Cloud
files_online = {f["name"]: f for f in file_client.list_files()}

files_to_upload = []
files_to_delete = []

# Check if there are any files in your local database that are not in deepset Cloud
for file_offline in files_offline.values():
    if file_offline.name not in files_online:
        # Add new files to the upload list
        files_to_upload.append(file_offline)
    else:
        # Check if all files are up to date
        datetime_online = datetime.datetime.fromisoformat(files_online[file_offline.name]["created_at"])
        datetime_offline = datetime.datetime.fromtimestamp(file_offline.stat().st_mtime).astimezone() # consider using st_ctime for windows
        # Add the local files that are newer than the files in deepset Cloud to the upload list
                # Add the files in deepset Cloud that are older than your local versions to the delete list
        if datetime_offline > datetime_online:
            files_to_upload.append(file_offline)
            files_to_delete.append(files_online[file_offline.name]["file_id"])

# If you want to delete files that do not exist any more, uncomment the following lines:
# for file_online_name, file_online in files_online.items():
#     if file_online_name not in files_offline:
#         files_to_delete.append(file_online["file_id"])


# Upload new files to deepset Cloud
# and delete old files from deepset Cloud
file_client.upload_files(files_to_upload)
for file_to_delete in files_to_delete:
    file_client.delete_file(file_to_delete)

Indexing

Whenever you add a file, it's automatically re-indexed. You don't have to redeploy the pipeline to trigger indexing.


Related Links