Let's look at how data flows through deepset Cloud and where it's stored at each workflow step.
Let's start with what happens to the files when you upload them to deepset Cloud and where they are stored. This differs a bit depending on whether you're uploading synchronously or asynchronously. It's also different if you're connecting your own virtual private cloud (VPC), like the AWS S3 bucket. Let's look at all these scenarios.
All files uploaded to deepset Cloud are eventually stored in the deepset Cloud AWS S3 bucket. There are two methods for uploading files: synchronous and asynchronous. When you upload synchronously, there's an additional step where the files go through the deepset Cloud main API service before they're sent to S3. When you upload asynchronously, using sessions, your files go directly to the S3 bucket.
deepset Cloud is also connected to a SQL database. This database stores information about files, such as file name, file id, and when it was created. It does not store the contents of the files.
When you're using your own VPC, like AWS S3 or OpenSearch, to store files, your files remain in your storage at all times. You authorize deepset Cloud to communicate with the file storage when it needs to index the files. We'll cover that in Deploying Pipelines in more detail.
When you deploy a pipeline, it triggers indexing. Indexing means your files are preprocessed, chunked into pieces of raw text called Documents, and stored in the DeepsetCloudDocumentStore, which is an OpenSearch document database.
After a pipeline is deployed and it's time for indexing:
- deepset Cloud fetches the names of files to index from the database.
- It then communicates these file names to the indexing pipeline.
- The indexing pipeline fetches the actual files from the data storage and starts indexing. During indexing, the files are temporarily stored in the deepset Cloud indexing service.
- The indexing pipeline preprocesses the files and sends the resulting documents to the DeepsetCloudDocumentStore (OpenSearch). After that, the files are deleted from the temporary location.
The process is the same regardless of whether you store your files in deepset Cloud or in a private AWS S3 bucket. If you want your data to stay in your accounts only, we recommend connecting both a private AWS S3 bucket and a private OpenSearch cluster. Otherwise, the documents, which are chunked files, are still stored in OpenSearch.
Let's have a look at what happens at search time. Your query goes to the query pipeline. The query pipeline (and, more specifically, the Retriever node) connects with the DeepsetCloudDocumentStore and fetches the documents that match the query.
If you're using your private OpenSearch cluster, you authorize deepset Cloud to connect to it at query time. The query pipeline then reaches out to your OpenSearch cluster to fetch the documents from there.
These documents are stored in a temporary memory; they're not saved anywhere. If it's a question answering pipeline, the Retriever passes the documents on to the Reader or Generator, which comes up with the final answer based on them.
The results of the query are stored in the deepset Cloud SQL database. The database is protected, and only a selected number of deepset employees can access it.
Updated 3 months ago