Data Flow in deepset Cloud

Explore the journey of data in deepset Cloud, from the moment you upload your files or connect your data storage, through processing, to output.

Let's look at how data flows through deepset Cloud and where it's stored at each workflow step.

Where Is My Data?

Let's start with what happens to the files when you upload them to deepset Cloud and where they are stored. This differs a bit depending on whether you're uploading synchronously or asynchronously. It's also different if you're connecting your own virtual private cloud (VPC), like the AWS S3 bucket. Let's look at all these scenarios.

Uploading Files

All files uploaded to deepset Cloud are eventually stored in the deepset Cloud AWS S3 bucket. There are two methods for uploading files: synchronous and asynchronous. When you upload synchronously, there's an additional step where the files go through the deepset Cloud main API service before they're sent to S3. When you upload asynchronously, using sessions, your files go directly to the S3 bucket.

deepset Cloud is also connected to a SQL database. This database stores information about files, such as file name, file id, and when it was created. It does not store the contents of the files.

A representation of files with an arrow going through deepset cloud to an AWS s3 bucket and described as synchronous upload. Then another arrow going from the files representation straight to the AWS s3 bucket for asynchronous upload. And another arrow going from the deepset Cloud logo to a database icon.

Files stored in deepset Cloud

When you're using your own VPC, like AWS S3 or OpenSearch, to store files, your files remain in your storage at all times. You authorize deepset Cloud to communicate with the file storage when it needs to index the files. We'll cover that in Deploying Pipelines in more detail.

An icon representing files with two arrows: one going to the opensearch logo and another one going to the AWS S3 bucket. Then two dotted-line arrows going from opensearch and s3 to the deepset Cloud logo.

Files stored in VPC

Deploying Pipelines

When you deploy a pipeline, it triggers indexing. Indexing means your files are preprocessed, chunked into pieces of raw text called Documents, and stored in the OpenSearchDocumentStore, which is an OpenSearch document database.

After a pipeline is deployed and it's time for indexing:

  1. deepset Cloud fetches the names of files to index from the database.
  2. It then communicates these file names to the indexing pipeline.
  3. The indexing pipeline fetches the actual files from the data storage and starts indexing. During indexing, the files are temporarily stored in the deepset Cloud indexing service.
  4. The indexing pipeline preprocesses the files and sends the resulting documents to the OpenSearchDocumentStore. After that, the files are deleted from the temporary location.
A diagram showing the logo of deepset Cloud with two arrows stemming from it: one bidirectional going to the database and the other one going in the direction of the indexing service, illustrated by an icon of four connected squares. Above the indexing service, there's the logo of the AWS S3 bucket with an arrow going towards the indexing service and icons of files. Then, from the indexing service, there's another arrow going to the deepset Cloud document store depiced by a square with the OpenSearch logo.

Indexing with files stored in deepset Cloud

The process is the same regardless of whether you store your files in deepset Cloud or in a private AWS S3 bucket. If you want your data to stay in your accounts, we recommend connecting a private AWS S3 bucket and a private OpenSearch cluster. Otherwise, the documents, which are chunked files, are still stored in OpenSearch.

A diagram showing a box with the deepset logo in it and an indexing service depicted by four connected squares. This box has three arrows linked to it: the first one, bidirectional, going downards toward a database icon. The second one going upwards towards a bucket icon depicting the AWS S3 bucket, which is the client private account. And the third one going to the OpenSerach logo, which is also the client private opensearch cluster.

Indexing with files stored in VPC

Searching

Let's examine what happens at search time. Your query goes to the query pipeline, which (more specifically, the Retriever node) connects with the OpenSearchDocumentStore and fetches the documents that match the query.

A question marked illustrating a query with an arrow going from it towards the deepset Cloud logo. The deepset Cloud logo has an arrow going towards a magnifying glass icon depiciting the search pipeline. From the search pipeline icon, there's a bidirectional arrow going towards the OpenSearch logo. Then, there's another arrow from the search pipeline towards a green tick icon depicting the search result. The search result further goes towards the database icon and a user icon.

Searching with files stored in deepset Cloud

If you use your private OpenSearch cluster, you authorize deepset Cloud to connect to it at query time. The query pipeline then reaches out to your OpenSearch cluster to fetch the documents from there.

A question marked illustrating a query with an arrow going from it towards the deepset Cloud logo. The deepset Cloud logo has an arrow going towards a magnifying glass icon depiciting the search pipeline. From the search pipeline icon, there's a bidirectional arrow going towards the OpenSearch logo. OpenSearch is in an orange box meaning it's VPC.Then, there's another arrow from the search pipeline towards a green tick icon depicting the search result. The search result further goes towards the database icon and a user icon.

Searching with files stored in VPC

These documents are stored in a temporary memory, not saved anywhere. If it's a question answering pipeline, the Retriever passes the documents on to the Reader or Generator, which comes up with the final answer based on them.

The results of the query are stored in the deepset Cloud SQL database. The database is protected, and only a selected number of deepset employees can access it.

Using Hosted Models

You can use models hosted by OpenAI, Hugging Face, Cohere, Azure OpenAI, SageMaker, or Amazon Bedrock in your query pipelines. For a full list, see Using Hosted Models and External Services.
A hosted model is especially useful for large language models requiring substantial infrastructure.

To use a hosted model, you first connect to the model provider using your credentials. The encrypted credentials are securely stored in the database. When you make a query, deepset Cloud sends an HTTP request to the model provider, including your credentials in the request header. If the authorization is successful, the model generates the response and sends it back to deepset Cloud.

A diagram illustrating the process flow of a search query using deepset Cloud. A query starts from the left, symbolized by a question mark, and enters a component labeled 'deepset Cloud.' It then passes through a 'Query pipeline' where it interacts with the 'DeepsetCloudDocumentStore' and 'OpenSearch' services, depicted with their logos. The process flow shows an HTTP request moving to a 'Model provider,' symbolized by a desktop computer icon, and then an HTTP response (answer) returning back through the pipeline. The result is shown being validated with a green check mark before being presented as 'Search results' to a user icon on the right. The bottom part of the diagram indicates a feedback loop where information is secured in a 'Database,' represented by a padlocked data icon. The overall mood of the diagram is informative and structured, using a blue and grey color scheme to convey a sense of technology and data processing.

Using hosted models at query time

When using a hosted model, remember that your pipeline's stability depends on the model provider. If you disconnect deepset Cloud from a model provider, all pipelines using models hosted by this provider stop working.