Data Flow in deepset AI Platform

Explore the journey of data in deepset AI Platform, from the moment you upload your files or connect your data storage, through processing, to output.

Suggest Edits

Let's look at how data flows through deepset AI Platform and where it's stored at each workflow step.

Where Is My Data?

Let's start with what happens to the files when you upload them to deepset and where they are stored. This differs a bit depending on whether you're uploading synchronously or asynchronously. It's also different if you're connecting your own virtual private cloud (VPC), like the AWS S3 bucket. Let's look at all these scenarios.

Uploading Files

All files uploaded to deepset AI Platform are eventually stored in the deepset AWS S3 bucket. There are two methods for uploading files: synchronous and asynchronous. When you upload synchronously, there's an additional step where the files go through the deepset main API service before they're sent to S3. When you upload asynchronously, using sessions, your files go directly to the S3 bucket.

deepset AI Platform is also connected to a SQL database. This database stores information about files, such as file name, file id, and when it was created. It does not store the contents of the files.

Files stored in deepset AI Platform

When you use your own VPC, like AWS S3 or OpenSearch, to store files, your files remain in your storage at all times. You authorize deepset to communicate with the file storage when it needs to index the files. We'll cover that in Deploying Pipelines in more detail.

Files stored in VPC

Enabling Indexes

When you enable an index, it triggers indexing. Indexing means your files are preprocessed, chunked into pieces of raw text called Documents, and stored in the document store. OpenSearchDocumentStore is the default, core document store of deepset AI Platform, but you can use any other supported database.

During indexing:

deepset AI Platform fetches the names of files to index from the database.
It then communicates these file names to the index.
The index fetches the actual files from the data storage and starts indexing. During indexing, the files are temporarily stored in the deepset indexing service.
The index preprocesses the files and sends the resulting documents to the document store. After that, the files are deleted from the temporary location.

This graphic shows the flow using the example of OpenSearchDocumentStore:

Indexing with files stored in deepset

The process is the same regardless of whether you store your files in deepset or in a private AWS S3 bucket. If you want your data to stay in your accounts, we recommend connecting a private AWS S3 bucket and a private OpenSearch cluster. Otherwise, the documents, which are chunked files, are still stored in OpenSearch.

Indexing with files stored in VPC

Searching

Let's examine what happens at search time. Your query goes to the query pipeline, which (more specifically, the Retriever node) connects with the document store and fetches the documents that match the query.

Searching with files stored in deepset AI Platform

If you use your private OpenSearch cluster, you authorize deepset to connect to it at query time. The query pipeline then reaches out to your OpenSearch cluster to fetch the documents from there.

Searching with files stored in VPC

These documents are stored in a temporary memory, not saved anywhere. If it's a question answering pipeline, the Retriever passes the documents on to the Reader or Generator, which comes up with the final answer based on them.

The results of the query are stored in the deepset SQL database. The database is protected, and only a selected number of deepset employees can access it.

Using Hosted Models

You can use models hosted by OpenAI, Hugging Face, Cohere, Azure OpenAI, SageMaker, or Amazon Bedrock in your query pipelines. For a full list, see Using Hosted Models and External Services.
A hosted model is especially useful for large language models requiring substantial infrastructure.

To use a hosted model, you first connect to the model provider using your credentials. The encrypted credentials are securely stored in the database. When you make a query, deepset sends an HTTP request to the model provider, including your credentials in the request header. If the authorization is successful, the model generates the response and sends it back to deepset AI Platform.

A diagram illustrating the process flow of a search query using deepset Cloud. A query starts from the left, symbolized by a question mark, and enters a component labeled 'deepset Cloud.' It then passes through a 'Query pipeline' where it interacts with the 'DeepsetCloudDocumentStore' and 'OpenSearch' services, depicted with their logos. The process flow shows an HTTP request moving to a 'Model provider,' symbolized by a desktop computer icon, and then an HTTP response (answer) returning back through the pipeline. The result is shown being validated with a green check mark before being presented as 'Search results' to a user icon on the right. The bottom part of the diagram indicates a feedback loop where information is secured in a 'Database,' represented by a padlocked data icon. The overall mood of the diagram is informative and structured, using a blue and grey color scheme to convey a sense of technology and data processing. — Using hosted models at query time

When using a hosted model, remember that your pipeline's stability depends on the model provider. If you disconnect deepset AI Platform from a model provider, all pipelines using models hosted by this provider stop working.

Updated about 2 months ago