Document Stores

Document store is a database that stores the pre-processed documents resulting from your indexing pipeline. The query pipeline uses retrievers to access the document store and fetch relevant documents to resolve queries.

Document Store Concepts

A document store is a Haystack concept that refers to an object that stores your data. It's an interface to a database like OpenSearch, Weaviate, or Pinecone, for storing and retrieving your data.

Document store stores data as Document objects. Each document has a unique ID, metadata, and can optionally include vector representations (embeddings) for enhanced search capabilities.

Document

To store your data in a document store, you must convert them into Document objects first. Documents are individual pieces of information that can include text, data frames, or binary data. When an indexing pipeline runs, files uploaded to your deepset Cloud workspace are preprocessed, cleaned, split, and converted into Document objects using PreProcessor components. One file can be split into multiple documents. Once processed, the DocumentWriter component writes them into a document store.

Query pipelines work on the documents stored in the document store, not directly on the uploaded files. A Retriever fetches the relevant documents from the document store and passes them to subsequent components in the pipeline tto resolve queries or run other tasks.

Index

An index is a data structure that helps you quickly find relevant documents without scanning every single document. Think of it as a book index that helps you find specific topics without reading the entire book. It's crucial for fast, efficient search, making it possible to handle large-scale datasets.

Writing Documents into the Store

You can write documents into a document store using DocumentWriter. As a best practice, include DocumentWriter as the last component in your indexing pipeline, specifying the document store in its configuration.

Retrieving Documents

Document stores work with retrievers. Retrievers in your query pipeline access the document store to fetch the documents relevant to the query. Each document store has dedicated retrievers, usually a keyword retriever, a vector retriever, and sometimes a hybrid retriever that combines both. This is because retrievers rely on the document store technology to fetch documents.

Configuring a Document Store

To configure a document store, use the document_store parameter. You can configure this parameter for DocumentWriter (in your indexing pipeline) or retrievers (in your query pipeline):

OpenSearchEmbeddingRetriever in Studio with the document_store parameter configuration open

You configure the settings in YAML. For details on the parameters you can customize, check the database you want to use in Haystack's Integrations API documentation.

Supported Document Stores

Currently, you can use the following document stores in deepset Cloud:

Core Document Stores

OpenSearch

OpenSearch is the only core document store of deepset Cloud. We manage its infrastructure and credentials and have access to its index and indexing information, such as the detailed status of files being indexed. deepset Cloud also manages file updates, including the metadata and deletions, and keeps the document store in sync.

Integrations

These document stores run on your infrastructure, and you're responsible for managing the credentials (you provide them in the configuration). When you deploy your indexing pipeline, deepset Cloud creates the index for these document stores, but the number of indexed files will always display as 0.

For integrations, deepset Cloud also handles metadata updates and file deletions, ensuring that changes are reflected in the document store.

Other

Snowflake is a table database that doesn't have an index. You can query your Snowflake data using DeepsetSnowflakeRetriever, which accesses the database and fetches a table that matches the SQL query.

Comparison

This table compares the document stores in deepset Cloud:

Document storeInfrastructureIndexIndexing statusFile updates (deleting, metadata updates)
OpenSearchManaged by deepset CloudManaged by deepset CloudShown in details (indexed, skipped, and failed files)Managed by deepset Cloud
ElasticsearchYour ownCreated on pipeline deploy
Deleted on pipeline undeploy
No information, always shown as 0Managed by deepset Cloud
PineconeYour own (you need to host the database locally)Created on pipeline deploy
Deleted on pipeline undeploy
No information, always shown as 0Managed by deepset Cloud
QdrantYour ownCreated on pipeline deploy
Deleted on pipeline undeploy
No information, always shown as 0Managed by deepset Cloud
SnowflakeYour ownNot available (Snowflake doesn't use indexes)Not available You're responsible for managing the tables
WeaviateYour ownCreated on pipeline deploy
Deleted on pipeline undeploy
No information, always shown as 0Managed by deepset Cloud

Choosing the Right Document Store

Have a look at the Haystack Guide to help you choose the document store that will work best for your scenario.