Document Stores
Document store is a database that stores the pre-processed documents resulting from your indexing pipeline. The query pipeline uses retrievers to access the document store and fetch relevant documents to resolve queries.
Document Store Concepts
A document store is a Haystack concept that refers to an object that stores your data. It's an interface to a database like OpenSearch, Weaviate, or Pinecone, for storing and retrieving your data.
Document store stores data as Document
objects. Each document has a unique ID, metadata, and can optionally include vector representations (embeddings) for enhanced search capabilities.
Document
To store your data in a document store, you must convert them into Document
objects first. Documents are individual pieces of information that can include text, data frames, or binary data. When an indexing pipeline runs, files uploaded to your deepset Cloud workspace are preprocessed, cleaned, split, and converted into Document
objects using PreProcessor components. One file can be split into multiple documents. Once processed, the DocumentWriter component writes them into a document store.
Query pipelines work on the documents stored in the document store, not directly on the uploaded files. A Retriever fetches the relevant documents from the document store and passes them to subsequent components in the pipeline tto resolve queries or run other tasks.
Index
An index is a data structure that helps you quickly find relevant documents without scanning every single document. Think of it as a book index that helps you find specific topics without reading the entire book. It's crucial for fast, efficient search, making it possible to handle large-scale datasets.
Writing Documents into the Store
You can write documents into a document store using DocumentWriter. As a best practice, include DocumentWriter as the last component in your indexing pipeline, specifying the document store in its configuration.
Retrieving Documents
Document stores work with retrievers. Retrievers in your query pipeline access the document store to fetch the documents relevant to the query. Each document store has dedicated retrievers, usually a keyword retriever, a vector retriever, and sometimes a hybrid retriever that combines both. This is because retrievers rely on the document store technology to fetch documents.
Configuring a Document Store
To configure a document store, use the document_store
parameter. You can configure this parameter for DocumentWriter (in your indexing pipeline) or retrievers (in your query pipeline):
You configure the settings in YAML. For details on the parameters you can customize, check the database you want to use in Haystack's Integrations API documentation.
Supported Document Stores
Currently, you can use the following document stores in deepset Cloud:
Core Document Stores
OpenSearch is the only core document store of deepset Cloud. We manage its infrastructure and credentials and have access to its index and indexing information, such as the detailed status of files being indexed. deepset Cloud also manages file updates, including the metadata and deletions, and keeps the document store in sync.
Integrations
These document stores run on your infrastructure, and you're responsible for managing the credentials (you provide them in the configuration). When you deploy your indexing pipeline, deepset Cloud creates the index for these document stores, but the number of indexed files will always display as 0
.
For integrations, deepset Cloud also handles metadata updates and file deletions, ensuring that changes are reflected in the document store.
Other
Snowflake is a table database that doesn't have an index. You can query your Snowflake data using DeepsetSnowflakeRetriever, which accesses the database and fetches a table that matches the SQL query.
Comparison
This table compares the document stores in deepset Cloud:
Document store | Infrastructure | Index | Indexing status | File updates (deleting, metadata updates) |
---|---|---|---|---|
OpenSearch | Managed by deepset Cloud | Managed by deepset Cloud | Shown in details (indexed, skipped, and failed files) | Managed by deepset Cloud |
Elasticsearch | Your own | Created on pipeline deploy Deleted on pipeline undeploy | No information, always shown as 0 | Managed by deepset Cloud |
Pinecone | Your own (you need to host the database locally) | Created on pipeline deploy Deleted on pipeline undeploy | No information, always shown as 0 | Managed by deepset Cloud |
Qdrant | Your own | Created on pipeline deploy Deleted on pipeline undeploy | No information, always shown as 0 | Managed by deepset Cloud |
Snowflake | Your own | Not available (Snowflake doesn't use indexes) | Not available | You're responsible for managing the tables |
Weaviate | Your own | Created on pipeline deploy Deleted on pipeline undeploy | No information, always shown as 0 | Managed by deepset Cloud |
Choosing the Right Document Store
Have a look at the Haystack Guide to help you choose the document store that will work best for your scenario.
Updated 21 days ago