DeepsetCloudDocumentStore

A document store is a kind of database that stores text and metadata and then provides them to the Retriever at query time. In deepset Cloud, you can use the DeepsetCloudDocumentStore. Learn how it works.

Working with pipelines in different environments requires a document store that can be shared among them and is compatible with all retrievers. This is why we created the DeepsetCloudDocumentStore. It makes it possible to interact with documents stored in deepset Cloud without having to index your data again.

DeepsetCloudDocumentStore is designed to access data that's already stored in deepset Cloud. It is not intended for use in production-like scenarios. For these scenarios, use API endpoints.

Usage

In most cases, you use DeepsetCloudDocumentStore within a pipeline. However, if you want to initialize it on its own, run:

import os
os.environ["DEEPSET_CLOUD_API_KEY"] = "<your_api_key>"
os.environ["DEEPSET_CLOUD_API_ENDPOINT"] = "https://api.cloud.deepset.ai/api/v1"

from haystack.document_stores import DeepsetCloudDocumentStore
document_store = DeepsetCloudDocumentStore(index=pipeline_name)

Arguments

These are the arguments the DeepsetCloudDocumentStore takes.

📘

When you create DeepsetCloudDocumentStore using a pipeline YAML in the deepset Cloud pipeline editor, these parameters are ignored:

  • api_key
  • workspace
  • index
  • api_endpoint
  • label_index

In the Python SDK, all parameters are used.

ArgumentTypePossible ValuesDescription
api_keyStringThe secret value of the API key. This is the value that you copy in step 4 of Generating an API Key.
If you do not specify it, it is read from the DEEPSET_CLOUD_API_KEY environment variable.
workspaceStringdefaultSpecifies the deepset Cloud workspace containing the pipeline, string. Optional. Set this value to default.
indexStringThe name of the pipeline to access within the deepset Cloud workspace.
In deepset Cloud, indexes share the name with their respective pipelines.
duplicate_documentsStringskip - ignores duplicate documents
overwrite - updates any existing documents with the same ID when adding documents
fail - raises an error if a document ID of the document that is being added already exists
Specifies how to handle duplicate documents.
api_endpointString<https://api.cloud.deepset.ai/api/v1>Specifies the URL of the deepset Cloud API. The API endpoint is: <https://api.cloud.deepset.ai/api/v1>.

If you don't specify it, it's read from the DEEPSET_CLOUD_API_ENDPOINT environment variable.
similarityStringdot_product - default, shows better performance with DPR embeddings
cosine - recommended if you are using a sentence BERT model
Specifies the similarity function used to compare document vectors.
label_indexStringSpecifies the name of the evaluation set uploaded to deepset Cloud.
In deepset Cloud, label indexes share the name with their corresponding evaluation sets.
return_embeddingBooleanTrue/FalseReturns document embeddings.
embedding_dimintDefault: 768Specifies the dimensionality of the embedding vector. This is only needed when using a dense retriever, such as a DensePassageRetriever or EmbeddingRetriever.
use_prefilteringBooleanTrue/False
Default: False
Specifies when to apply filters to search. By default, DeepsetCloudDocumentStore uses post-filtering when querying with filters. This means the filters are applied after the documents are retrieved. You can change it to pre-filtering, where the filters are applied before retrieving the documents. this comes at the cost of higher latency, though.

Known Limitations

DeepsetCloudDocumentStore is read-only and cannot be used for creating new pipelines. To use this document store, the pipeline must already exist.