Working with pipelines in different environments requires a DocumentStore that can be shared among them and is compatible with all retrievers. This is why we created the DeepsetCloudDocumentStore. It makes it possible to interact with Documents stored in deepset Cloud without having to index your data again.

When a pipeline is deployed, it indexes the files. This means it turns them into Documents, and then stores these Documents together with their metadata in the DocumentStore. These Documents are then used at query time. The Retriever fetches them from the DocumentStore.

📘
DeepsetCloudDocumentStore is designed to access data that's already stored in deepset Cloud. It's read-only and cannot be used for production-like scenarios. For these scenarios, use API endpoints.

Basic Information

Pipeline type: Used in indexing pipelines
Position in the pipeline: After the Retriever, takes Retriever as input.

Usage Example

In most cases, you use DeepsetCloudDocumentStore within a pipeline:

components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
    ...
    pipelines: 
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]
        ...

Avoiding Duplicate Documents

To avoid duplicate documents from being cited as sources for responses multiple times, set the duplicate_documents parameter to skip or overwrite and use the PreProcessor's id_hash_keys parameter to configure how to identify duplicates.

For example, to identify duplicate documents by their content and skip duplicates, set DeepsetCloudDocumentStore's duplicate_documents parameter to skip and PreProcessor's id_hash_keys parameter to content, like this:

components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
    params: 
       - duplicate_documents: skip
  - name: Preprocessor 
    type: PreProcessor
    params:
      id_hash_keys:  
        - content 
        ...

During indexing, we add contextual metadata to your documents, such as file_id. This means that even if your files have the same name and the same content, their metadata will be different because each file will be assigned a different file_id. That's why setting id_hash_keys to meta doesn't identify duplicates.

Parameters

📘
When you create DeepsetCloudDocumentStore in the deepset Cloud Pipeline Designer, these parameters are ignored:

api_key

workspace

index

api_endpoint

label_index

In the Python SDK, all parameters are used.

These are the parameters you can specify for DeepsetCloudDocumentStore in the YAML:

Parameter	Type	Possible Values	Description
`api_key`	String		The secret value of the API key. This is the value that you copy in step 4 of Generate an API Key. If you don't specify it, it is read from the `DEEPSET_CLOUD_API_KEY` environment variable. Optional.
`workspace`	String	Default: `default`	Specifies the deepset Cloud workspace you want to use. Required.
`index`	String	Default: `None`	The name of the pipeline to access within the deepset Cloud workspace. In deepset Cloud, indexes share the names with their respective pipelines. Optional
`duplicate_documents`	String	`skip` - Ignores duplicate documents. `overwrite` - Updates any existing documents with the same ID when adding documents. `fail` - Raises an error if a document ID of the document that is being added already exists. Default: `overwrite`	Specifies how to handle duplicate documents. This setting only has an effect if you specify the fields you want to use to identify duplicate documents in the PreProcessor's `id_hash_keys` parameter. For example, to identify duplicate documents by their content, set `id_hash_keys: content`. Note that we add contextual metadata, like `file_id`, to your documents during indexing. This is why setting `id_hash_keys: meta` doesn't work. Required.
`api_endpoint`	String	Default: `None`	Specifies the URL of the deepset Cloud API. The API endpoint is: `<https://api.cloud.deepset.ai/api/v1`>. If you don't specify it, it's read from the `DEEPSET_CLOUD_API_ENDPOINT` environment variable. Optional.
`similarity`	String	`dot_product` - Default, use it if an embedding model was optimized for dot_product similarity. `cosine` - Recommended if the embedding model was optimized for cosine similarity. Default: `dot_product`	Specifies the similarity function used to compare document vectors. Required.
`label_index`	String	Default: `default`	Specifies the name of the evaluation set uploaded to deepset Cloud. In deepset Cloud, label indexes share the name with their corresponding evaluation sets. Required.
`return_embedding`	Boolean	`True`/`False` Default: `False`	Returns document embeddings. Required.
`embedding_dim`	int	Default: `768`	Specifies the dimensionality of the embedding vector. You only need this parameter if you're using a vector-based retriever, such as a `DensePassageRetriever` or `EmbeddingRetriever`. Required.
`use_prefiltering`	Boolean	True/False Default: `False`	Specifies when to apply filters to search. This is only relevant if you use an `EmbeddingRetriever`. With `EmbeddingRetriever`, DeepsetCloudDocumentStore defaults to post-filtering when querying with filters. This means the filters are applied after the documents are retrieved. You can change it to pre-filtering, where the filters are applied before retrieving the documents. this comes at the cost of higher latency, though. For the `BM25Retriever` filtering is always applied before a search. Required.
`search_fields`	Union[str, list]	Default: `content`	The names of fields BM25Retriever uses to find matches to the incoming query in the documents. For example: `["content", "title"]`. Required.