Boosting Retrieval with OpenSearch Queries

deepset Cloud uses OpenSearch through OpenSearchDocumentStore to store documents. At query time, the Retriever connects to OpenSearch to fetch the documents that are most relevant to the query. You can pass a custom query to the Retriever to fetch the documents based on this query.

💡

Recommended Reading

Before you dig into this topic, it's good to understand how the Retriever works. Have a look at Retrievers to learn more.

We also recommend you get a general understanding of the OpenSearch query syntax.

ℹ️

OpenSearch Document Store

This guide is for users of the OpenSearch document store, the core document store deepset Cloud uses. If you're using a different document store, like Qdrant, Pinecone, or Weaviate, you won't be able to use the techniques described here.

You can pass a custom OpenSearch query to OpenSearchBM25Retriever and OpenSearchEmbeddingRetriever to specify how you want it to fetch Documents from OpenSearchDocumentStore. With a custom query, you can prioritize documents based on their characteristics, such as a metadata field value, the date a document was created, and so on. This page covers the most common cases when you may want to write custom OpenSearch queries.

Prioritizing the Most Recent Documents

You can use the Gaussian decay function to prioritize the most recent documents. This function penalizes marginal time deltas.

Example Query

This query matches documents based on the $query string against the content field and any submitted filters. It then decreases the relevance score of the documents as they get older. The query favors documents newer than 30 days.

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": {
            "match": {
              "content": $query
            }
          },
          "filter": $filters
        }
      },
      "gauss": {
        "file_created_at": {
					"origin": "now",
          "offset": "30d",
          "scale": "730d"
        }
      }
    }
  }
}

The first part defines function_score as query. This query type changes the relevance score the Retriever assigns to Documents using the function you specify in your custom query. Within the function_score, we further specify the query to match the $query value against the content field in your documents. If $query contains multiple words, the match query returns all documents that contain any of the words from the $query. This means that if the query is "Joe Biden", the query returns documents containing "Joe", "Biden", or both. $filters allows us to use filters as usual.

Queries prioritizing the most recent documents often use the Gaussian decay function, which penalizes marginal time deltas. This is the case in this query as well. gauss refers to the Gaussian function the query applies to your documents based on the file_created_at field. This means the score of each document decreases (or decays) the further the value in its file_created_at field is from the origin value.

Parameters

The origin parameter defines the point where the decay starts. Setting it to now means the decay starts from the current time.

The offset parameter is the distance from the origin at which the decay starts. Up until the value defined in offset, the relevance score of your documents is not affected. This query sets the offset to 30 days, which means documents created within the last 30 days won't have their score affected by the Gaussian function.

The scale parameter defines the rate of decay. Here, it's set to 730 days (about 2 years), which means the score of the document is decreased by half when the value in its file_created_at field is 730 days away from the value set in origin.

Used in a Pipeline

When using an OpenSearch query in a YAML pipeline, pass it in the retriever's custom_query parameter and then connect the retriever to other components, like you normally would:

components:
  bm25_retriever:
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - "${OPENSEARCH_USER}"
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 768
          similarity: cosine
        top_k: 20
        custom_query:
          {
            "query": {
              "function_score": {
                "query": {
                  "bool": {
                    "must": {
                      "match": {
                        "content": $query
                      }
                    },
                    "filter": $filters
                  }
                },
                "gauss": {
                  "file_created_at": {
                    "origin": "now",
                    "offset": "30d",
                    "scale": "730d"
                  }
                }
              }
            }
          }
          
          ...

Other Options

If the Gaussian decay function doesn't match your use case, you can also use the following functions:

  • exp for exponential decay.
  • linear for linear decay.

For more information about functions and their parameters, see OpenSearch documentation.

Prioritizing Documents Based on Metadata Field Values

You can favor documents with a specific value in a given metadata field. This can be a textual value or a numerical value.

Prioritizing Based on Textual Values

This method is useful if you want to prioritize documents with metadata fields containing a text string. For example, you can give the highest priority to documents with the metadata field file_type="article" and slightly lower priority to documents with metadata field file_type="paper" and the lowest priority to documents with metadata field file_type="comment". You do this by assigning weigth to each metadata field value.

Note that filters won't work with this query.

Example Query

{
  "query": {
    "function_score": {
			"query": {
        "bool": {
          "must": {
            "match": {
              "content": $query
            }
          },
          "filter": $filters
        }
      },
      "functions": [
        {
          "filter": { 
            "terms": { 
              "file_type": ["article", "paper"]
            }
          },
          "weight": 2.0
        },
        {
          "filter": { 
            "terms": { 
              "file_type": ["comment"]
            }
          },
          "weight": 1.5
        },
				{
          "filter": { 
            "terms": { 
              "file_type": ["archive"]
            }
          },
          "weight": 0.5
        }
      ]
    }
  }
}

As in the query above, it starts with the function_score, which means we want to change the relevance scores of the documents. Within function_score, we tell the query to match the $query value against the content field in your documents. Filters will also be applied as usual thanks to $filters.

The next part of the query contains the functions to be used. Within the functions, the query is defined to filter for particular metadata field values and assign weight to them. The higher the weight, the higher the priority of the metadata field value.

In this particular query, documents with file_type="article" or file_type="paper" metadata fields have the highest priority with their BM25 score doubled. Then, we tell the query to boost the relevance score of documents with the file_type="comment" metadata field by 1.5 times, which is less than articles and papers but still higher than other document types. And finally, we penalize documents with file_type="archive" metadata field by decreasing their relevance score by half.

Parameters

  • filter - The filter to apply. Use the terms filter for categorical data.
  • weight - The value by which the BM25 score of the documents matchin the filter criteria will be multiplied.

Used in a Pipeline

To use the custom query in a pipeline, pass the query in the custom_query parameter of the retriever and the connect the retriever to other components, like you normally would:


components:
  bm25_retriever:
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - "${OPENSEARCH_USER}"
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 768
          similarity: cosine
        top_k: 20
        custom_query:
            {
              "query": {
                "function_score": {
                  "query": {
                    "bool": {
                      "must": {
                        "match": {
                          "content": $query
                        }
                      },
                      "filter": $filters
                    }
                  },
                  "functions": [
                    {
                      "filter": { 
                        "terms": { 
                          "file_type": ["article", "paper"]
                        }
                      },
                      "weight": 2.0
                    },
                    {
                      "filter": { 
                        "terms": { 
                          "file_type": ["comment"]
                        }
                      },
                      "weight": 1.5
                    },
                    {
                      "filter": { 
                        "terms": { 
                          "file_type": ["archive"]
                        }
                      },
                      "weight": 0.5
                    }
                  ]
                }
              }
            }

Prioritizing Based on Numerical Values

You can construct a query to prioritize documents with a metadata field containing numerical values. Say you collect popularity metrics for your documents, such as likes, and you want to favor documents that are the most popular.

Example Query

Let's look at a query that prioritizes documents with the most likes. Note that filters won't work with this query.

{
  "query": {
    "function_score": {
			"query": {
        "bool": {
          "must": {
            "match": {
              "content": $query
            }
          },
          "filter": $filters
        }
      },
      "field_value_factor": {
        "field": "likes_last_month",
        "factor": 0.1,
        "modifier": "log1p",
        "missing": 0
      }
    }
  }
}

As previously, first we tell the query to score documents based on how well they match the provided $query in the content field. Filters will also be applied as usual thanks to $filters.

It uses the field_value_factor function to adjust the scores of the documents based on the metadata field likes_last_month. factor multiplies the value of the field by 0.1 before the modifier function is applied.

You can think of the factor parameter like a spotlight. Setting it to a smaller value, like 0.1, focuses the spotlight on the most relevant documents. Together with the log1p modifier, it's like saying that the number of likes is most meaningful at the values around 100.

This kind of tweaking comes in handy because different fields can have different ranges of values. For example, the likes_last_month field might have values ranging from 0 to several thousand. By setting the factor to 0.1, we make sure that when we reach the value of 100 likes, it's like we're multiplying the original text matching score by roughly 1.0.

If the likes are under 100, the text matching score is decreased. If the likes are above 100, the text matching score is increased according to the modifier function.

modifier defines how we want to modify the value of the metadata field after it's multiplied by factor. Here the modifier is the log1p mathematical function which adds 1 to the value of the likes_last_month field to avoid negative values and then takes the logarithm of base 10. This helps prevent documents with exceptionally high numbers of likes from dominating the rest of the documents to balance out the impact of likes on the overall score of the document.

Finally, if a document doesn't have the likes_last_month metadata field, we want to assign the ranking score of 0 to it, hence "missing":0".

Parameters

  • field - the name of the metadata field whose value you want to use for calculating the relevance score of the document.
  • factor - the number by which the value of the metadata field will be multiplied before the modifier is applied to the value.
  • modifier - the function you want to use to modify the field value.
  • missing - the value you want to use for the document if the field is missing from it.

For more details, see the OpenSearch Field Value Factor function documentation.

Used in a Pipeline


components:
  bm25_retriever:
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - "${OPENSEARCH_USER}"
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 768
          similarity: cosine
        top_k: 20
        custom_query:
            {
              "query": {
                "function_score": {
                  "query": {
                    "bool": {
                      "must": {
                        "match": {
                          "content": $query
                        }
                      },
                      "filter": $filters
                    }
                  },
                  "field_value_factor": {
                    "field": "likes_last_month",
                    "factor": 0.1,
                    "modifier": "log1p",
                    "missing": 0
                  }
                }
              }
            }
 

Other Options

You can use other functions as the modifier. For a full list, see OpenSearch documentation.

Enabling Fuzzy Matching

You can use OpenSearch queries to account for small spelling errors users may make when typing the query.

Example Query

{
  "query": {
    "bool": {
      "must": {
        "multi_match": {
          "query": $query,  
          "fields": ["content"], 
          "fuzziness": "AUTO", 
          "operator": "or"
        }
      },
      "filter": $filters
    }
  }
}

You can experiment with the fuzziness parameter. AUTO sets it to a value that makes sense in a given context. You can choose a small integer to set the specific number of corrections allowed. For example, to handle the query: "waht is political correctness", you need fuzziness: 3 as you need three modifications to bring this level to the originally intended meaning. In our experiments, setting fuzziness: "AUTO" handled cases with three corrections well. You may think about setting this parameter manually if you find that AUTO is too allowing or too strict for your use case.

Enabling fuzzy matching can potentially lower the pipeline's recall, so it's always important to run experiments before moving your pipeline to production. To avoid drastic recall loss, make sure you set the operator to OR. It allows returning matches even if not all words from the query are present. That's very useful, as when asking questions, you use a lot of auxiliary words to formulate the query, but they don't need to be present in the document.
For example, if you ask, "How can I do X in Y app?", "How can I" are auxiliary words, and you only care about "X in Y app." With the AND operator, the document will only be returned if it has all the words from the query.

Used in a Pipeline

Pass it as a multi-line string in the custom_query parameter or the retriever:


components:
  bm25_retriever:
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - "${OPENSEARCH_USER}"
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 768
          similarity: cosine
        top_k: 20
        custom_query:
          {
            "query": {
              "bool": {
                "must": {
                  "multi_match": {
                    "query": $query,  
                    "fields": ["content"], 
                    "fuzziness": "AUTO", 
                    "operator": "or"
                  }
                },
                "filter": $filters
              }
            }
          }