Boosting Retrieval with OpenSearch Queries

deepset Cloud uses OpenSearch through DeepsetCloudDocumentStore to store documents. At query time, the Retriever connects to OpenSearch to fetch the documents that are most relevant to the query. You can pass a custom query to the Retriever to fetch the documents based on this query.

πŸ’‘

Recommended Reading

Before you dig into this topic, it's good to understand how the Retriever works. Have a look at Retriever to learn more.

We also recommend you get a general understanding of the OpenSearch query syntax.

You can pass a custom OpenSearch query to BM25Retriever to specify how you want it to fetch Documents from DeepsetCloudDocumentStore. With a custom query, you can prioritize documents based on their characteristics, such as a metadata field value, the date a document was created, and so on. This page covers the most common cases when you may want to write custom OpenSearch queries.

Prioritizing Most Recent Documents

You can use the Gaussian decay function to prioritize the most recent documents. This function penalizes marginal time deltas.

Example Query

This query matches documents based on the $query string against the content field and any submitted filters. It then decreases the relevance score of the documents as they get older. The query favors documents newer than 30 days.

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": {
            "match": {
              "content": $query
            }
          },
          "filter": $filters
        }
      },
      "gauss": {
        "_file_created_at": {
					"origin": "now",
          "offset": "30d",
          "scale": "730d"
        }
      }
    }
  }
}

The first part defines function_score as query. This query type changes the relevance score the Retriever assigns to Documents using the function you specify in your custom query. Within the function_score, we further specify the query to match the $query value against the content field in your documents. If $query contains multiple words, the match query returns all documents that contain any of the words from the $query. This means that if the query is "Joe Biden", the query returns documents containing "Joe", "Biden", or both. $filters allows us to use filters as usual.

Queries prioritizing the most recent documents often use the Gaussian decay function, which penalizes marginal time deltas. This is the case in this query as well. gauss refers to the Gaussian function the query applies to your documents based on the _file_created_at field. This means the score of each document decreases (or decays) the further the value in its _file_created_at field is from the origin value.

Parameters

The origin parameter defines the point where the decay starts. Setting it to now means the decay starts from the current time.

The offset parameter is the distance from the origin at which the decay starts. Up until the value defined in offset, the relevance score of your documents is not affected. This query sets the offset to 30 days, which means documents created within the last 30 days won't have their score affected by the Gaussian function.

The scale parameter defines the rate of decay. Here, it's set to 730 days (about 2 years), which means the score of the document is decreased by half when the value in its _file_created_at field is 730 days away from the value set in origin.

Used in a Pipeline

When using an OpenSearch query in a YAML pipeline, it's recommended to pass it as a multi-line string using >:


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      custom_query: >
        {
          "query": {
            "function_score": {
        			"query": {
                "bool": {
                  "must": {
                    "match": {
                      "content": $query
                    }
                  },
                  "filter": $filters
                }
              },
              "gauss": {
                "_file_created_at": {
        					"origin": "now",
                  "offset": "30d",
                  "scale": "180d"
                }
              }
            }
          }
        }
  - name: TextConverter
    type: TextConverter
  - name: Preprocessor
    type: PreProcessor

pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: TextConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextConverter]
      - name: DocumentStore
        inputs: [Preprocessor]

Other Options

If the Gaussian decay function doesn't match your use case, you can also use the following functions:

  • exp for exponential decay.
  • linear for linear decay.

For more information about functions and their parameters, see OpenSearch documentation.

Prioritizing Documents Based on Metadata Field Values

You can favor documents with a specific value in a given metadata field. This can be a textual value or a numerical value.

Prioritizing Based on Textual Values

This method is useful if you want to prioritize documents with metdata fields containing a text string. For example, you can give the highest priority to documents with metadata field file_type="article" and slightly lower priority to documents with metadata field file_type="paper" and the lowest priority to documents with metadata field file_type="comment". You do this by assigning weigth to each metadata field value.

Note that filters won't work with this query.

Example Query

{
  "query": {
    "function_score": {
			"query": {
        "bool": {
          "must": {
            "match": {
              "content": $query
            }
          },
          "filter": $filters
        }
      },
      "functions": [
        {
          "filter": { 
            "terms": { 
              "file_type": ["article", "paper"]
            }
          },
          "weight": 2.0
        },
        {
          "filter": { 
            "terms": { 
              "file_type": ["comment"]
            }
          },
          "weight": 1.5
        },
				{
          "filter": { 
            "terms": { 
              "file_type": ["archive"]
            }
          },
          "weight": 0.5
        }
      ]
    }
  }
}

As in the query above, it starts with the function_score, which means we want to change the relevance scores of the documents. Within function_score, we tell the query to match the $query value against the content field in your documents. Filters will also be applied as usual thanks to $filters.

The next part of the query contains the functions to be used. Within the functions, the query is defined to filter for particular metadata field values and assign weight to them. The higher the weight, the higher the priority of the metadata field value.

In this particular query, documents with file_type="article" or file_type="paper" metadata fields have the highest priority with their BM25 score doubled. Then, we tell the query to boost the relevance score of documents with the file_type="comment" metadata field by 1.5 times, which is less than articles and papers but still higher than other document types. And finally, we penalize documents with file_type="archive" metadata field by decreasing their relevance score by half.

Parameters

  • filter - The filter to apply. Use the terms filter for categorical data.
  • weight - The value by which the BM25 score of the documents matchin the filter criteria will be multiplied.

Used in a Pipeline

Here's how to use this custom query within your pipeline. It's recommended to pass it as a multi-line string using >:


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      custom_query: >
        {
          "query": {
            "function_score": {
	        		"query": {
                "bool": {
                  "must": {
                    "match": {
                      "content": $query
                    }
                  },
                  "filter": $filters
                }
              },
              "functions": [
                {
                  "filter": { 
                    "terms": { 
                      "file_type": ["article", "paper"]
                    }
                  },
                  "weight": 2.0
                },
                {
                  "filter": { 
                    "terms": { 
                      "file_type": ["comment"]
                    }
                  },
                  "weight": 1.5
                },
				        {
                  "filter": { 
                    "terms": { 
                      "file_type": ["archive"]
                    }
                  },
                  "weight": 0.5
                }
              ]
            }
          }
        }
  - name: TextConverter
    type: TextConverter
  - name: Preprocessor
    type: PreProcessor

pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: TextConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextConverter]
      - name: DocumentStore
        inputs: [Preprocessor]

Prioritizing Based on Numerical Values

You can construct a query to prioritize documents with a metadata field containing numerical values. Say you collect popularity metrics for your documents, such as likes, and you want to favor documents that are the most popular.

Example Query

Let's look at a query that prioritizes documents with the most likes. Note that filters won't work with this query.

{
  "query": {
    "function_score": {
			"query": {
        "bool": {
          "must": {
            "match": {
              "content": $query
            }
          },
          "filter": $filters
        }
      },
      "field_value_factor": {
        "field": "likes_last_month",
        "factor": 0.1,
        "modifier": "log1p",
        "missing": 0
      }
    }
  }
}

As previously, first we tell the query to score documents based on how well they match the provided $query in the content field. Filters will also be applied as usual thanks to $filters.

It uses the field_value_factor function to adjust the scores of the documents based on the metadata field likes_last_month. factor multiplies the value of the field by 0.1 before the modifier function is applied.

You can think of the factor parameter like a spotlight. Setting it to a smaller value, like 0.1, focuses the spotlight on the most relevant documents. Together with the log1p modifier, it's like saying that the number of likes is most meaningful at the values around 100.

This kind of tweaking comes in handy because different fields can have different ranges of values. For example, the likes_last_month field might have values ranging from 0 to several thousand. By setting the factor to 0.1, we make sure that when we reach the value of 100 likes, it's like we're multiplying the original text matching score by roughly 1.0.

If the likes are under 100, the text matching score is decreased. If the likes are above 100, the text matching score is increased according to the modifier function.

modifier defines how we want to modify the value of the metadata field after it's multiplied by factor. Here the modifier is the log1p mathematical function which adds 1 to the value of the likes_last_month field to avoid negative values and then takes the logarithm of base 10. This helps prevent documents with exceptionally high numbers of likes from dominating the rest of the documents to balance out the impact of likes on the overall score of the document.

Finally, if a document doesn't have the likes_last_month metadata field, we want to assign the ranking score of 0 to it, hence "missing":0".

Parameters

  • field - the name of the metadata field whose value you want to use for calculating the relevance score of the document.
  • factor - the number by which the value of the metadata field will be multiplied before the modifier is applied to the value.
  • modifier - the function you want to use to modify the field value.
  • missing - the value you want to use for the document if the field is missing from it.

For more details, see the OpenSearch Field Value Factor function documentation.

Used in a Pipeline


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      custom_query: >
        {
          "query": {
            "function_score": {
			        "query": {
                "bool": {
                  "must": {
                    "match": {
                      "content": $query
                    }
                  },
                  "filter": $filters
                }
              },
              "field_value_factor": {
                "field": "likes_last_month",
                "factor": 0.1,
                "modifier": "log1p",
                "missing": 0
              }
            }
          }
        }
  - name: TextConverter
    type: TextConverter
  - name: Preprocessor
    type: PreProcessor

pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: TextConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextConverter]
      - name: DocumentStore
        inputs: [Preprocessor]

Other Options

You can use other functions as the modifier. For a full list, see OpenSearch documentation.

Enabling Fuzzy Matching

You can use OpenSearch queries to account for small spelling errors users may make when typing the query.

Example Query

{
  "query": {
    "bool": {
      "must": {
        "multi_match": {
          "query": $query,  
          "fields": ["content"], 
          "fuzziness": "AUTO", 
          "operator": "or"
        }
      },
      "filter": $filters
    }
  }
}

You can experiment with the fuzziness parameter. AUTO sets it to a value that makes sense in a given context. You can choose a small integer to set the specific number of corrections allowed. For example, to handle the query: "waht is political correctness", you need fuzziness: 3 as you need three modifications to bring this level to the originally intended meaning. In our experiments, setting fuzziness: "AUTO" handled cases with three corrections well. You may think about setting this parameter manually if you find that AUTO is too allowing or too strict for your use case.

Enabling fuzzy matching can potentially lower the pipeline's recall, so it's always important to run experiments before moving your pipeline to production. To avoid drastic recall loss, make sure you set the operator to OR. It allows returning matches even if not all words from the query are present. That's very useful, as when asking questions, you use a lot of auxiliary words to formulate the query, but they don't need to be present in the document.
For example, if you ask "How can I do X in Y app?", "How can I" are auxiliary words, and you only care about "X in Y app". With the AND operator, the document will only be returned if it has all words from the query.

Used in a Pipeline

Pass it as a multi-line string in the custom_query parameter, using >:


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
  - name: BM25Retriever # Selects the most relevant documents from the document store
    type: BM25Retriever # The keyword-based retriever
    params:
      document_store: DocumentStore
      top_k: 5 # The number of results to return
      all_terms_must_match: true
      custom_query: >
        {
          "query": {
            "bool": {
              "must": {
                "multi_match": {
                  "query": $query,  
                  "fields": ["content"], 
                  "fuzziness": "AUTO", 
                  "operator": "or"
                }
              },
              "filter": $filters
            }
          }
        }
  - name: EmbeddingRetriever # The vector-based retriever
    type: EmbeddingRetriever
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 5 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results
  - name: Ranker
    type: SentenceTransformersRanker
    params:
      model_name_or_path: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
      top_k: 5
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: MarkdownConverter # Converts PDFs into documents
    type: MarkdownConverter
    params:
      add_frontmatter_to_meta: false 
      extract_headlines: true
  - name: Preprocessor # Splits files into smaller documents and cleans them up
    type: PreProcessor
    params:
      # With a keyword-based retriever, you can keep slightly longer documents
      split_by: word # The unit by which you want to split the documents
      split_length: 50 # The max number of words in a document
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines: 
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Ranker
        inputs: [JoinResults]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: MarkdownConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: Preprocessor
        inputs: [MarkdownConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]