Boosting Retrieval with OpenSearch Queries
deepset Cloud uses OpenSearch through DeepsetCloudDocumentStore to store documents. At query time, the Retriever connects to OpenSearch to fetch the documents that are most relevant to the query. You can pass a custom query to the Retriever to fetch the documents based on this query.
Recommended Reading
Before you dig into this topic, it's good to understand how the Retriever works. Have a look at Retriever to learn more.
We also recommend you get a general understanding of the OpenSearch query syntax.
You can pass a custom OpenSearch query to BM25Retriever to specify how you want it to fetch Documents from DeepsetCloudDocumentStore. With a custom query, you can prioritize documents based on their characteristics, such as a metadata field value, the date a document was created, and so on. This page covers the most common cases when you may want to write custom OpenSearch queries.
Prioritizing Most Recent Documents
You can use the Gaussian decay function to prioritize the most recent documents. This function penalizes marginal time deltas.
Example Query
This query matches documents based on the $query
string against the content
field and any submitted filters. It then decreases the relevance score of the documents as they get older. The query favors documents newer than 30 days.
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"gauss": {
"_file_created_at": {
"origin": "now",
"offset": "30d",
"scale": "730d"
}
}
}
}
}
The first part defines function_score
as query
. This query type changes the relevance score the Retriever assigns to Documents using the function you specify in your custom query. Within the function_score
, we further specify the query to match the $query
value against the content
field in your documents. If $query
contains multiple words, the match
query returns all documents that contain any of the words from the $query
. This means that if the query is "Joe Biden", the query returns documents containing "Joe", "Biden", or both. $filters
allows us to use filters as usual.
Queries prioritizing the most recent documents often use the Gaussian decay function, which penalizes marginal time deltas. This is the case in this query as well. gauss
refers to the Gaussian function the query applies to your documents based on the _file_created_at
field. This means the score of each document decreases (or decays) the further the value in its _file_created_at
field is from the origin
value.
Parameters
The origin
parameter defines the point where the decay starts. Setting it to now
means the decay starts from the current time.
The offset
parameter is the distance from the origin at which the decay starts. Up until the value defined in offset
, the relevance score of your documents is not affected. This query sets the offset to 30 days, which means documents created within the last 30 days won't have their score affected by the Gaussian function.
The scale
parameter defines the rate of decay. Here, it's set to 730 days (about 2 years), which means the score of the document is decreased by half when the value in its _file_created_at
field is 730 days away from the value set in origin
.
Used in a Pipeline
When using an OpenSearch query in a YAML pipeline, it's recommended to pass it as a multi-line string using >
:
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: Retriever
type: BM25Retriever
params:
document_store: DocumentStore
custom_query: >
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"gauss": {
"_file_created_at": {
"origin": "now",
"offset": "30d",
"scale": "180d"
}
}
}
}
}
- name: TextConverter
type: TextConverter
- name: Preprocessor
type: PreProcessor
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
- name: TextConverter
inputs: [File]
- name: Preprocessor
inputs: [TextConverter]
- name: DocumentStore
inputs: [Preprocessor]
Other Options
If the Gaussian decay function doesn't match your use case, you can also use the following functions:
exp
for exponential decay.linear
for linear decay.
For more information about functions and their parameters, see OpenSearch documentation.
Prioritizing Documents Based on Metadata Field Values
You can favor documents with a specific value in a given metadata field. This can be a textual value or a numerical value.
Prioritizing Based on Textual Values
This method is useful if you want to prioritize documents with metdata fields containing a text string. For example, you can give the highest priority to documents with metadata field file_type="article"
and slightly lower priority to documents with metadata field file_type="paper"
and the lowest priority to documents with metadata field file_type="comment"
. You do this by assigning weigth
to each metadata field value.
Note that filters won't work with this query.
Example Query
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"functions": [
{
"filter": {
"terms": {
"file_type": ["article", "paper"]
}
},
"weight": 2.0
},
{
"filter": {
"terms": {
"file_type": ["comment"]
}
},
"weight": 1.5
},
{
"filter": {
"terms": {
"file_type": ["archive"]
}
},
"weight": 0.5
}
]
}
}
}
As in the query above, it starts with the function_score
, which means we want to change the relevance scores of the documents. Within function_score
, we tell the query to match the $query
value against the content
field in your documents. Filters will also be applied as usual thanks to $filters
.
The next part of the query contains the functions to be used. Within the functions, the query is defined to filter for particular metadata field values and assign weight to them. The higher the weight, the higher the priority of the metadata field value.
In this particular query, documents with file_type="article"
or file_type="paper"
metadata fields have the highest priority with their BM25 score doubled. Then, we tell the query to boost the relevance score of documents with the file_type="comment"
metadata field by 1.5 times, which is less than articles and papers but still higher than other document types. And finally, we penalize documents with file_type="archive"
metadata field by decreasing their relevance score by half.
Parameters
filter
- The filter to apply. Use theterms
filter for categorical data.weight
- The value by which the BM25 score of the documents matchin the filter criteria will be multiplied.
Used in a Pipeline
Here's how to use this custom query within your pipeline. It's recommended to pass it as a multi-line string using >
:
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: Retriever
type: BM25Retriever
params:
document_store: DocumentStore
custom_query: >
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"functions": [
{
"filter": {
"terms": {
"file_type": ["article", "paper"]
}
},
"weight": 2.0
},
{
"filter": {
"terms": {
"file_type": ["comment"]
}
},
"weight": 1.5
},
{
"filter": {
"terms": {
"file_type": ["archive"]
}
},
"weight": 0.5
}
]
}
}
}
- name: TextConverter
type: TextConverter
- name: Preprocessor
type: PreProcessor
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
- name: TextConverter
inputs: [File]
- name: Preprocessor
inputs: [TextConverter]
- name: DocumentStore
inputs: [Preprocessor]
Prioritizing Based on Numerical Values
You can construct a query to prioritize documents with a metadata field containing numerical values. Say you collect popularity metrics for your documents, such as likes, and you want to favor documents that are the most popular.
Example Query
Let's look at a query that prioritizes documents with the most likes. Note that filters won't work with this query.
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"field_value_factor": {
"field": "likes_last_month",
"factor": 0.1,
"modifier": "log1p",
"missing": 0
}
}
}
}
As previously, first we tell the query to score documents based on how well they match the provided $query
in the content
field. Filters will also be applied as usual thanks to $filters
.
It uses the field_value_factor
function to adjust the scores of the documents based on the metadata field likes_last_month
. factor
multiplies the value of the field by 0.1
before the modifier
function is applied.
You can think of the factor
parameter like a spotlight. Setting it to a smaller value, like 0.1
, focuses the spotlight on the most relevant documents. Together with the log1p
modifier, it's like saying that the number of likes is most meaningful at the values around 100.
This kind of tweaking comes in handy because different fields can have different ranges of values. For example, the likes_last_month
field might have values ranging from 0 to several thousand. By setting the factor to 0.1
, we make sure that when we reach the value of 100
likes, it's like we're multiplying the original text matching score by roughly 1.0
.
If the likes are under 100
, the text matching score is decreased. If the likes are above 100
, the text matching score is increased according to the modifier
function.
modifier
defines how we want to modify the value of the metadata field after it's multiplied by factor
. Here the modifier is the log1p
mathematical function which adds 1 to the value of the likes_last_month
field to avoid negative values and then takes the logarithm of base 10. This helps prevent documents with exceptionally high numbers of likes from dominating the rest of the documents to balance out the impact of likes on the overall score of the document.
Finally, if a document doesn't have the likes_last_month
metadata field, we want to assign the ranking score of 0
to it, hence "missing":0"
.
Parameters
field
- the name of the metadata field whose value you want to use for calculating the relevance score of the document.factor
- the number by which the value of the metadata field will be multiplied before themodifier
is applied to the value.modifier
- the function you want to use to modify the field value.missing
- the value you want to use for the document if the field is missing from it.
For more details, see the OpenSearch Field Value Factor function documentation.
Used in a Pipeline
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: Retriever
type: BM25Retriever
params:
document_store: DocumentStore
custom_query: >
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"field_value_factor": {
"field": "likes_last_month",
"factor": 0.1,
"modifier": "log1p",
"missing": 0
}
}
}
}
- name: TextConverter
type: TextConverter
- name: Preprocessor
type: PreProcessor
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
- name: TextConverter
inputs: [File]
- name: Preprocessor
inputs: [TextConverter]
- name: DocumentStore
inputs: [Preprocessor]
Other Options
You can use other functions as the modifier
. For a full list, see OpenSearch documentation.
Enabling Fuzzy Matching
You can use OpenSearch queries to account for small spelling errors users may make when typing the query.
Example Query
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": $query,
"fields": ["content"],
"fuzziness": "AUTO",
"operator": "or"
}
},
"filter": $filters
}
}
}
You can experiment with the fuzziness
parameter. AUTO
sets it to a value that makes sense in a given context. You can choose a small integer to set the specific number of corrections allowed. For example, to handle the query: "waht is political correctness", you need fuzziness: 3
as you need three modifications to bring this level to the originally intended meaning. In our experiments, setting fuzziness: "AUTO"
handled cases with three corrections well. You may think about setting this parameter manually if you find that AUTO
is too allowing or too strict for your use case.
Enabling fuzzy matching can potentially lower the pipeline's recall, so it's always important to run experiments before moving your pipeline to production. To avoid drastic recall loss, make sure you set the operator
to OR
. It allows returning matches even if not all words from the query are present. That's very useful, as when asking questions, you use a lot of auxiliary words to formulate the query, but they don't need to be present in the document.
For example, if you ask "How can I do X in Y app?", "How can I" are auxiliary words, and you only care about "X in Y app". With the AND
operator, the document will only be returned if it has all words from the query.
Used in a Pipeline
Pass it as a multi-line string in the custom_query
parameter, using >
:
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
- name: BM25Retriever # Selects the most relevant documents from the document store
type: BM25Retriever # The keyword-based retriever
params:
document_store: DocumentStore
top_k: 5 # The number of results to return
all_terms_must_match: true
custom_query: >
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": $query,
"fields": ["content"],
"fuzziness": "AUTO",
"operator": "or"
}
},
"filter": $filters
}
}
}
- name: EmbeddingRetriever # The vector-based retriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 5 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results
- name: Ranker
type: SentenceTransformersRanker
params:
model_name_or_path: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
top_k: 5
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
type: FileTypeClassifier
- name: MarkdownConverter # Converts PDFs into documents
type: MarkdownConverter
params:
add_frontmatter_to_meta: false
extract_headlines: true
- name: Preprocessor # Splits files into smaller documents and cleans them up
type: PreProcessor
params:
# With a keyword-based retriever, you can keep slightly longer documents
split_by: word # The unit by which you want to split the documents
split_length: 50 # The max number of words in a document
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: Ranker
inputs: [JoinResults]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: MarkdownConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
- name: Preprocessor
inputs: [MarkdownConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
Updated 9 months ago