Document Retrieval Pipelines
Document retrieval pipelines, also called document search pipelines, return whole documents as results. They are also the first stage in other types of pipelines, for example in retrieval augmented generative pipelines.
Semantic Document Retrieval Pipeline
This pipeline uses a vector-based retriever to fetch relevant documents based on their semantic similarity to the query.
version: '1.21.0'
name: 'SemanticDocumentSearch'
# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: Retriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters
type: FileTypeClassifier
- name: TextConverter # Converts TXT files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
# Depending on the file type we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
Semantic Document Retrieval with a Ranker
This is a document retrieval pipeline that searches for documents based on semantic similarity. It uses a vector-based search followed by re-ranking with a powerful cross-encoder model. This means that the resulting documents are ordered by the most relevant one.
version: '1.21.0'
name: 'English_Rerank_Doc_Search'
# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
params:
similarity: dot_product
- name: Retriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: Reranker # Uses a cross-encoder model to rerank the documents returned by the retriever
type: SentenceTransformersRanker
params:
model_name_or_path: cross-encoder/ms-marco-MiniLM-L-12-v2 # Best performing model on the cross-encoders website https://www.sbert.net/docs/pretrained-models/ce-msmarco.html
top_k: 10 # The number of results to return
batch_size: 20 # Try to keep this number equal to the top_k of the retriever so all docs are processed at once
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters
type: FileTypeClassifier
- name: TextConverter # Converts TXT files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
language: en
split_by: word
split_length: 250
split_overlap: 10
split_respect_sentence_boundary: true
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: Reranker
inputs: [Retriever]
- name: indexing
nodes:
# Depending on the file type we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
Keyword Document Retrieval
This pipeline is a good starting point for a document search pipeline. It returns documents as answers based on keyword matches with your query. It uses the BM25 algorithm to prioritize the keywords.
version: '1.21.0'
name: 'KeywordDocumentSearch'
# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
- name: Retriever # Selects the most relevant documents from the document store
type: BM25Retriever # The keyword-based retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits files into smaller documents and cleans them up
type: PreProcessor
params:
# With a keyword-based retriever, you can keep slightly longer documents
split_by: word # The unit by which you want to split the documents
split_length: 500 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
Hybrid Document Retrieval
These pipelines combine the advantages of keyword-based and vector-based searches. Such a combination usually yields the best results without having to train the model.
Hybrid Document Retrieval with a Ranker
This pipeline uses a Ranker to rank the documents according to how relevant they are to the query.
# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.
# This is a document search pipeline that combines vector-based and keyword-based searches. Such combination usually yields the best results without any training.
version: '1.21.0'
name: 'HybridDocumentSearch'
# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: BM25Retriever # The keyword-based retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: EmbeddingRetriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: concatenate # Combines documents from multiple retrievers
- name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
type: SentenceTransformersRanker
params:
model_name_or_path: cross-encoder/ms-marco-MiniLM-L-6-v2 # Fast model optimized for reranking
top_k: 20 # The number of results to return
batch_size: 30 # Try to keep this number equal to or greater than the sum of the top_k of the two retrievers so all docs are processed at once
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: Reranker
inputs: [JoinResults]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
Hybrid Document Retrieval with Fuzzy Matching
This pipeline accommodates typos the user may make when typing the query. It does this by using a custom OpenSearch query with the BM25Retriever.
version: '1.21.0'
name: 'hybrid-doc-search-reranker-fuzzymatchingbm25'
# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
- name: BM25Retriever # Selects the most relevant documents from the document store
type: BM25Retriever # The keyword-based retriever
params:
document_store: DocumentStore
top_k: 5 # The number of results to return
all_terms_must_match: true
custom_query: >
{"query": {
"multi_match": {
"query": $query,
"fields": ["content"],
"fuzziness": "AUTO",
"operator": "or"
}
},
"highlight": {
"fields": {
"content": {
}
}
}
}
- name: EmbeddingRetriever # The vector-based retriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 5 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results
- name: Ranker
type: SentenceTransformersRanker
params:
model_name_or_path: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
top_k: 5
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
type: FileTypeClassifier
- name: MarkdownConverter # Converts PDFs into documents
type: MarkdownConverter
params:
add_frontmatter_to_meta: false
extract_headlines: true
- name: Preprocessor # Splits files into smaller documents and cleans them up
type: PreProcessor
params:
# With a keyword-based retriever, you can keep slightly longer documents
split_by: word # The unit by which you want to split the documents
split_length: 50 # The max number of words in a document
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: Ranker
inputs: [JoinResults]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: MarkdownConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
- name: Preprocessor
inputs: [MarkdownConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
Pipelines Prioritizing Documents Based on Their Metadata
Prioritizing the Newest Documents with RecentnessRanker
You can add RecentnessRanker to your query pipeline to prioritize documents based on the criteria you specify. This pipeline uses RecentnessRanker to prioritize the newest documents. It does so by using the document's metadata field containing the date when the document was created.
version: '1.21.0'
name: 'search_by_recency'
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: Retriever # Selects the most relevant documents from the document store so that the LLM can base its generation on it.
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 1 # The number of documents to return
- name: PromptNode # The component that generates the answer based on the documents it gets from the retriever
type: PromptNode
params:
default_prompt_template: question-answering # A default prompt for question answering.
model_name_or_path: google/flan-t5-large # A free large language model for PromptNode. For production scenarios, we recommend a paid model.
top_k: 3 # The number of answers to generate, you can change this value.
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
- name: Ranker
type: RecentnessRanker
params:
date_identifier: updated_at # this is the name of the document's metadata field containing the date
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: Ranker
inputs: [Retriever]
- name: PromptNode
inputs: [Ranker]
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
Prioritizing the Newest Documents with an OpenSearch Query
You can pass a custom query to BM25Retriever to configure how you want it to fetch documents from the Document Store. Here's an example of a query that makes the retriever fetch the newest documents:
name: 'newest_docs'
version: '1.21.0'
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: Retriever
type: BM25Retriever
params:
document_store: DocumentStore
custom_query: >
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"gauss": {
"_file_created_at": {
"origin": "now",
"offset": "30d",
"scale": "180d"
}
}
}
}
}
- name: TextConverter
type: TextConverter
- name: Preprocessor
type: PreProcessor
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
- name: TextConverter
inputs: [File]
- name: Preprocessor
inputs: [TextConverter]
- name: DocumentStore
inputs: [Preprocessor]
For an in-depth explanation, see Boosting Retrieval with OpenSearch Queries.
Prioritizing Documents Based on Textual Values
You can prioritize documents whose metadata fields contain a particular text string. This pipeline gives the highest priority to documents with the metadata field file_type="article"
and slightly lower priority to documents with metadata field file_type="paper"
and the lowest priority to documents with metadata field file_type="comment"
.
name: "text_metadata_boost"
version: '1.21.0'
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: Retriever
type: BM25Retriever
params:
document_store: DocumentStore
custom_query: >
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"functions": [
{
"filter": {
"terms": {
"file_type": ["article", "paper"]
}
},
"weight": 2.0
},
{
"filter": {
"terms": {
"file_type": ["comment"]
}
},
"weight": 1.5
},
{
"filter": {
"terms": {
"file_type": ["archive"]
}
},
"weight": 0.5
}
]
}
}
}
- name: TextConverter
type: TextConverter
- name: Preprocessor
type: PreProcessor
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
- name: TextConverter
inputs: [File]
- name: Preprocessor
inputs: [TextConverter]
- name: DocumentStore
inputs: [Preprocessor]
Prioritizing Documents Based on Numerical Values Such As "Likes"
You can create a query to prioritize documents with a metadata field containing numerical values. Say you collect popularity metrics for your documents, such as likes, and you want to favor documents that are the most popular.
name: "numeric_metadata_boost"
version: '1.21.0'
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: Retriever
type: BM25Retriever
params:
document_store: DocumentStore
custom_query: >
{
"query": {
"function_score": {
"query": {
"bool": {
"must": {
"match": {
"content": $query
}
},
"filter": $filters
}
},
"field_value_factor": {
"field": "likes_last_month",
"factor": 0.1,
"modifier": "log1p",
"missing": 0
}
}
}
}
- name: TextConverter
type: TextConverter
- name: Preprocessor
type: PreProcessor
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
- name: TextConverter
inputs: [File]
- name: Preprocessor
inputs: [TextConverter]
- name: DocumentStore
inputs: [Preprocessor]
Updated about 24 hours ago