Use Case: A Document Retrieval System
This example explains how to set up a system that searches for documents. It describes the benefits of this type of a system, the users, the data and the pipelines you will need.
Description
A document retrieval system, also called document search, is a system that returns whole documents relevant to the query.
A document retrieval system is best if:
- Answers cannot be short spans of text but need to be more complex and long text passages.
- You need a fast system. A document retrieval system doesn't use a reader, which speeds it up significantly.
- You need a system that can handle millions of requests and has very low latency.
- You need a system that can handle natural language questions.
- Word-based approaches, such as Elasticsearch, are not enough for your use case.
When compared to a question answering system, document retrieval is faster and cheaper. It can even work on the CPU in production. Also, the document-retrieval models available are very powerful already so domain adaptation is easier than it is for question answering.
An example of document search
Here's what a document retrieval system looks like:

Answers returned by a document search system
Data
You can use any text data. For a fast prototype, your data should be restricted to one domain.
You can divide your data into underlying text data and an annotated set for evaluating your pipelines.
Users
- Data scientists: Design the system, create the pipelines, and supervise domain experts.
- Domain experts: Use the system and provide their feedback in the deepset Cloud interface.
Pipelines
A pipeline file in deepset Cloud contains both the indexing pipeline and a query pipeline. The examples below are YAML files that contain both of these pipelines.
A pipeline using semantic search
This pipeline uses EmbeddingRetriever, which is an embedding-based (dense) retriever.
# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.
# This is a document search pipeline that searches for documents based on semantic similarity. It uses a vector-based search.
version: 1.14.0
name: 'SemanticDocumentSearch'
# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: Retriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters
type: FileTypeClassifier
- name: TextConverter # Converts TXT files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
# Depending on the file type we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
A pipeline that combines vector-based and keyword-based searches
Combining an embedding-based and a keyword-based retriever uses the advantages of the two in your search. In this type of pipeline, you need a node that joins the answers each of the Retrievers returned.
# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.
# This is a document search pipeline that combines vector-based and keyword-based searches. Such combination usually yields the best results without any training.
version: 1.14.0
name: 'HybridDocumentSearch'
# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: BM25Retriever # The keyword-based retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: EmbeddingRetriever # The vector-based retriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
A pipeline that uses a keyword-based search
This pipeline uses a keyword-based retriever. This retriever uses the Elasticsearch BM25 algorithm. It relies on matching keywords between a query and a document and is a solid baseline to start with.
# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.
# A pipeline for document search that uses a traditional, keyword-based retriever (using Elasticsearch's BM25 algorithm).
# It relies on matching keywords between query and document and is often a solid baseline to start with.
version: 1.14.0
name: 'KeywordDocumentSearch'
# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
- name: Retriever # Selects the most relevant documents from the document store
type: BM25Retriever # The keyword-based retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits files into smaller documents and cleans them up
type: PreProcessor
params:
# With a keyword-based retriever, you can keep slightly longer documents
split_by: word # The unit by which you want to split the documents
split_length: 500 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
What to Do Next?
You can now demo your search system to the users. Share your pipeline prototype and have them test it. Have a look at the Guidelines for Onboarding Your Users to ensure that your demo is successful.
Updated 16 days ago