Use Case: A Document Retrieval System

This example explains how to set up a system that searches for documents. It describes the benefits of this type of a system, the users, the data and the pipelines you will need.

Description

A document retrieval system, also called document search, is a system that returns whole documents relevant to the query.

A document retrieval system is best if:

  • Answers cannot be short spans of text but need to be more complex and long text passages.
  • You need a fast system. A document retrieval system doesn't use a reader, which speeds it up significantly.
  • You need a system that can handle millions of requests and has very low latency.
  • You need a system that can handle natural language questions.
  • Word-based approaches, such as Elasticsearch, are not enough for your use case.

When compared to a question answering system, document retrieval is faster and cheaper. It can even work on the CPU in production. Also, the document-retrieval models available are very powerful already so domain adaptation is easier than it is for question answering.

An example of document search

Here's what a document retrieval system looks like:

A screenshot of the search page displaying results to the query "summary of winds of winter". The results are passages of text.

Answers returned by a document search system

Data

You can use any text data. For a fast prototype, your data should be restricted to one domain.

You can divide your data into underlying text data and an annotated set for evaluating your pipelines.

Users

  • Data scientists: Design the system, create the pipelines, and supervise domain experts.
  • Domain experts: Use the system and provide their feedback in the deepset Cloud interface.

Pipelines

A pipeline file in deepset Cloud contains both the indexing pipeline and a query pipeline. The examples below are YAML files that contain both of these pipelines.

A pipeline using semantic search

This pipeline uses EmbeddingRetriever, which is an embedding-based (dense) retriever.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# This is a document search pipeline that searches for documents based on semantic similarity. It uses a vector-based search.
version: 1.14.0
name: 'SemanticDocumentSearch'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: Retriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters
    type: FileTypeClassifier
  - name: TextConverter # Converts TXT files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      # Depending on the file type we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]
A pipeline that combines vector-based and keyword-based searches

Combining an embedding-based and a keyword-based retriever uses the advantages of the two in your search. In this type of pipeline, you need a node that joins the answers each of the Retrievers returned.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# This is a document search pipeline that combines vector-based and keyword-based searches. Such combination usually yields the best results without any training. 
version: 1.14.0
name: 'HybridDocumentSearch'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # The vector-based retriever
    type: EmbeddingRetriever
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]
A pipeline that uses a keyword-based search

This pipeline uses a keyword-based retriever. This retriever uses the Elasticsearch BM25 algorithm. It relies on matching keywords between a query and a document and is a solid baseline to start with.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# A pipeline for document search that uses a traditional, keyword-based retriever (using Elasticsearch's BM25 algorithm).
# It relies on matching keywords between query and document and is often a solid baseline to start with.
version: 1.14.0
name: 'KeywordDocumentSearch'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
  - name: Retriever # Selects the most relevant documents from the document store
    type: BM25Retriever # The keyword-based retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits files into smaller documents and cleans them up
    type: PreProcessor
    params:
      # With a keyword-based retriever, you can keep slightly longer documents
      split_by: word # The unit by which you want to split the documents
      split_length: 500 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines: 
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

What to Do Next?

You can now demo your search system to the users. Share your pipeline prototype and have them test it. Have a look at the Guidelines for Onboarding Your Users to ensure that your demo is successful.