Use Case: A Document Retrieval System

This example explains how to set up a system that searches for documents. It describes the benefits of this type of a system, the users, the data and the pipelines that you will need.

Description

A document retrieval system, also called document search, is a system that returns whole documents that are relevant to the query.

A document retrieval system is best if:

  • Answers cannot be short spans of text but need to be more complex and long text passages
  • You need a fast system. A document retrieval system doesn't use a reader, which speeds it up significantly.
  • You need a system that can handle millions of requests and has very low latency
  • You need a system that can handle natural language questions
  • Word-based approaches, such as Elasticsearch, are not enough for your use case

When compared to a question-answering system, document retrieval is faster and cheaper and it can even work on CPU in production. Also, the document-retrieval models available are very powerful already so domain adaptation is easier than for question answering.

An example of document search

Here's what a document retrieval system looks like:

A screenshot of the search page displaying results to the query "summary of winds of winter". The results are passages of text.A screenshot of the search page displaying results to the query "summary of winds of winter". The results are passages of text.

Answers returned by a document search system

Data

You can use any text data. For a fast prototype, your data should be restricted to one domain.

You can divide your data into underlying text data and an annotated set for evaluating your pipelines.

Users

  • Data scientists: Design the system, create the pipelines, and supervise domain experts
  • Domain experts: Use the system and provide their feedback in the deepset Cloud interface

Pipelines

A pipeline file in deepset Cloud contains both the indexing pipeline and a query pipeline. The examples below are YAML files that contain both of these pipelines.

A pipeline using an embedding-based retriever

This pipeline uses EmbeddingRetriever, which is an embedding-based (dense) retriever.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline-using-a-yaml-file.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press control + space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# This is a default document search pipeline with a good embedding-based Retriever
version: '1.10.0'
name: 'DenseDocSearch'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore #the only supported document store in deepset Cloud
  - name: Retriever #selects the most relevant documents from the document store
    type: EmbeddingRetriever #uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 #model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 #the number of results to return
  - name: FileTypeClassifier #routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter #converts files into documents
    type: TextConverter
  - name: PDFConverter #converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor #splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      #With a dense retriever, it's good to split your documents into smaller ones
      split_by: word #the unit by which you want to split the documents
      split_length: 250 #the max number of words in a document
      split_overlap: 30 #enables the sliding window approach
      split_respect_sentence_boundary: True #retains complete sentences in split documents
      language: en #used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] #ensures that this converter receives txt files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] #ensures that this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]
A pipeline that combines an embedding-based retriever with a keyword-based retriever

Combining a dense (embedding-based) and a sparse (keyword-based) retriever uses the advantages of the two in your search. In this type of a pipeline, you need a node that joins the answers returned by each of the retriever.

# A document search pipeline that combines a dense and a sparse retriever
version: '1.10.0'
name: 'HybridDocSearch'

# This section defines nodes that we want to use in our pipelines
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore #the only supported document store in deepset Cloud
  - name: ESRetriever #the keyword-based retriever
    type: ElasticsearchRetriever
    params:
      document_store: DocumentStore 
      top_k: 20 #the number of results to return
  - name: EmbeddingRetriever #the embedding-based retriever
    type: EmbeddingRetriever
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 #model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 #the number of results to return
  - name: JoinResults #joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: reciprocal_rank_fusion #applies rank-based scoring to the results
  - name: FileTypeClassifier #routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, htmls 
    type: FileTypeClassifier
  - name: TextConverter #converts files into documents
    type: TextConverter
  - name: PDFConverter #converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor #splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      #With a dense retriever, it's good to split your documents into smaller ones
      split_by: word #the unit by which you want to split the documents
      split_length: 250 #the max number of words in a document
      split_overlap: 30 #enables the sliding window approach
      split_respect_sentence_boundary: True #retains complete sentences in split documents
      language: en #used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: ESRetriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [ESRetriever, EmbeddingRetriever]
  - name: indexing
    nodes:
      # Depending on the file type we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] #ensures that this converter gets txt files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] #ensures that this converter gets pdf files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]
A pipeline that uses a keyword-based retriever

This pipeline uses a sparse (keyword-based) retriever. This retriever uses the Elasticsearch BM25 algorithm. It relies on matching keywords between a query and a document and is a solid baseline to start with.

# A baseline pipeline for document search with a sparse retriever
version: '1.10.0'
name: 'SparseDocSearch_BM25'

# This section defines nodes that we want to use in our pipelines
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore #this is the only supported document store in deepset Cloud
  - name: Retriever #selects the most relevant documents from the document store
    type: BM25Retriever #sparse retriever
    params:
      document_store: DocumentStore
      top_k: 20 #the number of results to return
  - name: FileTypeClassifier #routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter #converts files into documents
    type: TextConverter
  - name: PDFConverter #converts PDFs into documents
    type: PDFToTextConverter 
  - name: Preprocessor #splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a sparse retriever, you can keep slightly longer documents
      split_by: word #the unit by which you want to split the documents
      split_length: 500 #the max number of words in a document
      split_overlap: 30 #enables the sliding window approach
      split_respect_sentence_boundary: True #retains complete sentences in split documents
      language: en #used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] #ensures that this converter gets txt files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] #ensures that this converter gets pdf files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

What to Do Next?

You can now demo your search system to the users. Invite users to your organization and have them test your pipelines. Have a look at the Guidelines for Onboarding Your Users to ensure that your demo is successful.


Related Links