Description

An extractive question answering (QA) system returns answers highlighted in text passages. Thanks to that, you can find the answer easily, without reading through returned documents.

Such a system is best for:

Users looking for Google-style answers to their natural language questions.
Users who want to verify their answers quickly.
Finding answers in large amounts of text data.

For this type of search to work best, queries should be constrained to a specific topic, such as IT product documentation. They should be using natural language rather than, for example, copying error messages.

Data

You can use any text data. For a fast prototype, your data should be restricted to one domain.

You can divide your data into underlying text data and an annotated question-answer set for evaluating your pipelines.

Users

Data scientists: Design the QA system, create the pipelines, and supervise domain experts.
Domain experts: Prepare annotated data.
End users: Use the system, evaluate its usefulness for business, and provide feedback in the deepset Cloud UI.

Pipelines

Here is an example of a pipeline definition file for this use case. It contains both the indexing and the query pipeline.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
      embedding_dim: 768
      similarity: cosine
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      model_format: sentence_transformers
      embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: concatenate # Combines documents from multiple retrievers
  - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
    type: SentenceTransformersRanker
    params:
      model_name_or_path: intfloat/simlm-msmarco-reranker # Fast model optimized for reranking
      top_k: 10 # The number of results to return
      batch_size: 40  # Try to keep this number equal or larger to the sum of the top_k of the two retrievers so all docs are processed at once
      model_kwargs:  # Additional keyword arguments for the model
        torch_dtype: torch.float16
  - name: Reader # The component that actually fetches answers from among the 20 documents returned by retriever 
    type: FARMReader # Transformer-based reader, specializes in extractive QA
    params:
      model_name_or_path: deepset/deberta-v3-large-squad2 # An optimized variant of BERT, a strong all-round model
      max_seq_len: 384
      context_window_size: 700 # The size of the window around the answer span
      model_kwargs:  # Additional keyword arguments for the model
        torch_dtype: torch.float16
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Reranker
        inputs: [JoinResults]
      - name: Reader
        inputs: [Reranker]
  - name: indexing
    nodes:
    # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

If it doesn't work for your domain, you can use the BM25 Retriever instead of EmbeddingRetriever. BM25 works on word overlap between the query and documents and may be a better choice for domains with complex vocabulary.

For more examples, see Pipeline Examples.

What To Do Next?

You can now demo your search system to the users. Share your pipeline prototype and have them test your pipelines. Have a look at the Guidelines for Onboarding Your Users to ensure your demo is successful.