Use Case: A Live QA System
This is an example of how to create an indexing and a query pipeline for a live question-answering system. It describes the data you need, the users, and the actual pipelines.
Description
A live question answering (QA) system returns answers highlighted in text passages. Thanks to that, you can find the answer easily, without reading through returned documents.
A live QA system is best for:
- Users looking for Google-style answers to their natural language questions.
- Users who want to verify their answers quickly.
- Finding answers in large amounts of text data.
For this type of search to work best, queries should be constrained to a specific topic, such as IT product documentation. They should be using natural language rather than, for example, copying error messages.
Data
You can use any text data. For a fast prototype, your data should be restricted to one domain.
You can divide your data into underlying text data and an annotated question-answer set for evaluating your pipelines.
Users
- Data scientists: Design the QA system, create the pipelines, and supervise domain experts.
- Domain experts: Prepare annotated data.
- End users: Use the system, evaluate its usefulness for business, and provide feedback in the deepset Cloud UI.
Pipelines
Here is an example of a pipeline definition file for this use case. It contains both the indexing and the query pipeline.
# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.
# This is baseline Question Answering pipeline for English. It has a good, vector-based Retriever and a small, fast Reader
version: '1.21.0'
name: "QA_en"
# This section defines the nodes you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: Retriever # Selects the most relevant documents from the document store and passes them on to the Reader
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
pooling_strategy: cls_token # Specifies how the embeddings from the model should be combined
top_k: 20 # The number of results to return
- name: Reader # The component that actually fetches answers from among the 20 documents the Retriever returns
type: FARMReader # Transformer-based Reader, specializes in extractive QA
params:
model_name_or_path: deepset/roberta-base-squad2 # An optimized variant of BERT, a strong all-round model
context_window_size: 700 # The size of the window around the answer span
- name: TextFileConverter # Converts files to documents
type: TextConverter
- name: Preprocessor # Splits documents into smaller ones, and cleans them up
type: PreProcessor
params:
split_by: word # The unit by which you want to split your documents
split_length: 250 # The maximum number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
pipelines: # Here you define the pipelines. For each component, specify its input.
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: Reader
inputs: [Retriever]
- name: indexing
nodes:
- name: TextFileConverter
inputs: [File]
- name: Preprocessor
inputs: [TextFileConverter]
- name: Retriever # We use the Retriever here to create embeddings
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
To learn more about the sentence-transformers/multi-qa-mpnet-base-dot-v1 model, see Hugging Face documentation. If it doesn't work for your domain, you can use the BM25 Retriever instead of EmbeddingRetriever. BM25 works on word overlap between the query and documents and may be a better choice for domains with complex vocabulary.
For more examples, see Pipeline Examples.
What To Do Next?
You can now demo your search system to the users. Share your pipeline prototype and have them test your pipelines. Have a look at the Guidelines for Onboarding Your Users to ensure your demo is successful.
Updated about 23 hours ago