Extractive Question Answering Pipelines
These pipelines return highlighted text passages as answers. They're good if you need to extract the answer from your documents and know the exact place where the answer is.
API Key
To reuse these pipelines, first make sure you have the API key needed to access the models. You'll need an API key for OpenAI, Cohere, and Hugging Face models. You can add them in the Connections tab in deepset Cloud.
English Question Answering Pipeline
This is a good starting point for a question answering system. It uses a vector-based search and a reader node highlighting the answers in text passages.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
params:
embedding_dim: 768
similarity: cosine
- name: BM25Retriever # The keyword-based retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: EmbeddingRetriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
model_format: sentence_transformers
embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
top_k: 20 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: concatenate # Combines documents from multiple retrievers
- name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
type: CNSentenceTransformersRanker
params:
model_name_or_path: intfloat/simlm-msmarco-reranker # Fast model optimized for reranking
top_k: 10 # The number of results to return
batch_size: 40 # Try to keep this number equal or larger to the sum of the top_k of the two retrievers so all docs are processed at once
model_kwargs: # Additional keyword arguments for the model
torch_dtype: torch.float16
- name: Reader # The component that actually fetches answers from among the 20 documents returned by retriever
type: CNFARMReader # Transformer-based reader, specializes in extractive QA
params:
model_name_or_path: deepset/deberta-v3-large-squad2 # An optimized variant of BERT, a strong all-round model
max_seq_len: 384
context_window_size: 700 # The size of the window around the answer span
model_kwargs: # Additional keyword arguments for the model
torch_dtype: torch.float16
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: Reranker
inputs: [JoinResults]
- name: Reader
inputs: [Reranker]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
German Question Answering Pipeline
This pipeline is a good starting point. It uses a vector-based search and a German question answering model. It highlights the answers within text passages thanks to the use of a Reader node.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
params:
embedding_dim: 768
similarity: cosine
- name: BM25Retriever # The keyword-based retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: EmbeddingRetriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
model_format: sentence_transformers
embedding_model: intfloat/multilingual-e5-base # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
top_k: 20 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: concatenate # Combines documents from multiple retrievers
- name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
type: SentenceTransformersRanker
params:
model_name_or_path: svalabs/cross-electra-ms-marco-german-uncased # Fast model optimized for reranking
top_k: 10 # The number of results to return
batch_size: 40 # Try to keep this number equal or larger to the sum of the top_k of the two retrievers so all docs are processed at once
model_kwargs: # Additional keyword arguments for the model
torch_dtype: torch.float16
# A "Reader" model that goes through those 20 candidate documents and identifies the exact answer
- name: Reader # The component that actually fetches answers from the 20 documents returned by retriever
type: FARMReader # Transformer-based reader, specializes in extractive QA
params:
model_name_or_path: deepset/gelectra-large-germanquad
context_window_size: 700 # The size of the window around the answer span
model_kwargs: # Additional keyword arguments for the model
torch_dtype: torch.float16
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based (dense) retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: de # Used by NLTK to best detect the sentence boundaries for that language
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: Reranker
inputs: [JoinResults]
- name: Reader
inputs: [Reranker]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
Updated 6 months ago