AnswerDeduplication
Add AnswerDeduplication to your extractive question answering pipeline to make sure it only returns unique answers.
In extractive question answering, where the answer is highlighted in the documents, AnswerDeduplication removes overlapping answers that come from the same document. It uses two attributes of an answer to determine if it's a duplicate: document_ids
(used to determine if two answers are from the same document) and offsets_in_document
(indicates the start and end positions of the answer in the document).
This node only works for extractive answers (answers highlighted in documents). It does't work for generated answers (answers generated by a large languge models in RAG pipelines).
We recommend using AnswerDeduplication in any extractive question answering pipeline after a FARMReader. FARMReader often produces numerous overlapping answers and AnswerDeduplication effectively eliminates them.
Basic Information
- Pipeline type: Used in query pipelines in extractive question answering.
- Nodes that can precede it in a pipeline: Used after FARMReader.
- Nodes that can follow it in a pipeline: The last node in the extractive query pipeline.
- Input: Answers
- Output: Answers
- Available node classes: AnswerDeduplication
Usage Example
Here's how you could configure AnswerDeduplication and use it in a pipeline:
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
params:
embedding_dim: 768
similarity: cosine
- name: EmbeddingRetriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
model_format: sentence_transformers
embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
top_k: 20 # The number of results to return
- name: Reader # The component that actually fetches answers from among the 20 documents returned by retriever
type: CNFARMReader # Transformer-based reader, specializes in extractive QA
params:
model_name_or_path: deepset/deberta-v3-large-squad2 # An optimized variant of BERT, a strong all-round model
max_seq_len: 384
context_window_size: 700 # The size of the window around the answer span
model_kwargs: # Additional keyword arguments for the model
torch_dtype: torch.float16
- name: Deduplicator
type: AnswerDeduplication
params:
overlap_fraction: 0.2
...
pipelines:
- name: query
nodes:
- name: EmbeddingRetriever
inputs: [Query]
- name: Reader
inputs: [EmbeddingRetriever]
- name: Deduplicator
inputs: [Reader]
...
Parameters
Here are the parameters you can pass to AnswerDeduplication in the pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
overlap_fraction | Float | Default: 0.01 | The threshold for determining if two answers overlap. It calculates the overlap of answer 1 with answer 2, and the overlap of answer 2 with answer 1. Then, it takes the bigger number and checks it against the threshold. For example, for these two answers: "in the river in Maine" and "the river in Maine", AnswerDeduplication would remove one of these answers as the second answer has 100% (1.0 overlap_fraction) with the first answer. Required. |
top_k | Integer | Default: 10 | The maximum number of answers to return. These answers are not duplicates. Required. |
Updated 7 months ago