In extractive question answering, where the answer is highlighted in the documents, AnswerDeduplication removes overlapping answers that come from the same document. It uses two attributes of an answer to determine if it's a duplicate: document_ids (used to determine if two answers are from the same document) and offsets_in_document (indicates the start and end positions of the answer in the document).

📘
This node only works for extractive answers (answers highlighted in documents). It does't work for generated answers (answers generated by a large languge models in RAG pipelines).

We recommend using AnswerDeduplication in any extractive question answering pipeline after a FARMReader. FARMReader often produces numerous overlapping answers and AnswerDeduplication effectively eliminates them.

Basic Information

Pipeline type: Used in query pipelines in extractive question answering.
Nodes that can precede it in a pipeline: Used after FARMReader.
Nodes that can follow it in a pipeline: The last node in the extractive query pipeline.
Input: Answers
Output: Answers
Available node classes: AnswerDeduplication

Usage Example

Here's how you could configure AnswerDeduplication and use it in a pipeline:

components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
      embedding_dim: 768
      similarity: cosine
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      model_format: sentence_transformers
      embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
      top_k: 20 # The number of results to return
  - name: Reader # The component that actually fetches answers from among the 20 documents returned by retriever 
    type: CNFARMReader # Transformer-based reader, specializes in extractive QA
    params:
      model_name_or_path: deepset/deberta-v3-large-squad2 # An optimized variant of BERT, a strong all-round model
      max_seq_len: 384
      context_window_size: 700 # The size of the window around the answer span
      model_kwargs:  # Additional keyword arguments for the model
        torch_dtype: torch.float16
  - name: Deduplicator
    type: AnswerDeduplication
    params:
      overlap_fraction: 0.2
      ...
      
 pipelines:
  - name: query
    nodes:
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: Reader
        inputs: [EmbeddingRetriever]
      - name: Deduplicator
        inputs: [Reader]
        ...

Parameters

Here are the parameters you can pass to AnswerDeduplication in the pipeline YAML:

Parameter	Type	Possible Values	Description
`overlap_fraction`	Float	Default: `0.01`	The threshold for determining if two answers overlap. It calculates the overlap of answer 1 with answer 2, and the overlap of answer 2 with answer 1. Then, it takes the bigger number and checks it against the threshold. For example, for these two answers: "in the river in Maine" and "the river in Maine", AnswerDeduplication would remove one of these answers as the second answer has 100% (1.0 overlap_fraction) with the first answer. Required.
`top_k`	Integer	Default: `10`	The maximum number of answers to return. These answers are not duplicates. Required.