Skip to main content

ExtractiveReader

Locates and extracts answers to a given query from Documents.

Basic Information

  • Type: haystack_integrations.readers.extractive.ExtractiveReader

Inputs

ParameterTypeDefaultDescription
querystrQuery string.
documentsList[Document]List of Documents in which you want to search for an answer to the query.
top_kOptional[int]NoneThe maximum number of answers to return. An additional answer is returned if no_answer is set to True (default).
score_thresholdOptional[float]NoneReturns only answers with the score above this threshold.
max_seq_lengthOptional[int]NoneMaximum number of tokens. If a sequence exceeds it, the sequence is split.
strideOptional[int]NoneNumber of tokens that overlap when sequence is split because it exceeds max_seq_length.
max_batch_sizeOptional[int]NoneMaximum number of samples that are fed through the model at the same time.
answers_per_seqOptional[int]NoneNumber of answer candidates to consider per sequence. This is relevant when a Document was split into multiple sequences because of max_seq_length.
no_answerOptional[bool]NoneWhether to return no answer scores.
overlap_thresholdOptional[float]NoneIf set this will remove duplicate answers if they have an overlap larger than the supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove one of these answers since the second answer has a 100% (1.0) overlap with the first answer. However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so both of these answers could be kept if this variable is set to 0.24 or lower. If None is provided then all answers are kept.

Outputs

ParameterTypeDefaultDescription
answersList[ExtractedAnswer]List of answers sorted by (desc.) answer score.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Locates and extracts answers to a given query from Documents.

The ExtractiveReader component performs extractive question answering. It assigns a score to every possible answer span independently of other answer spans. This fixes a common issue of other implementations which make comparisons across documents harder by normalizing each document's answers independently.

Example usage:

from haystack import Document
from haystack.components.readers import ExtractiveReader

docs = [
Document(content="Python is a popular programming language"),
Document(content="python ist eine beliebte Programmiersprache"),
]

reader = ExtractiveReader()
reader.warm_up()

question = "What is a popular programming language?"
result = reader.run(query=question, documents=docs)
assert "Python" in result["answers"][0].data

Usage Example

components:
ExtractiveReader:
type: components.readers.extractive.ExtractiveReader
init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
modelUnion[Path, str]deepset/roberta-base-squad2-distilledA Hugging Face transformers question answering model. Can either be a path to a folder containing the model files or an identifier for the Hugging Face hub.
deviceOptional[ComponentDevice]NoneThe device on which the model is loaded. If None, the default device is automatically selected.
tokenOptional[Secret]Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False)The API token used to download private models from Hugging Face.
top_kint20Number of answers to return per query. It is required even if score_threshold is set. An additional answer with no text is returned if no_answer is set to True (default).
score_thresholdOptional[float]NoneReturns only answers with the probability score above this threshold.
max_seq_lengthint384Maximum number of tokens. If a sequence exceeds it, the sequence is split.
strideint128Number of tokens that overlap when sequence is split because it exceeds max_seq_length.
max_batch_sizeOptional[int]NoneMaximum number of samples that are fed through the model at the same time.
answers_per_seqOptional[int]NoneNumber of answer candidates to consider per sequence. This is relevant when a Document was split into multiple sequences because of max_seq_length.
no_answerboolTrueWhether to return an additional no answer with an empty text and a score representing the probability that the other top_k answers are incorrect.
calibration_factorfloat0.1Factor used for calibrating probabilities.
overlap_thresholdOptional[float]0.01If set this will remove duplicate answers if they have an overlap larger than the supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove one of these answers since the second answer has a 100% (1.0) overlap with the first answer. However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so both of these answers could be kept if this variable is set to 0.24 or lower. If None is provided then all answers are kept.
model_kwargsOptional[Dict[str, Any]]NoneAdditional keyword arguments passed to AutoModelForQuestionAnswering.from_pretrained when loading the model specified in model. For details on what kwargs you can pass, see the model's documentation.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
querystrQuery string.
documentsList[Document]List of Documents in which you want to search for an answer to the query.
top_kOptional[int]NoneThe maximum number of answers to return. An additional answer is returned if no_answer is set to True (default).
score_thresholdOptional[float]NoneReturns only answers with the score above this threshold.
max_seq_lengthOptional[int]NoneMaximum number of tokens. If a sequence exceeds it, the sequence is split.
strideOptional[int]NoneNumber of tokens that overlap when sequence is split because it exceeds max_seq_length.
max_batch_sizeOptional[int]NoneMaximum number of samples that are fed through the model at the same time.
answers_per_seqOptional[int]NoneNumber of answer candidates to consider per sequence. This is relevant when a Document was split into multiple sequences because of max_seq_length.
no_answerOptional[bool]NoneWhether to return no answer scores.
overlap_thresholdOptional[float]NoneIf set this will remove duplicate answers if they have an overlap larger than the supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove one of these answers since the second answer has a 100% (1.0) overlap with the first answer. However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so both of these answers could be kept if this variable is set to 0.24 or lower. If None is provided then all answers are kept.