ExtractiveReader

Locates and extracts answers to a given query from Documents.

Basic Information

Type: haystack_integrations.readers.extractive.ExtractiveReader

Inputs

Parameter	Type	Default	Description
query	str		Query string.
documents	List[Document]		List of Documents in which you want to search for an answer to the query.
top_k	Optional[int]	None	The maximum number of answers to return. An additional answer is returned if no_answer is set to True (default).
score_threshold	Optional[float]	None	Returns only answers with the score above this threshold.
max_seq_length	Optional[int]	None	Maximum number of tokens. If a sequence exceeds it, the sequence is split.
stride	Optional[int]	None	Number of tokens that overlap when sequence is split because it exceeds max_seq_length.
max_batch_size	Optional[int]	None	Maximum number of samples that are fed through the model at the same time.
answers_per_seq	Optional[int]	None	Number of answer candidates to consider per sequence. This is relevant when a Document was split into multiple sequences because of max_seq_length.
no_answer	Optional[bool]	None	Whether to return no answer scores.
overlap_threshold	Optional[float]	None	If set this will remove duplicate answers if they have an overlap larger than the supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove one of these answers since the second answer has a 100% (1.0) overlap with the first answer. However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so both of these answers could be kept if this variable is set to 0.24 or lower. If None is provided then all answers are kept.

Outputs

Parameter	Type	Default	Description
answers	List[ExtractedAnswer]		List of answers sorted by (desc.) answer score.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Locates and extracts answers to a given query from Documents.

The ExtractiveReader component performs extractive question answering. It assigns a score to every possible answer span independently of other answer spans. This fixes a common issue of other implementations which make comparisons across documents harder by normalizing each document's answers independently.

Example usage:

from haystack import Document
from haystack.components.readers import ExtractiveReader

docs = [
    Document(content="Python is a popular programming language"),
    Document(content="python ist eine beliebte Programmiersprache"),
]

reader = ExtractiveReader()
reader.warm_up()

question = "What is a popular programming language?"
result = reader.run(query=question, documents=docs)
assert "Python" in result["answers"][0].data

Usage Example

components:
  ExtractiveReader:
    type: components.readers.extractive.ExtractiveReader
    init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
model	Union[Path, str]	deepset/roberta-base-squad2-distilled	A Hugging Face transformers question answering model. Can either be a path to a folder containing the model files or an identifier for the Hugging Face hub.
device	Optional[ComponentDevice]	None	The device on which the model is loaded. If `None`, the default device is automatically selected.
token	Optional[Secret]	Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False)	The API token used to download private models from Hugging Face.
top_k	int	20	Number of answers to return per query. It is required even if score_threshold is set. An additional answer with no text is returned if no_answer is set to True (default).
score_threshold	Optional[float]	None	Returns only answers with the probability score above this threshold.
max_seq_length	int	384	Maximum number of tokens. If a sequence exceeds it, the sequence is split.
stride	int	128	Number of tokens that overlap when sequence is split because it exceeds max_seq_length.
max_batch_size	Optional[int]	None	Maximum number of samples that are fed through the model at the same time.
answers_per_seq	Optional[int]	None	Number of answer candidates to consider per sequence. This is relevant when a Document was split into multiple sequences because of max_seq_length.
no_answer	bool	True	Whether to return an additional `no answer` with an empty text and a score representing the probability that the other top_k answers are incorrect.
calibration_factor	float	0.1	Factor used for calibrating probabilities.
overlap_threshold	Optional[float]	0.01	If set this will remove duplicate answers if they have an overlap larger than the supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove one of these answers since the second answer has a 100% (1.0) overlap with the first answer. However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so both of these answers could be kept if this variable is set to 0.24 or lower. If None is provided then all answers are kept.
model_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments passed to `AutoModelForQuestionAnswering.from_pretrained` when loading the model specified in `model`. For details on what kwargs you can pass, see the model's documentation.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
query	str		Query string.
documents	List[Document]		List of Documents in which you want to search for an answer to the query.
top_k	Optional[int]	None	The maximum number of answers to return. An additional answer is returned if no_answer is set to True (default).
score_threshold	Optional[float]	None	Returns only answers with the score above this threshold.
max_seq_length	Optional[int]	None	Maximum number of tokens. If a sequence exceeds it, the sequence is split.
stride	Optional[int]	None	Number of tokens that overlap when sequence is split because it exceeds max_seq_length.
max_batch_size	Optional[int]	None	Maximum number of samples that are fed through the model at the same time.
answers_per_seq	Optional[int]	None	Number of answer candidates to consider per sequence. This is relevant when a Document was split into multiple sequences because of max_seq_length.
no_answer	Optional[bool]	None	Whether to return no answer scores.
overlap_threshold	Optional[float]	None	If set this will remove duplicate answers if they have an overlap larger than the supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove one of these answers since the second answer has a 100% (1.0) overlap with the first answer. However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so both of these answers could be kept if this variable is set to 0.24 or lower. If None is provided then all answers are kept.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​