ExtractiveReader
Locate and extract answers to a query directly from documents using a transformer-based question answering model. Unlike generative models, ExtractiveReader returns exact spans of text from the source documents as answers.
Key Features
- Assigns independent scores to every possible answer span, making comparisons across documents consistent.
- Returns a configurable number of top answers ranked by score.
- Supports a score threshold to filter out low-confidence answers.
- Optionally returns a "no answer" result when no confident answer is found.
- Removes duplicate answers based on a configurable overlap threshold.
- Works with any HuggingFace question answering model.
Configuration
- Drag the
ExtractiveReadercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab, enter the model to use. This can be a HuggingFace Hub model ID (for example,
deepset/roberta-base-squad2-distilled) or a path to a local folder containing the model files. - Optionally, go to the Advanced tab to configure more settings, like
top_k, model kwargs,score_thresholdto filter out answers below a minimum confidence score, and more.
Connections
ExtractiveReader accepts a query string and a list of Document objects as inputs. It outputs a list of ExtractedAnswer objects ranked by score.
Typically, you connect a retriever (such as OpenSearchBM25Retriever) or a Ranker to the ExtractiveReader's documents input, and pass the user query to its query input. The answers output then connects to Output's answers input to get the final response.
Source Code
To check this component's source code, open extractive.py in the Haystack repository.
Usage Examples
Basic Configuration
reader:
type: haystack.components.readers.extractive.ExtractiveReader
init_parameters:
answers_per_seq: 20
calibration_factor: 1.0
max_seq_length: 384
model: "deepset/deberta-v3-large-squad2"
model_kwargs:
torch_dtype: "torch.float16"
no_answer: false
top_k: 10
Using ExtractiveReader in a Pipeline
ExtractiveReader is typically used in a pipeline to extract answers from a list of documents. Here's an example:
components:
retriever:
# Selects the most similar documents from the document store
type: haystack_integrations.components.retrievers.opensearch.open_search_hybrid_retriever.OpenSearchHybridRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
hosts:
index: ""
max_chunk_bytes: 104857600
return_embedding: false
method:
mappings:
settings:
index.knn: true
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
top_k: 20 # The number of results to return
fuzziness: 0
embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2
ranker:
type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
init_parameters:
model: tomaarsen/Qwen3-Reranker-0.6B-seq-cls
top_k: 10
reader:
type: haystack.components.readers.extractive.ExtractiveReader
init_parameters:
answers_per_seq: 20
calibration_factor: 1.0
max_seq_length: 384
model: "deepset/deberta-v3-large-squad2"
model_kwargs:
torch_dtype: "torch.float16"
no_answer: false
top_k: 10
attachments_joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
weights:
top_k:
sort_by_score: true
multi_file_converter:
type: haystack.core.super_component.super_component.SuperComponent
init_parameters:
input_mapping:
sources:
- file_classifier.sources
is_pipeline_async: false
output_mapping:
score_adder.output: documents
pipeline:
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- application/pdf
- text/markdown
- text/html
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- text/csv
text_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
pdf_converter:
type: haystack.components.converters.pdfminer.PDFMinerToDocument
init_parameters:
line_overlap: 0.5
char_margin: 2
line_margin: 0.5
word_margin: 0.1
boxes_flow: 0.5
detect_vertical: true
all_texts: false
store_full_path: false
markdown_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
# A dictionary of keyword arguments to customize how you want to extract content from your HTML files.
# For the full list of available arguments, see
# the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
extraction_kwargs:
output_format: markdown # Extract text from HTML. You can also also choose "txt"
target_language: # You can define a language (using the ISO 639-1 format) to discard documents that don't match that language.
include_tables: true # If true, includes tables in the output
include_links: true # If true, keeps links along with their targets
docx_converter:
type: haystack.components.converters.docx.DOCXToDocument
init_parameters:
link_format: markdown
pptx_converter:
type: haystack.components.converters.pptx.PPTXToDocument
init_parameters: {}
xlsx_converter:
type: haystack.components.converters.xlsx.XLSXToDocument
init_parameters: {}
csv_converter:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: true
language: en
score_adder:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: |
{%- set scored_documents = [] -%}
{%- for document in documents -%}
{%- set doc_dict = document.to_dict() -%}
{%- set _ = doc_dict.update({'score': 100.0}) -%}
{%- set scored_doc = document.from_dict(doc_dict) -%}
{%- set _ = scored_documents.append(scored_doc) -%}
{%- endfor -%}
{{ scored_documents }}
output_type: List[haystack.Document]
custom_filters:
unsafe: true
tabular_joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false
connections:
- sender: file_classifier.text/plain
receiver: text_converter.sources
- sender: file_classifier.application/pdf
receiver: pdf_converter.sources
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: file_classifier.text/csv
receiver: csv_converter.sources
- sender: text_converter.documents
receiver: splitter.documents
- sender: pdf_converter.documents
receiver: splitter.documents
- sender: markdown_converter.documents
receiver: splitter.documents
- sender: html_converter.documents
receiver: splitter.documents
- sender: pptx_converter.documents
receiver: splitter.documents
- sender: docx_converter.documents
receiver: splitter.documents
- sender: xlsx_converter.documents
receiver: tabular_joiner.documents
- sender: csv_converter.documents
receiver: tabular_joiner.documents
- sender: splitter.documents
receiver: tabular_joiner.documents
- sender: tabular_joiner.documents
receiver: score_adder.documents
connections:
- sender: retriever.documents
receiver: ranker.documents
- sender: ranker.documents
receiver: attachments_joiner.documents
- sender: multi_file_converter.documents
receiver: attachments_joiner.documents
- sender: attachments_joiner.documents
receiver: reader.documents
inputs:
query:
- retriever.query
- ranker.query
- reader.query
filters:
- retriever.filters_bm25
- retriever.filters_embedding
files:
- multi_file_converter.sources
outputs:
documents: attachments_joiner.documents
answers: reader.answers
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
- Drag the
ExtractiveReadercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- On the General tab:
- Select the model: enter a Hugging Face model identifier or a local path. The default is
deepset/roberta-base-squad2-distilled.
- Select the model: enter a Hugging Face model identifier or a local path. The default is
- Go to the Advanced tab to configure the device, API token, top_k, score threshold, maximum sequence length, stride, batch size, answers per sequence, no-answer scoring, calibration factor, overlap threshold, and model keyword arguments.
Connections
Outputs
| Parameter | Type | Description |
|---|---|---|
| answers | List[ExtractedAnswer] | List of answers sorted by (desc.) answer score. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | Union[Path, str] | deepset/roberta-base-squad2-distilled | A Hugging Face transformers question answering model. Can either be a path to a folder containing the model files or an identifier for the Hugging Face hub. |
| device | Optional[ComponentDevice] | None | The device on which the model is loaded. If None, the default device is automatically selected. |
| token | Optional[Secret] | Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False) | The API token used to download private models from Hugging Face. |
| top_k | int | 20 | Number of answers to return per query. It is required even if score_threshold is set. An additional answer with no text is returned if no_answer is set to True (default). |
| score_threshold | Optional[float] | None | Returns only answers with the probability score above this threshold. |
| max_seq_length | int | 384 | Maximum number of tokens. If a sequence exceeds it, the sequence is split. |
| stride | int | 128 | Number of tokens that overlap when sequence is split because it exceeds max_seq_length. |
| max_batch_size | Optional[int] | None | Maximum number of samples that are fed through the model at the same time. |
| answers_per_seq | Optional[int] | None | Number of answer candidates to consider per sequence. This is relevant when a Document was split into multiple sequences because of max_seq_length. |
| no_answer | bool | True | Whether to return an additional no answer with an empty text and a score representing the probability that the other top_k answers are incorrect. |
| calibration_factor | float | 0.1 | Factor used for calibrating probabilities. |
| overlap_threshold | Optional[float] | 0.01 | If set this will remove duplicate answers if they have an overlap larger than the supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove one of these answers since the second answer has a 100% (1.0) overlap with the first answer. However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so both of these answers could be kept if this variable is set to 0.24 or lower. If None is provided then all answers are kept. |
| model_kwargs | Optional[Dict[str, Any]] | None | Additional keyword arguments passed to AutoModelForQuestionAnswering.from_pretrained when loading the model specified in model. For details on what kwargs you can pass, see the model's documentation. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| query | str | Query string. | |
| documents | List[Document] | List of Documents in which you want to search for an answer to the query. | |
| top_k | Optional[int] | None | The maximum number of answers to return. An additional answer is returned if no_answer is set to True (default). |
| score_threshold | Optional[float] | None | Returns only answers with the score above this threshold. |
| max_seq_length | Optional[int] | None | Maximum number of tokens. If a sequence exceeds it, the sequence is split. |
| stride | Optional[int] | None | Number of tokens that overlap when sequence is split because it exceeds max_seq_length. |
| max_batch_size | Optional[int] | None | Maximum number of samples that are fed through the model at the same time. |
| answers_per_seq | Optional[int] | None | Number of answer candidates to consider per sequence. This is relevant when a Document was split into multiple sequences because of max_seq_length. |
| no_answer | Optional[bool] | None | Whether to return no answer scores. |
| overlap_threshold | Optional[float] | None | If set this will remove duplicate answers if they have an overlap larger than the supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove one of these answers since the second answer has a 100% (1.0) overlap with the first answer. However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so both of these answers could be kept if this variable is set to 0.24 or lower. If None is provided then all answers are kept. |
Was this page helpful?