Skip to main content

TopPSampler

Implements top-p (nucleus) sampling for document filtering based on cumulative probability scores.

Basic Information

  • Type: haystack_integrations.samplers.top_p.TopPSampler

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of Document objects to be filtered.
top_pOptional[float]NoneIf specified, a float to override the cumulative probability threshold set during initialization.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]A dictionary with the following key: - documents: List of Document objects that have been selected based on the top-p sampling.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Implements top-p (nucleus) sampling for document filtering based on cumulative probability scores.

This component provides functionality to filter a list of documents by selecting those whose scores fall within the top 'p' percent of the cumulative distribution. It is useful for focusing on high-probability documents while filtering out less relevant ones based on their assigned scores.

Usage example:

from haystack import Document
from haystack.components.samplers import TopPSampler

sampler = TopPSampler(top_p=0.95, score_field="similarity_score")
docs = [
Document(content="Berlin", meta={"similarity_score": -10.6}),
Document(content="Belgrade", meta={"similarity_score": -8.9}),
Document(content="Sarajevo", meta={"similarity_score": -4.6}),
]
output = sampler.run(documents=docs)
docs = output["documents"]
assert len(docs) == 1
assert docs[0].content == "Sarajevo"

Usage Example

components:
TopPSampler:
type: components.samplers.top_p.TopPSampler
init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
top_pfloat1Float between 0 and 1 representing the cumulative probability threshold for document selection. A value of 1.0 indicates no filtering (all documents are retained).
score_fieldOptional[str]NoneName of the field in each document's metadata that contains the score. If None, the default document score field is used.
min_top_kOptional[int]NoneIf specified, the minimum number of documents to return. If the top_p selects fewer documents, additional ones with the next highest scores are added to the selection.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of Document objects to be filtered.
top_pOptional[float]NoneIf specified, a float to override the cumulative probability threshold set during initialization.