TransformersZeroShotDocumentClassifier
Classify documents based on the labels you provide and add the predicted label to the document's metadata.
Key Features
- Uses a Hugging Face zero-shot classification pipeline to classify documents without task-specific training.
- Adds the predicted label to the document's
classificationmetadata field. - Supports multi-label classification by setting
multi_label=True. - Runs classification on document
contentby default, or on any metadata field viaclassification_field. - Compatible models include
valhalla/distilbart-mnli-12-3,cross-encoder/nli-distilroberta-base, andcross-encoder/nli-deberta-v3-xsmall.
Configuration
- Drag the
TransformersZeroShotDocumentClassifiercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Set the model name or path for zero-shot classification.
- Set the
labelslist with the categories to classify documents into, such as["positive", "negative"].
- Go to the Advanced tab to configure
multi_label,classification_field,device, andhuggingface_pipeline_kwargs.
Connections
TransformersZeroShotDocumentClassifier accepts a list of documents as input. Connect it to any component that outputs documents, such as TextFileToDocument.
It outputs a list of documents with the classification metadata field added. Connect its documents output to MetadataRouter to route documents based on their classification, or to DocumentWriter for storage.
Source Code
To check this component's source code, open zero_shot_document_classifier.py in the Haystack repository.
Usage Examples
Basic Configuration
TransformersZeroShotDocumentClassifier:
type: haystack_integrations.classifiers.zero_shot_document_classifier.TransformersZeroShotDocumentClassifier
init_parameters:
model: cross-encoder/nli-deberta-v3-xsmall
labels:
- positive
- negative
multi_label: false
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
In this index, TransformersZeroShotDocumentClassifier classifies documents by sentiment (positive or negative) and sends classified documents to MetadataRouter. MetadataRouter then routes positive documents to one document store and negative documents to another.
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
TransformersZeroShotDocumentClassifier:
type: haystack_integrations.classifiers.zero_shot_document_classifier.TransformersZeroShotDocumentClassifier
init_parameters:
model: cross-encoder/nli-deberta-v3-xsmall
labels:
- positive
- negative
multi_label: false
classification_field:
device:
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
huggingface_pipeline_kwargs:
MetadataRouter:
type: haystack.components.routers.metadata_router.MetadataRouter
init_parameters:
rules:
positive:
operator: OR
conditions:
- field: classification.label
operator: ==
value: positive
negative:
operator: OR
conditions:
- field: classification.label
operator: ==
value: negative
DocumentWriter_Positive:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'positive-sentiment-index'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
DocumentWriter_Negative:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'negative-sentiment-index'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
connections: # Defines how the components are connected
- sender: TextFileToDocument.documents
receiver: TransformersZeroShotDocumentClassifier.documents
- sender: TransformersZeroShotDocumentClassifier.documents
receiver: MetadataRouter.documents
- sender: MetadataRouter.positive
receiver: DocumentWriter_Positive.documents
- sender: MetadataRouter.negative
receiver: DocumentWriter_Negative.documents
inputs: # Define the inputs for your pipeline
files:
- TextFileToDocument.sources
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | Documents to process. | |
batch_size | int (Optional) | 1 | Batch size used for processing the content in each document. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | A list of documents with an added metadata field called classification. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | The name or path of a Hugging Face model for zero-shot document classification. | |
labels | List[str] | The set of possible class labels to classify each document into, for example, ["positive", "negative"]. The labels depend on the selected model. | |
multi_label | bool | False | Whether or not multiple candidate labels can be true. If False, the scores are normalized such that the sum of the label likelihoods for each sequence is 1. If True, the labels are considered independent and probabilities are normalized for each candidate by doing a softmax of the entailment score vs. the contradiction score. |
classification_field | Optional[str] | None | Name of document's meta field to be used for classification. If not set, Document.content is used by default. |
device | Optional[ComponentDevice] | None | The device on which the model is loaded. If None, the default device is automatically selected. If a device/device map is specified in huggingface_pipeline_kwargs, it overrides this parameter. |
token | Optional[Secret] | Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False) | The Hugging Face token to use as HTTP bearer authorization. |
huggingface_pipeline_kwargs | Optional[Dict[str, Any]] | None | Dictionary containing keyword arguments used to initialize the Hugging Face pipeline for text classification. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | Documents to process. | |
batch_size | int | 1 | Batch size used for processing the content in each document. |
Was this page helpful?