QueryClassifier
QueryClassifier distinguishes between different types of queries and routes them to the pipeline branch that can handle them best. It can categorize queries into keyword-based and natural language queries.
Overview
A common use case for QueryClassifier is in a question answering pipeline where it routes keyword queries to a less computationally expensive sparse Retriever and natural language questions to a dense Retriever. This helps you save time and can produce better results for your keyword queries.
To handle these tasks, QueryClassifier uses a classification model.
Basic Information
- Position in a pipeline: Use QueryClassifier at the beginning of the query pipeline.
- Input and output: QueryClassifier takes a query as input and returns a classified query as output.
- Available classes: There are two types of QueryClassifier: TransformersQueryClassifier and SklearnQueryClassifier
When used in a pipeline, it acts as a decision node, which means it routes the queries to a specific node, depending on how the query is classified.
TransformersQueryClassifier
This QueryClassifier is sensitive to the syntax of a sentence as it uses a transformers model to classify queries.
The default model for TransformersQueryClassifier is shahrukhx01/bert-mini-finetune-question-detection
. It was trained using the mini BERT architecture of about 50 MB in size, which allows relatively fast inference on the CPU.
Main features:
- Uses a transformers model to classify an incoming query
- More accurate than SklearnQueryClassifier
- Supports zero-shot classification
Arguments
These are the arguments you can specify for TransformersQueryClassifier:
Argument | Type | Mandatory | Possible Values | Description |
---|---|---|---|---|
model_name_or_path | String | Yes | - | Specifies the model you want to use. You can either type a path to the model stored on your computer or the name of a public model from Hugging Face. We recommend the shahrukhx01/bert-mini-finetune-question-detection model. It was trained on the mini BERT architecture and can distinguish between natural language queries and questions. |
model_version | String | No | Tag name Branch name Commit hash | The version of the model from Hugging Face. |
tokenizer | String | No | The name of the tokenizer usually the same as the model name. | |
use_gpu | Boolean | Yes | True/False Default: True | Specifies if GPU should be used. |
task | String | Yes | text-classification zero-shot-classification | Specifies the type of classification the node should perform. |
labels | A list of strings | Yes | If you choose text-classification as task and provide an ordered label, the first label corresponds to output_1, the second label corresponds to output_2, and so on. The labels must match the model labels; only their order can differ.If you selected zero-shot-classification as task , these are the candidate labels. | |
batch_size | Integer | Yes | Default: 16 | The number of queries you want to process at one time. |
progress_bar | Boolean | Yes | True/False Default: True | Shows the progress bar when processing queries. |
use_auth_token | String or Boolean | No | - | Specifies the API token used to download private models from Hugging Face. If you set it to True , it uses the token generated when running transformers-cli login .` |
devices | String or torch.device | No | - | A list of torch devices such as cuda, cpu, mps, to limit inference to specific devices. Example: [torch.device( cuda:0), "mps, "cuda:1" If you set use_gpu to False , this parameter is not used and a single cpu device is used for inference. |
SkLearnQueryClassifier
This QueryClassifier class uses a lightweight sklearn model to classify queries. It's less accurate than TransformersQueryClassifier but needs fewer resources.
Main Features
- Lightweight
- Uses a sklearn model
Arguments
Here are the arguments you can specify for SklearnQueryClassifier:
Argument | Type | Mandatory | Possible Values | Description |
---|---|---|---|---|
model_name_or_path | String | Yes | - | A gradient boosting-based binary classifier to classify between keywords and statements or questions, or statements and questions. You can use the following pre-trained query classifier: https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/model.pickle To learn how it was trained and how it performed, see readme. |
vectorizer_name_or_path | String | Yes | - | An ngram-based TFIDF vectorizer for extracting features from the query. You can use the following pre-trained query vectorizer: https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/vectorizer.pickle To learn how it was trained and how it performed, see readme. |
batch_size | Integer | No | Specifies the number of queries you want to process at one time. | |
progress_bar | Boolean | Yes | True/False Default: True | Shows a progress bar when processing the queries. |
Handling Different Query Types with QueryClassifier
Queries come in different shapes—keywords, questions, and statements. You can optimize your search by routing each query type to a node that handles it best and saving time and resources at the same time.
Query Types
There are two main query types you may want to distinguish between: keywords and natural language queries. Keyword queries are just keywords. They don't have a sentence structure, and the order of words doesn't matter, for example:
- last year results
- results 2022
- USA president
Natural language queries can be questions or statements. They're complete, grammatical sentences, such as:
- What were the results last year?
- What were the results in 2022?
- Who is the president of the USA
or
- Last year's results were good.
- Results in 2022 were not satisfying.
- The president of the USA is Joe Biden.
(Pipelines in deepset Cloud don't need a question mark to process a query.)
Optimizing the Pipeline to Handle Each Query Type
You can adjust the architecture of your pipeline so that only statements and questions are routed to the Reader, while for keywords, the pipeline performs a regular document search. This way, you save time and computational resources.
Here's what an example pipeline with this setup would look like:
And here's the pipeline code:
import os
os.environ["DEEPSET_CLOUD_API_KEY"] = "api_eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJiNTEzZTFmNi03YzA3LTRhMzUtOTczZS00Zjg4NGIxY2JkMDV8NjJjNTUzMjI0MWJhMDExZjIzM2IwNWIzIiwiZXhwIjoxNjY2MzU1Mjg3LCJhdWQiOlsiaHR0cHM6Ly9hcGkuY2xvdWQuZGVlcHNldC5haSJdfQ.QZfTPKL12ea_tDK6WhZPyPiHJ92znYDHAM4wxa03TUc"
os.environ["DEEPSET_CLOUD_API_ENDPOINT"] = "https://api.cloud.deepset.ai/api/v1"
from haystack import Pipeline
from haystack.nodes import TransformersQueryClassifier, FARMReader, BM25Retriever, EmbeddingRetriever
from haystack.document_stores import DeepsetCloudDocumentStore
document_store = DeepsetCloudDocumentStore()
query_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection")
embedding_retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
model_format="sentence_transformers",
top_k=20
)
bm25_retriever = BM25Retriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/deberta-v3-base-squad2", use_gpu="True")
pipe = Pipeline()
pipe.add_node(component=query_classifier, name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=embedding_retriever, name="EmbeddingRetriever", inputs=["QueryClassifier.output_1"])
pipe.add_node(component=bm25_retriever, name="BM25", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=reader, name="QAReader", inputs=["EmbeddingRetriever"])
# Pass a question -> run DPR + QA -> return answers
res_1 = pipe.run(query="Who is the father of Arya Stark?")
# Pass keywords -> run only BM25Retriever -> return Documents
res_2 = pipe.run(query="arya stark father")
# This example contains just the query pipeline, without the indexing pipeline
version: 1.9.1
name: QueryClassifierPipeline
components:
#here's how you specify QueryClassifier:
- name: QueryClassifier
type: TransformersQueryClassifier
params:
model_name_or_path: shahrukhx01/bert-mini-finetune-question-detection
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: DenseRetriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1
model_format: sentence_transformers
top_k: 20
- name: SparseRetriever
type: BM25Retriever
params:
document_store: DocumentStore
- name: Reader
type: FARMReader
params:
model: deepset/deberta-v3-base-squad2
use_gpu: True
pipelines:
- name: query
nodes:
- name: QueryClassifier
inputs: [Query]
- name: DenseRetriever
inputs: [QueryClassifier.output_1]
- name: SparseRetriever
inputs: [QueryClassifier.output_2]
- name: Reader
inputs: [DenseRetriever]
Updated 3 months ago