Tutorial: Creating a Custom RegexBooster Component
Write your own component, upload it to deepset Cloud, and use it in your pipelines. In this tutorial, you'll create a RegexBooster component that adjusts document scores based on regex patterns. You'll then learn how to add it to your pipelines.
- Level: Intermediate
- Time to complete: 20 minutes
- Prerequisites:
- Good knowledge of Python.
- Basic knowledge of regular expressions.
- Understanding of how components and pipelines work. Read the following resources:
- A GitHub account and basic knowledge of working with GitHub repositories.
- deepset Cloud API key. For instructions, see Generate an API Key
- Goal: After completing this tutorial, you'll have created a custom component that boosts document scores based on regex patterns. You'll then have uploaded this component to your deepset Cloud workspace and added it to a pipeline.
Prepare the Custom Components Template
You'll create the component using a template we provide. First, we need to prepare it.
- Fork the dc-custom-component-template GitHub repository.
- Clone the forked repository to your local machine.
- Navigate to the
./dc-custom-component-template/src/dc_custom_component/example_components/
directory. - Delete the
preprocessors
folder. - Rename the
example_components
folder tocustom_components
and open it.
The forked repo should now have the following structure:./dc-custom-component-template/src/dc_custom_component/custom_components/rankers/
. - Open the
rankers
folder and rename thekeyword_booster.py
file toregex_booster.py
.
Result: Your template now has the following structure:
dc-custom-component-template/
├── src/
│ └── dc_custom_component/
│ └── custom_components/
│ ├── __about__.py
│ ├── __init__.py
│ └── rankers/
│ └── regex_booster.py
├── pyproject.toml
├── readme.md
└── tests/
Set Up a Virtual Environment
- Install Hatch by running:
pip install hatch
. - In your terminal, navigate to the root directory (
./dc-custom-component-template
) of your cloned repository. - Create a virtual environment by running:
hatch shell
.
Result: The virtual environment is running.
Implement RegexBooster
-
Open the
./dc-custom-component-template/src/dc_custom_component/custom_components/rankers/regex_booster.py
file. -
Paste the following code replacing all the file contents, and save the file.
import re from typing import Dict, List from haystack import component, Document @component class RegexBooster: r""" A component for boosting document scores based on regex patterns. This component adjusts the scores of documents based on whether their content matches specified regular expression patterns. After adjusting scores, it sorts the documents in descending order of their new scores. Note: - Regex matching is case-insensitive by default. - Multiple regex patterns can match a single document, in which case the boosts are multiplied together. - Documents that don't match any patterns keep their original score. - The component assumes documents already have a 'score' attribute. Documents without a score are treated as having a score of 0. Example: ```python booster = RegexBooster({ r"\bpython\b": 1.5, # Boost documents mentioning "python" by 50% r"machine\s+learning": 1.3, # Boost "machine learning" by 30% r"\bsql\b": 0.8, # Reduce score for documents mentioning "sql" by 20% }) ``` In this example, a document containing both "python" and "machine learning" would have its score multiplied by 1.5 * 1.3 = 1.95, effectively boosting it by 95%. """ def __init__(self, regex_boosts: Dict[str, float]): self.regex_boosts = {re.compile(k, re.IGNORECASE): v for k, v in regex_boosts.items()} """ Initialize the component. :param regex_boosts: A dictionary where: - Keys are string representations of regular expression patterns. - Values are float numbers representing the boost factor. The boost factor must be greater than 1.0 to increase the score, or between 0 and 1 to decrease it. A boost of exactly 1.0 will have no effect. """ @component.output_types(documents=List[Document]) def run(self, documents: List[Document]) -> Dict[str, List[Document]]: """ Apply regex-based score boosting to the input documents. :param documents: The list of documents to process. Returns: A dictionary with a single key 'documents', containing the list of processed documents, sorted by their new scores. """ for regex, boost in self.regex_boosts.items(): for doc in documents: if doc.score is not None and regex.search(doc.content): doc.score *= boost documents = sorted(documents, key=lambda x: x.score or 0, reverse=True) return {"documents": documents}
Or follow this recipe for a step-by-step explanation of the code.
- Format your code by running
hatch run code-quality:all
from the project root directory. - Update RegexBooster version:
- Open the file
./dc-custom-component-template/src/dc_custom_component/__about__.py.
. - Change the version to
__version__ = "1.0.0"
.
- Open the file
Result: The Python implementation for the RegexBooster component is now in the regex_booster.py
file, the code is formatted, and the component version is updated.
Add Tests
-
Open the
./dc-custom-component-template/tests
folder. -
Create a file called
test_regex_booster.py
. -
Paste this code into this file and save it:
import pytest from typing import List, Dict, Any from haystack import component, Document, Pipeline from haystack.components.joiners import DocumentJoiner from dc_custom_component.custom_components.rankers.regex_booster import RegexBooster # Unit Tests def test_regex_booster_initialization(): booster = RegexBooster({"pattern": 1.5}) assert len(booster.regex_boosts) == 1 assert list(booster.regex_boosts.values())[0] == 1.5 def test_regex_booster_case_insensitivity(): booster = RegexBooster({r"\bPython\b": 1.5}) doc = Document(content="python is great", score=1.0) result = booster.run(documents=[doc]) assert result["documents"][0].score == 1.5 def test_regex_booster_multiple_patterns(): booster = RegexBooster({r"\bPython\b": 1.5, r"\bgreat\b": 1.2}) doc = Document(content="Python is great", score=1.0) result = booster.run(documents=[doc]) assert result["documents"][0].score == 1.5 * 1.2 def test_regex_booster_no_match(): booster = RegexBooster({r"\bJava\b": 1.5}) doc = Document(content="Python is great", score=1.0) result = booster.run(documents=[doc]) assert result["documents"][0].score == 1.0 def test_regex_booster_sorting(): booster = RegexBooster({r"\bPython\b": 1.5, r"\bJava\b": 1.2}) docs = [ Document(content="Java is okay", score=1.0), Document(content="Python is great", score=1.0), Document(content="C++ is fast", score=1.0) ] result = booster.run(documents=docs) assert [doc.content for doc in result["documents"]] == ["Python is great", "Java is okay", "C++ is fast"] def test_regex_booster_no_score(): booster = RegexBooster({r"\bPython\b": 1.5}) doc = Document(content="Python is great") result = booster.run(documents=[doc]) assert result["documents"][0].score is None # Integration Tests @component class MockRetriever: @component.output_types(documents=List[Document]) def run(self, query: str) -> Dict[str, Any]: docs = [ Document(content="Python is a programming language", score=0.9), Document(content="Java is also a programming language", score=0.7), Document(content="Machine learning is a subset of AI", score=0.5) ] return {"documents": docs} @pytest.fixture def regex_pipeline(): retriever = MockRetriever() regex_booster = RegexBooster({r"\bPython\b": 1.5, r"\bAI\b": 1.3}) joiner = DocumentJoiner() pipeline = Pipeline() pipeline.add_component("retriever", retriever) pipeline.add_component("regex_booster", regex_booster) pipeline.add_component("joiner", joiner) pipeline.connect("retriever.documents", "regex_booster.documents") pipeline.connect("regex_booster.documents", "joiner.documents") return pipeline def test_regex_booster_in_pipeline(regex_pipeline): results = regex_pipeline.run(data={"query": "programming languages"}) documents = results["joiner"]["documents"] assert len(documents) == 3 assert documents[0].content == "Python is a programming language" assert pytest.approx(documents[0].score, 0.01) == 0.9 * 1.5 assert documents[1].content == "Java is also a programming language" assert pytest.approx(documents[1].score, 0.01) == 0.7 assert documents[2].content == "Machine learning is a subset of AI" assert pytest.approx(documents[2].score, 0.01) == 0.5 * 1.3 def test_regex_booster_pipeline_no_matches(): @component class NoMatchRetriever: @component.output_types(documents=List[Document]) def run(self, query: str) -> Dict[str, Any]: return { "documents": [ Document(content="C++ is a compiled language", score=0.8), Document(content="Ruby is dynamic", score=0.6) ] } new_pipeline = Pipeline() new_pipeline.add_component("retriever", NoMatchRetriever()) new_pipeline.add_component("regex_booster", RegexBooster({r"\bPython\b": 1.5, r"\bAI\b": 1.3})) new_pipeline.add_component("joiner", DocumentJoiner()) new_pipeline.connect("retriever.documents", "regex_booster.documents") new_pipeline.connect("regex_booster.documents", "joiner.documents") results = new_pipeline.run(data={"query": "programming languages"}) documents = results["joiner"]["documents"] assert len(documents) == 2 assert documents[0].content == "C++ is a compiled language" assert pytest.approx(documents[0].score, 0.01) == 0.8 assert documents[1].content == "Ruby is dynamic" assert pytest.approx(documents[1].score, 0.01) == 0.6
-
From the root directory of your project, where the
pyproject.toml
is located, run:hatch run tests
If the tests pass, you can upload your component to deepset Cloud.
Import RegexBooster to deepset Cloud
- On macOS and Linux,:
- Set up your deepset Cloud API key:
export API_KEY=<your_api_key>
- Navigate to
./dc-custom-component-template
and run:
hatch run dc:build-and-push
This creates a .zip file called custom_component.zip
in the dist
directory and uploads it to deepset Cloud.
- On Windows:
- Zip the repository from the template folder by running:
Compress-Archive -Path .\* -DestinationPath ..\custom_component.zip -Force
- In the project root folder, run the following command replacing your API key with your actual API key:
curl --request POST \ --url https://api.cloud.deepset.ai/api/v2/custom_components \ --header 'accept: application/json' \ --header 'Authorization: Bearer api_XXX' \ --form 'file=@"/<parent_folder>/custom_component.zip";type=application/zip'
Verify if the import was correct by running:
curl --request GET \
--url https://api.cloud.deepset.ai/api/v2/custom_components \
--header 'accept: application/json' \
--header 'Authorization: Bearer api_XXX'
If the component status is finished
, it means it was correctly uploaded and is now ready to be used in pipelines.
Result: RegexBooster is in deepset Cloud, ready to be added to a pipeline.
Add RegexBooster to a Pipeline
Let's quickly create a pipeline. If you have one ready, open it for edition.
-
Go to Pipeline Templates and switch to the deepset Cloud 2.0 tab.
-
Choose Document Search templates, hover over Semantic Document Search, and click Use template.
-
Click Create a pipeline. You're redirected to the Pipelines page.
-
Find your newly created pipeline (it should be at the top of the page), click More actions next to it and choose Edit.
-
Open the Query Pipeline tab and in the
components
section, replaceranker
withregex_booster
:
regex_booster:
type: dc_custom_component.custom_components.rankers.regex_booster.RegexBooster
init_parameters:
regex_boosts: # This example boosts medical documents with these keywords
'\bcovid-19\b|\bcoronavirus\b': 2.0
'\bcancer\b': 1.5
'\basthma\b': 1.3
'\btreatment\b': 1.2
'\bsymptoms\b': 1.1
- In the
connections
section, replaceranker.documents
withregex_booster.documents
. RegexBooster should receive documents from the ranker:
connections:
- sender: query_embedder.embedding
receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
receiver: regex_booster.documents
- Remove
ranker.query
from theinputs
section. - In the
outputs
section, replaceembedding_retriever.documents
withregex_booster.documents
as this is the final set of documents we want as the pipeline's output:
outputs:
documents: "regex_booster.documents"
This is what your query pipeline should look like:
# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/v2.0/docs/create-a-pipeline#create-a-pipeline-using-pipeline-editor.
# This section defines components that you want to use in your pipelines. Each component must have a name and a type. You can also set the component's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying the connections in the pipeline.
# Type is the class path of the component. You can check the type on the component's documentation page.
components:
query_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: "intfloat/e5-base-v2"
embedding_retriever: # Selects the most similar documents from the document store
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
use_ssl: True
verify_certs: False
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- "${OPENSEARCH_USER}"
- "${OPENSEARCH_PASSWORD}"
embedding_dim: 768
similarity: cosine
top_k: 20 # The number of results to return
regex_booster:
type: dc_custom_component.custom_components.rankers.regex_booster.RegexBooster
init_parameters:
regex_boosts: # This example boosts medical documents with these keywords
'\bcovid-19\b|\bcoronavirus\b': 2.0
'\bcancer\b': 1.5
'\basthma\b': 1.3
'\btreatment\b': 1.2
'\bsymptoms\b': 1.1
connections: # Defines how the components are connected
- sender: query_embedder.embedding
receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
receiver: regex_booster.documents
max_loops_allowed: 100
inputs: # Define the inputs for your pipeline
query: # These components will receive the query as input
- "query_embedder.text"
filters: # These components will receive a potential query filter as input
- "embedding_retriever.filters"
outputs: # Defines the output of your pipeline
documents: "regex_booster.documents" # The output of the pipeline is the retrieved documents
- Save and deploy your pipeline.
Test the Pipeline with RegexBooster
Let's see if it works on actual files.
- Download this set of medical articlesand extract it to your machine. You should have a set of 10 articles.
- In deepset Cloud, open the same workspace where you created the Semantic_Document_Search pipeline and go to Files.
- Click Upload files, choose the files you extracted in step 1, and click Upload.
- Let's create a document search pipeline without RegexBooster to compare the results:
- Go to Pipeline Templates and switch to the deepset Cloud 2.0 tab.
- Choose Document Search templates, hover over Semantic Document Search, and click Use template.
- Change the pipeline name to
no_regex
and click Create Pipeline. - Find the
no_regex
pipeline on the Pipelines page and deploy it.
- When the pipeline is indexed, go to Playground and choose the Semantic_Document_Search pipeline.
- In the Search field, type the following query:
Tell me about ongoing medical research
. You can see that the top three documents are about coronavirus, cancer, and asthma. All of them are keywords RegexBooster prioritizes. - Now, let's change the pipeline to the
no_regex
one and repeat the same query. The top three documents are about alzheimer, coronavirus, and cancer. The ranking of documents is different if we're not prioritizing certain keywords.
Congratulations!: You have implemented your custom component and imported it to deepset Cloud. You can now run pipelines with RegexBooster!
Updated 10 days ago