RegexTextExtractor
Extract text from chat messages or strings using a regex pattern.
Key Features
- Extracts specific text from strings or
ChatMessagelists using a regular expression pattern. - Supports capture groups to isolate exactly the text you need.
- Returns the entire match when no capture group is defined.
- Configurable behavior when no match is found: return an empty string or skip output.
- Useful for parsing structured output from LLMs, such as URLs, codes, or tagged content.
Configuration
- Drag the
RegexTextExtractorcomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Enter the regular expression pattern to use for extraction. Include a capture group to extract only the desired text. For example,
<issue url="([^"]+)">captures the URL from the tag.
- Enter the regular expression pattern to use for extraction. Include a capture group to extract only the desired text. For example,
- Go to the Advanced tab to configure additional settings:
- Set
return_empty_on_no_matchto control what happens when the pattern finds no match.
- Set
Connections
RegexTextExtractor accepts either a string or a list of ChatMessage objects as input. Connect it to the Input component or to any component that produces string or ChatMessage output, such as LLM.
Connect its captured_text output to any component that consumes a text string.
Source Code
To check this component's source code, open regex_text_extractor.py in the Haystack repository.
Usage Examples
Basic Configuration
RegexTextExtractor:
type: haystack.components.extractors.regex_text_extractor.RegexTextExtractor
init_parameters:
regex_pattern: <issue url="([^"]+)">
return_empty_on_no_match: true
In a Pipeline
This query pipeline uses an LLM to analyze text and produce structured output, then uses RegexTextExtractor to extract a specific issue URL from the response:
# haystack-pipeline
components:
RegexTextExtractor:
type: haystack.components.extractors.regex_text_extractor.RegexTextExtractor
init_parameters:
regex_pattern: '<issue url="([^"]+)">'
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
LLM:
type: haystack.components.generators.chat.llm.LLM
init_parameters:
chat_generator:
init_parameters:
model: gpt-5.5
type: haystack.components.generators.chat.openai_responses.OpenAIResponsesChatGenerator
system_prompt: ""
user_prompt: >-
{% message role="user" %}
You are a helpful assistant that identifies issues in text.
When you find an issue, format your response as:
<issue url="https://github.com/example/repo/issues/123">Description of
the issue</issue>
Analyze the following text and identify any issues:
{{ query }}
{% endmessage %}
required_variables: "*"
streaming_callback:
connections:
- sender: RegexTextExtractor.captured_text
receiver: AnswerBuilder.replies
- sender: LLM.last_message
receiver: RegexTextExtractor.text_or_messages
inputs:
query:
- AnswerBuilder.query
- LLM.query
outputs:
answers: AnswerBuilder.answers
max_runs_per_component: 100
metadata: {}
In this example:
LLMgenerates a response containing the structured output.RegexTextExtractoruses the pattern<issue url="([^"]+)">to extract the URL from the response. The capture group([^"]+)matches any characters except quotes inside theurlattribute.AnswerBuilderformats the extracted URL as the final answer.
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
text_or_messages | Union[str, List[ChatMessage]] | Either a string or a list of ChatMessage objects to search through. |
Outputs
| Parameter | Type | Description |
|---|---|---|
captured_text | str | The matched text if a match is found. Empty string if no match is found and return_empty_on_no_match=False. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
regex_pattern | str | The regular expression pattern used to extract text. The pattern should include a capture group to extract the desired text. Example: '<issue url="(.+)">' captures the URL from the tag. | |
return_empty_on_no_match | bool | True | If True, returns an empty dictionary when no match is found. If False, returns {"captured_text": ""}. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
text_or_messages | Union[str, List[ChatMessage]] | Either a string or a list of ChatMessage objects to search through. |
Was this page helpful?