RegexTextExtractor
Extract text from chat messages or strings using a regex pattern.
Key Features
- Parses input text or
ChatMessageobjects using a regular expression you provide - Supports capture groups to extract a specific part of a match
- Returns the entire match if no capture group is defined
- Can search through all messages or just the last message in a list of
ChatMessageobjects - Returns an empty string when no match is found (configurable)
- Useful in query pipelines for extracting structured data from LLM responses
Configuration
- Drag the
RegexTextExtractorcomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- On the General tab:
- Enter the regular expression pattern to use for extraction. Include a capture group to extract specific text from a match. For example,
<issue url="([^"]+)">captures the URL from the tag.
- Enter the regular expression pattern to use for extraction. Include a capture group to extract specific text from a match. For example,
- Go to the Advanced tab to configure
return_empty_on_no_match.
Connections
RegexTextExtractor accepts either a string or a list of ChatMessage objects as input. It outputs the matched text as a string (captured_text). It is typically used in query pipelines to extract structured data from a ChatGenerator's output, and it sends the extracted text to components such as PromptBuilder, AnswerBuilder, or Retrievers.
Usage Example
This query pipeline uses a ChatGenerator to analyze text and produce structured output, then uses RegexTextExtractor to extract a specific issue URL from the response:
components:
ChatPromptBuilder:
type: haystack.components.builders.chat_prompt_builder.ChatPromptBuilder
init_parameters:
template:
- _content:
- text: |
You are a helpful assistant that identifies issues in text.
When you find an issue, format your response as:
<issue url="https://github.com/example/repo/issues/123">Description of the issue</issue>
Analyze the following text and identify any issues:
{{ query }}
_role: user
required_variables:
variables:
OpenAIChatGenerator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
model: gpt-4o-mini
generation_kwargs:
temperature: 0.3
RegexTextExtractor:
type: haystack.components.extractors.regex_text_extractor.RegexTextExtractor
init_parameters:
regex_pattern: '<issue url="([^"]+)">'
return_empty_on_no_match: true
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
connections:
- sender: ChatPromptBuilder.prompt
receiver: OpenAIChatGenerator.messages
- sender: OpenAIChatGenerator.replies
receiver: RegexTextExtractor.text_or_messages
- sender: RegexTextExtractor.captured_text
receiver: AnswerBuilder.replies
inputs:
query:
- ChatPromptBuilder.query
- AnswerBuilder.query
outputs:
answers: AnswerBuilder.answers
In this example:
ChatPromptBuildercreates a prompt asking the LLM to identify issues and format them with a specific XML-like tag.OpenAIChatGeneratorgenerates a response containing the structured output.RegexTextExtractoruses the pattern<issue url="([^"]+)">to extract the URL from the response. The capture group([^"]+)matches any characters except quotes inside theurlattribute.AnswerBuilderformats the extracted URL as the final answer.
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| text_or_messages | Union[str, List[ChatMessage]] | Either a string or a list of ChatMessage objects to search through. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| captured_text | str | The matched text if a match is found. Empty string if no match is found and return_empty_on_no_match=False. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| regex_pattern | str | The regular expression pattern used to extract text. The pattern should include a capture group to extract the desired text. Example: '<issue url="(.+)">' captures the URL from the tag. | |
| return_empty_on_no_match | bool | True | If True, returns an empty dictionary when no match is found. If False, returns {"captured_text": ""}. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| text_or_messages | Union[str, List[ChatMessage]] | Either a string or a list of ChatMessage objects to search through. |
Was this page helpful?