RegexTextExtractor

Extract text from chat messages or strings using a regex pattern.

Basic Information

Type: haystack.components.extractors.regex_text_extractor.RegexTextExtractor
Components it can connect with:
- Any component that produces text_or_messages. It's usually used in query pipelines to extract text from the query it receives from the Input component. You can also use it to extract text from a ChatGenerator's output.
- Any component that consumes a text string. You can use it in query pipelines to send the extracted text to PromptBuilder, AnswerBuilder, Retrievers, and similar.

Inputs

Parameter	Type	Default	Description
text_or_messages	Union[str, List[ChatMessage]]		Either a string or a list of ChatMessage objects to search through.

Outputs

Parameter	Type	Default	Description
captured_text	str		The matched text if a match is found. Empty string if no match is found and return_empty_on_no_match=False.

Overview

The RegexTextExtractor parses input text or ChatMessages using a regular expression pattern you provide. You can configure it to search through all messages or only the last message in a list of ChatMessages.

The pattern should include a capture group to extract the desired text. If the pattern has no capture groups, the component returns the entire match.

Usage Example

This query pipeline uses a ChatGenerator to analyze text and produce structured output, then uses RegexTextExtractor to extract a specific issue URL from the response:

components:
  ChatPromptBuilder:
    type: haystack.components.builders.chat_prompt_builder.ChatPromptBuilder
    init_parameters:
      template:
        - _content:
            - text: |
                You are a helpful assistant that identifies issues in text.
                When you find an issue, format your response as:
                <issue url="https://github.com/example/repo/issues/123">Description of the issue</issue>
                
                Analyze the following text and identify any issues:
                {{ query }}
          _role: user
      required_variables:
      variables:

  OpenAIChatGenerator:
    type: haystack.components.generators.chat.openai.OpenAIChatGenerator
    init_parameters:
      model: gpt-4o-mini
      generation_kwargs:
        temperature: 0.3

  RegexTextExtractor:
    type: haystack.components.extractors.regex_text_extractor.RegexTextExtractor
    init_parameters:
      regex_pattern: '<issue url="([^"]+)">'
      return_empty_on_no_match: true

  AnswerBuilder:
    type: haystack.components.builders.answer_builder.AnswerBuilder
    init_parameters:
      pattern:
      reference_pattern:

connections:
  - sender: ChatPromptBuilder.prompt
    receiver: OpenAIChatGenerator.messages
  - sender: OpenAIChatGenerator.replies
    receiver: RegexTextExtractor.text_or_messages
  - sender: RegexTextExtractor.captured_text
    receiver: AnswerBuilder.replies

inputs:
  query:
    - ChatPromptBuilder.query
    - AnswerBuilder.query

outputs:
  answers: AnswerBuilder.answers

In this example:

ChatPromptBuilder creates a prompt asking the LLM to identify issues and format them with a specific XML-like tag.
OpenAIChatGenerator generates a response containing the structured output.
RegexTextExtractor uses the pattern <issue url="([^"]+)"> to extract the URL from the response. The capture group ([^"]+) matches any characters except quotes inside the url attribute.
AnswerBuilder formats the extracted URL as the final answer.

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
regex_pattern	str		The regular expression pattern used to extract text. The pattern should include a capture group to extract the desired text. Example: `'<issue url="(.+)">'` captures the URL from the tag.
return_empty_on_no_match	bool	True	If True, returns an empty dictionary when no match is found. If False, returns `{"captured_text": ""}`.

Run Method Parameters

These are the parameters you can configure for the component's run() method.

Parameter	Type	Default	Description
text_or_messages	Union[str, List[ChatMessage]]		Either a string or a list of ChatMessage objects to search through.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​