Skip to main content

TextCleaner

Clean text strings by removing patterns, converting to lowercase, or removing punctuation and numbers.

Basic Information

  • Type: haystack.components.preprocessors.text_cleaner.TextCleaner
  • Components it can connect with:
    • Generators: TextCleaner can receive generated text replies from generators.
    • Evaluators: TextCleaner can send cleaned text to evaluation components for comparison.
    • Any component that outputs text strings.

Inputs

ParameterTypeDefaultDescription
textsList[str]List of strings to clean.

Outputs

ParameterTypeDefaultDescription
textsList[str]List of cleaned text strings.

Overview

TextCleaner cleans text strings by applying various transformations. Unlike DocumentCleaner which works with Document objects, TextCleaner operates on plain text strings.

This component is particularly useful for:

  • Cleaning up text data before evaluation
  • Normalizing text for comparison
  • Preprocessing generated responses

Available cleaning options:

  • remove_regexps: Remove substrings matching regular expressions
  • convert_to_lowercase: Convert all characters to lowercase
  • remove_punctuation: Remove punctuation from text
  • remove_numbers: Remove numerical digits from text

Usage Example

Using the Component in a Pipeline

This example shows a pipeline that cleans generated answers before evaluation.

components:
retriever:
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
top_k: 5
text_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
prompt_builder:
type: haystack.components.builders.prompt_builder.PromptBuilder
init_parameters:
template: |-
Answer the question based on the context.
Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
Question: {{ question }}
Answer:
generator:
type: haystack.components.generators.openai.OpenAIGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o-mini
TextCleaner:
type: haystack.components.preprocessors.text_cleaner.TextCleaner
init_parameters:
remove_regexps:
convert_to_lowercase: true
remove_punctuation: false
remove_numbers: false

connections:
- sender: text_embedder.embedding
receiver: retriever.query_embedding
- sender: retriever.documents
receiver: prompt_builder.documents
- sender: prompt_builder.prompt
receiver: generator.prompt
- sender: generator.replies
receiver: TextCleaner.texts

max_runs_per_component: 100

metadata: {}

inputs:
query:
- text_embedder.text
- prompt_builder.question

outputs:
answers: TextCleaner.texts

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
remove_regexpsOptional[List[str]]NoneA list of regex patterns to remove matching substrings from the text.
convert_to_lowercaseboolFalseIf True, converts all characters to lowercase.
remove_punctuationboolFalseIf True, removes punctuation from the text.
remove_numbersboolFalseIf True, removes numerical digits from the text.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
textsList[str]List of strings to clean.