TextCleaner
Clean text strings by removing patterns, converting to lowercase, or removing punctuation and numbers. Unlike DocumentCleaner, which works with Document objects, TextCleaner operates on plain text strings.
Key Features
- Removes substrings matching regular expressions.
- Converts all text to lowercase.
- Removes punctuation from text.
- Removes numerical digits from text.
Configuration
- Drag the
TextCleanercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Set Remove Regexps to specify a list of regular expression patterns. The component removes all matching substrings from the text.
- Toggle Convert to Lowercase to convert all characters to lowercase.
- Toggle Remove Punctuation to remove punctuation from the text.
- Toggle Remove Numbers to remove numerical digits from the text.
Connections
TextCleaner accepts a list of text strings and outputs a list of cleaned text strings.
It typically receives generated text from generators and sends cleaned text to evaluation components for comparison. It connects with any component that outputs text strings.
Source Code
To check this component's source code, open text_cleaner.py in the Haystack repository.
Usage Examples
Basic Configuration
TextCleaner:
type: haystack.components.preprocessors.text_cleaner.TextCleaner
init_parameters:
convert_to_lowercase: true
remove_punctuation: false
remove_numbers: false
Using the Component in a Pipeline
This example shows a pipeline that cleans generated answers before evaluation.
# haystack-pipeline
components:
retriever:
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
top_k: 5
text_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
prompt_builder:
type: haystack.components.builders.prompt_builder.PromptBuilder
init_parameters:
template: |-
Answer the question based on the context.
Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
Question: {{ question }}
Answer:
generator:
type: haystack.components.generators.openai.OpenAIGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o-mini
TextCleaner:
type: haystack.components.preprocessors.text_cleaner.TextCleaner
init_parameters:
remove_regexps:
convert_to_lowercase: true
remove_punctuation: false
remove_numbers: false
connections:
- sender: text_embedder.embedding
receiver: retriever.query_embedding
- sender: retriever.documents
receiver: prompt_builder.documents
- sender: prompt_builder.prompt
receiver: generator.prompt
- sender: generator.replies
receiver: TextCleaner.texts
max_runs_per_component: 100
metadata: {}
inputs:
query:
- text_embedder.text
- prompt_builder.question
outputs:
answers: TextCleaner.texts
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
texts | List[str] | List of strings to clean. |
Outputs
| Parameter | Type | Description |
|---|---|---|
texts | List[str] | List of cleaned text strings. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
remove_regexps | Optional[List[str]] | None | A list of regex patterns to remove matching substrings from the text. |
convert_to_lowercase | bool | False | If True, converts all characters to lowercase. |
remove_punctuation | bool | False | If True, removes punctuation from the text. |
remove_numbers | bool | False | If True, removes numerical digits from the text. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
texts | List[str] | List of strings to clean. |
Related Information
Was this page helpful?