Skip to main content

Tutorial: Build a PII Masking Index with Custom Code

Build an index that hides sensitive personal information using custom code. This tutorial is a great starting point if you're building an app where you need to keep sensitive data out of the LLM, such as a CV screening system. You'll also learn how ot use the Code component in your apps.


  • Level: Beginner
  • Time to complete: 10 minutes
  • Prerequisites:
    • You must have an Editor role in the workspace where you'll create the index.
    • You must have a deepset workspace where you'll create the index. For instructions on how to create a workspace, see Quick Start Guide.
  • Goal: After completing this tutorial, you'll have create an index that can process PDF and TXT files and masks sensitive personal information using custom code. You'll also learn how to use the Code component in your apps.

Create the Index

To keep it simple, let's create an index that can process PDF files.

  1. In deepset, make sure in the correct workspace and go to Indexes.
  2. Click Create Index>Build your own to create an index from scratch.
    The Build your own index button on the Idexe templates page
  3. Type PII-masking-index as the name and An index that masks sensitive personal information as the description and click Create Index.
  4. From the Component Library, drag the following components onto the canvas:
tip

You can use the search field above the Component Library to find components quickly.

  • Input
  • PDFMinerToDocument
  • Code
  • DocumentSplitter
  • DeepsetNvidiaDocumentEmbedder
  • DocumentWriter
  • OpenSearchDocumentStore
  1. On the Code component card, click Code. This opens the code editor. You can expand it to make it more convenient.

    The expand code button on the Code component card
  2. Replace the example code with the following code:

    from haystack import component, Document
    import re
    from typing import List

    @component
    class Code:
    """
    Masks personally identifiable information in CV text before LLM processing.
    """
    @component.output_types(cleaned_docs=List[Document])
    def run(self, documents: List[Document]) -> dict:
    cleaned_docs = []

    for doc in documents:
    redacted = doc.content

    # Email addresses
    email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
    redacted = re.sub(email_pattern, '[EMAIL]', redacted)

    # LinkedIn URLs
    linkedin_pattern = r'linkedin\.com/in/[A-Za-z0-9_-]+'
    redacted = re.sub(linkedin_pattern, '[LINKEDIN]', redacted, flags=re.IGNORECASE)

    # GitHub URLs
    github_pattern = r'github\.com/[A-Za-z0-9_-]+'
    redacted = re.sub(github_pattern, '[GITHUB]', redacted, flags=re.IGNORECASE)

    # Phone numbers: +44 7700 112233, 07700 112233, 798072811, etc.
    phone_pattern = r'(\+\d{1,3}[\s.-]?)?\(?\d{2,4}\)?[\s.-]?\d{3,4}[\s.-]?\d{3,4}|\b\d{9,11}\b'
    redacted = re.sub(phone_pattern, '[PHONE]', redacted)

    # UK postcodes: BS8 1AF, M1 4HE, EC1A 1BB
    postcode_pattern = r'\b[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}\b'
    redacted = re.sub(postcode_pattern, '[POSTCODE]', redacted, flags=re.IGNORECASE)

    # Street addresses (line containing number + street type)
    street_pattern = r'\d+\s+[\w\s]+(?:Street|St|Road|Rd|Avenue|Ave|Lane|Ln|Drive|Dr|Way|Court|Ct|Place|Pl|Square|Sq|Close|Crescent|Terrace|Grove|Park|Gardens|Row|Mews|Hill|Green|Wood|Gate|Walk|Fields|Rise|View|Valley|Meadow|Heights)\b[^,\n]*'
    redacted = re.sub(street_pattern, '[ADDRESS]', redacted, flags=re.IGNORECASE)

    # Date of birth patterns: DOB: 15/03/1988, Date of Birth: 15 March 1988
    dob_pattern = r'(DOB|Date of Birth|Born|Birthday|D\.O\.B\.?)[\s:]*\d{1,2}[\s./-](\d{1,2}|Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[\s./-]\d{2,4}'
    redacted = re.sub(dob_pattern, '[DATE_OF_BIRTH]', redacted, flags=re.IGNORECASE)

    # National Insurance Number (UK): AB 12 34 56 C
    nino_pattern = r'\b[A-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-Z]\b'
    redacted = re.sub(nino_pattern, '[NATIONAL_ID]', redacted, flags=re.IGNORECASE)

    cleaned_docs.append(Document(content=redacted, meta=doc.meta))

    return {"cleaned_docs": cleaned_docs}

    Your code is instantly validated.

  3. Close the code editor.

  4. Connect the components as follows:

    • Connect Input's file output to PDFMinerToDocument's sources input.
    • Connect PDFMinerToDocument's documents output to Code's documents input.
    • Connect Code's cleaned_docs output to DocumentSplitter's documents input.
    • Connect DocumentSplitter's documents output to DeepsetNvidiaDocumentEmbedder's documents input.
    • Connect DeepsetNvidiaDocumentEmbedder's embeddings output to DocumentWriter's documents input.
    • Connect DocumentWriter's documents output to DocumentStore's documents input.
    • Connect DocumentStore's document_store input to OpenSearchDocumentStore's document_store output.

    This is what your index should look like:

    The index component connections
  5. Save and enable the index.

Click to view the complete index configuration
components:
Code:
type: deepset_cloud_custom_nodes.code.code_component.Code
init_parameters:
code: "from haystack import component, Document\nimport re\nfrom typing import List\n\n@component\nclass Code:\n \"\"\"\n Masks personally identifiable information in CV text before LLM processing.\n \"\"\"\n @component.output_types(cleaned_docs=List[Document])\n def run(self, documents: List[Document]) -> dict:\n cleaned_docs = []\n \n for doc in documents:\n redacted = doc.content\n\n # Email addresses\n email_pattern = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b'\n redacted = re.sub(email_pattern, '[EMAIL]', redacted)\n\n # LinkedIn URLs\n linkedin_pattern = r'linkedin\\.com/in/[A-Za-z0-9_-]+'\n redacted = re.sub(linkedin_pattern, '[LINKEDIN]', redacted, flags=re.IGNORECASE)\n\n # GitHub URLs\n github_pattern = r'github\\.com/[A-Za-z0-9_-]+'\n redacted = re.sub(github_pattern, '[GITHUB]', redacted, flags=re.IGNORECASE)\n\n # Phone numbers: +44 7700 112233, 07700 112233, 798072811, etc.\n phone_pattern = r'(\\+\\d{1,3}[\\s.-]?)?\\(?\\d{2,4}\\)?[\\s.-]?\\d{3,4}[\\s.-]?\\d{3,4}|\\b\\d{9,11}\\b'\n redacted = re.sub(phone_pattern, '[PHONE]', redacted)\n\n # UK postcodes: BS8 1AF, M1 4HE, EC1A 1BB\n postcode_pattern = r'\\b[A-Z]{1,2}\\d[A-Z\\d]?\\s?\\d[A-Z]{2}\\b'\n redacted = re.sub(postcode_pattern, '[POSTCODE]', redacted, flags=re.IGNORECASE)\n\n # Street addresses (line containing number + street type)\n street_pattern = r'\\d+\\s+[\\w\\s]+(?:Street|St|Road|Rd|Avenue|Ave|Lane|Ln|Drive|Dr|Way|Court|Ct|Place|Pl|Square|Sq|Close|Crescent|Terrace|Grove|Park|Gardens|Row|Mews|Hill|Green|Wood|Gate|Walk|Fields|Rise|View|Valley|Meadow|Heights)\\b[^,\\n]*'\n redacted = re.sub(street_pattern, '[ADDRESS]', redacted, flags=re.IGNORECASE)\n\n # Date of birth patterns: DOB: 15/03/1988, Date of Birth: 15 March 1988\n dob_pattern = r'(DOB|Date of Birth|Born|Birthday|D\\.O\\.B\\.?)[\\s:]*\\d{1,2}[\\s./-](\\d{1,2}|Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[\\s./-]\\d{2,4}'\n redacted = re.sub(dob_pattern, '[DATE_OF_BIRTH]', redacted, flags=re.IGNORECASE)\n\n # National Insurance Number (UK): AB 12 34 56 C\n nino_pattern = r'\\b[A-Z]{2}\\s?\\d{2}\\s?\\d{2}\\s?\\d{2}\\s?[A-Z]\\b'\n redacted = re.sub(nino_pattern, '[NATIONAL_ID]', redacted, flags=re.IGNORECASE)\n\n cleaned_docs.append(Document(content=redacted, meta=doc.meta))\n\n return {\"cleaned_docs\": cleaned_docs}"
PDFMinerToDocument:
type: haystack.components.converters.pdfminer.PDFMinerToDocument
init_parameters:
line_overlap: 0.5
char_margin: 2
line_margin: 0.5
word_margin: 0.1
boxes_flow: 0.5
detect_vertical: true
all_texts: false
store_full_path: false
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 0
split_threshold: 0
splitting_function:
respect_sentence_boundary: false
language: en
use_split_rules: true
extend_abbreviations: true
skip_empty_documents: true
DeepsetNvidiaDocumentEmbedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
model: intfloat/multilingual-e5-base
prefix: ''
suffix: ''
batch_size: 32
meta_fields_to_embed:
embedding_separator: \n
truncate:
normalize_embeddings: true
timeout:
backend_kwargs:
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:

connections: # Defines how the components are connected
- sender: PDFMinerToDocument.documents
receiver: Code.documents
- sender: Code.cleaned_docs
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: DeepsetNvidiaDocumentEmbedder.documents
- sender: DeepsetNvidiaDocumentEmbedder.documents
receiver: DocumentWriter.documents

inputs: # Define the inputs for your pipeline
files: # This component will receive the files to index as input
- PDFMinerToDocument.sources

max_runs_per_component: 100

metadata: {}

Result: You have created an index that can process PDF files and masks sensitive personal information using custom code. The index is enabled and ready to process files.

Index Files

Now, let's upload sample files to the workspace and index them.

  1. Download the sample files from the sample-cvs folder in Google Drive and unzip it to a location on your computer. You should have five PDF files.
  2. In deepset, go to Files and click Upload Files.
  3. Choose the five PDF files you just downloaded and click Upload.
  4. Wait until the upload finishes. You should have five PDF files in your workspace.
  5. Go to Indexes and click your PII-masking-index. This opens the index details page where you can see it's indexing the files you uploaded.
    The index details page with the indexing status
  6. Wait until the index status changes to Indexed and all the files show as Indexed.
  7. Click View next to the first file and switch to the Documents tab. You should see the document produced from this file with the sensitive information masked.
    The document preview with the sensitive information masked

Result: Congratulations! You have created an index that can process PDF files and masks sensitive personal information using custom code. You've also learned how to use the Code component in your apps.