# Tutorial: Build a PII Masking Index

# Tutorial: Build a PII Masking Index with Custom Code

Build an index that hides sensitive personal information using custom code. This tutorial is a great starting point if you're building an app where you need to keep sensitive data out of the LLM, such as a CV screening system. You'll also learn how ot use the `Code` component in your apps.

***

- **Level**: Beginner
- **Time to complete**: 10 minutes
- **Prerequisites**: 
    - You must have an Editor role in the workspace where you'll create the index.
    - You must have a <ProductShortName /> workspace where you'll create the index. For instructions on how to create a workspace, see [Quick Start Guide](/docs/getting-started/quick-start-guide.mdx).
- **Goal**: After completing this tutorial, you'll have create an index that can process PDF and TXT files and masks sensitive personal information using custom code. You'll also learn how to use the `Code` component in your apps.

***

## Create the Index

To keep it simple, let's create an index that can process PDF files. 

1. In <ProductShortName />, make sure in the correct workspace and go to **Indexes**. 
2. Click **Create Index>Build your own** to create an index from scratch.
    <ClickableImage
      src="/img/tutorials/build-your-own-index.png"
      alt="The Build your own index button on the Idexe templates page"
      size="standard"
    />
3. Type *PII-masking-index* as the name and *An index that masks sensitive personal information* as the description and click **Create Index**.
4. From the Component Library, drag the following components onto the canvas:

:::tip 
    You can use the search field above the Component Library to find components quickly.
    :::

    - `Input`
    - `PDFMinerToDocument`
    - `Code`
    - `DocumentSplitter`
    - `DeepsetNvidiaDocumentEmbedder`
    - `DocumentWriter`
    - `OpenSearchDocumentStore`
    
5. On the `Code` component card, click **Code**. This opens the code editor. You can expand it to make it more convenient.
    <ClickableImage
      src="/img/tutorials/expand-code.png"
      alt="The expand code button on the Code component card"
      size="large"
    />
6. Replace the example code with the following code:

    ```python
    from haystack import component, Document
    from typing import List

    @component
    class Code:
        """
        Masks personally identifiable information in CV text before LLM processing.
        """
        @component.output_types(cleaned_docs=List[Document])
        def run(self, documents: List[Document]) -> dict:
            cleaned_docs = []
            
            for doc in documents:
                redacted = doc.content

                # Email addresses
                email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
                redacted = re.sub(email_pattern, '[EMAIL]', redacted)

                # LinkedIn URLs
                linkedin_pattern = r'linkedin\.com/in/[A-Za-z0-9_-]+'
                redacted = re.sub(linkedin_pattern, '[LINKEDIN]', redacted, flags=re.IGNORECASE)

                # GitHub URLs
                github_pattern = r'github\.com/[A-Za-z0-9_-]+'
                redacted = re.sub(github_pattern, '[GITHUB]', redacted, flags=re.IGNORECASE)

                # Phone numbers: +44 7700 112233, 07700 112233, 798072811, etc.
                phone_pattern = r'(\+\d{1,3}[\s.-]?)?\(?\d{2,4}\)?[\s.-]?\d{3,4}[\s.-]?\d{3,4}|\b\d{9,11}\b'
                redacted = re.sub(phone_pattern, '[PHONE]', redacted)

                # UK postcodes: BS8 1AF, M1 4HE, EC1A 1BB
                postcode_pattern = r'\b[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}\b'
                redacted = re.sub(postcode_pattern, '[POSTCODE]', redacted, flags=re.IGNORECASE)

                # Street addresses (line containing number + street type)
                street_pattern = r'\d+\s+[\w\s]+(?:Street|St|Road|Rd|Avenue|Ave|Lane|Ln|Drive|Dr|Way|Court|Ct|Place|Pl|Square|Sq|Close|Crescent|Terrace|Grove|Park|Gardens|Row|Mews|Hill|Green|Wood|Gate|Walk|Fields|Rise|View|Valley|Meadow|Heights)\b[^,\n]*'
                redacted = re.sub(street_pattern, '[ADDRESS]', redacted, flags=re.IGNORECASE)

                # Date of birth patterns: DOB: 15/03/1988, Date of Birth: 15 March 1988
                dob_pattern = r'(DOB|Date of Birth|Born|Birthday|D\.O\.B\.?)[\s:]*\d{1,2}[\s./-](\d{1,2}|Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[\s./-]\d{2,4}'
                redacted = re.sub(dob_pattern, '[DATE_OF_BIRTH]', redacted, flags=re.IGNORECASE)

                # National Insurance Number (UK): AB 12 34 56 C
                nino_pattern = r'\b[A-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-Z]\b'
                redacted = re.sub(nino_pattern, '[NATIONAL_ID]', redacted, flags=re.IGNORECASE)

                cleaned_docs.append(Document(content=redacted, meta=doc.meta))

            return {"cleaned_docs": cleaned_docs}
    ```
    Your code is instantly validated.
7. Close the code editor.
8. Connect the components as follows:
    - Connect `Input`'s `file` output to `PDFMinerToDocument`'s `sources` input.
    - Connect `PDFMinerToDocument`'s `documents` output to `Code`'s `documents` input.
    - Connect `Code`'s `cleaned_docs` output to `DocumentSplitter`'s `documents` input.
    - Connect `DocumentSplitter`'s `documents` output to `DeepsetNvidiaDocumentEmbedder`'s `documents` input.
    - Connect `DeepsetNvidiaDocumentEmbedder`'s `embeddings` output to `DocumentWriter`'s `documents` input.
    - Connect `DocumentWriter`'s `documents` output to `DocumentStore`'s `documents` input.
    - Connect `DocumentStore`'s `document_store` input to `OpenSearchDocumentStore`'s `document_store` output.
    
    This is what your index should look like:

    <ClickableImage
      src="/img/tutorials/pii-component-connections.png"
      alt="The index component connections"
      size="standard"
    />
9. Save and enable the index.

<details>
<summary>Click to view the complete index configuration</summary>

```yaml
components:
  Code:
    type: deepset_cloud_custom_nodes.code.code_component.Code
    init_parameters:
      code: "from haystack import component, Document\nimport re\nfrom typing import List\n\n@component\nclass Code:\n    \"\"\"\n    Masks personally identifiable information in CV text before LLM processing.\n    \"\"\"\n    @component.output_types(cleaned_docs=List[Document])\n    def run(self, documents: List[Document]) -> dict:\n        cleaned_docs = []\n        \n        for doc in documents:\n            redacted = doc.content\n\n            # Email addresses\n            email_pattern = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b'\n            redacted = re.sub(email_pattern, '[EMAIL]', redacted)\n\n            # LinkedIn URLs\n            linkedin_pattern = r'linkedin\\.com/in/[A-Za-z0-9_-]+'\n            redacted = re.sub(linkedin_pattern, '[LINKEDIN]', redacted, flags=re.IGNORECASE)\n\n            # GitHub URLs\n            github_pattern = r'github\\.com/[A-Za-z0-9_-]+'\n            redacted = re.sub(github_pattern, '[GITHUB]', redacted, flags=re.IGNORECASE)\n\n            # Phone numbers: +44 7700 112233, 07700 112233, 798072811, etc.\n            phone_pattern = r'(\\+\\d{1,3}[\\s.-]?)?\\(?\\d{2,4}\\)?[\\s.-]?\\d{3,4}[\\s.-]?\\d{3,4}|\\b\\d{9,11}\\b'\n            redacted = re.sub(phone_pattern, '[PHONE]', redacted)\n\n            # UK postcodes: BS8 1AF, M1 4HE, EC1A 1BB\n            postcode_pattern = r'\\b[A-Z]{1,2}\\d[A-Z\\d]?\\s?\\d[A-Z]{2}\\b'\n            redacted = re.sub(postcode_pattern, '[POSTCODE]', redacted, flags=re.IGNORECASE)\n\n            # Street addresses (line containing number + street type)\n            street_pattern = r'\\d+\\s+[\\w\\s]+(?:Street|St|Road|Rd|Avenue|Ave|Lane|Ln|Drive|Dr|Way|Court|Ct|Place|Pl|Square|Sq|Close|Crescent|Terrace|Grove|Park|Gardens|Row|Mews|Hill|Green|Wood|Gate|Walk|Fields|Rise|View|Valley|Meadow|Heights)\\b[^,\\n]*'\n            redacted = re.sub(street_pattern, '[ADDRESS]', redacted, flags=re.IGNORECASE)\n\n            # Date of birth patterns: DOB: 15/03/1988, Date of Birth: 15 March 1988\n            dob_pattern = r'(DOB|Date of Birth|Born|Birthday|D\\.O\\.B\\.?)[\\s:]*\\d{1,2}[\\s./-](\\d{1,2}|Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[\\s./-]\\d{2,4}'\n            redacted = re.sub(dob_pattern, '[DATE_OF_BIRTH]', redacted, flags=re.IGNORECASE)\n\n            # National Insurance Number (UK): AB 12 34 56 C\n            nino_pattern = r'\\b[A-Z]{2}\\s?\\d{2}\\s?\\d{2}\\s?\\d{2}\\s?[A-Z]\\b'\n            redacted = re.sub(nino_pattern, '[NATIONAL_ID]', redacted, flags=re.IGNORECASE)\n\n            cleaned_docs.append(Document(content=redacted, meta=doc.meta))\n\n        return {\"cleaned_docs\": cleaned_docs}"
  PDFMinerToDocument:
    type: haystack.components.converters.pdfminer.PDFMinerToDocument
    init_parameters:
      line_overlap: 0.5
      char_margin: 2
      line_margin: 0.5
      word_margin: 0.1
      boxes_flow: 0.5
      detect_vertical: true
      all_texts: false
      store_full_path: false
  DocumentSplitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 200
      split_overlap: 0
      split_threshold: 0
      splitting_function:
      respect_sentence_boundary: false
      language: en
      use_split_rules: true
      extend_abbreviations: true
      skip_empty_documents: true
  DeepsetNvidiaDocumentEmbedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
    init_parameters:
      model: intfloat/multilingual-e5-base
      prefix: ''
      suffix: ''
      batch_size: 32
      meta_fields_to_embed:
      embedding_separator: \n
      truncate:
      normalize_embeddings: true
      timeout:
      backend_kwargs:
  DocumentWriter:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      policy: NONE
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:

connections:  # Defines how the components are connected
- sender: PDFMinerToDocument.documents
  receiver: Code.documents
- sender: Code.cleaned_docs
  receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
  receiver: DeepsetNvidiaDocumentEmbedder.documents
- sender: DeepsetNvidiaDocumentEmbedder.documents
  receiver: DocumentWriter.documents

inputs:  # Define the inputs for your pipeline
  files:                            # This component will receive the files to index as input
  - PDFMinerToDocument.sources

max_runs_per_component: 100

metadata: {}

```

</details>  

**Result**: You have created an index that can process PDF files and masks sensitive personal information using custom code. The index is enabled and ready to process files.

## Index Files

Now, let's upload sample files to the workspace and index them. 

1. Download the sample files from the <a href="https://drive.google.com/drive/folders/1w_XcScBBN29GlvW0n9MuobyBNMW9h-Cb?usp=sharing" target="_blank">sample-cvs</a> folder in Google Drive and unzip it to a location on your computer. You should have five PDF files.
2. In <ProductShortName />, go to **Files** and click **Upload Files**.
3. Choose the five PDF files you just downloaded and click **Upload**.
4. Wait until the upload finishes. You should have five PDF files in your workspace.
5. Go to **Indexes** and click your PII-masking-index. This opens the index details page where you can see it's indexing the files you uploaded. 
    <ClickableImage
      src="/img/tutorials/pii-indexing-status.png"
      alt="The index details page with the indexing status"
      size="large"
    />
6. Wait until the index status changes to *Indexed* and all the files show as *Indexed*.
7. Click **View** next to the first file and switch to the **Documents** tab. You should see the document produced from this file with the sensitive information masked.
    <ClickableImage
      src="/img/tutorials/document-pii-hidden.png"
      alt="The document preview with the sensitive information masked"
      size="large"
    />

**Result**: Congratulations! You have created an index that can process PDF files and masks sensitive personal information using custom code. You've also learned how to use the `Code` component in your apps.
