Skip to content

Data Anonymizer

The Data Anonymizer lets you protect sensitive information in your text.
It scans the content you give it, finds names, emails, phone numbers, and other personal data, and swaps them with placeholder text so you can share or store the data safely.

How it Works

  1. Read the text – The component looks at the field you specify (default is "text") in each data item.
  2. Split into chunks – Long passages are broken into smaller pieces so the analysis runs smoothly.
  3. Detect language – It figures out which language the chunk is in to use the right model.
  4. Choose a model
    • spaCy (fast, local) or
    • LLM (e.g., Gemma3) for more advanced recognition.
      You can pick the model size (small, medium, large) if you use the LLM.
  5. Find entities – The component looks for the types you selected (e.g., PERSON, EMAIL_ADDRESS).
    You can also add your own custom patterns or tell it to ignore certain entities.
  6. Replace with dummy values – Each detected piece of personal data is replaced with a generic placeholder (e.g., “John Doe” → “Person_1”).
  7. Return results – The original data is kept, and two new fields are added:
    • text_anonymized – the cleaned text.
    • anonymizer_mapping – a map showing what was replaced with what, useful for audit or re‑identification if needed.

All of this happens inside Nappai; no external services are called unless you choose an LLM model, which will use the LLM API.

Inputs

Input Fields

  • Data: Data to be anonymized
  • Analyze Fields: Fields to be analyzed
  • Custom Recognizers: Custom recognizers to be added
  • Source data input key: Input key to be used from data
  • Spacy Ignore Entities: Entities to be ignored for recognition.
  • Model Size: Model size to use for anonymization
  • NER Model Name: NER Model Name to use for anonymization
  • Remark Anonymization: Remark anonymization

Outputs

  • Data: The original data enriched with two new keys:
    • text_anonymized – the anonymized text.
    • anonymizer_mapping – a dictionary that shows which original values were replaced.

Usage Example

  1. Add the component to your workflow.
  2. Connect a data source (e.g., a “Data Importer” that pulls raw text).
  3. Set the inputs:
    • Source data input key"text" (or whatever field holds the text).
    • Analyze Fields → choose the types you care about (e.g., PERSON, EMAIL_ADDRESS).
    • NER Model Name"spacy" for quick local runs or "gemma3:12b" for deeper analysis.
  4. Run the workflow.
  5. Use the output:
    • Feed the anonymized data into a “Data Exporter” to save it.
    • Pass it to a “Data Visualizer” to create charts without exposing personal info.
    • Store it in a database for compliance audits.
  • Data Importer – Pull raw data into the workflow.
  • Data Exporter – Save the anonymized data to files or databases.
  • Data Processor – Perform additional transformations after anonymization.
  • Data Analyzer – Run analytics on the cleaned data.
  • Data Visualizer – Create charts and dashboards from anonymized data.

Tips and Best Practices

  • Start small: Test the component on a short sample to verify that the right entities are being replaced.
  • Use custom recognizers when you have domain‑specific patterns (e.g., company IDs).
  • Ignore non‑critical entities (like dates or numbers) to keep the text readable.
  • Choose the right model: spaCy is fast for most use‑cases; switch to an LLM only if you need higher accuracy.
  • Check the mapping: The anonymizer_mapping field lets you audit what was changed, which is handy for compliance.
  • Keep the key consistent: If you change the Source data input key, update all downstream components that rely on the original field.

Security Considerations

  • The component never sends your data outside of Nappai unless you select an LLM model that uses an external API.
  • When using an LLM, ensure that the API key is stored securely and that the LLM provider complies with your data‑privacy policies.
  • The anonymized output is still part of your workflow; treat it with the same security controls as any other sensitive data.