Show HN: Deidentification, Python tool for removing personal info using NLP
1 point by jftuga 5 months ago | 0 commentsI created a Python library and CLI to automatically identify and remove personal information from text documents using Natural Language Processing. It has been used to de-identify internal employee surveys and patient satisfaction surveys.
What my project does:
* Identifies and replaces person names using spaCy's transformer model
* Converts gender-specific pronouns to neutral alternatives
* Handles possessives and hyphenated names
* Offers HTML output with color-coded replacements
___
Here's a quick example:
Input: John Smith's report was excellent. He clearly understands the topic.
Output: [PERSON]'s report was excellent. HE/SHE clearly understands the topic.
___This was a fun project to work on - especially solving the challenge of maintaining correct character positions during replacements. The backwards processing approach was a neat solution to avoid recalculating positions after each replacement.
* blog post: https://gitgist.com/posts/introducing-deidentification-pytho...