Impact of Deep Natural Anonymization on the Training of Machine Learning Models

12. March 2021

Management Summary

A major problem for training machine learning models for image recognition is the availability of large amounts of visual data that is compliant with data privacy regulations. Our Deep Natural Anonymization (DNAT) automatically anonymizes personally identifiable information in image and video data while keeping relevant visual information and context. This analysis shows that Deep Natural Anonymization has no significant impact on the training of machine learning models compared to using the original images. It is a valuable tool to ensure privacy during machine learning model training for image data.

What is Deep Natural Anonymization (DNAT) and why does it exist?

DNAT is an advanced solution to protect personally identifiable information (PII) in image and video data. It automatically detects and anonymizes personal information such as faces and license plates, and therefore, ensures privacy in machine learning. General video redaction techniques include blurring the PII, which leads to a loss of information and context of the image. This is why we use DNAT, which replaces the original PII with an artificial one that has a natural appearance and preserves the content information of the image.

Sample image from Cityscapes dataset after processed by Deep Natural Anonymization.

How do you evaluate the impact of DNAT on machine learning?

We aimed to use both unmodified data and anonymized data to understand the differences in model accuracy. Keeping the hyperparameters the same for both training paths enables us to say that the differences, if any, are related to the differences between the unmodified and anonymized data.

We chose a standardized publicly available dataset called Cityscapes. It contains images of street scenes recorded from various locations, in different weather conditions, and spanning different dates and times. We used brighter AI’s DNAT to create an anonymized copy of the entire Cityscapes dataset.

We selected a detection and instance segmentation approach called Mask R-CNN for our experiment, most notably due to its applicability to our dataset and its state-of-the-art performance across multiple public benchmarks.

What are the results of your analysis?

Through experiments, we conclude that brighter AI’s DNAT does not have any significant impact on the accuracy of training a state-of-the-art machine learning model named Mask R-CNN on the public Cityscapes dataset. We show that the difference in the mean average precision (mAP) between training such a model on original versus anonymized data is negligible.

The accuracy of brighter AI's Deep Natural Anonymization
Data anonymized by brighter AI’s Deep Natural Anonymization preserves the same level of accuracy as unmodified data during machine learning model traning.

Andreea Mandeal
Head of Marketing