Impact of Deep Natural Anonymization on the Training of Machine Learning Models

12. March 2021

Management Summary

A major problem for training machine learning models for image recognition is the availability of large amounts of visual data that is compliant with data privacy regulations. Our Deep Natural Anonymization Technology automatically anonymizes personally identifiable information in image and video data while keeping relevant visual information and context. This analysis shows that Deep Natural Anonymization has no significant impact on the training of machine learning models compared to using the original images. It is therefore a valuable tool to protect identities when working with image data for the training of machine learning models.

Download Full White Paper

What is Deep Natural Anonymization and why does it exist?

Our Deep Natural Anonymization Technology (DNAT) is an advanced solution to protect personally identifiable information (PII) in image and video data. This technology automatically detects and anonymizes personal information such as faces and license plates. General video redaction techniques include blurring the PIIs, however, this leads to loss of information and context of the image. This is why we use DNAT, which replaces the original PII with an artificial one that has a natural appearance and preserves the content information of the image.

Sample image from Cityscapes dataset after processing it with Deep Natural Anonymization.

How do you evaluate the impact of DNAT on machine learning?

Our aim is to use both the original data and the anonymized data and understand the differences in model accuracy between the two training paths. Keeping the hyperparameters same for both training paths enables us to say that the differences, if any, are related to the differences between the original and anonymized data.

We choose a standardized publicly available dataset called Cityscapes. It contains images of street scenes recorded from a varied range of locations, in different weather conditions and spanning different dates and times. We use brighter AI’s DNAT to create an anonymized copy of the entire Cityscapes dataset.

We select a detection and instance segmentation approach called Mask R-CNN for our experiment, most notably due to its applicability to our dataset and its state-of-the-art performance across multiple public benchmarks.

What are the results of your analysis?

Through experiments, we demonstrate that brighter AI’s DNAT does not have any significant impact on the accuracy of training a state-of-the-art machine learning model named Mask R-CNN on the public Cityscapes dataset. We show that the difference of the mean average precision (mAP) between training such a model on original versus anonymized data is negligible.

Contact us
[wpforms id=”274″]

Sreenjoy Chatterjee
Machine Learning Engineer