Invented by Ishaan Nerurkar, Christopher Hockenbrocht, Liam Damewood, Mihai Maruseac, Alexander Rozenshteyn, Leapyear Technologies Inc

Differentially Private Machine Learning Using a Random Forest Classifier

Random forest has several advantages in decision tree classification, such as its tolerance for unbalanced data and noise. Unfortunately, it may not be the ideal solution for protecting private information due to potential privacy leakage risks.

This paper describes a method for creating a differentially private random forest. We utilize pruning strategies to reduce the feature set and save budget on privacy protection.

What is Differentially Private Machine Learning?

Differential privacy is a method for safeguarding an individual’s privacy when using their information in machine learning algorithms. It establishes a mathematical framework that guarantees no learning about an individual if they aren’t present in the training data set, thus making sure adversaries cannot use this knowledge to identify individuals.

Implementing differentially private algorithms requires several techniques and methods. One popular option is to modify data in a machine learning model so it cannot be used for identification purposes. Another approach involves replacing non-private optimizers with differentially private ones.

These approaches present both a challenge and potential solution to the security of machine learning algorithms. While complex, they require extensive development effort; however, these solutions have the potential to provide valuable advantages across numerous industries.

Though the market for Differentially Private Machine Learning has seen rapid growth in recent years, several obstacles must still be overcome before this technology can be widely adopted. Most notably, algorithms must be able to generate sensitive synthetic data without disclosing users’ personal details.

Second, algorithms must be able to accurately learn from data without violating individual privacy rights. In order to do this, they need to add noise into the data in a way that makes it difficult for an attacker to identify it.

Random forest classifiers are one type of machine learning method that can be employed to accomplish this objective. It utilizes a decision tree model to predict values for specific data points such as email addresses or phone numbers.

This algorithm works on the principle that random forests possess the capacity to learn and make predictions by combining various features. Studies have demonstrated its efficacy for various machine learning tasks, such as image recognition and speech recognition.

What is the Market for Differentially Private Machine Learning?

Differentially private machine learning is a method for training algorithms on sensitive data without endangering individual users’ privacy. It enables analysts to conduct all statistical analysis as they would with non-private information, while restricting how far each data point can be used to infer anything about an individual user’s identity.

The market for differentially private machine learning is being driven by research firms that need to protect their customer and partner data from hackers and other malicious actors. Large corporations like Google and Apple have adopted privacy-preserving data analysis into their business models, while smaller firms and software startups are also finding value in this technique.

Many machine learning algorithms can potentially expose individual training data points, even with formal security and confidentiality mechanisms in place. This may result in a trade-off between model accuracy and privacy, as higher levels of privacy may reduce its usefulness.

Researchers have devised various methods for safeguarding individual privacy while still using machine learning algorithms. One promising technique is differential privacy, which describes patterns within data without disclosing personal information about individuals in the database.

Data Parsing (DP) has been successfully applied to numerous datasets, providing superior statistical performance in various scenarios. As an example, statistical query release allows data owners to specify counting queries such as “how many men live in Massachusetts?” and receive approximate answers that allow researchers to draw similar conclusions without access to the full dataset.

One more complex use case for data sharing in bioinformatics, where formal privacy measures may not be enough to protect research subjects’ personal information. Here, DP can be especially beneficial as it offers the chance to share research data without disclosing anyone else’s identities.

This paper proposes a novel approach to achieving differential privacy when using random decision trees for classification. We first demonstrate that a differentially private random forest can be efficiently trained with minimal noise, thus limiting the impact of the privacy budget on accuracy. Secondly, we present an efficient general method using smooth sensitivities to reduce noise required by random forests and demonstrate its applicability across multiple classes of classification problems with varying levels of accuracy.

What are the Challenges for Differentially Private Machine Learning?

Differential privacy is an increasingly critical topic in machine learning. This concept seeks to protect individual privacy by preventing data leaks that might arise from using private datasets. This is particularly crucial in areas like medical imaging, where models that generalize well require large multi-centre datasets during training but must never reveal personal test data once deployed.

Although many advances have been made to enable the use of private deep learning models, existing techniques show large performance drops when compared with training without a privacy-protection layer in the pipeline. This is mainly due to increased sample efficiency and complexity when learning high-level features that enable classifiers to perform certain tasks.

To address this, we developed a novel approach to privacy budget allocation based on out-of-bag estimation in random forest. This algorithm calculates decision tree weights and feature weights with this data under differential privacy protection while also applying statistical methods to classify the best feature set, pruned feature set and removable feature set.

Our results demonstrate that our technique not only achieves comparable accuracy to thresholdout, but also has negligible overfitting when applied to simulated data with characteristics commonly encountered in biological imaging. Furthermore, it performs well across a variety of simulated datasets, demonstrating its capacity for handling long-tailed distributions and sparse datasets alike.

We are also developing model agnostic private learning, a class of machine learning algorithms that provides differential privacy without concern about the internal workings of any particular algorithm. Utilizing the Sample and Aggregate framework [NRS07], we have created an efficient generic method to add differential privacy to non-private algorithms without regard to how the model was trained.

This approach is remarkably straightforward to implement and has been demonstrated to produce impressive results, especially on complex data. Our experiments show that training a model on moderately-sized data sets consistently produces better results than simply decreasing its learning rate (in non-private training) over time.

What are the Solutions for Differentially Private Machine Learning?

One popular solution for Differentially Private Machine Learning is a Random Forest Classifier. This algorithm has several advantages over other supervised machine learning algorithms, such as its sensitivity to unbalanced data and noise, and the capacity to handle large training sets.

Random Forest models have a limited tolerance for overfitting, particularly when it comes to feature selection. Overfitting can lead to incorrect conclusions being drawn by the model and reduced performance – especially in bioinformatics where data contains both large feature spaces and small sample sizes.

Therefore, having control over feature selection when it comes to distinguishing between desirable and undesirable characteristics is essential. This can be accomplished by evaluating each feature’s importance and eliminating those which do not contribute much towards improving a model’s accuracy.

This approach is essential as it prevents users from experiencing many unwanted holdouts, which can be costly when running a feature selection algorithm across multiple data sets to construct models for different use cases.

In this paper, we propose a solution to this problem by designing an algorithm that allocates an adequate privacy protection budget via out-of-bag estimation in a random forest. The method utilizes tree weights, feature weights and statistical techniques in an efficient manner to segment features into best, pruned and removable feature sets.

Furthermore, tree weights and feature weights are calculated and optimized in an equivariant way by using out-of-bag data under differential privacy protection to satisfy the e-differential privacy budget allocation algorithm. This makes the algorithm more targeted and reasonable, as well as enabling it to be applied to other data types like decision trees.

Consequently, this algorithm offers a flexible and privacy-preserving solution to overfitting and feature selection in random forests without needing additional resources. Furthermore, it is scalable to large datasets. It could be applied in bioinformatics, finance, insurance, and other fields where an accurate model is essential to obtain satisfactory outcomes.

The Leapyear Technologies Inc invention works as follows

A client requests that a differentially private random forests classifier be generated using restricted data. In response to the request, the differentially private random forest classificationifier is created. The differentially private random forest classificationifier is created by determining the number of decision tree options and then generating the number of decision tree options. A decision tree is created by generating a set splits using restricted data. Each split is assigned an information gain. The exponential mechanism then selects the split and adds it to the decision tree. Client is given the differentially private random forest classificationifier.

Background for Differentially Private Machine Learning Using a Random Forest Classifier

Field of Disclosure

The invention generally concerns building classifiers for computerized machine learning and, more specifically, protecting privacy of the training data required to build a machine-learned classificationifier.

Description of Related Art

Data about individuals, including financial records, health data, location information, browsing habits and web browsing patterns, can be valuable for collaboration and analysis. Many technologies allow for statistical or predictive analysis of personal information. Medical research institutions, for example, use information on individuals’ medical histories to support epidemiological studies. To provide routing guidance and traffic information, map providers use the location data gathered from mobile devices. To improve their offerings, technology companies collect data about Internet users’ behavior to help them improve their products. This includes redesigning user interfaces to make human-computer interaction easier, offering recommendations and sponsor messages.

However, this data’s personal nature limits its utility. There are strict guidelines that govern how personal data can collected, used, and shared. Individuals have expectations about the use of their personal data and can react negatively to it being made public. Companies that gather and store personal data try to extract value without falling foul of these expectations and rules.

Limiting access to personal data is one of the many techniques that can be used. Access controls can be used to limit access to those with the appropriate credentials. Another set of techniques involves removing personally-identifiable information from the data through masking, hashing, anonymization, aggregation, and tokenization. These techniques are resource-intensive and can compromise analytical utility. Data masking, for example, may be used to remove or alter data and compromise statistical properties.

Differential privacy is another technique. In order to protect private data, differential privacy technology injects noise into statistical databases. There are many issues in this space regarding how noise can be added to different uses cases and how much noise. These questions are difficult to answer due to the possible resources available to determined adversaries (e.g. the computing power available for a potential attacker trying access to the private information), the resources (e.g. computing power) available the database and the types of queries the database supports.

Machine-learnt classification is one example of where these issues can arise. This is where the known properties of training data are used for statistical models called?classifiers? This can be used to map new data into one or several output classifications. The training data may contain private information. The training data could include the medical history of a patient and can be used to create a classifier to detect a specific medical condition. A privacy-protection mechanism, such as differential privacy is not possible. However, an adversary might be able to obtain at least some of the private data from the training data by examining a machine-learned classificationifier.

A differentially private security systems is communicatively connected to a restricted data database. A client requests that a differentially secure security system generate a differentially personal random forest classifier using restricted data. The system then identifies the level of differential privacy appropriate to the request. A privacy parameter c is used to indicate the amount of information that was released about the database as a result of the query.

The differentially secure security system generates the differentially personal random forest classifier in response the request. The differentially private random forest classificationifier is created by determining the number of decision tree options and then generating the number of decision trees. A decision tree is created by generating a set splits using restricted data. Each split is assigned an information gain. The exponential mechanism then selects the split and adds it to the decision tree. Client is given the differentially private random forest classificationifier.




FIG. “FIG. 1. According to one embodiment




FIG. “FIG.7A” illustrates a recursive process to identify threshold points in the classification output vector for differentially private model test query, according one embodiment.

FIG. 7B is an illustration of a confusion matrix that was generated by a differentially private model test query.

FIG. FIG. 8 shows a system-level modification of the FIG. 1 allows the client access to a differentially protected synthetic database according to one embodiment.


Click here to view the patent on Google Patents.