Microsoft: This clever open-source technique helps to protect your privacy
Adding statistical noise to a data set can guarantee that there are no accidental information leaks. It’s a difficult task, made easier by the open-source SmartNoise framework.
Data is the new oil, as the saying goes — both valuable and requiring a lot of clean-up if it leaks. The fear that information in anonymised data can be reidentified and deanonymized puts people off contributing their information and makes it harder for researchers to get access to sensitive data and unlock insights that could help everyone. That applies to everything from health and education to Windows bugs and how Office is used.
Even with clear documentation of what’s collected, some users worry that the telemetry sent by Windows might reveal personal information. But the Windows data science team doesn’t want personal information when they’re looking for patterns of bugs and configurations, Sarah Bird, principal program manager for responsible AI at Microsoft, told TechRepublic.
“We don’t even want to know this information about our users. We want to know aggregate [information]. We don’t want a situation where we accidentally learned something that we didn’t even want to know.”
There’s a similar problem with a lot of machine learning, and the solution is differential privacy. This adds random ‘statistical noise’ to the results of queries — enough to protect individual privacy without compromising the accuracy of answers — in a way that can be proved to protect privacy.
“You only want to learn the larger patterns in the data, and so what differential privacy is doing is adding some noise to hide those smaller patterns that you didn’t want to know anyway,” Bird explained.
Differential privacy protects against both attackers trying to dig out individual information and systems accidentally exposing it she added. “If you’ve set the parameters correctly, it shouldn’t harm your analysis at all. It should be enable you to learn those big patterns, but protect you from learning the smaller patterns that you shouldn’t learn. The models are going to learn all sorts of things whether you want them to or not. We can actually guarantee, with a strong statistical guarantee, that we’re not going to learn that information as a result of this computation.”
Before the data collected on a Windows PC is sent to Microsoft, the telemetry system adds noise, so Microsoft can see the big picture of how Windows performs without getting information tied to any specific Windows user.
It’s already common to add noise during machine learning to prevent a problem called over-fitting, which occurs when the system learns the training data so well that it gets impressive results that don’t transfer over to the live data you want to use it with. “This is conceptually similar,” Bird said, “except the great thing about differential privacy is that mathematical guarantee that if you add the right type of noise and you keep track of how much information you reveal, then you’ll actually be able to say ‘I cannot reverse-engineer this; I am not able to learn anything about any individual in the data set’.”
The idea of differential privacy goes back about 15 years. In 2006, Microsoft Research distinguished scientist Cynthia Dwork, one of the researchers who came up with the idea, described it to us as ‘working on answers to problems not everybody has figured out they have yet’.
As organizations like Netflix and AOL started releasing data sets that were supposed to have the personal data removed, it quickly became clear that if you had extra information about people who had contributed data, you could sometimes identify them in the anonymised data set. That had implications for sharing medical data, census information and other useful data sets for research.
The idea behind differential privacy is to remove the risk of putting your information in a database by guaranteeing that it can’t leak what you specifically contributed. The key point is whether the system behaves differently when your data is in the database and when it isn’t. Differential privacy hides that difference using a precisely calculated amount of noise in the query results.
“Suppose you have a corpus of private information and you’re seeking to understand the underlying population; you want to carry out statistical analyses of data,” Dwork explained at the time. “You also want to allow people to form their own queries, and you want to allow even adversarial people [to do that]. You can not only believe, but mathematically provably guarantee, that you’re preserving privacy.”
The amount of noise required depends not on the size of the database, but how many times it will be queried. To avoid someone homing in on the real answer by repeatedly asking very similar questions, the magnitude of the noise added is tied to the number of queries that can be made against the database, or against specific data in it. Think of that as a privacy budget for the database (technically, it’s referred to as ‘epsilon’, and calculating the slope of the privacy risk using differential calculus gives the technique its name).
Sticking with the privacy budget means only sharing a database until that number of queries has been run against it.
“We know how much noise we have to add to ensure our definition of privacy,” Dwork told us. In some cases (but not all), that would be less than the sampling error in the database, giving you privacy ‘for free’.
Differential privacy means thinking about how the data set will be queried, but one big advantage, Dwork told us, is that: “You don’t have to decide in advance what’s identifiable information. One of our goals is that you don’t have to think too much.”
But putting differential privacy into practice has taken a lot of work, and it’s mainly been used by large, sophisticated organizations like Apple, Microsoft and the US Census Bureau (which has proven controversial).
“We’re seeing organisations start using it, but it has been the more tech-savvy ones like Microsoft saying, ‘I want that guarantee that we’re not going to have that data leave’,” Bird said.
In fact, it was almost too hard even for Microsoft to use, especially because Windows telemetry uses the trickiest but most privacy-protecting option of adding noise locally, before the data even goes into the database.
“Our original use case in Windows telemetry was successful and it was released in production, but the experience was that they had to work closely with Microsoft researchers and build up a lot of differential privacy expertise themselves in Windows,” Bird said. “And they came out the other side of this going, ‘Wow, that was way too hard and we want to do it a lot more’.”
“We had several teams in Microsoft who were wanting to use this technology because it has that higher level of privacy and there isn’t any other technology that can give you that guarantee that you won’t leak information in the output of the computation,” she added.
That included Office and the AI for Good program, who wanted researchers to have better access to sensitive data like healthcare and education information. “We all want to use differential privacy and it cannot be as hard as it was in Windows, or no-one’s going to adopt this technology,” said Bird.
SEE: AI in the OR: One company is closing the gaps in surgery using technology (TechRepublic)
To help with that, Microsoft partnered with Harvard University (where Dwork is a professor) as part of the OpenDP initiative and released the SmartNoise open-source framework. Built in Rust, SmartNoise has connections for data lakes, SQL Server, Postgres, Apache Spark, Apache Presto and CSV files, and a runtime that can be used from C, C++, Python, R and other languages to generate and validate differential privacy results. It also has ways to control the numbers of queries that are allowed, so you don’t run out of the ‘budget’ of queries that can be protected by the level of noise set for the database.
When you train a model or query data protected by SmartNoise, it adds statistical noise to the results, calculates how much privacy risk that adds to the database and subtracts that amount from the budget for future queries and training runs. It can also be used to create synthetic data to use in machine learning. “That means you don’t need to worry about budget tracking because you use your budget to generate one data set and people can do whatever queries they want,” Bird explained.
“If we have open-source tools, we’re going to be able to accelerate the adoption of differential privacy, because we’ll make it easier for people to use it, but also because we’ll make it easier for people to create things that other people can use, and advance the state of the art that way,” Bird said. Some users are small organisations that want to work at even higher scales than the amount of data collected as Windows telemetry, so Microsoft has done more work optimising the algorithms to run efficiently. “It’s very grounding and helping us really figure out what it’s going to take to make this technology really work.”
Even with SmartNoise, which reduces the amount of expertise and development work required, organisations still need a lot of data science expertise to choose the algorithm and settings (especially figuring out the right epsilon value for a data set).
If what you’re trying to do is similar to a way that differential privacy has already been used, Bird suggested that teams with data scientists and developers would be able to use the toolkit successfully on their own. Others reach out to the SmartNoise team on GitHub, which has led to a more formal early adoption programme where Microsoft is helping organisations like Humana and the Educational Results Partnership build differential privacy into research programmes looking at health and education data. “It’s everything from new startups that want to build around differential privacy to non-profits that want to use this for education,” Bird explained. “Hopefully in about six months we will have several more production use cases of differential privacy in the world.”
Microsoft has also used differential privacy to share US broadband usage data (originally collected for the FCC) with researchers looking at how connectivity has affected access to education during the pandemic.
Differential privacy at Microsoft
Microsoft is now using differential privacy in Office, and at LinkedIn, where it’s used for advertiser queries.
The new feature in Outlook that suggests replies to emails you receive is built using differential privacy, so none of the suggestions can include personal information. “You don’t want it revealing long-tail answers that it’s learned, like autocompleting ‘my social security number is’,” Bird explained. “Differential privacy protects you from learning those individual answers.” (Differential privacy is used elsewhere in Office, but Microsoft hasn’t started talking about those other uses yet.)
The manager dashboard in Workplace Analytics needs to give managers information about how their team is working, but not reveal details about specific people. “You want a manager to be able to look at the health and productivity and success of the team, but not learn anything about individual employees,” Bird said.
Differential privacy is particularly successful where there’s a fixed set of known queries or known analyses that can be optimised in a differentially private way.
The LinkedIn advertiser queries are ‘top k’ queries, looking for the most frequent results. “They’re all essentially the same structure,” Bird explained. “In Windows telemetry, it’s the same type of data and analysis coming over and over and over and over again. Work done once is heavily reused. For operational analytics like telemetry, you’re allowing more people to leverage data with privacy guarantees. In machine learning, [it’s useful] where it’s worth the effort to spend longer training the model or more carefully featurise, to have that privacy guarantee.”
Similarly, generating synthetic data with differential privacy is most useful if you know the questions you want to ask the data, so you can generate data that successfully answers those questions and preserves those properties in the original data set. “If you’re going to release this dataset and you have no idea of the kind of questions researchers are going to ask the data, it’s very difficult to guarantee that the synthetic data is going to uphold the true properties,” Bird noted.
Eventually, Bird hopes that differential privacy will extend to allowing researchers to make dynamic queries against data sets “to advance the state of the art for society but not reveal private information.” That’s the most challenging scenario, however.
“You need to be able to optimise the queries automatically and find the right point in the trade-off space between accuracy and privacy and computational efficiency. Then you also need dynamic budget tracking governance around who gets how much of what budget, and do you actually retire the data set?” she said.
“That’s the vision where we very much want to go — and in practice, we’re succeeding at pieces of that. That’s all the more reason to encourage more people to be using the technology now, because we need a lot of people working on it to help advance the state to a point where we can get to that ultimate vision.”
Microsoft customers who don’t have the data science expertise to work with the SmartNoise toolkit will eventually see differential privacy as a data-processing option in platforms like Power BI and Azure Data Share, Bird suggested. Instead of simply sharing a view of a database, you could share a differentially private view or allow differential privacy queries, or get differentially private results from Power BI analytics.
There’s still more work to be done on how to implement that, she said: “We need to know, when you’re generating dashboards in Power BI, here’s the queries, here’s the parameters that work for most cases or here’s how you adjust them. We’re not quite there yet, but that that’s the future I see where this actually gets used at scale.”