Document title
Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data
This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the \textit{synthpop} package for \textbf{R}. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted probabilities as above. This proved to provide useable synthetic data for most of these data sets at values of the differentially privacy parameter ϵ as low as 0.5. The relationship between the disclosure risk and ϵ is illustrated for each of the data sets. Results show how the trade-off between disclosiveness and data utility depend on the characteristics of the data sets.
synthpop: Bespoke creation of synthetic data in R
This is a shorter version of the paper above that might be an easier starting point for someone new to this area. The same caveats about its referring to an older version of the package apply to this.
Nowok, Raab, Dibben
Journal of Statistical Software, 74:1-26; DOI:10.18637/jss.v074.i11
synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control
This is a shorter version of the paper above that might be an easier starting point for someone new to this area. The same caveats about its referring to an older version of the package apply to this.
Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland
synthpop: Bespoke creation of synthetic data in R
Raab, Nowok, Dibben
Available as a package vignette on the CRAN website
This is a slightly amended version of the paper from the Journal of Statistical Software.
Practical data synthesis for large samples
Raab, Nowok, Dibben
Journal of Privacy and Confidentiality, 7(3):67-97
This paper gives a brief description of the motivation for developing synthpop but it also includes theoretical work which allow inferences from fully synthetic data to be carried out with much less effort than the previous literature had suggested. In particular, the new methods do not require multiple synthetic data sets to be produced for making inferences to populations, thus reducing disclosure risk.
Inference from fitted models in synthpop
Raab, Nowok
Preprint currently available as a package vignette on the CRAN website
Describes how the methods for inference from synthetic data, including those in the paper above, and those proposed by others, are implemented in synthpop.
Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R
Nowok, Raab, Dibben
Statistical Journal of the IAOS, 33(3):785-796; DOI: 10.3233/SJI-150153
Describes how synthpop is used in the Scottish Longitudinal Study, and presents an example of the analysis of survey data that is available as part of the synthpop package.
General and specific utility measures for synthetic data
Snoke, Raab, Nowok, Dibben and Slavkovic.
Journal of the Royal Statistical Society: Series A; DOI: 10.1111/rssa.12358
Derives a general utility measure that is available in synthpop and the function utility.gen() illustrates their use on examples. When the published version is online (soon we hope) there will be a link to sample code that can be used.
Guidelines for producing useful synthetic data
Gives practical advice on how to create synthetic data and also introduces a utility measure for comparing tables between synthesised data and the original. This is implemented in synthpop as
Raab, Nowok, Dibben
Earlier version presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016.
Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team
Report 2015-2, Cathie Marsh Centre for Census and Survey Research (CCSR)
Introduction to synthetic data produced with the synthpop package
Internal report
An internal report summarising the methods used in synthpop and focussing, particularly, on issues of disclosure control.
Utility of synthetic microdata generated using tree-based methods
Compared different tree based methods, including bagging and random forests as methods for modelling conditional distributions in synthpop.
A slightly amended version of the paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland, 5-7 October 2015
Paper presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016
Recognising real people in synthetic microdata: risk mitigation and impact on utility
Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Skopje, North Macedonia, 20-22 September 2017
Putting synthetic people in place: creating synthetic data for spatial analysis at the individual level
Nowok, Dibben
Technical report for the QCumber-EnvHealth project
Assessing, visualizing and improving the utility of synthetic data
Raab, Nowok,
Paper presented at the Joint UNECE/Eurostat expert meeting on statistical data confidentiality; Poznań, Poland, 1-3 December 2021. Available as a package vignette on the CRAN website
A review and comparison of utility measures available in the synthpop package
If you are looking for a specific presentation which is not listed below, please get in touch and we will send it to you.
Presentation title
Synthetic data in Scotland and beyond: lessons learned and future directions
Cathie Marsh Institute for Social Research (CMI) Afternoon Seminar, Manchester
Facilitating access to administrative records with synthetic data
Dealing with Data 2017 Conference, University of Edinburgh
Synthetic data in practice: software, applications and challenges
Royal Statistical Society (RSS) 2017 Conference,
Course details
Learning to create useful synthetic data
Date: 6th September 2022
Place: Edinburgh, workshop at the IDPLN conference
Presenters: G Raab & B Nowok &, supported by L Adair;
Generating synthetic data with the synthpop package for R
Date: 20 June 2018
Place: Belfast, International Conference for Administrative Data Research
Presenters: Nowok & Raab
Session 1: Introducing data synthesis and synthpop
A brief overview of the history of proposals for synthetic data generation and how these have been used in practice. In particular, how synthetic data sets are being made available to users of the Scottish Longitudinal Study. A brief introduction to synthpop and a simple example of data synthesis.
Session 2: Using synthpop
Details of the various functionalities of the synthpop package for R. Real data examples showing how to run default and customized synthesis and how to evaluate quality of synthetic data by visualisation, formal utility measures and comparisons of results of analysis based on original observed data and their synthesised version. Some practical advice on synthesising problematic variables.
