Document title
Authors
Year
Source
Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data
show/hide description
This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the \textit{synthpop} package for \textbf{R}. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted probabilities as above. This proved to provide useable synthetic data for most of these data sets at values of the differentially privacy parameter ϵ as low as 0.5. The relationship between the disclosure risk and ϵ is illustrated for each of the data sets. Results show how the trade-off between disclosiveness and data utility depend on the characteristics of the data sets.
Raab
2022
synthpop: Bespoke creation of synthetic data in R
show/hide description
This is a shorter version of the paper above that might be an easier starting point for someone new to this area. The same caveats about its referring to an older version of the package apply to this.
Nowok, Raab, Dibben
2016
Journal of Statistical Software, 74:1-26; DOI:10.18637/jss.v074.i11
synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control
show/hide description
This is a shorter version of the paper above that might be an easier starting point for someone new to this area. The same caveats about its referring to an older version of the package apply to this.
Nowok
2015
Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland
synthpop: Bespoke creation of synthetic data in R
Raab, Nowok, Dibben
2016
Available as a package vignette on the CRAN website
show/hide description
This is a slightly amended version of the paper from the Journal of Statistical Software.
Practical data synthesis for large samples
Raab, Nowok, Dibben
2017
Journal of Privacy and Confidentiality, 7(3):67-97
show/hide description
This paper gives a brief description of the motivation for developing synthpop but it also includes theoretical work which allow inferences from fully synthetic data to be carried out with much less effort than the previous literature had suggested. In particular, the new methods do not require multiple synthetic data sets to be produced for making inferences to populations, thus reducing disclosure risk.
Inference from fitted models in synthpop
Raab, Nowok
2017
Preprint currently available as a package vignette on the CRAN website
show/hide description
Describes how the methods for inference from synthetic data, including those in the paper above, and those proposed by others, are implemented in synthpop.
Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R
Nowok, Raab, Dibben
2017
Statistical Journal of the IAOS, 33(3):785-796; DOI: 10.3233/SJI-150153
show/hide description
Describes how synthpop is used in the Scottish Longitudinal Study, and presents an example of the analysis of survey data that is available as part of the synthpop package.
General and specific utility measures for synthetic data
Snoke, Raab, Nowok, Dibben and Slavkovic.
2018
Journal of the Royal Statistical Society: Series A; DOI: 10.1111/rssa.12358
show/hide description
Derives a general utility measure that is available in synthpop and the function utility.gen() illustrates their use on examples. When the published version is online (soon we hope) there will be a link to sample code that can be used.
Guidelines for producing useful synthetic data
show/hide description
Gives practical advice on how to create synthetic data and also introduces a utility measure for comparing tables between synthesised data and the original. This is implemented in synthpop as utility.tab().
Raab, Nowok, Dibben
2016
Earlier version presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016.
Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team
Elliot
2014
Report 2015-2, Cathie Marsh Centre for Census and Survey Research (CCSR)
Introduction to synthetic data produced with the synthpop package
Raab
2018
Internal report
show/hide description
An internal report summarising the methods used in synthpop and focussing, particularly, on issues of disclosure control.
Utility of synthetic microdata generated using tree-based methods
show/hide description
Compared different tree based methods, including bagging and random forests as methods for modelling conditional distributions in synthpop.
A slightly amended version of the paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland, 5-7 October 2015
Nowok
2015
Paper presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016
Recognising real people in synthetic microdata: risk mitigation and impact on utility
Nowok
2017
Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Skopje, North Macedonia, 20-22 September 2017
Putting synthetic people in place: creating synthetic data for spatial analysis at the individual level
Nowok, Dibben
2017
Technical report for the QCumber-EnvHealth project
Assessing, visualizing and improving the utility of synthetic data
Raab, Nowok,
Dibben
2021
Paper presented at the Joint UNECE/Eurostat expert meeting on statistical data confidentiality; Poznań, Poland, 1-3 December 2021. Available as a package vignette on the CRAN website
show/hide description
A review and comparison of utility measures available in the synthpop package
If you are looking for a specific presentation which is not listed below, please get in touch and we will send it to you.
Presentation title
Presenter
Date
Event
Synthetic data in Scotland and beyond: lessons learned and future directions
Nowok
2018/06/12
Cathie Marsh Institute for Social Research (CMI) Afternoon Seminar, Manchester
Facilitating access to administrative records with synthetic data
Raab
2017/11/22
Dealing with Data 2017 Conference, University of Edinburgh
Synthetic data in practice: software, applications and challenges
Nowok
2017/09/07
Royal Statistical Society (RSS) 2017 Conference,
Glasgow
Course details
Files
Learning to create useful synthetic data
Date: 6th September 2022
Place: Edinburgh, workshop at the IDPLN conference
Presenters: G Raab & B Nowok &, supported by L Adair;
Generating synthetic data with the synthpop package for R
Date: 20 June 2018
Place: Belfast, International Conference for Administrative Data Research
Presenters: Nowok & Raab
Session 1: Introducing data synthesis and synthpop
A brief overview of the history of proposals for synthetic data generation and how these have been used in practice. In particular, how synthetic data sets are being made available to users of the Scottish Longitudinal Study. A brief introduction to synthpop and a simple example of data synthesis.
Session 2: Using synthpop
Details of the various functionalities of the synthpop package for R. Real data examples showing how to run default and customized synthesis and how to evaluate quality of synthetic data by visualisation, formal utility measures and comparisons of results of analysis based on original observed data and their synthesised version. Some practical advice on synthesising problematic variables.
Stay connected with us
Enter your email address to receive occasional updates
Document title
Authors, year, source
Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data/p>
Raab 2022, PSD 2022, Paris DOI:10.18637/jss.v074.i11
synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control
Nowok, 2015, Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland, 5-7 October 2015
synthpop: Bespoke creation of synthetic data in R
Raab, Nowok, Dibben, 2016, Available as a package vignette on the CRAN website
show/hide description
This is a slightly amended version of the paper from the Journal of Statistical Software.
Practical data synthesis for large samples
Raab, Nowok, Dibben, 2017, Journal of Privacy and Confidentiality, 7(3):67-97
show/hide description
This paper gives a brief description of the motivation for developing synthpop but it also includes theoretical work which allow inferences from fully synthetic data to be carried out with much less effort than the previous literature had suggested. In particular, the new methods do not require multiple synthetic data sets to be produced for making inferences to populations, thus reducing disclosure risk.
Inference from fitted models in synthpop
Raab, Nowok, 2017, Preprint currently available as a package vignette on the CRAN website
show/hide description
Describes how the methods for inference from synthetic data, including those in the paper above, and those proposed by others, are implemented in synthpop.
Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R
Nowok, Raab, Dibben, 2017, Statistical Journal of the IAOS, 33(3):785-796; DOI: 10.3233/SJI-150153
show/hide description
Describes how synthpop is used in the Scottish Longitudinal Study, and presents an example of the analysis of survey data that is available as part of the synthpop package.
General and specific utility measures for synthetic data
Snoke et al., 2018, Journal of the Royal Statistical Society: Series A; DOI: 10.1111/rssa.12358
show/hide description
Derives a general utility measure that is available in synthpop and the function utility.gen() illustrates their use on examples. When the published version is online (soon we hope) there will be a link to sample code that can be used.
Guidelines for producing useful synthetic data
Raab, Nowok, Dibben, 2016, Earlier version presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016.
show/hide description
Gives practical advice on how to create synthetic data and also introduces a utility measure for comparing tables between synthesised data and the original. This is implemented in synthpop as utility.tab().
Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team
Elliot, 2014, Report 2015-2, Cathie Marsh Centre for Census and Survey Research (CCSR)
show/hide description
Tested data produced by synthpop for disclosure risk.
Introduction to synthetic data produced with the synthpop package
Raab, 2018, Internal report
show/hide description
An internal report summarising the methods used in synthpop and focussing, particularly, on issues of disclosure control.
Utility of synthetic microdata generated using tree-based methods
Nowok, 2015, Paper presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016
show/hide description
Compared different tree based methods, including bagging and random forests as methods for modelling conditional distributions in synthpop.
A slightly amended version of the paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland, 5-7 October 2015
Recognising real people in synthetic microdata: risk mitigation and impact on utility
Nowok, 2017, Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Skopje, Macedonia, 20-22 September 2017
show/hide description
...
Putting synthetic people in place: creating synthetic data for spatial analysis at the individual level
Nowok, Dibben, 2017, Technical report for the QCumber-EnvHealth project
show/hide description
...
Presentation title
Presenter, date, event
Synthetic data in Scotland and beyond: lessons learned and future directions
Nowok, 2018/06/12, Cathie Marsh Institute for Social Research (CMI) Afternoon Seminar, Manchester
Facilitating access to administrative records with synthetic data
Raab, 2017/11/22, Dealing with Data 2017 Conference, University of Edinburgh
Synthetic data in practice: software, applications and challenges
Nowok, 2017/09/07, Royal Statistical Society (RSS) 2017 Conference, Glasgow
Stay connected with us
Enter your email address to receive occasional update