Publications

Presentations

Courses

Publications

Document title

Authors

Year

Source

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

show/hide description
This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the \textit{synthpop} package for \textbf{R}. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted probabilities as above. This proved to provide useable synthetic data for most of these data sets at values of the differentially privacy parameter ϵ as low as 0.5. The relationship between the disclosure risk and ϵ is illustrated for each of the data sets. Results show how the trade-off between disclosiveness and data utility depend on the characteristics of the data sets.

get document

Raab

2022

Raab, Gillian M. Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data. Lecture Notes in Computer Science. vol. 13463. Cham: Springer International Publishing, 2022. 250--265.

synthpop: Bespoke creation of synthetic data in R

show/hide description
This is a shorter version of the paper above that might be an easier starting point for someone new to this area. The same caveats about its referring to an older version of the package apply to this.

get document

R code

Nowok, Raab, Dibben

2016

Journal of Statistical Software, 74:1-26; DOI:10.18637/jss.v074.i11

synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control

show/hide description
This is a shorter version of the paper above that might be an easier starting point for someone new to this area. The same caveats about its referring to an older version of the package apply to this.

get document

R code

Nowok

2015

Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland

synthpop: Bespoke creation of synthetic data in R

Raab, Nowok, Dibben

2016

Available as a package vignette on the CRAN website

show/hide description
This is a slightly amended version of the paper from the Journal of Statistical Software.

get document

Practical data synthesis for large samples

Raab, Nowok, Dibben

2017

Journal of Privacy and Confidentiality, 7(3):67-97

show/hide description
This paper gives a brief description of the motivation for developing synthpop but it also includes theoretical work which allow inferences from fully synthetic data to be carried out with much less effort than the previous literature had suggested. In particular, the new methods do not require multiple synthetic data sets to be produced for making inferences to populations, thus reducing disclosure risk.

get document

Inference from fitted models in synthpop

Raab, Nowok

2017

Preprint currently available as a package vignette on the CRAN website

show/hide description
Describes how the methods for inference from synthetic data, including those in the paper above, and those proposed by others, are implemented in synthpop.

get document

Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R

Nowok, Raab, Dibben

2017

Statistical Journal of the IAOS, 33(3):785-796; DOI: 10.3233/SJI-150153

show/hide description
Describes how synthpop is used in the Scottish Longitudinal Study, and presents an example of the analysis of survey data that is available as part of the synthpop package.

get document

General and specific utility measures for synthetic data

Snoke, Raab, Nowok, Dibben and Slavkovic.

2018

Journal of the Royal Statistical Society: Series A; DOI: 10.1111/rssa.12358

show/hide description
Derives a general utility measure that is available in synthpop and the function utility.gen() illustrates their use on examples. When the published version is online (soon we hope) there will be a link to sample code that can be used.

get document

Guidelines for producing useful synthetic data

show/hide description
Gives practical advice on how to create synthetic data and also introduces a utility measure for comparing tables between synthesised data and the original. This is implemented in synthpop as utility.tab().

get document

Raab, Nowok, Dibben

2016

Earlier version presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016.

Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team

Elliot

2014

Report 2015-2, Cathie Marsh Centre for Census and Survey Research (CCSR)

show/hide description
Tested data produced by synthpop for disclosure risk.

get document

Introduction to synthetic data produced with the synthpop package

Raab

2018

Internal report

show/hide description
An internal report summarising the methods used in synthpop and focussing, particularly, on issues of disclosure control.

get document

Utility of synthetic microdata generated using tree-based methods

show/hide description
Compared different tree based methods, including bagging and random forests as methods for modelling conditional distributions in synthpop.
A slightly amended version of the paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland, 5-7 October 2015

get document

Nowok

2015

Paper presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016

Recognising real people in synthetic microdata: risk mitigation and impact on utility

show/hide description
...

get document

Nowok

2017

Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Skopje, North Macedonia, 20-22 September 2017

Putting synthetic people in place: creating synthetic data for spatial analysis at the individual level

Nowok, Dibben

2017

Technical report for the QCumber-EnvHealth project

show/hide description
...

get document

Assessing, visualizing and improving the utility of synthetic data

Raab, Nowok,

Dibben

2021

Paper presented at the Joint UNECE/Eurostat expert meeting on statistical data confidentiality; Poznań, Poland, 1-3 December 2021. Available as a package vignette on the CRAN website

show/hide description
A review and comparison of utility measures available in the synthpop package

get document

Presentations (selected)

If you are looking for a specific presentation which is not listed below, please get in touch and we will send it to you.

Presentation title

Presenter

Date

Event

Synthetic data in Scotland and beyond: lessons learned and future directions

pdf

Nowok

2018/06/12

Cathie Marsh Institute for Social Research (CMI) Afternoon Seminar, Manchester

Facilitating access to administrative records with synthetic data

pdf

Raab

2017/11/22

Dealing with Data 2017 Conference, University of Edinburgh

Synthetic data in practice: software, applications and challenges

pdf

Nowok

2017/09/07

Royal Statistical Society (RSS) 2017 Conference,

Glasgow

Courses

Course details

Files

Learning to create useful synthetic data

Date: 6th September 2022

Place: Edinburgh, workshop at the IDPLN conference

Presenters: G Raab & B Nowok &, supported by L Adair;

Presentations, data sets and instructions for practicals can be accessed here

Generating synthetic data with the synthpop package for R

Date: 20 June 2018

Place: Belfast, International Conference for Administrative Data Research

Presenters: Nowok & Raab

Notes with overview and instructions for practicals

Session 1: Introducing data synthesis and synthpop

A brief overview of the history of proposals for synthetic data generation and how these have been used in practice. In particular, how synthetic data sets are being made available to users of the Scottish Longitudinal Study. A brief introduction to synthpop and a simple example of data synthesis.

Presentation 1: Introduction and background

Presentation 2: Introduction to synthpop

Practical 1: Sample code

Session 2: Using synthpop

Details of the various functionalities of the synthpop package for R. Real data examples showing how to run default and customized synthesis and how to evaluate quality of synthetic data by visualisation, formal utility measures and comparisons of results of analysis based on original observed data and their synthesised version. Some practical advice on synthesising problematic variables.

Presentation 1: Going over first practical

Presentation 2: Synthesising larger datasets

Practical 2: Sample code

I-CeM data

I-CeM codebook

I-CeM sample code

Stay connected with us

Enter your email address to receive occasional updates

CRAN

X

Publications

Document title

Authors, year, source

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data/p>

Raab 2022, PSD 2022, Paris DOI:10.18637/jss.v074.i11

synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control

Nowok, 2015, Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland, 5-7 October 2015

get document

R code

synthpop: Bespoke creation of synthetic data in R

Raab, Nowok, Dibben, 2016, Available as a package vignette on the CRAN website

get document

show/hide description
This is a slightly amended version of the paper from the Journal of Statistical Software.

Practical data synthesis for large samples

Raab, Nowok, Dibben, 2017, Journal of Privacy and Confidentiality, 7(3):67-97

get document

show/hide description
This paper gives a brief description of the motivation for developing synthpop but it also includes theoretical work which allow inferences from fully synthetic data to be carried out with much less effort than the previous literature had suggested. In particular, the new methods do not require multiple synthetic data sets to be produced for making inferences to populations, thus reducing disclosure risk.

Inference from fitted models in synthpop

Raab, Nowok, 2017, Preprint currently available as a package vignette on the CRAN website

get document

show/hide description
Describes how the methods for inference from synthetic data, including those in the paper above, and those proposed by others, are implemented in synthpop.

Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R

Nowok, Raab, Dibben, 2017, Statistical Journal of the IAOS, 33(3):785-796; DOI: 10.3233/SJI-150153

get document

show/hide description
Describes how synthpop is used in the Scottish Longitudinal Study, and presents an example of the analysis of survey data that is available as part of the synthpop package.

General and specific utility measures for synthetic data

Snoke et al., 2018, Journal of the Royal Statistical Society: Series A; DOI: 10.1111/rssa.12358

get document

show/hide description
Derives a general utility measure that is available in synthpop and the function utility.gen() illustrates their use on examples. When the published version is online (soon we hope) there will be a link to sample code that can be used.

Guidelines for producing useful synthetic data

Raab, Nowok, Dibben, 2016, Earlier version presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016.

get document

show/hide description
Gives practical advice on how to create synthetic data and also introduces a utility measure for comparing tables between synthesised data and the original. This is implemented in synthpop as utility.tab().

Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team

Elliot, 2014, Report 2015-2, Cathie Marsh Centre for Census and Survey Research (CCSR)

get document

show/hide description
Tested data produced by synthpop for disclosure risk.

Introduction to synthetic data produced with the synthpop package

Raab, 2018, Internal report

get document

show/hide description
An internal report summarising the methods used in synthpop and focussing, particularly, on issues of disclosure control.

Utility of synthetic microdata generated using tree-based methods

Nowok, 2015, Paper presented at the Privacy in Statistical Databases Conference 2016; Dubrovnik, Croatia, 14-16 September 2016

get document

show/hide description
Compared different tree based methods, including bagging and random forests as methods for modelling conditional distributions in synthpop.
A slightly amended version of the paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Helsinki, Finland, 5-7 October 2015

Recognising real people in synthetic microdata: risk mitigation and impact on utility

Nowok, 2017, Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality; Skopje, Macedonia, 20-22 September 2017

get document