The synthpop package makes a synthetic version of individual-level data. If you want to have a quick look at synthpop in action then please use our Shiny app. If you want to start running synthesis in R and you are new to R, you will need to install R and learn the basics of R. Then you should read your data into R as an R data frame, as explained below, and carry out some exploratory analyses to understand their structure.
As far as possible your R data frame should consist of the variables that might be used by an analyst who will be producing data summaries, such as tables or fits to statistical models from the data. Each variable in the data frame should have an appropriate data type (e.g. numeric, factor). The synthpop package provides some tools to help you check this (see below).
In order to run synthpop, you must have R installed on your computer. If you need to install R go here. If you have never used R before you should access some of the resources for getting started with R.
In addition to R, you might also want to install RStudio. It is an integrated development environment (IDE) for R that will make your experience with R much more enjoyable. If you want to install RStudio go here.
Open R or RStudio and install synthpop by typing the following code into the console
You will only need to do this once. It will install synthpop and all the other packages it uses from the CRAN website.
If you are in a secure setting, without internet access, you will need to download the appropriate zip files from CRAN, import them into your secure environment and install packages from those local zip files.
To start using the package you will need to load it using the
library() function and you will have to repeat this step every time you open R or RStudio and want to run synthpop.
You can get a list of all the synthpop functions using the command
help(package = synthpop)
To quickly access a help file for a specific function, e.g. the main synthpop function
syn(), you can type its name preceded by
You will be working with your own data, but to help you we have provided an R script that uses the data
SD2011 that is supplied as part of the synthpop package. Get the sample R script here.
Read the data you want to synthesise into R, if it is not there already. You can use the synthpop function
read.obs() to read it in from other formats (check the help file).
We strongly advise you to start creating synthetic data from an example with only a modest number of variables (say between 8 and 12 variables) so you can understand synthpop. If your data have more variables than this then make a selection. The synthpop package is intended for large data sets. We do not recommend using it for data sets with fewer than around 500 observations because a small data set will not provide enough information about relationships between many variables.
Now examine your data. Perhaps check the first or last few lines with
tail() and any other R functions you know. You can also use the synthpop function
codebook.syn() to examine the features that will be relevant to synthesising.
Use the output to do the following things to make your data ready to be synthesised:
codebook.syn()after this. The
syn()function will do this conversion for you but it is better that you do it first.
NA. For example the value
-9often signifies missing data for positive items like income. These can be identified to the
syn()function via the
syn()function will warn you if the rule is not obeyed in the observed data.
If your data have more than 12 variables or if you have any factors with a large number of levels (say more than 20) you should create a smaller and simpler data frame that will be easier to synthesise for your first attempt. Omit or recode factors with many levels and select fewer variables. It would be a good idea to select a set of variables you might be interested in analysing.
You are now ready to do your first synthesis, e.g.
mysyn <- syn(mydata, cont.na = list(income = -8))
You have created a synthetic data object
mysyn of class
syn() for details. To get an overview, use the
summary() function for a
synds object, e.g.
You will see a list of variables with their synthesis methods in the order in which they were synthesised. As you used the default values for most
syn() parameters you will see that the data have been synthesised in the order in the data frame and all except the first method used "
(classification and regression trees).
To do an initial comparison of the original and synthetic data as tables and histograms use the
compare() function, e.g.
compare(mysyn, mydata, stat = "counts")
We hope that these will indicate similar distributions for the original and synthetic data.
If you want to export your synthetic data to analyse in other programs you can use the synthpop
write.syn() function, e.g.
write.syn(mysyn,file = "mysyn.sav", filetype = "SPSS")
Now you have managed your first synthesis you could read our paper in the Journal of Statistical Software and explore other resources on our website with more explanation of different features of synthpop including:
Stay connected with us
Enter your email address to receive occasional updates
Stay connected with us
Enter your email address to receive occasional update