% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/createDat.R, R/recordSwap.R
\name{createDat}
\alias{createDat}
\alias{recordSwap}
\alias{recordSwap.sdcMicroObj}
\alias{recordSwap.default}
\title{Dummy Dataset for Record Swapping}
\usage{
createDat(N = 10000)

recordSwap(data, ...)

\method{recordSwap}{sdcMicroObj}(data, ...)

\method{recordSwap}{default}(
  data,
  hid,
  hierarchy,
  similar,
  swaprate = 0.05,
  risk = NULL,
  risk_threshold = 0,
  k_anonymity = 3,
  risk_variables = NULL,
  carry_along = NULL,
  return_swapped_id = FALSE,
  log_file_name = "TRS_logfile.txt",
  seed = NULL,
  ...
)
}
\arguments{
\item{N}{integer, number of household to generate}

\item{data}{must be either a micro data set in the form of a
`data.table` or `data.frame`, or an `sdcObject`, see
\link[sdcMicro]{createSdcObj}.}

\item{...}{parameters passed to `recordSwap.default()`}

\item{hid}{column index or column name in `data` which refers
to the household identifier.}

\item{hierarchy}{column indices or column names of variables in
`data` which refer to the geographic hierarchy in the micro data
set. For instance county > municipality > district.}

\item{similar}{vector or list of integer vectors or column names
containing similarity profiles, see details for more explanations.}

\item{swaprate}{double between 0 and 1 defining the proportion of
households which should be swapped, see details for more explanations}

\item{risk}{either column indices or column names in `data` or
`data.table`, `data.frame` or `matrix` indicating risk of each record
at each hierarchy level. If `risk`-matrix is supplied to swapping procedure
will not use the k-anonymity rule but the values found in this matrix
for swapping.
ATTENTION: This is NOT fully implemented yet and currently ignored by the
underlying c++ functions until tested properly}

\item{risk_threshold}{single numeric value indicating when a household is
considered "high risk", e.g. when this household must be swapped. Is only
used when `risk` is not `NULL`.
ATTENTION: This is NOT fully implemented yet and currently ignored by the
underlying c++ functions until tested properly}

\item{k_anonymity}{integer defining the threshold of high risk households
(counts<k) for using k-anonymity rule}

\item{risk_variables}{column indices or column names of variables in `data`
which will be considered for estimating the risk. Only used when k-anonymity
rule is applied.}

\item{carry_along}{integer vector indicating additional variables to swap
besides to hierarchy variables. These variables do not interfere with the
procedure of finding a record to swap with or calculating risk. This
parameter is only used at the end of the procedure when swapping the
hierarchies.}

\item{return_swapped_id, }{boolean if `TRUE` the output includes an
additional column showing the `hid` with which a record was swapped with.
The new column will have the name `paste0(hid,"_swapped")`.}

\item{log_file_name}{character, path for writing a log file. The log
file contains a list of household IDs (`hid`) which could not have been
swapped and is only created if any such households exist.}

\item{seed}{integer defining the seed for the random number generator, for
reproducibility. if `NULL` a random seed will be set using `sample(1e5,1)`.}
}
\value{
`data.table` containing dummy data

`data.table` with swapped records.
}
\description{
[createDat()] returns dummy data to illustrate
targeted record swapping. The generated data contain
household ids (`hid`), geographic variables
(`nuts1`, `nuts2`, `nuts3`, `lau2`)as well as some
other household or personal variables.

Applies targeted record swapping on micro data considering the identification
risk of each record as well the geographic topology.
}
\details{
The procedure accepts a `data.frame` or `data.table`
containing all necessary information for the record swapping, e.g
parameter `hid`, `similar`, `hierarchy`, etc ...
First the micro data in `data` is ordered by `hid` and the identification
risk is calculated for each record in each hierarchy level. As of right
now only counts is used as identification risk and the inverse of counts
is used as sampling probability.
NOTE: It will be possible to supply an identification risk for each record
and hierarchy level which will be passed down to the C++-function. This
is however not fully implemented.

With the parameter `k_anonymity` a k-anonymity rule is applied to define
risky households in each hierarchy level. A household is set to risky
if counts < k_anonymity in any hierarchy level and the household needs
to be swapped across this hierarchy level.
For instance having a geographic hierarchy of NUTS1 > NUTS2 > NUTS3 the
counts are calculated for each geographic variable and defined
`risk_variables`. If the counts for a record falls below `k_anonymity`
for hierarchy county then this record needs to be swapped across counties.
Setting `k_anonymity = 0` disables this feature and no risky households
are defined.

After that the targeted record swapping is applied starting from the highest
to the lowest hierarchy level and cycling through all possible geographic
areas at each hierarchy level, e.g every county, every municipality in
every county, etc, ...

At each geographic area a set of values is created for records to be
swapped. In all but the lowest hierarchy level this is ONLY made out
of all records which do not fulfill the k-anonymity and have not already
been swapped. Those records are swapped with records not belonging to
the same geographic area, which have not already been swapped beforehand.
Swapping refers to the interchange of geographic variables defined in
`hierarchy`. When a record is swapped all other record containing the
same `hid` are swapped as well.

At the lowest hierarchy level in every geographic area the set of records to
be bswapped is made up of all records which do not fulfill the k-anonymity
as well as the remaining numer of records such that the proportion of
swapped records of the geographic area is in coherence with the `swaprate`.
If, due to the k-anonymity condition, more records have already been swapped
in this geographic area then only the records which do not fulfill the
k-anonymity are swapped.

Using the parameter `similar` one can define similarity profiles.
`similar` needs to be a list of vectors with each list entry containing
column indices of `data`. These entries are used when searching for donor
households, meaning that for a specific record the set of all donor
records is made out of records which have the same values in
`similar[[1]]`. It is however important to note, that these variables
can only be variables related to households (not persons!). If no suitable
donor can be found the next similarity profile is used, `similar[[2]]` and
the set of all donors is then made up out of all records which have the
same values in the column indices in `similar[[2]]`. This procedure
continues until a donor record was found or all the similarity profiles
have been used.

`swaprate` sets the swaprate of households to be swapped, where a single
swap counts for swapping 2 households, the sampled household and the
corresponding donor. Prior to the procedure the swaprate is applied on
the lowest hierarchy level, to determine the target number of swapped
households in each of the lowest hierarchies. If the target numbers of a
decimal point they will randomly be rounded up or down such that the
number of households swapped in total is in coherence to the swaprate.
}
\examples{
# generate 10000 dummy households
library(data.table)
seed <- 2021
set.seed(seed)
nhid <- 10000
dat <- sdcMicro::createDat(nhid)

# define paramters for swapping
k_anonymity <- 1
swaprate <- .05 # 5\%
similar <- list(c("hsize"))
hier <- c("nuts1", "nuts2")
risk_variables <- c("ageGroup", "national")
hid <- "hid"

# apply record swapping
dat_s <- recordSwap(
  data = dat,
  hid = hid,
  hierarchy = hier,
  similar = similar,
  swaprate = swaprate,
  k_anonymity = k_anonymity,
  risk_variables = risk_variables,
  carry_along = NULL,
  return_swapped_id = TRUE,
  seed = seed
)

# number of swapped households
dat_s[hid != hid_swapped, uniqueN(hid)]

# hierarchies are not consistently swapped
dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]

# use parameter carry_along
dat_s <- recordSwap(
  data = dat,
  hid = hid,
  hierarchy = hier,
  similar = similar,
  swaprate = swaprate,
  k_anonymity = k_anonymity,
  risk_variables = risk_variables,
  carry_along = c("nuts3", "lau2"),
  return_swapped_id = TRUE,
  seed = seed)

dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]

}
\seealso{
recordSwap
}
\author{
Johannes Gussenbauer
}
