Computation of the Transition Probabilities
# note:
# all parameters except for maximum noise D and variance V have default values
ptab <- create_cnt_ptable(D = 2, V = 1)
The minimum set of parameters that have to be specified are:
D
: the maximum noise/perturbation (a positive integer)
and
V
: the noise or perturbation variance (a positive
integer).
The result of the above function is an object of class “ptable” which
contains the following slots:
str(ptab)
#> Formal class 'ptable' [package "ptable"] with 8 slots
#> ..@ tMatrix : num [1:3, 1:5] 1 0.3665 0.0638 0 0.3665 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:3] "0" "1" "2"
#> .. .. ..$ : chr [1:5] "0" "1" "2" "3" ...
#> ..@ pClasses : num [1:3] 0 1 2
#> ..@ pTable :Classes 'data.table' and 'data.frame': 10 obs. of 7 variables:
#> .. ..$ i : num [1:10] 0 1 1 1 1 2 2 2 2 2
#> .. ..$ j : num [1:10] 0 0 1 2 3 0 1 2 3 4
#> .. ..$ p : num [1:10] 1 0.3665 0.3665 0.1676 0.0995 ...
#> .. ..$ v : num [1:10] 0 -1 0 1 2 -2 -1 0 1 2
#> .. ..$ p_int_lb: num [1:10] 0 0 0.366 0.733 0.901 ...
#> .. ..$ p_int_ub: num [1:10] 1 0.366 0.733 0.901 1 ...
#> .. ..$ type : chr [1:10] "all" "all" "all" "all" ...
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "intervals")= chr "default"
#> ..@ empResults:Classes 'data.table' and 'data.frame': 3 obs. of 6 variables:
#> .. ..$ i : num [1:3] 0 1 2
#> .. ..$ p_mean: num [1:3] 0 0 0
#> .. ..$ p_var : num [1:3] 0 0.932 1
#> .. ..$ p_sum : num [1:3] 1 1 1
#> .. ..$ p_stay: num [1:3] 1 0.366 0.383
#> .. ..$ iter : int [1:3] 0 20 1
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "sorted")= chr "i"
#> ..@ pParams :Formal class 'ptable_params' [package "ptable"] with 12 slots
#> .. .. ..@ D : int 2
#> .. .. ..@ V : num 1
#> .. .. ..@ js : int 0
#> .. .. ..@ ncat : int 2
#> .. .. ..@ pstay: num [1:2] NA NA
#> .. .. ..@ optim: int [1:2] 1 1
#> .. .. ..@ mono : logi [1:2] TRUE TRUE
#> .. .. ..@ table: chr "cnts"
#> .. .. ..@ icat : int [1:2] 1 2
#> .. .. ..@ step : int 1
#> .. .. ..@ type : chr "all"
#> .. .. ..@ label: chr "D2V100"
#> ..@ tStamp : chr "20230301164637"
#> ..@ type : chr "all"
#> ..@ table : chr "cnts"
The most relevant and important slots of the object are:
tMatrix
: A transition matrix that describes the
perturbation probabilities from one state (original frequency count) to
another (target frequency count) .
pTable
: Data table needed for the lookup step of a SDC
tool that can apply random noise to statistical tables (e.g. the cellKey
package or the software Tau-Argus).
However, in the following sections there will be explained, how the
tables can be used or exported by auxiliary functions.
pParams
: The input parameters that result from the
preceding function pt_create_pParams()
.
empResults
: A data frame for output checking of the
constraints.
The Transition Matrix
Let’s have a look at the transition matrix (i.e. at the slot
@tMatrix
) of the object ptab
:
# note: to look at a specific slot, just name the object and add the
# corresponding slot with a leading "@"
ptab@tMatrix
#> 0 1 2 3 4
#> 0 1.00000000 0.0000000 0.0000000 0.00000000 0.00000000
#> 1 0.36648551 0.3664855 0.1675725 0.09945652 0.00000000
#> 2 0.06382714 0.2446915 0.3829628 0.24469145 0.06382714
Each row of the transition matrix represents the noise distribution
for an original frequency count. The probability that an original
frequency count of 1 becomes a 3 (i.e. the 1 is perturbed with a noise
value of +2) is 0.0994565.
diag(ptab@tMatrix)
#> 0 1 2
#> 1.0000000 0.3664855 0.3829628
The main diagonal shows the preservation probabilities. These are the
probabilities that the original frequencies remain unchanged. In this
instance, the probability that an original frequency 2 remains unchanged
is 0.3829628.
Symmetry - and what does it mean in the context of perturbation
tables?
As you may have recognized, the transition matrix has a finite number
of rows (that represent original frequency counts) and columns (that
represent the target frequency counts).
# let's have a look at the number of different original positive frequency
# counts that will be treated
params <- slot(ptab, "pParams")
params@ncat
#> [1] 2
# if this number is added by +1 (for the zero count) we get
params@ncat + 1
#> [1] 3
The number depends on both, the maximum perturbation D
and the threshold value js
. The last row of the transition
matrix delivers a symmetric distribution. This distribution will be
applied to each original frequency larger than this frequency.
# the object @pClasses shows all original frequencies
# that have their own distribution
ptab@pClasses
#> [1] 0 1 2
# symmetry is achieved for the original frequency count i=...
max(ptab@pClasses)
#> [1] 2
# or
ptab@pClasses[params@ncat + 1]
#> [1] 2
In the given example, each frequency count larger than 2 will be
perturbed according to the distribution for i=2. For example, in case of
i=3 the distribution reads as follows
#> 1 2 3 4 5
#> 0.06382714 0.24469145 0.38296282 0.24469145 0.06382714
or in case of i=12326
#> 12324 12325 12326 12327 12328
#> 0.06382714 0.24469145 0.38296282 0.24469145 0.06382714
Given this symmetry, the transition matrix can be displayed in the
reduced form. There is no need to define more rows than up to the case
of symmetry.
Output Checking and Troubleshooting
Next, we will check the empirical results that can be used for
troubleshooting:
ptab@empResults
#> i p_mean p_var p_sum p_stay iter
#> 1: 0 0 0.00000 1 1.00000 0
#> 2: 1 0 0.93188 1 0.36649 20
#> 3: 2 0 1.00000 1 0.38296 1
The matrix has the following columns:
i
: indicates the original frequency to which the
remaining columns are referred to
p_mean
: shows the bias of the perturbation of an
original frequency (should be zero: unbiasedness)
p_var
: shows the noise variance of an original
frequency (Note: If p_var differs from the chosen input
parameter V
- as it does in the example above - we have a
violation of the fixed variance condition. In that case, we have to set
a different variance parameter or to change other parameters!)
p_sum
: the sum of the transition probabilities for an
original frequency count must equal to 1
p_stay
: corresponds to the diagonal of the transition
matrix
iter
: any value other than 1 points out discrepancies
(e.g. violation of the fixed variance parameter)
As can be seen in the example, the preset variance of
V=1
does not hold for the original frequency
i=1
. To fulfill the condition of a fixed variance, we
re-run the computation with a different variance parameter. Let’s try a
variance parameter V=0.9
:
ptab_new <- create_cnt_ptable(D = 2, V = 0.9)
ptab_new@empResults
#> i p_mean p_var p_sum p_stay iter
#> 1: 0 0 0.0 1 1.00000 0
#> 2: 1 0 0.9 1 0.36250 1
#> 3: 2 0 0.9 1 0.40928 1
The new computation with the updated variance parameter results in a
perturbation table that fulfills all conditions. Of course, the
resulting transition matrix now differs from the first one:
ptab_new@tMatrix
#> 0 1 2 3 4
#> 0 1.00000000 0.0000000 0.0000000 0.0000000 0.00000000
#> 1 0.36250000 0.3625000 0.1875000 0.0875000 0.00000000
#> 2 0.05154609 0.2438156 0.4092765 0.2438156 0.05154609