fabricatr provides convenient helper functions to generate discrete random variables far more easily than using R’s built-in data generation mechanisms. Below we introduce you to the types of data you can generate using fabricatr.
The simplest possible type of data is a binary random variable (also called a bernoulli random variable). Generating a binary random variable requires only one parameter prob
which specifies the probability that outcomes drawn from this variable are equal to 1. By default, draw_binary()
will generate N = length(prob)
draws. N
can also be specified explicitly. Consider these examples:
draw_binary_ex <- fabricate(
N = 3, p = c(0, .5, 1),
binary_1 = draw_binary(prob = p),
binary_2 = draw_binary(N = 3, prob = 0.5)
)
In addition to binary variables, you can make data from repeated Bernoulli trials (“binomial” data). This requires using the draw_binomial()
function and specifying an argument trials
, equal to the number of trials.
binomial_ex <- fabricate(
N = 3,
freethrows = draw_binomial(N = N, prob = 0.5, trials = 10)
)
Some researchers may be interested in specifying probabilities through a “link function”. This can be done in any of your data generating functions through the link
argument. The default link function is “identity”, but we also support “logit”, and “probit”. These link functions transform continuous and unbounded latent data into probabilities of a positive outcome.
bernoulli_probit <- fabricate(
N = 3, x = 10 * rnorm(N),
binary = draw_binary(prob = x, link = "probit")
)
Some researchers may be interested in generating ordered outcomes – for example, Likert scale outcomes. You can do this with the draw_ordered()
function. Ordered variables require a vector of breakpoints, supplied as the argument breaks
– points at which the underlying latent variable switches from category to category. The first break should always be below the lower bound of the data, while the final break should always be above the upper bound of the data – if breaks do not cover the data, draw_ordered()
will attempt to correct this by adding breaks where appropriate.
In the following example, each of three observations has a latent variable x
which is continuous and unbounded. The variable ordered
transforms x
into three numeric categories: 1, 2, and 3. All values of x
below -1 result in ordered
1; all values of x
between -1 and 1 result in ordered
2; all values of x
above 1 result in ordered
3:
ordered_example <- fabricate(
N = 3,
x = 5 * rnorm(N),
ordered = draw_ordered(x, breaks = c(-Inf, -1, 1, Inf))
)
Ordered data also supports link functions including “logit” or “probit”:
ordered_probit_example <- fabricate(
N = 3,
x = 5 * rnorm(N),
ordered = draw_ordered(
x, breaks = c(-Inf, -1, 1, Inf),
link = "probit"
)
)
Likert variables are a special case of ordered variables. Users can use draw_ordered()
with properly specified breaks and break labels to generate Likert data, or use the draw_likert()
function as a convenient alias:
survey_data <- fabricate(
N = 100,
Q1 = draw_likert(x = rnorm(N)),
Q2 = draw_likert(x = rnorm(N)),
Q3 = draw_likert(x = rnorm(N))
)
draw_likert()
takes one compulsory argument (x
, which represents the latent variable being transformed into ordered data). By default, draw_likert()
provides a 7-item Likert scale with breaks at [-\(\infty\), -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, \(\infty\)]. Users can explicitly specify the type
argument to use other types of Likert data. Supported types are 4, 5, and 7. Default breaks for 5-item Likert scales are [-\(\infty\), -1.5, -0.5, 0.5, 1.5, \(\infty\)]. Default breaks for 4-item Likert scales are [-\(\infty\), -1, 0, 1, \(\infty\)].
Optionally, users can specify their own breaks. These will override the type
command and scale type will be detected based on the length of the break
argument. As above, a break
argument with 8 values will produce a 7-item Likert scale, one with 6 values will produce a 5-item Likert scale, and one with 5 values will produce a 4-item Likert scale.
Labels are automatically provided by draw_likert()
. The default 7-item Likert scale uses the labels [“Strongly Disagree”, “Disagree”, “Lean Disagree”, “Don’t Know / Neutral”, “Lean Agree”, “Agree”, “Strongly Agree”].
Examples of how users might use the function are available below:
survey_data <- fabricate(
N = 100,
Q1 = draw_likert(x = rnorm(N), type = 7),
Q2 = draw_likert(x = rnorm(N), type = 5),
Q3 = draw_likert(x = rnorm(N), type = 4),
Q4 = draw_likert(x = rnorm(N), breaks = c(-Inf, -0.8, 0, 1, 2, Inf))
)
table(survey_data$Q2)
Strongly Disagree | Disagree | Don’t Know / Neutral | Agree | Strongly Agree |
---|---|---|---|---|
9 | 18 | 45 | 19 | 9 |
This function is a convenient, quick alias for creating likert variables with these labels. Users who want more flexibility with respect to break labels or number of breaks should use draw_ordered()
and specify breaks and break labels explicitly.
draw_count()
allows you to create Poisson-distributed count outcomes. These require that the user specify the parameter mean
, equal to the Poisson distribution mean (often referred to as lambda
in statistical formulations of count data).
count_outcome_example = fabricate(N = 3,
x = c(0, 5, 100),
count = draw_count(mean = x))
draw_categorical()
can generate non-ordered, categorical data. Users must provide a vector of probabilities for each category (or a matrix, if each observation should have separate probabilities).
If probabilities do not sum to exactly one, they will be normalized, but negative probabilities will cause an error.
In the first example, each unit has a different set of probabilities and the probabilities are provided as a matrix:
categorical_example <- fabricate(
N = 6,
p1 = runif(N, 0, 1),
p2 = runif(N, 0, 1),
p3 = runif(N, 0, 1),
cat = draw_categorical(N = N, prob = cbind(p1, p2, p3))
)
In the second example, each unit has the same probability of getting a given category. draw_categorical()
will issue a warning to remind you that it is interpreting the vector in this way.
warn_draw_cat_example <- fabricate(
N = 6,
cat = draw_categorical(N = N, prob = c(0.2, 0.4, 0.4))
)
## Warning in draw_categorical(N = N, prob = c(0.2, 0.4, 0.4)): For a
## categorical (multinomial) distribution, a matrix of probabilities should
## be provided. The data below is generated by interpreting the vector of
## category probabilities you provided as identical for each observation.
“categorical” variables can also use link functions, for example to generate multinomial probit data.
If you are interested in reading more about how to generate specific variables with fabricatr, you can read our tutorial on common social science variables, or learn how to use other data-generating packages with fabricatr.
If you are interested in learning how to import or build data, you can read our introduction to building and importing data. More advanced users can read our tutorial on generating panel or cross-classified data. You can also learn about bootstrapping and resampling hierarchical data.
When generating binary data with a fixed ICC, we use this formula, where \(i\) is a cluster and \(j\) is a unit in a cluster:
\[ \begin{aligned} z_i &\sim \text{Bern}(p_i) \\ u_{ij} &\sim \text{Bern}(\sqrt{\rho}) \\ x_{ij} &= \begin{cases} x_{ij} \sim \text{Bern}(p_i) & \quad \text{if } u_{ij} = 1 \\ z_i & \quad \text{if } u_{ij} = 0 \end{cases} \end{aligned} \]
In expectation, this guarantees an intra-cluster correlation of \(\rho\) and a cluster proportion of \(p_i\). This approach derives from Hossain, Akhtar and Chakraborti, Hrishikesh. “ICCBin: Facilitates Clustered Binary Data Generation, and Estimation of Intracluster Correlation Coefficient (ICC) for Binary Data”, available on https://cran.r-project.org/web/packages/ICCbin/index.html or https://github.com/akhtarh/ICCbin
When generating normal data with a fixed ICC, we follow this formula, again with \(i\) as a cluster and \(j\) as a unit in the cluster:
\[ \begin{aligned} \sigma^2_{\alpha i} &= \frac{(\rho * \sigma^2_{\epsilon i})}{(1 - \rho)} \\ \alpha_i &\sim \mathcal{N}(0, \sigma^2_{\alpha i}) \\ \mu_{ij} &\sim \mathcal{N}(\mu_i, \sigma^2_{\epsilon i}) \\ x_{ij} &= \mu_{ij} + \alpha_i \end{aligned} \]
In expectation, this approach guarantees an intra-cluster correlation of \(\rho\), a cluster mean of \(\mu_{i}\), and a cluster-level variance in error terms of \(\sigma^2_{\epsilon i}\). This approach is specified on https://stats.stackexchange.com/questions/263451/create-synthetic-data-with-a-given-intraclass-correlation-coefficient-icc.