This CRAN task view contains a list of packages that includes
methods typically used in official statistics and survey methodology.
Many packages provide functionality for more than one of the topics listed
below. Therefore this list is not a strict categorization and packages can be
listed more than once. Certain data import/export facilities regarding to often used statistical software tools
like SPSS, SAS or Stata are mentioned in the end of the task view.
Complex Survey Design: Sampling and Sample Size Calculation
-
Package
sampling
includes many different algorithms (Brewer, Midzuno, pps, systematic, Sampford, balanced
(cluster or stratified) sampling via the cube method, etc.) for
drawing survey samples and calibrating the design weights.
-
R package
surveyplanning
includes tools for sample survey planning, including sample size calculation, estimation of expected precision for the estimates of totals, and calculation of optimal sample size allocation.
-
Package
simFrame
includes a fast (compiled C-Code) version of
Midzuno sampling.
-
The
pps
package contains functions to select samples using pps
sampling. Also stratified simple random sampling is possible as well as to
compute joint inclusion probabilities for Sampford's method of pps sampling.
-
Package
stratification
allows univariate stratification of survey
populations with a generalisation of the Lavallee-Hidiroglou method.
-
Package
SamplingStrata
offers an approach for choosing the best
stratification of a sampling frame in a multivariate and multidomain setting,
where the sampling sizes in each strata are determined in order to satisfy accuracy
constraints on target estimates.
To evaluate the distribution of target variables in different strata, information of the sampling frame,
or data from previous rounds of the same survey, may be used.
-
The package
BalancedSampling
selects balanced and spatially balanced probability samples in multi-dimensional spaces with any prescribed inclusion probabilities. It also includes the local pivot method, the cube and local cube method and a few more methods.
-
Package
gridsample
selects PSUs within user-defined strata using gridded population data, given desired numbers of sampled households within each PSU. The population densities used to create PSUs are drawn from rasters
-
Package
PracTools
contains functions for sample size calculation for survey samples using stratified or clustered one-, two-, and three-stage sample designs as well as functions to compute variance components for multistage designs and sample sizes in two-phase designs.
-
Package
samplesize4surveys
computes the required sample size for estimation of totals, means and proportions under complex sampling designs.
Complex Survey Design: Point and Variance Estimation and Model Fitting
-
Package
survey
works with survey samples. It allows to specify a complex survey design (stratified sampling design, cluster sampling, multi-stage sampling and pps
sampling with or without replacement). Once
the given survey design is specified within the function
svydesign(), point and variance estimates can be computed.
The resulting object can be used to estimate (Horvitz-Thompson-) totals, means,
ratios and quantiles for domains or the whole survey sample, and to apply
regression models. Variance estimation for means, totals and ratios can be
done either by Taylor linearization or resampling (BRR, jackkife, bootstrap
or user-defined).
-
The methods from the
survey
package are called from package
srvyr
using the dplyr syntax, i.e., piping, verbs like
group_by
and
summarize, and other dplyr-inspired
syntactic style when calculating summary statistics on survey data.
-
Package
convey
extends package
survey
-- see the topic about indicators below.
-
Package
laeken
provides functions to estimate certain Laeken
indicators (at-risk-of-poverty rate, quintile share ratio, relative median
risk-of-poverty gap, Gini coefficient) including their variance for domains
and strata using a calibrated bootstrap.
-
Package
simFrame
allows to compare (user-defined) point and
variance estimators in a simulation environment. It provides a framework for comparing
different point and variance estimators under different survey designs as
well as different conditions regarding missing values, representative and
non-representative outliers.
-
The
lavaan.survey
package provides a wrapper function for packages
survey
and
lavaan.
It can be used for
fitting structural equation models (SEM) on samples from complex designs. Using
the design object functionality from package
survey, lavaan objects are re-fit
(corrected) with the
lavaan.survey()
function of package
lavaan.survey.
This allows for the incorporation of clustering, stratification, sampling weights,
and finite population corrections into a SEM analysis.
lavaan.survey()
also accommodates
replicate weights and multiply imputed datasets.
-
Package
vardpoor
allows to calculate linearisation of several nonlinear population statistics, variance estimation of sample surveys by the ultimate cluster method, variance estimation for longitudinal and cross-sectional measures and measures of change for any stage cluster sampling designs.
-
The package
rpms
fits a linear model to survey data in each node obtained by recursively partitioning the data. The algorithm
accounts for one-stage of stratification and clustering as well as unequal probability of selection.
-
Package
svyPVpack
extends package
survey. This package deals with data
which stem from survey designs and has been created to handle data from large scale
assessments like PISA, PIAAC etc..
Complex Survey Design: Calibration
-
Package
survey
allows for post-stratification, generalized
raking/calibration, GREG estimation and trimming of weights.
-
The
calib()
function in package
sampling
allows to
calibrate for nonresponse (with response homogeneity groups) for stratified
samples.
-
The
calibWeights()
function in package
laeken
is a
possible faster (depending on the example) implementation of parts of
calib()
from package
sampling.
-
The
calibSample()
function in package
simPop
is potential faster than the previous two mentioned functions, and it provides more user-friendlyness.
calibVars()
can be used to construct a matrix of binary variables for calibration.
calibPop()
is used to calibrate population person within household data using a simulated annealing approach.
-
Package
icarus
focuses on calibration and reweighting in survey sampling and was designed to provide a familiar setting in R for user of the SAS macro
Calmar
.
-
Package
reweight
allows for calibration of survey weights for
categorical survey data so that the marginal distributions of certain
variables fit more closely to those from a given population, but does not
allow complex sampling designs.
-
The package
CalibrateSSB
include a function to calculate weights and estimates for panel data with non-response.
-
Package
Frames2
allows point and interval estimation in dual frame surveys. When two probability samples (one from each frame) are drawn. Information collected is suitably combined to get estimators of the parameter of interest.
Editing and Visual Inspection of Microdata
Editing tools:
-
Package
validate
includes rule management and data validation and package
validatetools
is checking and simplifying sets of validation rules.
-
Package
errorlocate
includes error localisation based on the principle of Fellegi and Holt. It supports categorical and/or numeric data and linear equalities, inequalities and conditional rules. The package includes a configurable backend for MIP-based error localization.
-
Package
editrules
convert readable linear (in)equalities into matrix form.
-
Package
deducorrect
depends on package
editrules
and applies deductive correction of simple rounding, typing and
sign errors based on balanced edits. Values are changed so that the given balanced edits are fulfilled. To determine which values are changed the Levenstein-metric is applied.
-
The package
rspa
implements functions to minimally
adjust numerical records so they obey (in)equation restrictions.
-
Package
SeleMix
can be used for selective editing for continuous scaled data.
A mixture model (Gaussian contamination model) based on response(s) y and a depended set of covariates is fit to the data to
quantify the impact of errors to the estimates.
-
Package
rrcovNA
provides robust location and scatter estimation and robust
principal component analysis with high breakdown point for
incomplete data. It is therefore
applicable to find representative and non-representative outliers.
Visual tools:
-
Package
VIM
is designed to visualize missing values
using suitable plot methods. It can be used to analyse the structure of missing values in microdata using univariate, bivariate, multiple and multivariate plots where the
information of missing values
from specified variables are highlighted in selected variables.
It also comes with a graphical user interface.
-
Package
tabplot
provides the tableplot visualization method, which is used to profile or explore large statistical datasets.
Up to a dozen of variables are shown column-wise as bar charts (numeric variables) or stacked bar charts (factors).
Key aspects of the analysis with tableplots are the smoothness of a data distribution,
the selective occurrence of missing values, and the distribution of correlated variables.
-
Package
treemap
provide treemaps. A treemap is a space-filling visualization of aggregates of data with
hierarchical structures. Colors can be used to relate to highlight differences between comparable aggregates.
Imputation
A distinction between iterative model-based methods, k-nearest neighbor methods
and miscellaneous methods is made. However, often the criteria for using a
method depend on the scale of the data, which in official statistics are
typically a mixture of continuous, semi-continuous, binary, categorical and
count variables. In addition, measurement errors may corrupt non-robust imputation methods.
Note that only few imputation methods can deal with mixed types of variables and only few methods account for robustness issues.
EM-based Imputation Methods:
-
Package
mi
provides iterative EM-based multiple Bayesian
regression imputation of missing values and model checking of the regression
models used. The regression models for each variable can also be
user-defined. The data set may consist of continuous, semi-continuous,
binary, categorical and/or count variables.
-
Package
mice
provides iterative EM-based multiple regression
imputation. The data set may consist of continuous, binary, categorical
and/or count variables.
-
Package
mitools
provides tools to perform analyses and combine
results from multiply-imputed datasets.
-
Package
Amelia
provides multiple imputation where first bootstrap
samples with the same dimensions as the original data are drawn, and then
used for EM-based imputation. It is also possible to impute longitudinal
data. The package in addition comes with a graphical user interface.
-
Package
VIM
provides EM-based multiple imputation (function
irmi()) using robust estimations, which allows to adequately
deal with data including outliers. It can handle data consisting of
continuous, semi-continuous, binary, categorical and/or count variables.
-
Single imputation methods are included or called from other packages by the package
simputation. It supports regression (standard, M-estimation, ridge/lasso/elasticnet), hot-deck methods (powered by VIM), randomForest, EM-based, and iterative randomForest imputation.
-
Package
mix
provides iterative EM-based multiple regression
imputation. The data set may consist of continuous, binary or categorical
variables, but methods for semi-continuous variables are missing.
-
Package
pan
provides multiple imputation for multivariate panel or
clustered data.
-
Package
norm
provides EM-based multiple imputation for
multivariate normal data.
-
Package
cat
provides EM-based multiple imputation for multivariate
categorical data.
-
Package
MImix
provides tools to combine results for
multiply-imputed data using mixture approximations.
-
Package
robCompositions
provides iterative model-based imputation
for compositional data (function
impCoda()).
-
Package
missForest
uses the functionality of the randomForest to impute missing values in an iterative single-imputation fashion. It can deal with almost any kind of variables except semi-continuous ones. Even the underlying bootstrap approach of random forests ensures that from multiple runs one can get multiple imputations but the additional uncertainty of imputation is only considered when choosing the random forest method of package
mice.
Nearest Neighbor Imputation Methods
-
Package
VIM
provides an implementation of the popular
sequential and random (within a domain) hot-deck algorithm.
-
VIM
also provides a fast k-nearest neighbor (knn) algorithm which can be used for large data sets.
It uses a modification of the Gower Distance for numerical, categorical, ordered, continuous and semi-continuous variables.
-
Package
yaImpute
performs popular nearest neighbor routines for
imputation of continuous variables where different metrics and methods can be
used for determining the distance between observations.
-
Package
robCompositions
provides knn imputation for
compositional data (function
impKNNa()) using the Aitchison
distance and adjustment of the nearest neighbor.
-
Package
rrcovNA
provides an algorithm for (robust) sequential imputation (function
impSeq()
and
impSeqRob()
by minimizing the determinant of the covariance of the augmented data matrix. It's application is limited to continuous scaled data.
-
Package
impute
on Bioconductor impute provides knn imputation of continuous
variables.
Copula-based Imputation Methods:
-
The S4 class package
CoImp
imputes multivariate missing data by using conditional copula functions. The imputation procedure is semiparametric: the margins are non-parametrically estimated through local likelihood of low-degree polynomials while a range of different parametric models for the copula can be selected by the user. The missing values are imputed by drawing observations from the conditional density functions by means of the Hit or Miss Monte Carlo method. It works either for a matrix of continuous scaled variables or a matrix of discrete distributions.
Miscellaneous Imputation Methods:
-
Package
missMDA
allows to impute incomplete continuous variables
by principal component analysis (PCA) or categorical variables by multiple
correspondence analysis (MCA).
-
Package
mice
(function
mice.impute.pmm()) and
Package
Hmisc
(function
aregImpute()) allow
predictive mean matching imputation.
-
Package
VIM
allows to visualize the structure of missing values
using suitable plot methods. It also comes with a graphical user interface.
Statistical Disclosure Control
Data from statistical agencies and other institutions are in its raw form
mostly confidential and data providers have to be ensure confidentiality by
both modifying the original data so that no statistical units can be
re-identified and by guaranteeing a minimum amount of information loss.
-
Package
sdcMicro
can be used for the generation of confidential
(micro)data, i.e. for the generation of public- and scientific-use files.
The package also comes with a graphical user interface.
-
Package
sdcTable
can be used to provide confidential (hierarchical) tabular data. It includes the HITAS and the HYPERCUBE technique and uses linear programming packages (Rglpk and lpSolveAPI) for solving (a large amount of) linear programs.
-
An interface to the package
sdcTable
is provided by package
easySdcTable.
Seasonal Adjustment and Forecasting
For a more general view on time series methodology we refer to the
TimeSeries
task view. Only very specialized time series packages related to complex surveys are discussed here.
-
Decomposition of time series can be done with the function
decompose(), or more advanced by using the function
stl(), both from the basic stats package.
Decomposition is also possible with the
StructTS()
function,
which can also be found in the stats package.
-
Many powerful tools can be accessed via packages
x12
and
x12GUI
and package
seasonal.
x12
provides a wrapper function for the X12 binaries, which have to be installed first. It uses
with a S4-class interface for batch processing of multiple time series.
x12GUI
provides a graphical user interface for
the X12-Arima seasonal adjustment software.
Less functionality but with the support of SEATS Spec is supported by package
seasonal.
-
Given the large pool of individual forecasts in survey-type forecasting, forecast combination techniques from package
GeomComb
can be useful. It can also handle missing values in the time series.
Statistical Matching and Record Linkage
-
Package
StatMatch
provides functions to perform statistical
matching between two data sources sharing a number of common variables. It
creates a synthetic data set after matching of two data sources via a
likelihood approach or via hot-deck.
-
Package
RecordLinkage
provides functions for linking and
deduplicating data sets.
-
Package
MatchIt
allows nearest neighbor matching, exact matching, optimal matching and full matching amongst
other matching methods. If two data sets have to be matched, the data must come as one data frame including a factor
variable which includes information about the membership of each observation.
-
Package
stringdist
can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler).
-
Package
XBRL
allows the extraction of business financial information from XBRL Documents.
Small Area Estimation
-
Package
rsae
provides functions to estimate the parameters of the basic unit-level
small area estimation (SAE) model (aka nested error regression model)
by means of maximum likelihood (ML) or robust ML. On the basis of the estimated parameters, robust predictions of the area-specific
means are computed (incl. MSE estimates; parametric bootstrap).
The current version (rsae 0.4-x) does not allow for categorical independent variables.
-
Package
nlme
provides facilities to fit Gaussian linear and nonlinear mixed-effects models and
lme4
provides facilities to fit linear and generalized linear mixed-effects model, both used in
small area estimation.
-
The
hbsae
package provides functions to compute small area estimates based on a basic area or unit-level model.
The model is fit using restricted maximum likelihood, or in a hierarchical Bayesian way. Auxilary information can be either
counts resulting from categorical variables or means from continuous population information.
-
With package
JoSAE
point and variance estimation for the generalized regression (GREG) and a unit level
empirical best linear unbiased prediction EBLUP estimators can be made at domain level. It basically provides wrapper functions to the
nlme
package
that is used to fit the basic random effects models.
Indices, Indicators, Tables and Visualisation of Indicators
-
Package
laeken
provides functions to estimate popular
risk-of-poverty and inequality indicators (at-risk-of-poverty rate, quintile
share ratio, relative median risk-of-poverty gap, Gini coefficient).
In addition, standard and robust methods for tail modeling of Pareto
distributions are provided for semi-parametric estimation of indicators
from continuous univariate distributions such as income variables.
-
Package
convey
estimates variances on indicators of income concentration and poverty using familiar linearized and replication-based designs created by the
survey
package such as the Gini coefficient, Atkinson index, at-risk-of-poverty threshold, and more than a dozen others.
-
Package
ineq
computes various inequality measures (Gini, Theil,
entropy, among others), concentration measures (Herfindahl, Rosenbluth), and poverty
measures (Watts, Sen, SST, and Foster). It also computes and draws empirical and theoretical
Lorenz curves as well as Pen's parade. It is not designed to deal with sampling weights directly
(these could only be emulated via
rep(x, weights)).
-
Package
IC2
include three inequality indices:
extended Gini, Atkinson and Generalized Entropy. It can deal with sampling weights and
subgroup decomposition is supported.
-
Functions
priceIndex()
from package
micEconIndex
allows to
estimate the Paasche, the Fisher and the Laspeyres price indices. For estimating quantities (of goods, for example), function
quantityIndex()
might be your friend.
-
Package
tmap
offers a layer-based way to make thematic maps, like choropleths and bubble maps.
-
Package
rworldmap
outline how to map country referenced data and
support users in visualising their own data. Examples are given, e.g., maps for the world bank and UN. It provides also new ways to visualise maps.
-
Package
rrcov3way
provides robust methods for multiway data analysis, applicable also for compositional data.
-
Package
robCompositions
methods for compositional tables including statistical tests.
Microsimulation
-
Using package
simPop
one can simulate populations from surveys based on auxiliary data with model-based methods or synthetic reconstruction methods. Hiercharical and cluster structures (such as households) can be considered as well as the methods takes account for samples collected based on complex sample designs. Calibration tools (iterative proportional fitting, iterative proportional updating) and combinatorial optimization tools (simulated annealing) are also available. The code is optimized for fast computations. The package based on a S4 class implementation. The simulated population can serve as basis data for microsimulation studies.
-
The
MicSim
package includes methods for microsimulations. Given a initial population, mortality rates, divorce rates, marriage rates, education changes, etc. and their transition matrix can be defined and included for the simulation of future states of the population. The package does not contain compiled code but functionality to run the microsimulation in parallel is provided.
-
Package
sms
provides facilities to simulate micro-data from given area-based macro-data. Simulated annealing is used to best satisfy the available description of an area.
For computational issues, the calculations can be run in parallel mode.
-
Package
synthpop
using regression tree methods to simulate synthetic data from given data. It is suitable to produce synthetic data when the data have no hierarchical and cluster information (such as households) as well as when the data does not collected with a complex sampling design.
-
Package
saeSim
Tools for the simulation of data in the context of small area estimation.
Additional Packages and Functionalities
Various additional packages are available that provides certain functionality useful in official statistics and survey methodology.
-
The
questionr
package contains a set of functions to make the processing and analysis of surveys easier. It provides interactive shiny apps and addins for data recoding, contingency tables, dataset metadata handling, and several convenience functions.
Data Import and Export:
-
Package
SAScii
imports ASCII files directly into R using only a SAS input script, which
is parsed and converted into arguments for a read.fwf call. This is useful whenever SAS scripts for importing data
are already available.
-
The
foreign
package includes tools for reading data from SAS Xport (function
read.xport()), Stata (function
read.dta()), SPSS (function
read.spss()) and various other formats. It provides facilities to write file to various formats, see function
write.foreign().
-
Also the package
haven
imports and exports SAS, Stata and SPSS (function
read.spss()) files. The package is more efficient for loading heavy data sets and it handles the labelling of variables and values in an advanced manner.
-
Also the package
Hmisc
provides tools to read data sets from SPSS (function
spss.get()) or Stata (function
stata.get()).
-
The
pxR
package provides a set of functions for reading
and writing PC-Axis files, used by different statistical
organizations around the globe for dissemination of their (multidimensional) tables.
-
With package
prevR
and it's function
import.dhs()
it is possible to directly imports
data from the Demographic Health Survey.
-
Function
describe()
from package
questionr
describes the variables of a dataset that might include labels imported with the foreign or memisc packages.
-
Package
OECD
searches and extracts data from the OECD.
-
Access to Finnish open government data is provided by package
sorvi
-
Tools to download data from the Eurostat database together with search and manipulation utilities are included in package
eurostat.
-
Package
acs
downloads, manipulates, and presents the American Community Survey and decennial data from the US Census.
-
Access to data published by INEGI, Mexico's official statistics agency, is supported by package
inegiR
-
Package
cbsodataR
provides access to Statistics Netherlands' (CBS) open data API.
-
A wrapper for the U.S. Census Bureau APIs that returns data frames of Census data and metadata is implemented in package
censusapi.
Misc:
-
Package
samplingbook
includes sampling procedures from the book
'Stichproben. Methoden und praktische Umsetzung mit R' by Goeran Kauermann
and Helmut Kuechenhoff (2010).
-
Package
SDaA
is designed to reproduce results from Lohr, S. (1999)
'Sampling: Design and Analysis, Duxbury' and includes the data sets from this
book.
-
The main contributions of
samplingVarEst
are Jackknife alternatives for variance estimation
of unequal probability with one or two stage designs.
-
Package
TeachingSampling
includes functionality for sampling
designs and parameter estimation in finite populations.
-
Package
memisc
includes tools for the management of survey data,
graphics and simulation.
-
Package
odfWeave.survey
provides support for
odfWeave
for the
survey
package.
-
Package
spsurvey
includes facilities for spatial survey design and
analysis for equal and unequal probability (stratified) sampling.
-
The
FFD
package is designed to calculate optimal sample sizes of a population of animals
living in herds for surveys to substantiate freedom from disease.
The criteria of estimating the sample sizes take the herd-level clustering of
diseases as well as imperfect diagnostic tests into account and select the samples
based on a two-stage design. Inclusion probabilities are not considered in the estimation.
The package provides a graphical user interface as well.
-
mipfp
provides multidimensional iterative proportional fitting to calibrate n-dimensional arrays given target marginal tables.
-
Package
MBHdesign
provides spatially balanced designs from a set of (contiguous) potential sampling locations
in a study region.
-
Package
quantification
provides different functions for quantifying qualitative survey data. It supports the Carlson-Parkin method, the regression approach, the balance approach and the conditional expectations method.
-
BIFIEsurvey
includes tools for survey statistics in educational assessment
including data with replication weights (e.g. from bootstrap).
-
surveybootstrap
includes tools for using different kinds of bootstrap for estimating sampling variation using complex survey data.
-
Package
surveyoutliers
winsorize values of a variable of interest.
-
The package
univOutl
includes various methods for detecting univariate outliers, e.g. the Hidiroglou-Berthelot method.
-
Package
extremevalues
is designed to detect univariate outliers based on modeling the bulk distribution.
-
Package
RRTCS
includes randomized response techniques for complex surveys.
-
Package
panelaggregation
aggregates business tendency survey data (and other qualitative surveys) to time series at various aggregation levels.
-
Package
surveydata
makes it easy to keep track of metadata from surveys, and to easily extract columns with specific questions.
-
RcmdrPlugin.sampling
includes tools for sampling in official statistical surveys. It includes tools for calculating sample sizes and selecting samples using various sampling designs.
-
Package
mapStats
does automated calculation and visualization of survey data statistics on a color-coded map.