Descriptive statistics are used to summarize data. It enables us to present the data in a more meaningful way and to discern any patterns existing in the data. They can be useful for two purposes:
This document introduces you to a basic set of functions that describe data. There is a second vignette which provides details about functions which help visualize statistical distributions.
We have modified the mtcars
data to create a new data set mtcarz
. The only difference between the two data sets is related to the variable types.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
The ds_screener()
function will screen a data set and return the following: - Column/Variable Names - Data Type - Levels (in case of categorical data) - Number of missing observations - % of missing observations
## -----------------------------------------------------------------------
## | Column Name | Data Type | Levels | Missing | Missing (%) |
## -----------------------------------------------------------------------
## | mpg | numeric | NA | 0 | 0 |
## | cyl | factor | 4 6 8 | 0 | 0 |
## | disp | numeric | NA | 0 | 0 |
## | hp | numeric | NA | 0 | 0 |
## | drat | numeric | NA | 0 | 0 |
## | wt | numeric | NA | 0 | 0 |
## | qsec | numeric | NA | 0 | 0 |
## | vs | factor | 0 1 | 0 | 0 |
## | am | factor | 0 1 | 0 | 0 |
## | gear | factor | 3 4 5 | 0 | 0 |
## | carb | factor |1 2 3 4 6 8| 0 | 0 |
## -----------------------------------------------------------------------
##
## Overall Missing Values 0
## Percentage of Missing Values 0 %
## Rows with Missing Values 0
## Columns With Missing Values 0
The ds_summary_stats
function returns a comprehensive set of statistics for continuous data.
## Univariate Analysis
##
## N 32.00 Variance 36.32
## Missing 0.00 Std Deviation 6.03
## Mean 20.09 Range 23.50
## Median 19.20 Interquartile Range 7.38
## Mode 10.40 Uncorrected SS 14042.31
## Trimmed Mean 19.95 Corrected SS 1126.05
## Skewness 0.67 Coeff Variation 30.00
## Kurtosis -0.02 Std Error Mean 1.07
##
## Quantiles
##
## Quantile Value
##
## Max 33.90
## 99% 33.44
## 95% 31.30
## 90% 30.09
## Q3 22.80
## Median 19.20
## Q1 15.43
## 10% 14.34
## 5% 12.00
## 1% 10.40
## Min 10.40
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 15 10.4 20 33.9
## 16 10.4 18 32.4
## 24 13.3 19 30.4
## 7 14.3 28 30.4
## 17 14.7 26 27.3
The ds_cross_table
function creates two way tables of categorical variables.
## Cell Contents
## |---------------|
## | Frequency |
## | Percent |
## | Row Pct |
## | Col Pct |
## |---------------|
##
## Total Observations: 32
##
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | cyl | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 4 | 1 | 8 | 2 | 11 |
## | | 0.031 | 0.25 | 0.062 | |
## | | 0.09 | 0.73 | 0.18 | 0.34 |
## | | 0.07 | 0.67 | 0.4 | |
## ----------------------------------------------------------------------------
## | 6 | 2 | 4 | 1 | 7 |
## | | 0.062 | 0.125 | 0.031 | |
## | | 0.29 | 0.57 | 0.14 | 0.22 |
## | | 0.13 | 0.33 | 0.2 | |
## ----------------------------------------------------------------------------
## | 8 | 12 | 0 | 2 | 14 |
## | | 0.375 | 0 | 0.062 | |
## | | 0.86 | 0 | 0.14 | 0.44 |
## | | 0.8 | 0 | 0.4 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.468 | 0.375 | 0.155 | |
## ----------------------------------------------------------------------------
A plot method has been defined which will generate:
The ds_freq_table()
function creates frequency tables for categorical variables.
## Variable: cyl
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 4 | 11 | 11 | 34.38 | 34.38 |
## |--------------------------------------------------------------------------|
## | 6 | 7 | 18 | 21.88 | 56.25 |
## |--------------------------------------------------------------------------|
## | 8 | 14 | 32 | 43.75 | 100 |
## |--------------------------------------------------------------------------|
The ds_freq_cont
function creates frequency tables for continuous variables. The default number of intervals is 5.
## Variable: mpg
## |---------------------------------------------------------------------------|
## | Bins | Frequency | Cum Frequency | Percent | Cum Percent |
## |---------------------------------------------------------------------------|
## | 10.4 - 16.3 | 10 | 10 | 31.25 | 31.25 |
## |---------------------------------------------------------------------------|
## | 16.3 - 22.1 | 13 | 23 | 40.62 | 71.88 |
## |---------------------------------------------------------------------------|
## | 22.1 - 28 | 5 | 28 | 15.62 | 87.5 |
## |---------------------------------------------------------------------------|
## | 28 - 33.9 | 4 | 32 | 12.5 | 100 |
## |---------------------------------------------------------------------------|
The ds_group_summary()
function returns descriptive statistics of a continuous variable for the different levels of a categorical variable.
## mpg by cyl
## -----------------------------------------------------------------------------------------
## | Statistic/Levels| 4| 6| 8|
## -----------------------------------------------------------------------------------------
## | Obs| 11| 7| 14|
## | Minimum| 21.4| 17.8| 10.4|
## | Maximum| 33.9| 21.4| 19.2|
## | Mean| 26.66| 19.74| 15.1|
## | Median| 26| 19.7| 15.2|
## | Mode| 22.8| 21| 10.4|
## | Std. Deviation| 4.51| 1.45| 2.56|
## | Variance| 20.34| 2.11| 6.55|
## | Skewness| 0.35| -0.26| -0.46|
## | Kurtosis| -1.43| -1.83| 0.33|
## | Uncorrected SS| 8023.83| 2741.14| 3277.34|
## | Corrected SS| 203.39| 12.68| 85.2|
## | Coeff Variation| 16.91| 7.36| 16.95|
## | Std. Error Mean| 1.36| 0.55| 0.68|
## | Range| 12.5| 3.6| 8.8|
## | Interquartile Range| 7.6| 2.35| 1.85|
## -----------------------------------------------------------------------------------------
The ds_multi_stats()
function generates summary/descriptive statistics for variables in a data frame/tibble.
## # A tibble: 3 x 16
## vars min max mean t_mean median mode range variance stdev skew
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 disp 71.1 472 231 228 196 276 401 15361 124 0.420
## 2 hp 52.0 335 147 144 123 110 283 4701 68.6 0.799
## 3 mpg 10.4 33.9 20.1 20.0 19.2 10.4 23.5 36.3 6.03 0.672
## # ... with 5 more variables: kurtosis <dbl>, coeff_var <dbl>, q1 <dbl>,
## # q3 <dbl>, iqrange <dbl>
The ds_oway_tables()
function creates multiple one way tables by creating a frequency table for each categorical variable in a data frame.
## Variable: cyl
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 4 | 11 | 11 | 34.38 | 34.38 |
## |--------------------------------------------------------------------------|
## | 6 | 7 | 18 | 21.88 | 56.25 |
## |--------------------------------------------------------------------------|
## | 8 | 14 | 32 | 43.75 | 100 |
## |--------------------------------------------------------------------------|
##
##
## Variable: vs
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 0 | 18 | 18 | 56.25 | 56.25 |
## |--------------------------------------------------------------------------|
## | 1 | 14 | 32 | 43.75 | 100 |
## |--------------------------------------------------------------------------|
##
##
## Variable: am
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 0 | 19 | 19 | 59.38 | 59.38 |
## |--------------------------------------------------------------------------|
## | 1 | 13 | 32 | 40.62 | 100 |
## |--------------------------------------------------------------------------|
##
##
## Variable: gear
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 3 | 15 | 15 | 46.88 | 46.88 |
## |--------------------------------------------------------------------------|
## | 4 | 12 | 27 | 37.5 | 84.38 |
## |--------------------------------------------------------------------------|
## | 5 | 5 | 32 | 15.62 | 100 |
## |--------------------------------------------------------------------------|
##
##
## Variable: carb
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 1 | 7 | 7 | 21.88 | 21.88 |
## |--------------------------------------------------------------------------|
## | 2 | 10 | 17 | 31.25 | 53.12 |
## |--------------------------------------------------------------------------|
## | 3 | 3 | 20 | 9.38 | 62.5 |
## |--------------------------------------------------------------------------|
## | 4 | 10 | 30 | 31.25 | 93.75 |
## |--------------------------------------------------------------------------|
## | 6 | 1 | 31 | 3.12 | 96.88 |
## |--------------------------------------------------------------------------|
## | 8 | 1 | 32 | 3.12 | 100 |
## |--------------------------------------------------------------------------|
The ds_tway_tables()
function creates multiple two way tables by creating a cross table for each unique pair of categorical variables in a data frame.
## Cell Contents
## |---------------|
## | Frequency |
## | Percent |
## | Row Pct |
## | Col Pct |
## |---------------|
##
## Total Observations: 32
##
## cyl vs vs
## -------------------------------------------------------------
## | | vs |
## -------------------------------------------------------------
## | cyl | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 4 | 1 | 10 | 11 |
## | | 0.031 | 0.312 | |
## | | 0.09 | 0.91 | 0.34 |
## | | 0.06 | 0.71 | |
## -------------------------------------------------------------
## | 6 | 3 | 4 | 7 |
## | | 0.094 | 0.125 | |
## | | 0.43 | 0.57 | 0.22 |
## | | 0.17 | 0.29 | |
## -------------------------------------------------------------
## | 8 | 14 | 0 | 14 |
## | | 0.438 | 0 | |
## | | 1 | 0 | 0.44 |
## | | 0.78 | 0 | |
## -------------------------------------------------------------
## | Column Total | 18 | 14 | 32 |
## | | 0.563 | 0.437 | |
## -------------------------------------------------------------
##
##
## cyl vs am
## -------------------------------------------------------------
## | | am |
## -------------------------------------------------------------
## | cyl | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 4 | 3 | 8 | 11 |
## | | 0.094 | 0.25 | |
## | | 0.27 | 0.73 | 0.34 |
## | | 0.16 | 0.62 | |
## -------------------------------------------------------------
## | 6 | 4 | 3 | 7 |
## | | 0.125 | 0.094 | |
## | | 0.57 | 0.43 | 0.22 |
## | | 0.21 | 0.23 | |
## -------------------------------------------------------------
## | 8 | 12 | 2 | 14 |
## | | 0.375 | 0.062 | |
## | | 0.86 | 0.14 | 0.44 |
## | | 0.63 | 0.15 | |
## -------------------------------------------------------------
## | Column Total | 19 | 13 | 32 |
## | | 0.594 | 0.406 | |
## -------------------------------------------------------------
##
##
## cyl vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | cyl | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 4 | 1 | 8 | 2 | 11 |
## | | 0.031 | 0.25 | 0.062 | |
## | | 0.09 | 0.73 | 0.18 | 0.34 |
## | | 0.07 | 0.67 | 0.4 | |
## ----------------------------------------------------------------------------
## | 6 | 2 | 4 | 1 | 7 |
## | | 0.062 | 0.125 | 0.031 | |
## | | 0.29 | 0.57 | 0.14 | 0.22 |
## | | 0.13 | 0.33 | 0.2 | |
## ----------------------------------------------------------------------------
## | 8 | 12 | 0 | 2 | 14 |
## | | 0.375 | 0 | 0.062 | |
## | | 0.86 | 0 | 0.14 | 0.44 |
## | | 0.8 | 0 | 0.4 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.468 | 0.375 | 0.155 | |
## ----------------------------------------------------------------------------
##
##
## cyl vs carb
## -------------------------------------------------------------------------------------------------------------------------
## | | carb |
## -------------------------------------------------------------------------------------------------------------------------
## | cyl | 1 | 2 | 3 | 4 | 6 | 8 | Row Total |
## -------------------------------------------------------------------------------------------------------------------------
## | 4 | 5 | 6 | 0 | 0 | 0 | 0 | 11 |
## | | 0.156 | 0.188 | 0 | 0 | 0 | 0 | |
## | | 0.45 | 0.55 | 0 | 0 | 0 | 0 | 0.34 |
## | | 0.71 | 0.6 | 0 | 0 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 6 | 2 | 0 | 0 | 4 | 1 | 0 | 7 |
## | | 0.062 | 0 | 0 | 0.125 | 0.031 | 0 | |
## | | 0.29 | 0 | 0 | 0.57 | 0.14 | 0 | 0.22 |
## | | 0.29 | 0 | 0 | 0.4 | 1 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 8 | 0 | 4 | 3 | 6 | 0 | 1 | 14 |
## | | 0 | 0.125 | 0.094 | 0.188 | 0 | 0.031 | |
## | | 0 | 0.29 | 0.21 | 0.43 | 0 | 0.07 | 0.44 |
## | | 0 | 0.4 | 1 | 0.6 | 0 | 1 | |
## -------------------------------------------------------------------------------------------------------------------------
## | Column Total | 7 | 10 | 3 | 10 | 1 | 1 | 32 |
## | | 0.218 | 0.313 | 0.094 | 0.313 | 0.031 | 0.031 | |
## -------------------------------------------------------------------------------------------------------------------------
##
##
## vs vs am
## -------------------------------------------------------------
## | | am |
## -------------------------------------------------------------
## | vs | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 0 | 12 | 6 | 18 |
## | | 0.375 | 0.188 | |
## | | 0.67 | 0.33 | 0.56 |
## | | 0.63 | 0.46 | |
## -------------------------------------------------------------
## | 1 | 7 | 7 | 14 |
## | | 0.219 | 0.219 | |
## | | 0.5 | 0.5 | 0.44 |
## | | 0.37 | 0.54 | |
## -------------------------------------------------------------
## | Column Total | 19 | 13 | 32 |
## | | 0.594 | 0.407 | |
## -------------------------------------------------------------
##
##
## vs vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | vs | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 0 | 12 | 2 | 4 | 18 |
## | | 0.375 | 0.062 | 0.125 | |
## | | 0.67 | 0.11 | 0.22 | 0.56 |
## | | 0.8 | 0.17 | 0.8 | |
## ----------------------------------------------------------------------------
## | 1 | 3 | 10 | 1 | 14 |
## | | 0.094 | 0.312 | 0.031 | |
## | | 0.21 | 0.71 | 0.07 | 0.44 |
## | | 0.2 | 0.83 | 0.2 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.469 | 0.374 | 0.156 | |
## ----------------------------------------------------------------------------
##
##
## vs vs carb
## -------------------------------------------------------------------------------------------------------------------------
## | | carb |
## -------------------------------------------------------------------------------------------------------------------------
## | vs | 1 | 2 | 3 | 4 | 6 | 8 | Row Total |
## -------------------------------------------------------------------------------------------------------------------------
## | 0 | 0 | 5 | 3 | 8 | 1 | 1 | 18 |
## | | 0 | 0.156 | 0.094 | 0.25 | 0.031 | 0.031 | |
## | | 0 | 0.28 | 0.17 | 0.44 | 0.06 | 0.06 | 0.56 |
## | | 0 | 0.5 | 1 | 0.8 | 1 | 1 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 1 | 7 | 5 | 0 | 2 | 0 | 0 | 14 |
## | | 0.219 | 0.156 | 0 | 0.062 | 0 | 0 | |
## | | 0.5 | 0.36 | 0 | 0.14 | 0 | 0 | 0.44 |
## | | 1 | 0.5 | 0 | 0.2 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | Column Total | 7 | 10 | 3 | 10 | 1 | 1 | 32 |
## | | 0.219 | 0.312 | 0.094 | 0.312 | 0.031 | 0.031 | |
## -------------------------------------------------------------------------------------------------------------------------
##
##
## am vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | am | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 0 | 15 | 4 | 0 | 19 |
## | | 0.469 | 0.125 | 0 | |
## | | 0.79 | 0.21 | 0 | 0.59 |
## | | 1 | 0.33 | 0 | |
## ----------------------------------------------------------------------------
## | 1 | 0 | 8 | 5 | 13 |
## | | 0 | 0.25 | 0.156 | |
## | | 0 | 0.62 | 0.38 | 0.41 |
## | | 0 | 0.67 | 1 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.469 | 0.375 | 0.156 | |
## ----------------------------------------------------------------------------
##
##
## am vs carb
## -------------------------------------------------------------------------------------------------------------------------
## | | carb |
## -------------------------------------------------------------------------------------------------------------------------
## | am | 1 | 2 | 3 | 4 | 6 | 8 | Row Total |
## -------------------------------------------------------------------------------------------------------------------------
## | 0 | 3 | 6 | 3 | 7 | 0 | 0 | 19 |
## | | 0.094 | 0.188 | 0.094 | 0.219 | 0 | 0 | |
## | | 0.16 | 0.32 | 0.16 | 0.37 | 0 | 0 | 0.6 |
## | | 0.43 | 0.6 | 1 | 0.7 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 1 | 4 | 4 | 0 | 3 | 1 | 1 | 13 |
## | | 0.125 | 0.125 | 0 | 0.094 | 0.031 | 0.031 | |
## | | 0.31 | 0.31 | 0 | 0.23 | 0.08 | 0.08 | 0.41 |
## | | 0.57 | 0.4 | 0 | 0.3 | 1 | 1 | |
## -------------------------------------------------------------------------------------------------------------------------
## | Column Total | 7 | 10 | 3 | 10 | 1 | 1 | 32 |
## | | 0.219 | 0.313 | 0.094 | 0.313 | 0.031 | 0.031 | |
## -------------------------------------------------------------------------------------------------------------------------
##
##
## gear vs carb
## -------------------------------------------------------------------------------------------------------------------------
## | | carb |
## -------------------------------------------------------------------------------------------------------------------------
## | gear | 1 | 2 | 3 | 4 | 6 | 8 | Row Total |
## -------------------------------------------------------------------------------------------------------------------------
## | 3 | 3 | 4 | 3 | 5 | 0 | 0 | 15 |
## | | 0.094 | 0.125 | 0.094 | 0.156 | 0 | 0 | |
## | | 0.2 | 0.27 | 0.2 | 0.33 | 0 | 0 | 0.47 |
## | | 0.43 | 0.4 | 1 | 0.5 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 4 | 4 | 4 | 0 | 4 | 0 | 0 | 12 |
## | | 0.125 | 0.125 | 0 | 0.125 | 0 | 0 | |
## | | 0.33 | 0.33 | 0 | 0.33 | 0 | 0 | 0.38 |
## | | 0.57 | 0.4 | 0 | 0.4 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 5 | 0 | 2 | 0 | 1 | 1 | 1 | 5 |
## | | 0 | 0.062 | 0 | 0.031 | 0.031 | 0.031 | |
## | | 0 | 0.4 | 0 | 0.2 | 0.2 | 0.2 | 0.16 |
## | | 0 | 0.2 | 0 | 0.1 | 1 | 1 | |
## -------------------------------------------------------------------------------------------------------------------------
## | Column Total | 7 | 10 | 3 | 10 | 1 | 1 | 32 |
## | | 0.219 | 0.312 | 0.094 | 0.312 | 0.031 | 0.031 | |
## -------------------------------------------------------------------------------------------------------------------------