The tableby function

Beth Atkinson, Ethan Heinzen, Jason Sinnwell, Shannon McDonnell and Greg Dougherty

02 February, 2018

Introduction

One of the most common tables in medical literature includes summary statistics for a set of variables, often stratified by some group (e.g. treatment arm). Locally at Mayo, the SAS macros %table and %summary were written to create summary tables with a single call. With the increasing interest in R, we have developed the function tableby to create similar tables within the R environment.

In developing the tableby() function, the goal was to bring the best features of these macros into an R function. However, the task was not simply to duplicate all the functionality, but rather to make use of R’s strengths (modeling, method dispersion, flexibility in function definition and output format) and make a tool that fits the needs of R users. Additionally, the results needed to fit within the general reproducible research framework so the tables could be displayed within an R markdown report.

This report provides step-by-step directions for using the functions associated with tableby(). All functions presented here are available within the arsenal package. An assumption is made that users are somewhat familiar with R Markdown documents. For those who are new to the topic, a good initial resource is available at rmarkdown.rstudio.com.

Simple Example

The first step when using the tableby function is to load the arsenal package. All the examples in this report use a dataset called mockstudy made available by Paul Novotny which includes a variety of types of variables (character, numeric, factor, ordered factor, survival) to use as examples.

require(arsenal)
require(knitr)
require(survival)
data(mockstudy) ##load data
dim(mockstudy)  ##look at how many subjects and variables are in the dataset 
## [1] 1499   14
# help(mockstudy) ##learn more about the dataset and variables
str(mockstudy) ##quick look at the data
## 'data.frame':    1499 obs. of  14 variables:
##  $ case       : int  110754 99706 105271 105001 112263 86205 99508 90158 88989 90515 ...
##  $ age        : atomic  67 74 50 71 69 56 50 57 51 63 ...
##   ..- attr(*, "label")= chr "Age in Years"
##  $ arm        : atomic  F: FOLFOX A: IFL A: IFL G: IROX ...
##   ..- attr(*, "label")= chr "Treatment Arm"
##  $ sex        : Factor w/ 2 levels "Male","Female": 1 2 2 2 2 1 1 1 2 1 ...
##  $ race       : atomic  Caucasian Caucasian Caucasian Caucasian ...
##   ..- attr(*, "label")= chr "Race"
##  $ fu.time    : int  922 270 175 128 233 120 369 421 387 363 ...
##  $ fu.stat    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ ps         : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ hgb        : num  11.5 10.7 11.1 12.6 13 10.2 13.3 12.1 13.8 12.1 ...
##  $ bmi        : atomic  25.1 19.5 NA 29.4 26.4 ...
##   ..- attr(*, "label")= chr "Body Mass Index (kg/m^2)"
##  $ alk.phos   : int  160 290 700 771 350 569 162 152 231 492 ...
##  $ ast        : int  35 52 100 68 35 27 16 12 25 18 ...
##  $ mdquality.s: int  NA 1 1 1 NA 1 1 1 1 1 ...
##  $ age.ord    : Ord.factor w/ 8 levels "10-19"<"20-29"<..: 6 7 4 7 6 5 4 5 5 6 ...

To create a simple table stratified by treament arm, use a formula statement to specify the variables that you want summarized. The example below uses age (a continuous variable) and sex (a factor).

tab1 <- tableby(arm ~ sex + age, data=mockstudy)

If you want to take a quick look at the table, you can use summary() on your tableby object and the table will print out as text in your R console window. If you use summary() without any options you will see a number of \(\&nbsp;\) statements which translates to “space” in HTML.

Pretty text version of table

If you want a nicer version in your console window then add the text=TRUE option.

summary(tab1, text=TRUE)
## 
## 
##                A: IFL (N=428)    F: FOLFOX (N=691)   G: IROX (N=380)   Total (N=1499)     p value
## -------------  ----------------  ------------------  ----------------  ----------------  --------
## sex                                                                                         0.190
## -  Male        277 (64.7%)       411 (59.5%)         228 (60.0%)       916 (61.1%)               
## -  Female      151 (35.3%)       280 (40.5%)         152 (40.0%)       583 (38.9%)               
## Age in Years                                                                                0.614
## -  Mean (SD)   59.673 (11.365)   60.301 (11.632)     59.763 (11.499)   59.985 (11.519)           
## -  Range       27.000 - 88.000   19.000 - 88.000     26.000 - 85.000   19.000 - 88.000

Pretty Rmarkdown version of table

In order for the report to look nice within an R markdown (knitr) report, you just need to specify results="asis" when creating the r chunk. This changes the layout slightly (compresses it) and bolds the variable names.

summary(tab1)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
sex 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
Age in Years 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000

Data frame version of table

If you want a data.frame version, simply use as.data.frame.

as.data.frame(tab1)
##   variable   term        label variable.type              A: IFL           F: FOLFOX
## 1      sex    sex          sex   categorical                                        
## 2      sex   Male         Male   categorical 277.00000, 64.71963 411.00000, 59.47902
## 3      sex Female       Female   categorical 151.00000, 35.28037 280.00000, 40.52098
## 4      age    age Age in Years       numeric                                        
## 5      age meansd    Mean (SD)       numeric  59.67290, 11.36454  60.30101, 11.63225
## 6      age  range        Range       numeric              27, 88              19, 88
##              G: IROX              Total                       test   p.value
## 1                                       Pearson's Chi-squared test 0.1904388
## 2            228, 60  916.0000, 61.1074 Pearson's Chi-squared test 0.1904388
## 3            152, 40  583.0000, 38.8926 Pearson's Chi-squared test 0.1904388
## 4                                               Linear Model ANOVA 0.6143859
## 5 59.76316, 11.49930 59.98532, 11.51877         Linear Model ANOVA 0.6143859
## 6             26, 85             19, 88         Linear Model ANOVA 0.6143859

Summaries using standard R code

## base R frequency example
tmp <- table(Gender=mockstudy$sex, "Study Arm"=mockstudy$arm)
tmp
##         Study Arm
## Gender   A: IFL F: FOLFOX G: IROX
##   Male      277       411     228
##   Female    151       280     152
# Note: The continuity correction is applied by default in R (not used in %table)
chisq.test(tmp)
## 
##  Pearson's Chi-squared test
## 
## data:  tmp
## X-squared = 3.3168, df = 2, p-value = 0.1904
## base R numeric summary example
tapply(mockstudy$age, mockstudy$arm, summary)
## $`A: IFL`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.00   53.00   61.00   59.67   68.00   88.00 
## 
## $`F: FOLFOX`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0    52.0    61.0    60.3    69.0    88.0 
## 
## $`G: IROX`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.00   52.00   61.00   59.76   68.00   85.00
summary(aov(age ~ arm, data=mockstudy))
##               Df Sum Sq Mean Sq F value Pr(>F)
## arm            2    129    64.7   0.487  0.614
## Residuals   1496 198628   132.8

Modifying Output

Add labels

In the above example, age is shown with a label (Age in Years), but sex is listed “as is” with lower case letters. This is because the data was created in SAS and in the SAS dataset, age had a label but sex did not. The label is stored as an attribute within R.

## Look at one variable's label
attr(mockstudy$age,'label')
## [1] "Age in Years"
## See all the variables with a label
unlist(lapply(mockstudy,'attr','label'))
##                        age                        arm                       race 
##             "Age in Years"            "Treatment Arm"                     "Race" 
##                        bmi 
## "Body Mass Index (kg/m^2)"
# Can also use labels(mockstudy)

If you want to add labels to other variables, there are a couple of options. First, you could add labels to the variables in your dataset.

attr(mockstudy$sex,'label')  <- 'Gender'

tab1 <- tableby(arm ~ sex + age, data=mockstudy)
summary(tab1)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Gender 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
Age in Years 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000

You can also use the built-in data.frame method for labels<-:

labels(mockstudy)  <- c(age = 'Age, yrs', sex = "Gender")

tab1 <- tableby(arm ~ sex + age, data=mockstudy)
summary(tab1)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Gender 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
Age, yrs 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000

Another option is to add labels after you have created the table

mylabels <- list(sex = "SEX", age = "Age, yrs")
summary(tab1, labelTranslations = mylabels)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
SEX 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
Age, yrs 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000

Alternatively, you can check the variable labels and manipulate them with a function called labels, which works on the tableby object.

labels(tab1)
##             arm             sex             age 
## "Treatment Arm"        "Gender"      "Age, yrs"
labels(tab1) <- c(arm="Treatment Assignment", age="Baseline Age (yrs)")
labels(tab1)
##                    arm                    sex                    age 
## "Treatment Assignment"               "Gender"   "Baseline Age (yrs)"
summary(tab1)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Gender 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
Baseline Age (yrs) 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000

Change summary statistics globally

Currently the default behavior is to summarize continuous variables with: Number of missing values, Mean (SD), 25th - 75th quantiles, and Minimum-Maximum values with an ANOVA (t-test with equal variances) p-value. For categorical variables the default is to show: Number of missing values and count (column percent) with a chi-square p-value. This behavior can be modified using the tableby.control function. In fact, you can save your standard settings and use that for future tables. Note that test=FALSE and total=FALSE results in the total column and p-value column not being printed.

mycontrols  <- tableby.control(test=FALSE, total=FALSE,
                               numeric.test="kwt", cat.test="chisq",
                               numeric.stats=c("N", "median", "q1q3"),
                               cat.stats=c("countpct"),
                               stats.labels=list(N='Count', median='Median', q1q3='Q1,Q3'))
tab2 <- tableby(arm ~ sex + age, data=mockstudy, control=mycontrols)
summary(tab2)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380)
Gender
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%)
Age, yrs
   Count 428 691 380
   Median 61.000 61.000 61.000
   Q1,Q3 53.000, 68.000 52.000, 69.000 52.000, 68.000

You can also change these settings directly in the tableby call.

tab3 <- tableby(arm ~ sex + age, data=mockstudy, test=FALSE, total=FALSE, 
                numeric.stats=c("median","q1q3"), numeric.test="kwt")
summary(tab3)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380)
Gender
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%)
Age, yrs
   median 61.000 61.000 61.000
   Q1, Q3 53.000, 68.000 52.000, 69.000 52.000, 68.000

Change summary statistics within the formula

In addition to modifying summary options globally, it is possible to modify the test and summary statistics for specific variables within the formula statement. For example, both the kwt (Kruskal-Wallis rank-based) and anova (asymptotic analysis of variance) tests apply to numeric variables and we can use one for the variable “age” and another for the variable “ast”. A list of all the options is shown at the end of the vignette.

The tests function can do a quick check on what tests were performed on each variable in tableby.

tab.test <- tableby(arm ~ kwt(age) + anova(bmi) + kwt(ast), data=mockstudy)
tests(tab.test)
##                     Variable    p.value                       Method
## age                 Age, yrs 0.63906143 Kruskal-Wallis rank sum test
## bmi Body Mass Index (kg/m^2) 0.89165522           Linear Model ANOVA
## ast                      ast 0.03902803 Kruskal-Wallis rank sum test
summary(tab.test)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Age, yrs 0.639
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000
Body Mass Index (kg/m^2) 0.892
   N-Miss 9 20 4 33
   Mean (SD) 27.290 (5.552) 27.210 (5.173) 27.106 (5.751) 27.206 (5.432)
   Range 14.053 - 53.008 16.649 - 49.130 15.430 - 60.243 14.053 - 60.243
ast 0.039
   N-Miss 69 141 56 266
   Mean (SD) 37.292 (28.036) 35.202 (26.659) 35.670 (25.807) 35.933 (26.843)
   Range 10.000 - 205.000 7.000 - 174.000 5.000 - 176.000 5.000 - 205.000

Summary statistics for any individual variable can also be modified, but it must be done as secondary arguments to the test function. The function names must be strings that are functions already written for tableby, built-in R functions like mean and range, or user-defined functions.

tab.test <- tableby(arm ~ kwt(ast, "Nmiss2","median") + anova(age, "N","mean") +
                    kwt(bmi, "Nmiss","median"), data=mockstudy)
summary(tab.test)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
ast 0.039
   N-Miss 69 141 56 266
   median 29.000 25.500 27.000 27.000
Age, yrs 0.614
   N 428 691 380 1499
   mean 59.7 60.3 59.8 60
Body Mass Index (kg/m^2) 0.641
   N-Miss 9 20 4 33
   median 26.234 26.525 25.978 26.325

Controlling Options for Categorical Tests (Chisq and Fisher’s)

The formal tests for categorical variables against the levels of the by variable, chisq and fe, have options to simulate p-values. We show how to turn on the simulations for these with 500 replicates for the Fisher’s test (fe).

set.seed(100)
tab.catsim <- tableby(arm ~ sex + race, cat.test="fe", simulate.p.value=TRUE, B=500, data=mockstudy)
tests(tab.catsim)
 Variable   p.value

sex Gender 0.2195609 race Race 0.3093812 Method sex Fisher’s Exact Test for Count Data with simulated p-value(based on 500 replicates) race Fisher’s Exact Test for Count Data with simulated p-value(based on 500 replicates)

The chis-square test on 2x2 tables applies Yates’ continuity correction by default, so we provide an option to turn off the correction. We show the results with and without the correction that is applied to treatment arm by sex, if we use subset to ignore one of the three treatment arms.

cat.correct <- tableby(arm ~ sex + race, cat.test="chisq", subset = !grepl("^F", arm), data=mockstudy)
tests(cat.correct)
 Variable   p.value                                                       Method

sex Gender 0.1900861 Pearson’s Chi-squared test with Yates’ continuity correction race Race 0.8108543 Pearson’s Chi-squared test

cat.nocorrect <- tableby(arm ~ sex + race, cat.test="chisq", subset = !grepl("^F", arm),
     chisq.correct=FALSE, data=mockstudy)
tests(cat.nocorrect)
 Variable   p.value                     Method

sex Gender 0.1666280 Pearson’s Chi-squared test race Race 0.8108543 Pearson’s Chi-squared test

Modifying the look & feel in Word documents

You can easily create Word versions of tableby output via an Rmarkdown report and the default options will give you a reasonable table in Word - just select the “Knit Word” option in RStudio.

The functionality listed in this next paragraph is coming soon but needs an upgraded version of RStudio If you want to modify fonts used for the table, then you’ll need to add an extra line to your header at the beginning of your file. You can take the WordStylesReference01.docx file and modify the fonts (storing the format preferences in your project directory). To see how this works, run your report once using WordStylesReference01.docx and then WordStylesReference02.docx.

output: word_document
  reference_docx: /projects/bsi/gentools/R/lib320/arsenal/doc/WordStylesReference01.docx 

For more informating on changing the look/feel of your Word document, see the Rmarkdown documentation website.

Additional Examples

Here are multiple examples showing how to use some of the different options.

1. Summarize without a group/by variable

tab.noby <- tableby(~ bmi + sex + age, data=mockstudy)
summary(tab.noby)
Overall (N=1499)
Body Mass Index (kg/m^2)
   N-Miss 33
   Mean (SD) 27.206 (5.432)
   Range 14.053 - 60.243
Gender
   Male 916 (61.1%)
   Female 583 (38.9%)
Age, yrs
   Mean (SD) 59.985 (11.519)
   Range 19.000 - 88.000

2. Display footnotes indicating which “test” was used

summary(tab.test) #, pfootnote=TRUE)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
ast 0.039
   N-Miss 69 141 56 266
   median 29.000 25.500 27.000 27.000
Age, yrs 0.614
   N 428 691 380 1499
   mean 59.7 60.3 59.8 60
Body Mass Index (kg/m^2) 0.641
   N-Miss 9 20 4 33
   median 26.234 26.525 25.978 26.325

3. Summarize an ordered factor

When comparing groups of ordered data there are a couple of options. The default uses a general independence test available from the coin package. For two-group comparisons, this is essentially the Armitage trend test. The other option is to specify the Kruskal Wallis test. The example below shows both options.

mockstudy$age.ordnew <- ordered(c("a",NA,as.character(mockstudy$age.ord[-(1:2)])))
table(mockstudy$age.ord, mockstudy$sex)
##        
##         Male Female
##   10-19    1      0
##   20-29    8     11
##   30-39   37     30
##   40-49  127     83
##   50-59  257    179
##   60-69  298    170
##   70-79  168    101
##   80-89   20      9
table(mockstudy$age.ordnew, mockstudy$sex)
##        
##         Male Female
##   10-19    1      0
##   20-29    8     11
##   30-39   37     30
##   40-49  127     83
##   50-59  257    179
##   60-69  297    170
##   70-79  168    100
##   80-89   20      9
##   a        1      0
class(mockstudy$age.ord)
## [1] "ordered" "factor"
summary(tableby(sex ~ age.ordnew, data = mockstudy)) #, pfootnote = TRUE)
Male (N=916) Female (N=583) Total (N=1499) p value
age.ordnew 0.040
   N-Miss 0 1 1
   10-19 1 (0.1%) 0 (0.0%) 1 (0.1%)
   20-29 8 (0.9%) 11 (1.9%) 19 (1.3%)
   30-39 37 (4.0%) 30 (5.2%) 67 (4.5%)
   40-49 127 (13.9%) 83 (14.3%) 210 (14.0%)
   50-59 257 (28.1%) 179 (30.8%) 436 (29.1%)
   60-69 297 (32.4%) 170 (29.2%) 467 (31.2%)
   70-79 168 (18.3%) 100 (17.2%) 268 (17.9%)
   80-89 20 (2.2%) 9 (1.5%) 29 (1.9%)
   a 1 (0.1%) 0 (0.0%) 1 (0.1%)
summary(tableby(sex ~ kwt(age.ord), data = mockstudy)) #) #, pfootnote = TRUE)
Male (N=916) Female (N=583) Total (N=1499) p value
age.ord 0.067
   10-19 1 (0.1%) 0 (0.0%) 1 (0.1%)
   20-29 8 (0.9%) 11 (1.9%) 19 (1.3%)
   30-39 37 (4.0%) 30 (5.1%) 67 (4.5%)
   40-49 127 (13.9%) 83 (14.2%) 210 (14.0%)
   50-59 257 (28.1%) 179 (30.7%) 436 (29.1%)
   60-69 298 (32.5%) 170 (29.2%) 468 (31.2%)
   70-79 168 (18.3%) 101 (17.3%) 269 (17.9%)
   80-89 20 (2.2%) 9 (1.5%) 29 (1.9%)

4. Summarize a survival variable

First look at the information that is presented by the survfit() function, then see how the same results can be seen with tableby. The default is to show the median survival (time at which the probability of survival = 50%).

survfit(Surv(fu.time, fu.stat)~sex, data=mockstudy)
## Call: survfit(formula = Surv(fu.time, fu.stat) ~ sex, data = mockstudy)
## 
##              n events median 0.95LCL 0.95UCL
## sex=Male   916    829    550     515     590
## sex=Female 583    527    543     511     575
survdiff(Surv(fu.time, fu.stat)~sex, data=mockstudy)
## Call:
## survdiff(formula = Surv(fu.time, fu.stat) ~ sex, data = mockstudy)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male   916      829      830  0.000370  0.000956
## sex=Female 583      527      526  0.000583  0.000956
## 
##  Chisq= 0  on 1 degrees of freedom, p= 0.975
summary(tableby(sex ~ Surv(fu.time, fu.stat), data=mockstudy))
Male (N=916) Female (N=583) Total (N=1499) p value
Surv(fu.time, fu.stat) 0.975
   Events 829 527 1356
   Median Survival 550.000 543.000 546.000

It is also possible to obtain summaries of the % survival at certain time points (say the probability of surviving 1-year).

summary(survfit(Surv(fu.time/365.25, fu.stat)~sex, data=mockstudy), times=1:5)
## Call: survfit(formula = Surv(fu.time/365.25, fu.stat) ~ sex, data = mockstudy)
## 
##                 sex=Male 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    626     286   0.6870  0.0153       0.6576       0.7177
##     2    309     311   0.3437  0.0158       0.3142       0.3761
##     3    152     151   0.1748  0.0127       0.1516       0.2015
##     4     57      61   0.0941  0.0104       0.0759       0.1168
##     5     24      16   0.0628  0.0095       0.0467       0.0844
## 
##                 sex=Female 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    380     202   0.6531  0.0197       0.6155        0.693
##     2    190     189   0.3277  0.0195       0.2917        0.368
##     3     95      90   0.1701  0.0157       0.1420        0.204
##     4     51      32   0.1093  0.0133       0.0861        0.139
##     5     18      12   0.0745  0.0126       0.0534        0.104
summary(tableby(sex ~ Surv(fu.time/365.25, fu.stat), data=mockstudy, times=1:5, surv.stats=c("NeventsSurv","NriskSurv")))
Male (N=916) Female (N=583) Total (N=1499) p value
Surv(fu.time/365.25, fu.stat) 0.975
   time = 1 286 (68.7) 202 (65.3) 488 (67.4)
   time = 2 597 (34.4) 391 (32.8) 988 (33.7)
   time = 3 748 (17.5) 481 (17.0) 1229 (17.3)
   time = 4 809 (9.4) 513 (10.9) 1322 (10.1)
   time = 5 825 (6.3) 525 (7.4) 1350 (6.8)
   time = 1 626 380 1006
   time = 2 309 190 499
   time = 3 152 95 247
   time = 4 57 51 108
   time = 5 24 18 42

5. Summarize date variables

Date variables by default are summarized with the number of missing values, the median, and the range. For example purposes we’ve created a random date. Missing values are introduced for impossible February dates.

set.seed(100)
N <- nrow(mockstudy)
mockstudy$dtentry <- mdy.Date(month=sample(1:12,N,replace=T), day=sample(1:29,N,replace=T), 
                              year=sample(2005:2009,N,replace=T))
summary(tableby(sex ~ dtentry, data=mockstudy))
Male (N=916) Female (N=583) Total (N=1499) p value
dtentry 0.554
   N-Miss 3 2 5
   median 2007-06-16 2007-06-15 2007-06-15
   Range 2005-01-03 - 2009-12-27 2005-01-01 - 2009-12-28 2005-01-01 - 2009-12-28

6. Summarize multiple variables without typing them out

Often one wants to summarize a number of variables. Instead of typing by hand each individual variable, an alternative approach is to create a formula using the paste command with the collapse="+" option.

## create a vector specifying the variable names
myvars <- names(mockstudy)

## select the 8th through the last variables
## paste them together, separated by the + sign
RHS <- paste(myvars[8:10], collapse="+")
RHS

[1] “ps+hgb+bmi”

## create a formula using the as.formula function
as.formula(paste('arm ~ ', RHS))

arm ~ ps + hgb + bmi

## use the formula in the tableby function
summary(tableby(as.formula(paste('arm ~', RHS)), data=mockstudy))
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
ps 0.903
   N-Miss 69 141 56 266
   Mean (SD) 0.529 (0.597) 0.547 (0.595) 0.537 (0.606) 0.539 (0.598)
   Range 0.000 - 2.000 0.000 - 2.000 0.000 - 2.000 0.000 - 2.000
hgb 0.639
   N-Miss 69 141 56 266
   Mean (SD) 12.276 (1.686) 12.381 (1.763) 12.373 (1.680) 12.348 (1.719)
   Range 9.060 - 17.300 9.000 - 18.200 9.000 - 17.000 9.000 - 18.200
Body Mass Index (kg/m^2) 0.892
   N-Miss 9 20 4 33
   Mean (SD) 27.290 (5.552) 27.210 (5.173) 27.106 (5.751) 27.206 (5.432)
   Range 14.053 - 53.008 16.649 - 49.130 15.430 - 60.243 14.053 - 60.243

These steps can also be done using the formulize function.

## The formulize function does the paste and as.formula steps
tmp <- formulize('arm',myvars[8:10])
tmp

arm ~ ps + hgb + bmi

## More complex formulas could also be written using formulize
tmp2 <- formulize('arm',c('ps','hgb^2','bmi'))

## use the formula in the tableby function
summary(tableby(tmp, data=mockstudy))
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
ps 0.903
   N-Miss 69 141 56 266
   Mean (SD) 0.529 (0.597) 0.547 (0.595) 0.537 (0.606) 0.539 (0.598)
   Range 0.000 - 2.000 0.000 - 2.000 0.000 - 2.000 0.000 - 2.000
hgb 0.639
   N-Miss 69 141 56 266
   Mean (SD) 12.276 (1.686) 12.381 (1.763) 12.373 (1.680) 12.348 (1.719)
   Range 9.060 - 17.300 9.000 - 18.200 9.000 - 17.000 9.000 - 18.200
Body Mass Index (kg/m^2) 0.892
   N-Miss 9 20 4 33
   Mean (SD) 27.290 (5.552) 27.210 (5.173) 27.106 (5.751) 27.206 (5.432)
   Range 14.053 - 53.008 16.649 - 49.130 15.430 - 60.243 14.053 - 60.243

7. Subset the dataset used in the analysis

Here are two ways to get the same result (limit the analysis to subjects age>5 and in the F: FOLFOX treatment group).

newdata <- subset(mockstudy, subset=age>50 & arm=='F: FOLFOX', select = c(sex,ps:bmi))
dim(mockstudy)
## [1] 1499   16
table(mockstudy$arm)
## 
##    A: IFL F: FOLFOX   G: IROX 
##       428       691       380
dim(newdata)
## [1] 557   4
names(newdata)
## [1] "sex" "ps"  "hgb" "bmi"
summary(tableby(sex ~ ., data=newdata))
Male (N=333) Female (N=224) Total (N=557) p value
ps 0.652
   N-Miss 64 44 108
   Mean (SD) 0.554 (0.600) 0.528 (0.602) 0.543 (0.600)
   Range 0.000 - 2.000 0.000 - 2.000 0.000 - 2.000
hgb < 0.001
   N-Miss 64 44 108
   Mean (SD) 12.720 (1.925) 12.063 (1.395) 12.457 (1.760)
   Range 9.000 - 18.200 9.100 - 15.900 9.000 - 18.200
bmi 0.650
   N-Miss 9 6 15
   Mean (SD) 27.539 (4.780) 27.337 (5.508) 27.458 (5.081)
   Range 17.927 - 47.458 16.649 - 49.130 16.649 - 49.130
summary(tableby(sex ~ ps + hgb + bmi, subset=age>50 & arm=="F: FOLFOX", data=mockstudy))
Male (N=333) Female (N=224) Total (N=557) p value
ps 0.652
   N-Miss 64 44 108
   Mean (SD) 0.554 (0.600) 0.528 (0.602) 0.543 (0.600)
   Range 0.000 - 2.000 0.000 - 2.000 0.000 - 2.000
hgb < 0.001
   N-Miss 64 44 108
   Mean (SD) 12.720 (1.925) 12.063 (1.395) 12.457 (1.760)
   Range 9.000 - 18.200 9.100 - 15.900 9.000 - 18.200
Body Mass Index (kg/m^2) 0.650
   N-Miss 9 6 15
   Mean (SD) 27.539 (4.780) 27.337 (5.508) 27.458 (5.081)
   Range 17.927 - 47.458 16.649 - 49.130 16.649 - 49.130

8. Create combinations of variables on the fly

## create a variable combining the levels of mdquality.s and sex
with(mockstudy, table(interaction(mdquality.s,sex)))
## 
##   0.Male   1.Male 0.Female 1.Female 
##       77      686       47      437
summary(tableby(arm ~ interaction(mdquality.s,sex), data=mockstudy))
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
interaction(mdquality.s, sex) 0.493
   N-Miss 55 156 41 252
   0.Male 29 (7.8%) 31 (5.8%) 17 (5.0%) 77 (6.2%)
   1.Male 214 (57.4%) 285 (53.3%) 187 (55.2%) 686 (55.0%)
   0.Female 12 (3.2%) 21 (3.9%) 14 (4.1%) 47 (3.8%)
   1.Female 118 (31.6%) 198 (37.0%) 121 (35.7%) 437 (35.0%)
## create a new grouping variable with combined levels of arm and sex
summary(tableby(interaction(mdquality.s, sex) ~  age + bmi, data=mockstudy, subset=arm=="F: FOLFOX"))
0.Male (N=31) 1.Male (N=285) 0.Female (N=21) 1.Female (N=198) Total (N=535) p value
Age, yrs 0.190
   Mean (SD) 63.065 (11.702) 60.653 (11.833) 60.810 (10.103) 58.924 (11.366) 60.159 (11.612)
   Range 41.000 - 82.000 19.000 - 88.000 42.000 - 81.000 29.000 - 83.000 19.000 - 88.000
Body Mass Index (kg/m^2) 0.894
   N-Miss 0 6 1 5 12
   Mean (SD) 26.633 (5.094) 27.387 (4.704) 27.359 (4.899) 27.294 (5.671) 27.307 (5.100)
   Range 20.177 - 41.766 17.927 - 47.458 19.801 - 39.369 16.799 - 44.841 16.799 - 47.458

9. Transform variables on the fly

Certain transformations need to be surrounded by I() so that R knows to treat it as a variable transformation and not some special model feature. If the transformation includes any of the symbols / - + ^ * then surround the new variable by I().

trans <- tableby(arm ~ I(age/10) + log(bmi) + factor(mdquality.s, levels=0:1, labels=c('N','Y')),
                 data=mockstudy)
summary(trans)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Age, yrs 0.614
   Mean (SD) 5.967 (1.136) 6.030 (1.163) 5.976 (1.150) 5.999 (1.152)
   Range 2.700 - 8.800 1.900 - 8.800 2.600 - 8.500 1.900 - 8.800
Body Mass Index (kg/m^2) 0.811
   N-Miss 9 20 4 33
   Mean (SD) 3.287 (0.197) 3.286 (0.183) 3.279 (0.200) 3.285 (0.192)
   Range 2.643 - 3.970 2.812 - 3.894 2.736 - 4.098 2.643 - 4.098
factor(mdquality.s, levels = 0:1, labels = c(“N”, “Y”)) 0.694
   N-Miss 55 156 41 252
   N 41 (11.0%) 52 (9.7%) 31 (9.1%) 124 (9.9%)
   Y 332 (89.0%) 483 (90.3%) 308 (90.9%) 1123 (90.1%)

The labels for these variables isn’t exactly what we’d like so we can change modify those after the fact. Instead of typing out the very long variable names you can modify specific labels by position.

labels(trans)
##                                                           arm 
##                                               "Treatment Arm" 
##                                                     I(age/10) 
##                                                    "Age, yrs" 
##                                                      log(bmi) 
##                                    "Body Mass Index (kg/m^2)" 
##       factor(mdquality.s, levels = 0:1, labels = c("N", "Y")) 
## "factor(mdquality.s, levels = 0:1, labels = c(\"N\", \"Y\"))"
labels(trans)[2:4] <- c('Age per 10 yrs', 'log(BMI)', 'MD Quality')
labels(trans)
##                                                     arm 
##                                         "Treatment Arm" 
##                                               I(age/10) 
##                                        "Age per 10 yrs" 
##                                                log(bmi) 
##                                              "log(BMI)" 
## factor(mdquality.s, levels = 0:1, labels = c("N", "Y")) 
##                                            "MD Quality"
summary(trans)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Age per 10 yrs 0.614
   Mean (SD) 5.967 (1.136) 6.030 (1.163) 5.976 (1.150) 5.999 (1.152)
   Range 2.700 - 8.800 1.900 - 8.800 2.600 - 8.500 1.900 - 8.800
log(BMI) 0.811
   N-Miss 9 20 4 33
   Mean (SD) 3.287 (0.197) 3.286 (0.183) 3.279 (0.200) 3.285 (0.192)
   Range 2.643 - 3.970 2.812 - 3.894 2.736 - 4.098 2.643 - 4.098
MD Quality 0.694
   N-Miss 55 156 41 252
   N 41 (11.0%) 52 (9.7%) 31 (9.1%) 124 (9.9%)
   Y 332 (89.0%) 483 (90.3%) 308 (90.9%) 1123 (90.1%)

Note that if we had not changed mdquality.s to a factor, it would have been summarized as though it were a continuous variable.

class(mockstudy$mdquality.s)

[1] “integer”

summary(tableby(arm~mdquality.s, data=mockstudy))
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
mdquality.s 0.695
   N-Miss 55 156 41 252
   Mean (SD) 0.890 (0.313) 0.903 (0.297) 0.909 (0.289) 0.901 (0.299)
   Range 0.000 - 1.000 0.000 - 1.000 0.000 - 1.000 0.000 - 1.000

Another option would be to specify the test and summary statistics. In fact, if I had a set of variables coded 0/1 and that was all I was summarizing, then I could change the global option for continuous variables to use the chi-square test and show countpct.

summary(tableby(arm ~ chisq(mdquality.s, "Nmiss","countpct"), data=mockstudy))
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
mdquality.s 0.694
   N-Miss 55 156 41 252
   0 41 (11.0%) 52 (9.7%) 31 (9.1%) 124 (9.9%)
   1 332 (89.0%) 483 (90.3%) 308 (90.9%) 1123 (90.1%)

10. Change the ordering of the variables or delete a variable

mytab <- tableby(arm ~ sex + alk.phos + age, data=mockstudy)
mytab2 <- mytab[c('age','sex','alk.phos')]
summary(mytab2)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Age, yrs 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000
Gender 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
alk.phos 0.226
   N-Miss 69 141 56 266
   Mean (SD) 175.577 (128.608) 161.984 (121.978) 173.506 (138.564) 168.969 (128.492)
   Range 11.000 - 858.000 10.000 - 1014.000 7.000 - 982.000 7.000 - 1014.000
summary(mytab[c('age','sex')], digits = 2)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Age, yrs 0.614
   Mean (SD) 59.67 (11.36) 60.30 (11.63) 59.76 (11.50) 59.99 (11.52)
   Range 27.00 - 88.00 19.00 - 88.00 26.00 - 85.00 19.00 - 88.00
Gender 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
summary(mytab[c(3,1)], digits = 3)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Age, yrs 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000
Gender 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)

11. Merge two tableby objects together

It is possible to combine two tableby objects so that they print out together.

## demographics
tab1 <- tableby(arm ~ sex + age, data=mockstudy,
                control=tableby.control(numeric.stats=c("Nmiss","meansd"), total=FALSE))
## lab data
tab2 <- tableby(arm ~ hgb + alk.phos, data=mockstudy,
                control=tableby.control(numeric.stats=c("Nmiss","median","q1q3"),
                                        numeric.test="kwt", total=FALSE))
names(tab1$x)

[1] “sex” “age”

names(tab2$x)

[1] “hgb” “alk.phos”

tab12 <- merge(tab1,tab2)
class(tab12)

[1] “tableby”

names(tab12$x)

[1] “sex” “age” “hgb” “alk.phos”

summary(tab12) #, pfootnote=TRUE)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) p value
Gender 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%)
Age, yrs 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499)
hgb 0.570
   N-Miss 69 141 56
   median 12.100 12.200 12.400
   Q1, Q3 11.000, 13.450 11.100, 13.600 11.175, 13.625
alk.phos 0.104
   N-Miss 69 141 56
   median 133.000 116.000 122.000
   Q1, Q3 89.000, 217.000 85.000, 194.750 87.750, 210.250

12. Add a title to the table

When creating a pdf the tables are automatically numbered and the title appears below the table. In Word and HTML, the titles appear un-numbered and above the table.

t1 <- tableby(arm ~ sex + age, data=mockstudy)
summary(t1, title='Demographics')
Demographics
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Gender 0.190
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
Age, yrs 0.614
   Mean (SD) 59.673 (11.365) 60.301 (11.632) 59.763 (11.499) 59.985 (11.519)
   Range 27.000 - 88.000 19.000 - 88.000 26.000 - 85.000 19.000 - 88.000

13. Modify how missing values are displayed

Depending on the report you are writing you have the following options:

## look at how many missing values there are for each variable
apply(is.na(mockstudy),2,sum)
##        case         age         arm         sex        race     fu.time     fu.stat          ps 
##           0           0           0           0           7           0           0         266 
##         hgb         bmi    alk.phos         ast mdquality.s     age.ord  age.ordnew     dtentry 
##         266          33         266         266         252           0           1           5
## Show how many subjects have each variable (non-missing)
summary(tableby(sex ~ ast + age, data=mockstudy,
                control=tableby.control(numeric.stats=c("N","median"), total=FALSE)))
Male (N=916) Female (N=583) p value
ast 0.921
   N 754 479
   median 27.000 27.000
Age, yrs 0.048
   N 916 583
   median 61.000 60.000
## Always list the number of missing values
summary(tableby(sex ~ ast + age, data=mockstudy,
                control=tableby.control(numeric.stats=c("Nmiss2","median"), total=FALSE)))
Male (N=916) Female (N=583) p value
ast 0.921
   N-Miss 162 104
   median 27.000 27.000
Age, yrs 0.048
   N-Miss 0 0
   median 61.000 60.000
## Only show the missing values if there are some (default)
summary(tableby(sex ~ ast + age, data=mockstudy, 
                control=tableby.control(numeric.stats=c("Nmiss","mean"),total=FALSE)))
Male (N=916) Female (N=583) p value
ast 0.921
   N-Miss 162 104
   mean 35.9 36
Age, yrs 0.048
   mean 60.5 59.2
## Don't show N at all
summary(tableby(sex ~ ast + age, data=mockstudy, 
                control=tableby.control(numeric.stats=c("mean"),total=FALSE)))
Male (N=916) Female (N=583) p value
ast 0.921
   mean 35.9 36
Age, yrs 0.048
   mean 60.5 59.2

One might also consider the use of includeNA() to include NAs in the counts and percents for categorical variables.

mockstudy$ps.cat <- factor(mockstudy$ps)
attr(mockstudy$ps.cat, "label") <- "ps"
summary(tableby(sex ~ includeNA(ps.cat), data = mockstudy, cat.stats = "countpct"))
Male (N=916) Female (N=583) Total (N=1499) p value
ps 0.354
   0 391 (42.7%) 244 (41.9%) 635 (42.4%)
   1 329 (35.9%) 202 (34.6%) 531 (35.4%)
   2 34 (3.7%) 33 (5.7%) 67 (4.5%)
   (Missing) 162 (17.7%) 104 (17.8%) 266 (17.7%)

14. Modify the number of digits used

Within tableby.control function there are 4 options for controlling the number of significant digits shown.

summary(tableby(arm ~ sex + age + fu.time, data=mockstudy), digits=4, digits.p=2, digits.pct=1)
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Gender 0.19
   Male 277 (64.7%) 411 (59.5%) 228 (60.0%) 916 (61.1%)
   Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%)
Age, yrs 0.61
   Mean (SD) 59.6729 (11.3645) 60.3010 (11.6323) 59.7632 (11.4993) 59.9853 (11.5188)
   Range 27.0000 - 88.0000 19.0000 - 88.0000 26.0000 - 85.0000 19.0000 - 88.0000
fu.time < 0.01
   Mean (SD) 553.5841 (419.6065) 731.2460 (487.7443) 607.2421 (435.5092) 649.0841 (462.5109)
   Range 9.0000 - 2170.0000 0.0000 - 2472.0000 17.0000 - 2118.0000 0.0000 - 2472.0000

15. Create a user-defined summary statistic

For purposes of this example, the code below creates a trimmed mean function (trims 10%) and use that to summarize the data. Note the use of the ... which tells R to pass extra arguments on - this is required for user-defined functions. In this case, na.rm=T is passed to myfunc. The weights argument is also required, even though it isn’t passed on to the internal function in this particular example.

myfunc <- function(x, weights=rep(1,length(x)), ...){
  mean(x, trim=.1, ...)
}

summary(tableby(sex ~ hgb, data=mockstudy, 
                control=tableby.control(numeric.stats=c("Nmiss","myfunc"), numeric.test="kwt",
                    stats.labels=list(Nmiss='Missing values', myfunc="Trimmed Mean, 10%"))))
Male (N=916) Female (N=583) Total (N=1499) p value
hgb < 0.001
   Missing values 162 104 266
   Trimmed Mean, 10% 12.6 11.9 12.3

16. Use case-weights for creating summary statistics

When comparing groups, they are often unbalanced when it comes to nuisances such as age and sex. The tableby function allows you to create weighted summary statistics. If this option us used then p-values are not calculated (test=FALSE).

##create fake group that is not balanced by age/sex 
set.seed(200)
mockstudy$fake_arm <- ifelse(mockstudy$age>60 & mockstudy$sex=='Female',sample(c('A','B'),replace=T, prob=c(.2,.8)),
                            sample(c('A','B'),replace=T, prob=c(.8,.4)))

mockstudy$agegp <- cut(mockstudy$age, breaks=c(18,50,60,70,90), right=FALSE)

## create weights based on agegp and sex distribution
tab1 <- with(mockstudy,table(agegp, sex))
tab2 <- with(mockstudy, table(agegp, sex, fake_arm))
tab2
## , , fake_arm = A
## 
##          sex
## agegp     Male Female
##   [18,50)   73     62
##   [50,60)  128     94
##   [60,70)  139      7
##   [70,90)  102      0
## 
## , , fake_arm = B
## 
##          sex
## agegp     Male Female
##   [18,50)   79     48
##   [50,60)  130     84
##   [60,70)  156    166
##   [70,90)  109    122
gpwts <- rep(tab1, length(unique(mockstudy$fake_arm)))/tab2
gpwts[gpwts>50] <- 30

## apply weights to subjects
index <- with(mockstudy, cbind(as.numeric(agegp), as.numeric(sex), as.numeric(as.factor(fake_arm)))) 
mockstudy$wts <- gpwts[index]

## show weights by treatment arm group
tapply(mockstudy$wts,mockstudy$fake_arm, summary)
## $A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.774   1.894   2.069   2.276   2.082  24.714 
## 
## $B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.042   1.924   1.677   1.985   2.292
orig <- tableby(fake_arm ~ age + sex + Surv(fu.time/365, fu.stat), data=mockstudy, test=FALSE)
summary(orig, title='No Case Weights used')
No Case Weights used
A (N=605) B (N=894) Total (N=1499)
Age, yrs
   Mean (SD) 57.413 (11.618) 61.726 (11.125) 59.985 (11.519)
   Range 22.000 - 85.000 19.000 - 88.000 19.000 - 88.000
Gender
   Male 442 (73.1%) 474 (53.0%) 916 (61.1%)
   Female 163 (26.9%) 420 (47.0%) 583 (38.9%)
Surv(fu.time/365, fu.stat)
   Events 554 802 1356
   Median Survival 1.504 1.493 1.496
tab1 <- tableby(fake_arm ~ age + sex + Surv(fu.time/365, fu.stat), data=mockstudy, weights=wts)
summary(tab1, title='Case Weights used')
Case Weights used
A (N=605) B (N=894) Total (N=1499)
Age, yrs
   Mean (SD) 58.009 (10.925) 60.151 (11.428) 59.126 (11.235)
   Range 22.000 - 85.000 19.000 - 88.000 19.000 - 88.000
Gender
   Male 916 (66.5%) 916 (61.1%) 1832 (63.7%)
   Female 461 (33.5%) 583 (38.9%) 1044 (36.3%)
Surv(fu.time/365, fu.stat)
   Events 1252 1348 2599
   Median Survival 1.534 1.496 1.532

17. Create your own p-value and add it to the table

When using weighted summary statistics, it is often desirable to then show a p-value from a model that corresponds to the weighted analysis. It is possible to add your own p-value and modify the column title for that new p-value. Another use for this would be to add standardized differences or confidence intervals instead of a p-value.

To add the p-value you simply need to create a data frame and use the function modpval.tableby. The first 2 columns in the dataframe are required and are the variable name and the new p-value. The third column can be used to indicate what method was used to calculate the p-value. If you specify use.pname=TRUE then the column name indicating the p-value will be also be used in the tableby summary.

mypval <- data.frame(variable=c('age','sex','Surv(fu.time/365, fu.stat)'), 
                     adj.pvalue=c(.953,.811,.01), 
                     method=c('Age/Sex adjusted model results'))
tab2 <- modpval.tableby(tab1, mypval, use.pname=TRUE)
summary(tab2, title='Case Weights used, p-values added') #, pfootnote=TRUE)
Case Weights used, p-values added
A (N=605) B (N=894) Total (N=1499) adj.pvalue
Age, yrs 0.953
   Mean (SD) 58.009 (10.925) 60.151 (11.428) 59.126 (11.235)
   Range 22.000 - 85.000 19.000 - 88.000 19.000 - 88.000
Gender 0.811
   Male 916 (66.5%) 916 (61.1%) 1832 (63.7%)
   Female 461 (33.5%) 583 (38.9%) 1044 (36.3%)
Surv(fu.time/365, fu.stat) 0.010
   Events 1252 1348 2599
   Median Survival 1.534 1.496 1.532

18. For two-level categorical variables, only display one level.

If the cat.simplify option is set to TRUE then only the second level of the group. In the example below sex has the levels and “Female” is the second level, hence only the %female is shown in the table. Similarly, “mdquality.s” was turned to a factor and “1” is the second level, hence

levels(mockstudy$sex)

[1] “Male” “Female”

table2 <- tableby(arm~sex + factor(mdquality.s), data=mockstudy, cat.simplify=TRUE)
summary(table2, labelTranslations=c(sex="Female", "factor(mdquality.s)"="MD Quality"))
A: IFL (N=428) F: FOLFOX (N=691) G: IROX (N=380) Total (N=1499) p value
Female 151 (35.3%) 280 (40.5%) 152 (40.0%) 583 (38.9%) 0.190
MD Quality 0.694
   N-Miss 55 156 41 252
   0 41 (11.0%) 52 (9.7%) 31 (9.1%) 124 (9.9%)
   1 332 (89.0%) 483 (90.3%) 308 (90.9%) 1123 (90.1%)

19. Use tableby within an Sweave document

For those users who wish to create tables within an Sweave document, the following code seems to work.

\documentclass{article}

\usepackage{longtable}
\usepackage{pdfpages}

\begin{document}

\section{Read in Data}
<<echo=TRUE>>=
require(arsenal)
require(knitr)
require(rmarkdown)
data(mockstudy)

tab1 <- tableby(arm~sex+age, data=mockstudy)
@

\section{Convert Summary.Tableby to LaTeX}
<<echo=TRUE, results='hide', message=FALSE>>=
capture.output(summary(tab1), file="Test.md")

## Convert R Markdown Table to LaTeX
render("Test.md", pdf_document(keep_tex=TRUE))
@ 

\includepdf{Test.pdf}

\end{document}

20. Export tableby object to a .CSV file

When looking at multiple variables it is sometimes useful to export the results to a csv file. The as.data.frame function creates a data frame object that can be exported or further manipulated within R.

tab1 <- tableby(arm~sex+age, data=mockstudy)
as.data.frame(tab1)
##   variable   term     label variable.type              A: IFL           F: FOLFOX
## 1      sex    sex    Gender   categorical                                        
## 2      sex   Male      Male   categorical 277.00000, 64.71963 411.00000, 59.47902
## 3      sex Female    Female   categorical 151.00000, 35.28037 280.00000, 40.52098
## 4      age    age  Age, yrs       numeric                                        
## 5      age meansd Mean (SD)       numeric  59.67290, 11.36454  60.30101, 11.63225
## 6      age  range     Range       numeric              27, 88              19, 88
##              G: IROX              Total                       test   p.value
## 1                                       Pearson's Chi-squared test 0.1904388
## 2            228, 60  916.0000, 61.1074 Pearson's Chi-squared test 0.1904388
## 3            152, 40  583.0000, 38.8926 Pearson's Chi-squared test 0.1904388
## 4                                               Linear Model ANOVA 0.6143859
## 5 59.76316, 11.49930 59.98532, 11.51877         Linear Model ANOVA 0.6143859
## 6             26, 85             19, 88         Linear Model ANOVA 0.6143859
# write.csv(tmp, '/my/path/here/mymodel.csv')

21. Write tableby object to a separate Word or HTML file

## write to an HTML document
tab1 <- tableby(arm ~ sex + age, data=mockstudy)
write2html(tab1, "~/trash.html")

## write to a Word document
write2word(tab1, "~/trash.doc", title="My table in Word")

Available Function Options

Summary statistics

The default summary statistics, by varible type, are:

Any summary statistics standardly defined in R (e.g. mean, median, sd, med, range) can be specified, however there are a number of extra functions defined specifically for the tableby function.

Testing options

The tests used to calculate p-values differ by the variable type, but can be specified explicitly in the formula statement or in the control function.

The following tests are accepted:

tableby.control settings

A quick way to see what arguments are possible to utilize in a function is to use the args() command. Settings involving the number of digits can be set in tableby.control or in summary.tableby.

args(tableby.control)
## function (test = TRUE, total = TRUE, test.pname = NULL, cat.simplify = FALSE, 
##     numeric.test = "anova", cat.test = "chisq", ordered.test = "trend", 
##     surv.test = "logrank", date.test = "kwt", numeric.stats = c("Nmiss", 
##         "meansd", "range"), cat.stats = c("Nmiss", "countpct"), 
##     ordered.stats = c("Nmiss", "countpct"), surv.stats = c("Nevents", 
##         "medSurv"), date.stats = c("Nmiss", "median", "range"), 
##     stats.labels = list(Nmiss = "N-Miss", Nmiss2 = "N-Miss", 
##         meansd = "Mean (SD)", medianq1q3 = "Median (Q1, Q3)", 
##         q1q3 = "Q1, Q3", range = "Range", countpct = "Count (Pct)", 
##         Nevents = "Events", medSurv = "Median Survival", medTime = "Median Follow-Up", 
##         rangeTime = "Range of Follow-Up"), digits = 3L, digits.count = 0L, 
##     digits.pct = 1L, digits.p = 3L, format.p = TRUE, chisq.correct = TRUE, 
##     simulate.p.value = FALSE, B = 2000, ...) 
## NULL

summary.tableby settings

The summary.tableby function has options that modify how the table appears (such as adding a title or modifying labels).

args(arsenal:::summary.tableby)
## function (object, ..., labelTranslations = NULL, text = FALSE, 
##     title = NULL, pfootnote = FALSE) 
## NULL