Consider the pigs
dataset provided with the package (help("pigs")
provides details). These data come from an unbalanced experiment where pigs are given different percentages of protein (percent
) from different sources (source
) in their diet, and later we measure the concentration (conc
) of leucine. Here’s an interaction plot showing the mean conc
at each combination of the other factors.
par(mar = .1 + c(4, 4, 1, 1)) # reduce head space
with(pigs, interaction.plot(percent, source, conc))
This plot suggests that with each source
, conc
tends to go up with percent
, but that the mean differs with each source
.
Now, suppose that we want to assess, numerically, the marginal results for percent
. The natural thing to do is to obtain the marginal means:
with(pigs, tapply(conc, percent, mean))
## 9 12 15 18
## 32.70000 38.01111 40.12857 39.94000
Looking at the plot, it seems a bit surprising that the last three means are all about the same, with the one for 15 percent being the largest.
Hmmmm, so let’s try another approach – actually averaging together the values we see in the plot. First, we need the means that are shown there:
cell.means <- matrix(with(pigs,
tapply(conc, interaction(source, percent), mean)),
nrow = 3)
cell.means
## [,1] [,2] [,3] [,4]
## [1,] 25.75000 30.93333 31.15000 32.33333
## [2,] 34.63333 39.63333 39.23333 42.90000
## [3,] 35.40000 43.46667 50.45000 59.80000
Confirm that the rows of this matrix match the plotted values for fish, soy, and skim, respectively. Now, average each column:
apply(cell.means, 2, mean)
## [1] 31.92778 38.01111 40.27778 45.01111
These results are decidedly different from the ordinary marginal means we obtained earlier. What’s going on? The answer is that some observations were lost, making the data unbalanced:
with(pigs, table(source, percent))
## percent
## source 9 12 15 18
## fish 2 3 2 3
## soy 3 3 3 1
## skim 3 3 2 1
We can reproduce the marginal means by weighting the cell means with these frequencies. For example, in the last column:
sum(c(3, 1, 1) * cell.means[, 4]) / 5
## [1] 39.94
The big discrepancy between the ordinary mean for percent = 18
and the marginal mean from cell.means
is due to the fact that the lowest value receives 3 times the weight as the other two values.
The point is that the marginal means of cell.means
give equal weight to each cell. In many situations (especially with experimental data), that is a much fairer way to compute marginal means, in that they are not biased by imbalances in the data. we are, in a sense, estimating what the marginal means would be, had the experiment been balanced. Estimated marginal means (EMMs) serve that need.
All this said, there are certainly situations where equal weighting is not appropriate. Suppose, for example, we have data on sales of a product given different packaging and features. The data could be unbalanced because customers are more attracted to some combinations than others. If our goal is to understand scientifically what packaging and features are inherently more profitable, then equally weighted EMMs may be appropriate; but if our goal is to predict or maximize profit, the ordinary marginal means provide better estimates of what we can expect in the marketplace.
Estimated marginal means are based on a model – not directly on data. The basis for them is what we call the reference grid for a given model. To obtain the reference grid, consider all the predictors in the model. Here are the default rules for constructing the reference grid
The reference grid is then a regular grid of all combinations of these reference levels.
As a simple example, consider again the pigs
dataset (see help("fiber")
for details). Examination of residual plots from preliminary models suggests that it is a good idea to work in terms of log concentration.
If we treat the predictor percent
as a factor, we might fit the following model:
pigs.lm1 <- lm(log(conc) ~ source + factor(percent), data = pigs)
The reference grid for this model can be found via the ref_grid
function:
ref_grid(pigs.lm1)
## 'emmGrid' object with variables:
## source = fish, soy, skim
## percent = 9, 12, 15, 18
## Transformation: "log"
Both predictors are factors, and the reference grid consists of the \(3\times4 = 12\) combinations of these factor levels. It can be seen explicitly by looking at the grid
slot of this object:
ref_grid(pigs.lm1) @ grid
## source percent .wgt.
## 1 fish 9 2
## 2 soy 9 3
## 3 skim 9 3
## 4 fish 12 3
## 5 soy 12 3
## 6 skim 12 3
## 7 fish 15 2
## 8 soy 15 3
## 9 skim 15 2
## 10 fish 18 3
## 11 soy 18 1
## 12 skim 18 1
Note that other information is retained in the reference grid, e.g., the transformation used on the response, and the cell counts as the .wgt.
column.
Now, suppose instead that we treat percent
as a numeric predictor. This leads to a different model – and a different reference grid.
pigs.lm2 <- lm(log(conc) ~ source + percent, data = pigs)
ref_grid(pigs.lm2)
## 'emmGrid' object with variables:
## source = fish, soy, skim
## percent = 12.931
## Transformation: "log"
This reference grid has the levels of source
, but only one percent
value, its average. Thus, the grid has only three elements:
ref_grid(pigs.lm2) @ grid
## source percent .wgt.
## 1 fish 12.93103 10
## 2 soy 12.93103 10
## 3 skim 12.93103 9
Once the reference grid is established, we can consider using the model to estimate the mean at each point in the reference grid. (Curiously, the convention is to call this “prediction” rather than “estimation”). For pigs.lm1
, we have
pigs.pred1 <- matrix(predict(ref_grid(pigs.lm1)), nrow = 3)
pigs.pred1
## [,1] [,2] [,3] [,4]
## [1,] 3.220292 3.399846 3.437691 3.520141
## [2,] 3.493060 3.672614 3.710459 3.792909
## [3,] 3.622569 3.802124 3.839968 3.922419
Estimated marginal means (EMMs) are defined as equally weighted means of these predictions at specified margins:
apply(pigs.pred1, 1, mean) ### EMMs for source
## [1] 3.394492 3.667260 3.796770
apply(pigs.pred1, 2, mean) ### EMMs for percent
## [1] 3.445307 3.624861 3.662706 3.745156
For the other model, pigs.lm2
, we have only one point in the reference grid for each source
level; so the EMMs for source
are just the predictions themselves:
predict(ref_grid(pigs.lm2))
## [1] 3.379865 3.652693 3.783120
These are slightly different from the previous EMMs for source
, emphasizing the fact that EMMs are model-dependent. In models with covariates, EMMs are often called adjusted means.
The emmeans
function computes EMMs, accompanied by standard errors and confidence intervals. For example,
emmeans(pigs.lm1, "percent")
## percent emmean SE df lower.CL upper.CL
## 9 3.445307 0.04088810 23 3.360723 3.529890
## 12 3.624861 0.03837600 23 3.545475 3.704248
## 15 3.662706 0.04372996 23 3.572244 3.753168
## 18 3.745156 0.05296030 23 3.635599 3.854713
##
## Results are averaged over the levels of: source
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
In these examples, all the results are presented on the log(conc)
scale (and the annotations in the output warn of this). It is possible to convert them back to the conc
scale by back-transforming. This topic is discussed in the vignette on transformations.
An additional note: There is an exception to the definition of EMMs given here. If the model has a nested structure in the fixed effects, then averaging is performed separately in each nesting group. See the section on nesting in the “messy-data” vignette for an example.
It is possible to alter the reference grid. We might, for example, want to define a reference grid for pigs.lm2
that is comparable to the one for pigs.lm1
.
ref_grid(pigs.lm2, cov.reduce = FALSE)
## 'emmGrid' object with variables:
## source = fish, soy, skim
## percent = 9, 12, 15, 18
## Transformation: "log"
Using cov.reduce = FALSE
specifies that, instead of using the mean, the reference grid should use all the unique values of each covariate. Be careful with this: it can create quite a mess if there is a covariate that was measured rather than experimentally varied. Another option is to give a function; e.g.,
ref_grid(pigs.lm2, cov.reduce = range)
## 'emmGrid' object with variables:
## source = fish, soy, skim
## percent = 9, 18
## Transformation: "log"
Perhaps more common is to use the at
argument. Consider this model for the built-in cars
dataset:
mtcars.lm <- lm(mpg ~ disp * cyl, data = mtcars)
ref_grid(mtcars.lm)
## 'emmGrid' object with variables:
## disp = 230.72
## cyl = 6.1875
Since both predictors are numeric, the default reference grid has only one point. For purposes of describing the fitted model, you might want to obtain predictions at a grid of points, like this:
mtcars.rg <- ref_grid(mtcars.lm,
at = list(disp = c(100, 200, 300), cyl = c(4, 6, 8)))
… which will create a 3 x 3 grid. (We’ll look at this model again shortly.) Another use of at
is to focus on only some of the levels of a factor. Note that at
does not need to specify every predictor; those not mentioned in at
are handled by cov.reduce
or the default methods.
The results of ref_grid()
or emmeans()
(these are objects of class emmGrid
) may be plotted in two different ways. One is an interaction-style plot, using emmip()
. In the following, let’s use it to compare the predictions from pigs.lm1
and pigs.lm2
:
emmip(pigs.lm1, source ~ percent)
emmip(ref_grid(pigs.lm2, cov.reduce = FALSE), source ~ percent)
Notice that emmip()
may also be used on a fitted model. The formula specification needs the x variable on the right-hand side and the “trace” factor (what is used to define the different curves) on the left. This is a good time to yet again emphasize that EMMs are based on a model. Neither of these plots is an interaction plot of the data; they are interaction plots of model predictions; and since both models do not include an interaction, no interaction at all is evident in the plots.
The other graphics option offered is the plot()
method for emmGrid
objects. In the following, we display the estimates and 95% confidence intervals for mtcars.rg
in separate panels for each disp
.
plot(mtcars.rg, by = "disp")
This plot illustrates, as much as anything else, how silly it is to try to predict mileage for a 4-cylinder car having high displacement, or an 8-cylinder car having low displacement. The widths of the intervals give us a clue that we are extrapolating. A better idea is to acknowledge that displacement largely depends on the number of cylinders. So here is yet another way to use cov.reduce
to modify the reference grid:
mtcars.rg_d.c <- ref_grid(mtcars.lm, at = list(cyl = c(4,6,8)),
cov.reduce = disp ~ cyl)
mtcars.rg_d.c @ grid
## disp cyl .wgt.
## 1 93.78673 4 1
## 2 218.98458 6 1
## 3 344.18243 8 1
The ref_grid
call specifies that disp
depends on cyl
; so a linear model is fitted with the given formula and its fitted values are used as the disp
values – only one for each cyl
. If we plot this grid, the results are sensible, reflecting what the model predicts for typical cars with each number of cylinders:
plot(mtcars.rg_d.c)
Wizards with the ggplot2 package can further enhance these plots if they like. For example, we can add the data to an interaction plot – this time we opt to include confidence intervals and put the three sources in separate panels:
require("ggplot2")
## Loading required package: ggplot2
emmip(pigs.lm1, ~ percent | source, CIs = TRUE) +
geom_point(aes(x = factor(percent), y = log(conc)), data = pigs, pch = 2, color = "blue")
It is possible to override the equal-weighting method for computing EMMs. Using weights = "cells"
in the call will weight the predictions according to their cell frequencies (recall this information is retained in the reference grid). This produces results comparable to ordinary marginal means:
emmeans(pigs.lm1, "percent", weights = "cells")
## percent emmean SE df lower.CL upper.CL
## 9 3.473434 0.04070389 23 3.389231 3.557636
## 12 3.624861 0.03837600 23 3.545475 3.704248
## 15 3.669528 0.04351429 23 3.579512 3.759544
## 18 3.655150 0.05148680 23 3.548642 3.761659
##
## Results are averaged over the levels of: source
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
Note that, as in the ordinary means in the motivating example, the highest estimate is for percent = 15
rather than percent = 18
. It is interesting to compare this with the results for a model that includes only percent
as a predictor.
pigs.lm3 <- lm(log(conc) ~ factor(percent), data = pigs)
emmeans(pigs.lm3, "percent")
## percent emmean SE df lower.CL upper.CL
## 9 3.473434 0.07313042 25 3.322819 3.624049
## 12 3.624861 0.06894802 25 3.482860 3.766862
## 15 3.669528 0.07817971 25 3.508514 3.830542
## 18 3.655150 0.09250348 25 3.464636 3.845665
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
The EMMs in these two tables are identical, but their standard errors are considerably different. That is because the model pigs.lm1
accounts for variations due to source
. The lesson here is that it is possible to obtain statistics comparable to ordinary marginal means, while still accounting for variations due to the factors that are being averaged over.
The emmeans package supports various multivariate models. When there is a multivariate response, the dimensions of that response are treated as if they were levels of a factor. For example, the MOats
dataset provided in the package has predictors Block
and Variety
, and a four-dimensional response yield
giving yields observed with varying amounts of nitrogen added to the soil. Here is a model and reference grid:
MOats.lm <- lm (yield ~ Block + Variety, data = MOats)
ref_grid (MOats.lm, mult.name = "nitro")
## 'emmGrid' object with variables:
## Block = VI, V, III, IV, II, I
## Variety = Golden Rain, Marvellous, Victory
## nitro = multivariate response levels: 0, 0.2, 0.4, 0.6
So, nitro
is in essence a factor having 4 levels corresponding to the 4 dimensions of yield
. We can subsequently obtain EMMs for any of the factors Block
, Variety
, nitro
, or combinations thereof. The argument mult.name = "nitro"
is optional; if it had been excluded, the multivariate levels would have been named rep.meas
.
at
or cov.reduce
; the latter may be logical, a function, or a formula.emmeans()
results may be plotted via plot()
(for parallel confidence intervals) or emmip()
(for an interaction-style plot).The reader is referred to other vignettes for more details and advanced use. The strings linked below are the names of the vignettes; i.e., they can also be accessed via vignette("
name", "emmeans")
emmGrid
objects: “utilities”