An R
dplyr
user must, to some extent, be able to know if a dplyr
expression is well-formed. The dplyr
implementation makes such decisions, so to program dplyr
one has to be able to anticipate a least a reasonable subset of these determinations.
Take the following dplyr
statement as an example:
dplyr::mutate(x = 1,
v1 = x,
x = x + 1,
v2 = x,
x = x + 1,
v3 = x)
Would the above be, in the context of a larger dplyr
pipeline, be a valid statement?
My personal opinion is: it is not.
The issues include: the statement is (unsafely) using values in the same block they are created, and also re-using variable names across terms. I don’t recall these issues being discussed in dplyr
mutate()
documentation, and indeed the statement appears to work when used with the in-memory (or on data.frame
) version of dplyr
.
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
packageVersion("dplyr")
#> [1] '0.7.4'
d <- data.frame(z = 1)
d %>%
mutate(x = 1,
v1 = x,
x = x + 1,
v2 = x,
x = x + 1,
v3 = x) %>%
knitr::kable()
z | x | v1 | v2 | v3 |
---|---|---|---|---|
1 | 3 | 1 | 2 | 3 |
However “in memory” is not the only place dplyr
is promoted and used. dplyr
is also often used to talk to SQL
back ends (such as Spark
and databases). When we try the same mutate()
in a database context we, currently, get a different (and wrong) result. This establishes the example mutate()
statement is at best unsafe.
packageVersion("dbplyr")
#> [1] '1.2.1'
packageVersion("DBI")
#> [1] '0.7'
packageVersion("RSQLite")
#> [1] '2.0'
db <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")
d2 <- dplyr::copy_to(db, d, 'd2')
d2 %>%
mutate(x = 1,
v1 = x,
x = x + 1,
v2 = x,
x = x + 1,
v3 = x) %>%
collect() %>%
knitr::kable()
z | x | v1 | v2 | v3 |
---|---|---|---|---|
1 | 1 | 1 | 1 | 1 |
The above is an incorrect result, with no warning and no error signaled.
Essentially this is a silent result corruption.
Looking further we can see the issue is likely in the SQL
generation path:
d2 %>%
mutate(x = 1,
v1 = x,
x = x + 1,
v2 = x,
x = x + 1,
v3 = x) %>%
show_query()
<SQL>
SELECT `z`, `x`, `v1`, `v2`, `x` AS `v3`
FROM (SELECT `z`, 1.0 AS `x`, `v1`, `x` AS `v2`
FROM (SELECT `z`, 1.0 AS `x`, `x` AS `v1`
FROM (SELECT `z`, 1.0 AS `x`
FROM `d2`)))
We have a function in development, seplyr::factor_mutate()
which inspects the terms one is considering placing in a dplyr::mutate()
(or even a dplyr::sumarize()
or dplyr::transmute()
) and:
# # seplyr::factor_mutate() currently
# # requires the dev-version of seplyr
# # version 0.5.2 or better.
# devtools::install_github("WinVector/seplyr")
stmts <- seplyr::factor_mutate(
x = 1,
v1 = x,
x = x + 1,
v2 = x,
x = x + 1,
v3 = x
)
#> Warning in seplyr::factor_mutate(x = 1, v1 = x, x = x + 1, v2 = x, x = x
#> + : Mutate should be split into more than one stage.
cat(stmts)
mutate(x = 1) %>%
mutate(v1 = x) %>%
mutate(x = x + 1) %>%
mutate(v2 = x) %>%
mutate(x = x + 1) %>%
mutate(v3 = x)
seplyr::factor_mutate()
issues a warning. It also builds statement source code that is safe in the sense that:
mutate()
stages.mutate()
stage can be evaluated in any order (without changing results).The statement is partitioned into a minimal (subject to some order constraints) number of new statements (though it is not obvious from this example). In particular statements that can be left alone are left alone (and no warning is issued).
And in fact this code returns the same result on the database as we did on the in-memory example.
eval(parse(text = paste("d2 %>%", stmts))) %>%
collect() %>%
knitr::kable()
z | x | v1 | v2 | v3 |
---|---|---|---|---|
1 | 3 | 1 | 2 | 3 |
We don’t suggest calling eval()
in production code. Instead we suggest using seplyr::factor_mutate()
to inspect code as you are writing, and to incorporate its advice in your best practices.
The important distinction is: the in-memory code seems to have (possibly undocumented) strict sequential semantics (terms are guaranteed to be evaluated in order in a single mutate()
), and database implementations do not have such a property. Thus we need to organize code to be completely unambiguous even with terms re-ordered with a single mutate()
stage. This is a service that the seplyr::factor_mutate()
method attempts to heuristically supply, and something central to the design of rquery
(a SQL
-only data processing system).
DBI::dbDisconnect(db)