A cornerstone feature of prt
is the ability to load a (small) subset of
rows (or columns) from a much larger tabular dataset. In order to specify
such a subset, an implementation of the base R S3 generic function
subset()
is provided, driving the non-standard evaluation (NSE) of an
expression within the context of the data (with similar semantics as the
base R implementation for data.frame
s).
# S3 method for prt
subset(x, subset, select, part_safe = FALSE, drop = FALSE, ...)
subset_quo(
x,
subset = NULL,
select = NULL,
part_safe = FALSE,
env = parent.frame()
)
object to be subsetted.
logical expression indicating elements or rows to keep: missing values are taken as false.
expression, indicating columns to select from a data frame.
Logical flag indicating whether the subset
expression
can be safely be applied to individual partitions.
passed on to [
indexing operator.
further arguments to be passed to or from other methods.
The environment in which subset
and select
are evaluated in.
This environment is not applicable for quosures because they have their own
environments.
The functions powering NSE are rlang::enquo()
which quote the subset
and
select
arguments and rlang::eval_tidy()
which evaluates the
expressions. This allows for some
rlang
-specific features to be used, such as the
.data
/.env
pronouns, or the double-curly brace forwarding operator. For
some example code, please refer to
vignette("prt", package = "prt")
.
While the function subset()
quotes the arguments passed as subset
and
select
, the function subset_quo()
can be used to operate on already
quoted expressions. A final noteworthy departure from the base R interface
is the part_safe
argument: this logical flag indicates whether it is safe
to evaluate the expression on partitions individually or whether
dependencies between partitions prevent this from yielding correct results.
As it is not straightforward to determine if dependencies might exists from
the expression alone, the default is FALSE
, which in many cases will
result in a less efficient resolution of the row-selection and it is up to
the user to enable this optimization.
dat <- as_prt(mtcars, n_chunks = 2L)
subset(dat, cyl == 6)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 4: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 5: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 6: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 7: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
subset(dat, cyl == 6 & hp > 110)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
#> 2: 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
#> 3: 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
colnames(subset(dat, select = mpg:hp))
#> [1] "mpg" "cyl" "disp" "hp"
colnames(subset(dat, select = -c(vs, am)))
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "gear" "carb"
sub_6 <- subset(dat, cyl == 6)
thresh <- 6
identical(subset(dat, cyl == thresh), sub_6)
#> [1] TRUE
identical(subset(dat, cyl == .env$thresh), sub_6)
#> [1] TRUE
cyl <- 6
identical(subset(dat, cyl == cyl), data.table::as.data.table(dat))
#> [1] TRUE
identical(subset(dat, cyl == !!cyl), sub_6)
#> [1] TRUE
identical(subset(dat, .data$cyl == .env$cyl), sub_6)
#> [1] TRUE
expr <- quote(cyl == 6)
# passing a quoted expression to subset() will yield an error
if (FALSE) {
subset(dat, expr)
}
identical(subset_quo(dat, expr), sub_6)
#> Evaluating row subsetting over the entire `prt` at once. If applicable consider the `part_safe` argument.
#> This message is displayed once every 8 hours.
#> [1] TRUE
identical(
subset(dat, qsec > mean(qsec), part_safe = TRUE),
subset(dat, qsec > mean(qsec), part_safe = FALSE)
)
#> [1] FALSE