NSE subsetting — nse • prt

A cornerstone feature of prt is the ability to load a (small) subset of rows (or columns) from a much larger tabular dataset. In order to specify such a subset, an implementation of the base R S3 generic function subset() is provided, driving the non-standard evaluation (NSE) of an expression within the context of the data (with similar semantics as the base R implementation for data.frames).

# S3 method for prt
subset(x, subset, select, part_safe = FALSE, drop = FALSE, ...)

subset_quo(
  x,
  subset = NULL,
  select = NULL,
  part_safe = FALSE,
  env = parent.frame()
)

Arguments

x: object to be subsetted.
subset: logical expression indicating elements or rows to keep: missing values are taken as false.
select: expression, indicating columns to select from a data frame.
part_safe: Logical flag indicating whether the subset expression can be safely be applied to individual partitions.
drop: passed on to [ indexing operator.
...: further arguments to be passed to or from other methods.
env: The environment in which subset and select are evaluated in. This environment is not applicable for quosures because they have their own environments.

Details

The functions powering NSE are rlang::enquo() which quote the subset and select arguments and rlang::eval_tidy() which evaluates the expressions. This allows for some rlang-specific features to be used, such as the .data/.env pronouns, or the double-curly brace forwarding operator. For some example code, please refer to vignette("prt", package = "prt").

While the function subset() quotes the arguments passed as subset and select, the function subset_quo() can be used to operate on already quoted expressions. A final noteworthy departure from the base R interface is the part_safe argument: this logical flag indicates whether it is safe to evaluate the expression on partitions individually or whether dependencies between partitions prevent this from yielding correct results. As it is not straightforward to determine if dependencies might exists from the expression alone, the default is FALSE, which in many cases will result in a less efficient resolution of the row-selection and it is up to the user to enable this optimization.

Examples

dat <- as_prt(mtcars, n_chunks = 2L)

subset(dat, cyl == 6)
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 4: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 5: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 6: 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 7: 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
subset(dat, cyl == 6 & hp > 110)
#>     mpg cyl  disp  hp drat   wt qsec vs am gear carb
#> 1: 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
#> 2: 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
#> 3: 19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6

colnames(subset(dat, select = mpg:hp))
#> [1] "mpg"  "cyl"  "disp" "hp"  
colnames(subset(dat, select = -c(vs, am)))
#> [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "gear" "carb"

sub_6 <- subset(dat, cyl == 6)

thresh <- 6
identical(subset(dat, cyl == thresh), sub_6)
#> [1] TRUE
identical(subset(dat, cyl == .env$thresh), sub_6)
#> [1] TRUE

cyl <- 6
identical(subset(dat, cyl == cyl), data.table::as.data.table(dat))
#> [1] TRUE
identical(subset(dat, cyl == !!cyl), sub_6)
#> [1] TRUE
identical(subset(dat, .data$cyl == .env$cyl), sub_6)
#> [1] TRUE

expr <- quote(cyl == 6)
# passing a quoted expression to subset() will yield an error
if (FALSE) {
  subset(dat, expr)
}
identical(subset_quo(dat, expr), sub_6)
#> Evaluating row subsetting over the entire `prt` at once. If applicable consider the `part_safe` argument.
#> This message is displayed once every 8 hours.
#> [1] TRUE

identical(
  subset(dat, qsec > mean(qsec), part_safe = TRUE),
  subset(dat, qsec > mean(qsec), part_safe = FALSE)
)
#> [1] FALSE