Building on data.frame
serialization provided by fst
, prt
offers an interface for working with partitioned data.frame
s, saved as individual fst
files.
You can install the development version of prt from GitHub by running
source("https://install-github.me/nbenn/prt")
Alternatively, if you have the remotes
package available, the latest release is available by calling install_github()
as
# install.packages("remotes") remotes::install_github("nbenn/prt@*release")
Creating a prt
object can be done either by calling new_prt()
on a list of previously created fst
files or by coercing a data.frame
object to prt
using as_prt()
.
tmp <- tempfile() dir.create(tmp) flights <- as_prt(nycflights13::flights, n_chunks = 2L, dir = tmp) print(flights) #> # A prt: 336,776 × 19 #> # Partitioning: [168,388, 168,388] rows #> year month day dep_time sched_dep_time dep_delay arr_time #> <int> <int> <int> <int> <int> <dbl> <int> #> 1 2013 1 1 517 515 2 830 #> 2 2013 1 1 533 529 4 850 #> 3 2013 1 1 542 540 2 923 #> 4 2013 1 1 544 545 -1 1004 #> 5 2013 1 1 554 600 -6 812 #> … #> 336,772 2013 9 30 NA 1455 NA NA #> 336,773 2013 9 30 NA 2200 NA NA #> 336,774 2013 9 30 NA 1210 NA NA #> 336,775 2013 9 30 NA 1159 NA NA #> 336,776 2013 9 30 NA 840 NA NA #> # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>, #> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm>
In case a prt
object is created from a data.frame
, the specified number of files is written to the directory of choice (a newly created directory within tempdir()
by default).
list.files(tmp) #> [1] "1.fst" "2.fst"
Subsetting and printing is closely modeled after tibble
and behavior that deviates from that of tibble
will most likely be considered a bug (please report). Some design choices that do set a prt
object apart from a tibble
include the use of data.table
s for any result of a subsetting operation and the complete disregard for row.names
.
In addition to standard subsetting operations involving the functions `[`()
, `[[`()
and `$`()
, the base generic function subset()
is implemented for the prt
class, enabling subsetting operations using non-standard evaluation. Combined with random access to tables stored as fst
files, this can make data access more efficient in cases where only a subset of the data is of interest.
jan <- flights[flights$month == 1, ] identical(jan, subset(flights, month == 1)) #> [1] TRUE print(jan) #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> 1: 2013 1 1 517 515 2 830 819 #> 2: 2013 1 1 533 529 4 850 830 #> 3: 2013 1 1 542 540 2 923 850 #> 4: 2013 1 1 544 545 -1 1004 1022 #> 5: 2013 1 1 554 600 -6 812 837 #> --- #> 27000: 2013 1 31 NA 1325 NA NA 1505 #> 27001: 2013 1 31 NA 1200 NA NA 1430 #> 27002: 2013 1 31 NA 1410 NA NA 1555 #> 27003: 2013 1 31 NA 1446 NA NA 1757 #> 27004: 2013 1 31 NA 625 NA NA 934 #> arr_delay carrier flight tailnum origin dest air_time distance hour #> 1: 11 UA 1545 N14228 EWR IAH 227 1400 5 #> 2: 20 UA 1714 N24211 LGA IAH 227 1416 5 #> 3: 33 AA 1141 N619AA JFK MIA 160 1089 5 #> 4: -18 B6 725 N804JB JFK BQN 183 1576 5 #> 5: -25 DL 461 N668DN LGA ATL 116 762 6 #> --- #> 27000: NA MQ 4475 N730MQ LGA RDU NA 431 13 #> 27001: NA MQ 4658 N505MQ LGA ATL NA 762 12 #> 27002: NA MQ 4491 N734MQ LGA CLE NA 419 14 #> 27003: NA UA 337 <NA> LGA IAH NA 1416 14 #> 27004: NA UA 1497 <NA> LGA IAH NA 1416 6 #> minute time_hour #> 1: 15 2013-01-01 05:00:00 #> 2: 29 2013-01-01 05:00:00 #> 3: 40 2013-01-01 05:00:00 #> 4: 45 2013-01-01 05:00:00 #> 5: 0 2013-01-01 06:00:00 #> --- #> 27000: 25 2013-01-31 13:00:00 #> 27001: 0 2013-01-31 12:00:00 #> 27002: 10 2013-01-31 14:00:00 #> 27003: 46 2013-01-31 14:00:00 #> 27004: 25 2013-01-31 06:00:00
A subsetting operation on a prt
object yields a data.table
. If the full table is of interest, a prt
-specific implementation of the as.data.table()
generic is available.
unlink(tmp, recursive = TRUE)