Take each requested (group and) version in an archive, run a computation (e.g., forecast)
Source:R/methods-epi_archive.R
epix_slide.Rd
... and collect the results. This is useful for more accurately simulating
how a forecaster, nowcaster, or other algorithm would have behaved in real
time, factoring in reporting latency and data revisions; see
vignette("backtesting", package="epipredict")
for a walkthrough.
Usage
epix_slide(
.x,
.f,
...,
.before = Inf,
.versions = NULL,
.new_col_name = NULL,
.all_versions = FALSE
)
Arguments
- .x
An
epi_archive
orgrouped_epi_archive
object. If ungrouped, all data inx
will be treated as part of a single data group.- .f
Function, formula, or missing; together with
...
specifies the computation. The computation will be run on each requested group-version combination, with a time window filter applied if.before
is supplied.If
.f
is a function must have the formfunction(x, g, v)
orfunction(x, g, v, <additional configuration args>)
, where- `x` is an `epi_df` with the same column names as the archive's `DT`, minus the `version` column. (Or, if `.all_versions = TRUE`, an `epi_archive` with the requested partial version history.) - `g` is a one-row tibble containing the values of the grouping variables for the associated group. - `v` (length-1) is the associated `version` (one of the requested `.versions`) - `<additional configuration args>` are optional; you can add such arguments to your function and set them by passing them through the `...` argument to `epix_slide()`.
If a formula,
.f
can operate directly on columns accessed via.x$var
or.$var
, as in~ mean (.x$var)
to compute a mean of a columnvar
for each group-ref_time_value
combination. The group key can be accessed via.y
or.group_key
, and the reference time value can be accessed via.z
,.version
, or.ref_time_value
. If.f
is missing, then...
will specify the computation.- ...
Additional arguments to pass to the function or formula specified via
f
. Alternatively, if.f
is missing, then the...
is interpreted as a "data-masking" expression or expressions for tidy evaluation; in addition to referring columns directly by name, the expressions have access to.data
and.env
pronouns as indplyr
verbs, and can also refer to.x
(not the same as the input epi_archive),.group_key
and.version
/.ref_time_value
. See details for more.- .before
Optional; applies a
time_value
filter before running each computation. The default is not to apply atime_value
filter. If provided, it should be a single integer or difftime that is compatible with the time_type of the time_value column. If an integer, then the minimum possibletime_value
included will be that many time steps (according to thetime_type
) before each requested.version
. This window endpoint is inclusive. For example, if.before = 14
, thetime_type
in the archive is "day", and the requested.version
is January 15, then the smallest possibletime_value
possible in the snapshot will be January 1. Note that this does not mean that there will be 14 or 15 distincttime_value
s actually appearing in the data; for most reporting streams, reporting as of January 15 won't includetime_value
s all the way through January 14, due to reporting latency. Unlikeepi_slide()
,epix_slide()
won't fill in any missingtime_values
in this window.- .versions
Requested versions on which to run the computation. Each requested
.version
also serves as the anchor point from which thetime_value
window specified by.before
is drawn. If.versions
is missing, it will be set to a regularly-spaced sequence of values set to cover the range ofversion
s in theDT
plus theversions_end
; the spacing of values will be guessed (using the GCD of the skips between values).- .new_col_name
Either
NULL
or a string indicating the name of the new column that will contain the derived values. The default,NULL
, will use the name "slide_value" unless your slide computations output data frames, in which case they will be unpacked into the constituent columns and the data frame's column names will be used instead. If the resulting column name(s) overlap with the column names used for labeling the computations, which aregroup_vars(x)
and"version"
, then the values for these columns must be identical to the labels we assign.- .all_versions
(Not the same as
.all_rows
parameter ofepi_slide
.) If.all_versions = TRUE
, then the slide computation will be passed the version history (all versions<= .version
where.version
is one of the requested.version
s), inepi_archive
format. Otherwise, the slide computation will be passed only the most recentversion
for every uniquetime_value
, inepi_df
format. Default isFALSE
.
Value
A tibble whose columns are: the grouping variables (if any),
time_value
, containing the reference time values for the slide
computation, and a column named according to the .new_col_name
argument,
containing the slide values. It will be grouped by the grouping variables.
Details
This is similar to looping over versions and calling epix_as_of
, but has
some conveniences such as working naturally with grouped_epi_archive
s,
optional time windowing, and syntactic sugar to make things shorter to write.
A few key distinctions between the current function and epi_slide()
:
In
.f
functions forepix_slide
, one should not assume that the input data to contain any rows withtime_value
matching the computation's.version
, due to reporting latency; for typical epidemiological surveillance data, observations pertaining to a particular time period (time_value
) are first reportedas_of
some instant after that time period has ended. No time window completion is performed as inepi_slide()
.The input class and columns are similar but different:
epix_slide
(with the default.all_versions=FALSE
) keeps all columns and theepi_df
-ness of the first argument to each computation;epi_slide
only provides the grouping variables in the second input, and will convert the first input into a regular tibble if the grouping variables include the essentialgeo_value
column. (With.all_versions=TRUE
,epix_slide
will provide anepi_archive
rather than anepi-df
to each computation.)The output class and columns are similar but different:
epix_slide()
returns a tibble containing only the grouping variables,time_value
, and the new column(s) from the slide computations, whereasepi_slide()
returns anepi_df
with all original variables plus the new columns from the slide computations. (Both will mirror the grouping or ungroupedness of their input, with one exception:epi_archive
s can have trivial (zero-variable) groupings, but these will be dropped inepix_slide
results as they are not supported by tibbles.)There are no size stability checks or element/row recycling to maintain size stability in
epix_slide
, unlike inepi_slide
. (epix_slide
is roughly analogous todplyr::group_modify
, whileepi_slide
is roughly analogous todplyr::mutate
.).all_rows
is not supported inepix_slide
; since the slide computations are allowed more flexibility in their outputs than inepi_slide
, we can't guess a good representation for missing computations for excluded group-.ref_time_value
pairs.The
.versions
default forepix_slide
is based on making an evenly-spaced sequence out of theversion
s in theDT
plus theversions_end
, rather than all uniquetime_value
s.epix_slide()
computations can refer to the current element of.versions
as either.version
or.ref_time_value
, whileepi_slide()
computations refer to the current element of.ref_time_values
with.ref_time_value
.
Apart from the above distinctions, the interfaces between epix_slide()
and
epi_slide()
are the same.
Examples
library(dplyr)
# Request only a small set of versions, for example's sake:
requested_versions <-
seq(as.Date("2020-09-02"), as.Date("2020-09-15"), by = "1 day")
# Investigate reporting lag of `percent_cli` signal (though normally we'd
# probably work off of the dedicated `revision_summary()` function instead):
archive_cases_dv_subset %>%
epix_slide(
geowide_percent_cli_max_time = max(time_value[!is.na(percent_cli)]),
geowide_percent_cli_rpt_lag = .version - geowide_percent_cli_max_time,
.versions = requested_versions
)
#> # A tibble: 14 × 3
#> version geowide_percent_cli_max_time geowide_percent_cli_rpt_lag
#> * <date> <date> <drtn>
#> 1 2020-09-02 2020-08-30 3 days
#> 2 2020-09-03 2020-08-31 3 days
#> 3 2020-09-04 2020-09-01 3 days
#> 4 2020-09-05 2020-09-02 3 days
#> 5 2020-09-06 2020-09-03 3 days
#> 6 2020-09-07 2020-09-04 3 days
#> 7 2020-09-08 2020-09-05 3 days
#> 8 2020-09-09 2020-09-06 3 days
#> 9 2020-09-10 2020-09-07 3 days
#> 10 2020-09-11 2020-09-08 3 days
#> 11 2020-09-12 2020-09-09 3 days
#> 12 2020-09-13 2020-09-10 3 days
#> 13 2020-09-14 2020-09-11 3 days
#> 14 2020-09-15 2020-09-12 3 days
archive_cases_dv_subset %>%
group_by(geo_value) %>%
epix_slide(
percent_cli_max_time = max(time_value[!is.na(percent_cli)]),
percent_cli_rpt_lag = .version - percent_cli_max_time,
.versions = requested_versions
)
#> # A tibble: 56 × 4
#> # Groups: geo_value [4]
#> geo_value version percent_cli_max_time percent_cli_rpt_lag
#> * <chr> <date> <date> <drtn>
#> 1 ca 2020-09-02 2020-08-30 3 days
#> 2 fl 2020-09-02 2020-08-30 3 days
#> 3 ny 2020-09-02 2020-08-30 3 days
#> 4 tx 2020-09-02 2020-08-30 3 days
#> 5 ca 2020-09-03 2020-08-31 3 days
#> 6 fl 2020-09-03 2020-08-31 3 days
#> 7 ny 2020-09-03 2020-08-31 3 days
#> 8 tx 2020-09-03 2020-08-31 3 days
#> 9 ca 2020-09-04 2020-09-01 3 days
#> 10 fl 2020-09-04 2020-09-01 3 days
#> # ℹ 46 more rows
# Backtest a forecaster "pseudoprospectively" (i.e., faithfully with respect
# to the data version history):
case_death_rate_archive %>%
epix_slide(
.versions = as.Date(c("2021-10-01", "2021-10-08")),
function(x, g, v) {
epipredict::arx_forecaster(
x,
outcome = "death_rate",
predictors = c("death_rate_7d_av", "case_rate_7d_av")
)$predictions
}
)
#> Registered S3 method overwritten by 'epipredict':
#> method from
#> print.step_naomit recipes
#> # A tibble: 112 × 6
#> version geo_value .pred .pred_distn forecast_date target_date
#> * <date> <chr> <dbl> <qtls(7)> <date> <date>
#> 1 2021-10-01 ak 1.95 [1.95] 2021-09-30 2021-10-07
#> 2 2021-10-01 al 1.36 [1.36] 2021-09-30 2021-10-07
#> 3 2021-10-01 ar 0.572 [0.572] 2021-09-30 2021-10-07
#> 4 2021-10-01 as 0.0128 [0.0128] 2021-09-30 2021-10-07
#> 5 2021-10-01 az 0.537 [0.537] 2021-09-30 2021-10-07
#> 6 2021-10-01 ca 0.260 [0.26] 2021-09-30 2021-10-07
#> 7 2021-10-01 co 0.308 [0.308] 2021-09-30 2021-10-07
#> 8 2021-10-01 ct 0.406 [0.406] 2021-09-30 2021-10-07
#> 9 2021-10-01 dc 0.147 [0.147] 2021-09-30 2021-10-07
#> 10 2021-10-01 de 0.382 [0.382] 2021-09-30 2021-10-07
#> # ℹ 102 more rows
# See `vignette("backtesting", package="epipredict")` for a full walkthrough
# on backtesting forecasters, including plots, etc.
# --- Advanced: ---
# `epix_slide` with `all_versions=FALSE` (the default) applies a
# version-unaware computation to several versions of the data. We can also
# use `.all_versions=TRUE` to apply a version-*aware* computation to several
# versions of the data, again looking at characteristics of the data passed
# to each computation. In this case, each computation should expect an
# `epi_archive` containing the relevant version data:
archive_cases_dv_subset %>%
group_by(geo_value) %>%
epix_slide(
function(x, gk, rtv) {
tibble(
versions_start = if (nrow(x$DT) == 0L) {
"NA (0 rows)"
} else {
toString(min(x$DT$version))
},
versions_end = x$versions_end,
time_range = if (nrow(x$DT) == 0L) {
"0 `time_value`s"
} else {
sprintf("%s -- %s", min(x$DT$time_value), max(x$DT$time_value))
},
n = nrow(x$DT),
class1 = class(x)[[1L]]
)
},
.before = 5, .all_versions = TRUE,
.versions = versions
) %>%
ungroup() %>%
# Focus on one geo_value so we can better see the columns above:
filter(geo_value == "ca") %>%
select(-geo_value)
#> Error: object 'versions' not found