load_observations() returns too many deaths #96

nikosbosse · 2021-02-18T16:04:09Z

the function load_observations() which reads the case data from here
obs <- fread(here("models", "rt", "data", "summary", target_date, "reported_cases.csv"))
returns too many deaths for the US as a whole.

Numbers in the data look different from the ones on Ourworldindata and also differ from the ones returned by get_us_deaths() They do, however, look similar to the ones shown on Google.

The text was updated successfully, but these errors were encountered:

seabbs · 2021-02-18T16:09:55Z

have you tried tracing the data back? The summary your drawing from is made using get_us_deaths so it should be very possible to track down an issue.

nikosbosse · 2021-02-18T16:27:15Z

working on it :)

seabbs · 2021-02-18T16:28:17Z

Linked to: #93

nikosbosse · 2021-02-18T16:31:42Z

it's taking me a while to track down - do you know from the top of your head where the epinow regional data comes from?

seabbs · 2021-02-18T17:00:02Z

covid-us-forecasts/models/rt/update-rt.R

Line 19 in ab05dee

deaths <- get_us_deaths(data = "daily")

nikosbosse · 2021-02-18T18:57:34Z

I'm very confused... three different versions to get data, three different results. Presumably I'm just tired and it is really obvious what's going on...

# option 1

source(here::here("utils", "get-us-data.R"))

weekly_deaths_state <- get_us_deaths(data = "daily") %>%
  filter(date >= (as.Date(forecast_date) - 8 * 4)) %>%
  group_by(state, epiweek) %>%
  summarise(deaths = sum(deaths), 
            target_end_date = max(date),
            .groups = "drop_last") %>%
  dplyr::select(-epiweek) %>%
  dplyr::ungroup()

weekly_deaths_national <- weekly_deaths_state %>%
  group_by(target_end_date) %>%
  summarise(deaths = sum(deaths), .groups = "drop_last") %>%
  mutate(state = "US")

# combine and only keep complete epiweeks, marked by the day 'Saturday'
obs2 <- dplyr::bind_rows(weekly_deaths_state, weekly_deaths_national) %>%
  dplyr::filter(weekdays(target_end_date) == "Saturday", 
                target_end_date <= as.Date(forecast_date))




# option 2

deaths <- get_us_deaths(data = "daily")
deaths <- as.data.table(deaths)
deaths <- deaths[, .(region = state, date = as.Date(date), 
                     confirm = deaths)]
us_deaths <- copy(deaths)[, .(confirm = sum(confirm, na.rm = TRUE)), by = "date"]
us_deaths <- us_deaths[, region := "US"]
deaths <- rbindlist(list(us_deaths, deaths), use.names = TRUE)
deaths <- deaths[date <= as.Date(target_date)]
deaths <- deaths[date >= (as.Date(target_date) - weeks(12))]
setorder(deaths, region, date)

obs <- deaths[, .(date, state = region, value = confirm)]

state_codes <- readRDS(here("data", "state_codes.rds"))
obs <- obs[state_codes, on = "state"]

source(here("utils", "dates-to-epiweek.R"))
obs <- dates_to_epiweek(obs)
obs <- obs[epiweek_full == TRUE, .(value = sum(value), date = max(date)), 
           by = .(location, state, epiweek)][, epiweek := NULL]


# option 3

obs3 <- load_observations("2021-02-15")


tail(obs2)
tail(obs)
tail(obs3)

seabbs · 2021-02-19T08:29:36Z

load_observations exists to load the truth data used for modelling which is stored by EpiNow2 (hence the file path). This means we can evaluate and ensemble against the correct data rather than using data that is updated retrospectively. Once anomaly correction is added this function needs to draw from another folder in which non-adjusted truth data is stored by date.

seabbs · 2021-02-19T08:36:57Z

The reason data input is different in the time series is that they were written by different people and standardisation was difficult.

I don't know why load-observations gives a different result but it needs more investigation. I'd suggest graphing it.

In general the use of data here had always been quite disjointed and a little messy. It would be good to rationalise aside from this potential bug.

nikosbosse · 2021-02-19T11:16:07Z

ok it seems like at least the first two do agree (only for some reason the green one has a week of data more when filtering for the same period).

Difference apparently mostly comes from Ohio. But for some reason the US curves don't agree even if almost all of the state curves do agree.

This is I assume because the data we download with get_us_deaths() gets corrected?

Questions are then:

which of these data streams should I use to generate the historic baseline forceasts? (the Rt data?)
which of these to use for future baseline forecasts (this shouldn't matter, right? except to avoid the dependency, so the non-Rt-data?)
do we want to externalise this to a separate package? I had started working on something here https://github.com/epiforecasts/forecasthubutils a while ago, but it never really got used

seabbs · 2021-02-22T11:35:12Z

This sounds like potentially it is due to the internal anomaly handling in EpiNow2 but I am not totally convinced as all that is is setting days with 0 cases to a local average.

I am not sure a split out package is required to handle data only processing? Though potentially for some of the processing tasks.

We need:

Raw data downloading and storage in a dated csv file.
A single processing of data from state level to US level that is then used everywhere.
Raw data processing with a basic anomaly correct (i.e something like set to moving average if outside 3 standard deviations in the data)
Flag on when anomaly correction has occurred and some kind of reporting process for this.
Use anomaly corrected dated data for all downstream modelling work.
Use raw dated data for plotting.

Anomaly correction in EpiNow2.

https://github.com/epiforecasts/EpiNow2/blob/d2b2aa6e76190000d5aad37e66f132f7c44d4644/R/create.R#L34

nikosbosse · 2021-02-24T19:33:15Z

Sounds very reasonable. Should we have a quick chat at some point to discuss how to move forward and divide up work?

as this is related to #88 (that one isn't merged because the plotting hasn't happened yet): how should I proceed with the PR? Keep it open until we solved the data issues, then do all the past plots there and then merge?

seabbs · 2021-02-24T19:51:23Z

Sounds like a good idea.

Nite sure why this is blocking #88? I'd prefer to keep PRs modular if possible.

nikosbosse · 2021-02-24T20:03:48Z

What is missing from #88 is an update of past plots with all models. I'm however unsure what data to use for the plotting Sam Abbott <[email protected]> schrieb am Mi., 24. Feb. 2021, 20:51:

…

Sounds like a good idea. Nite sure why this is blocking #88 <#88>? I'd prefer to keep PRs modular if possible. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#96 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJBYFLNNA2RXSGP5LSR6CF3TAVKETANCNFSM4X2R7B7Q> .

seabbs · 2021-02-24T22:14:36Z

Can you use the structure as present and we can fix the underling data it draws from later?

nikosbosse · 2021-02-25T08:11:21Z

👍 did that. PR can be merged now I think and then we can address this issue

seabbs mentioned this issue Feb 22, 2021

Data handling #97

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_observations() returns too many deaths #96

load_observations() returns too many deaths #96

nikosbosse commented Feb 18, 2021 •

edited

Loading

seabbs commented Feb 18, 2021

nikosbosse commented Feb 18, 2021

seabbs commented Feb 18, 2021

nikosbosse commented Feb 18, 2021

seabbs commented Feb 18, 2021

nikosbosse commented Feb 18, 2021

seabbs commented Feb 19, 2021

seabbs commented Feb 19, 2021

nikosbosse commented Feb 19, 2021

seabbs commented Feb 22, 2021 •

edited

Loading

nikosbosse commented Feb 24, 2021

seabbs commented Feb 24, 2021

nikosbosse commented Feb 24, 2021 via email

seabbs commented Feb 24, 2021

nikosbosse commented Feb 25, 2021

load_observations() returns too many deaths #96

load_observations() returns too many deaths #96

Comments

nikosbosse commented Feb 18, 2021 • edited Loading

seabbs commented Feb 18, 2021

nikosbosse commented Feb 18, 2021

seabbs commented Feb 18, 2021

nikosbosse commented Feb 18, 2021

seabbs commented Feb 18, 2021

nikosbosse commented Feb 18, 2021

seabbs commented Feb 19, 2021

seabbs commented Feb 19, 2021

nikosbosse commented Feb 19, 2021

seabbs commented Feb 22, 2021 • edited Loading

nikosbosse commented Feb 24, 2021

seabbs commented Feb 24, 2021

nikosbosse commented Feb 24, 2021 via email

seabbs commented Feb 24, 2021

nikosbosse commented Feb 25, 2021

nikosbosse commented Feb 18, 2021 •

edited

Loading

seabbs commented Feb 22, 2021 •

edited

Loading