Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_observations() returns too many deaths #96

Open
nikosbosse opened this issue Feb 18, 2021 · 15 comments
Open

load_observations() returns too many deaths #96

nikosbosse opened this issue Feb 18, 2021 · 15 comments

Comments

@nikosbosse
Copy link
Contributor

nikosbosse commented Feb 18, 2021

the function load_observations() which reads the case data from here
obs <- fread(here("models", "rt", "data", "summary", target_date, "reported_cases.csv"))
returns too many deaths for the US as a whole.

Numbers in the data look different from the ones on Ourworldindata and also differ from the ones returned by get_us_deaths() They do, however, look similar to the ones shown on Google.

@seabbs
Copy link
Contributor

seabbs commented Feb 18, 2021

have you tried tracing the data back? The summary your drawing from is made using get_us_deaths so it should be very possible to track down an issue.

@nikosbosse
Copy link
Contributor Author

working on it :)

@seabbs
Copy link
Contributor

seabbs commented Feb 18, 2021

Linked to: #93

@nikosbosse
Copy link
Contributor Author

it's taking me a while to track down - do you know from the top of your head where the epinow regional data comes from?

@seabbs
Copy link
Contributor

seabbs commented Feb 18, 2021

deaths <- get_us_deaths(data = "daily")

@nikosbosse
Copy link
Contributor Author

I'm very confused... three different versions to get data, three different results. Presumably I'm just tired and it is really obvious what's going on...

# option 1

source(here::here("utils", "get-us-data.R"))

weekly_deaths_state <- get_us_deaths(data = "daily") %>%
  filter(date >= (as.Date(forecast_date) - 8 * 4)) %>%
  group_by(state, epiweek) %>%
  summarise(deaths = sum(deaths), 
            target_end_date = max(date),
            .groups = "drop_last") %>%
  dplyr::select(-epiweek) %>%
  dplyr::ungroup()

weekly_deaths_national <- weekly_deaths_state %>%
  group_by(target_end_date) %>%
  summarise(deaths = sum(deaths), .groups = "drop_last") %>%
  mutate(state = "US")

# combine and only keep complete epiweeks, marked by the day 'Saturday'
obs2 <- dplyr::bind_rows(weekly_deaths_state, weekly_deaths_national) %>%
  dplyr::filter(weekdays(target_end_date) == "Saturday", 
                target_end_date <= as.Date(forecast_date))




# option 2

deaths <- get_us_deaths(data = "daily")
deaths <- as.data.table(deaths)
deaths <- deaths[, .(region = state, date = as.Date(date), 
                     confirm = deaths)]
us_deaths <- copy(deaths)[, .(confirm = sum(confirm, na.rm = TRUE)), by = "date"]
us_deaths <- us_deaths[, region := "US"]
deaths <- rbindlist(list(us_deaths, deaths), use.names = TRUE)
deaths <- deaths[date <= as.Date(target_date)]
deaths <- deaths[date >= (as.Date(target_date) - weeks(12))]
setorder(deaths, region, date)

obs <- deaths[, .(date, state = region, value = confirm)]

state_codes <- readRDS(here("data", "state_codes.rds"))
obs <- obs[state_codes, on = "state"]

source(here("utils", "dates-to-epiweek.R"))
obs <- dates_to_epiweek(obs)
obs <- obs[epiweek_full == TRUE, .(value = sum(value), date = max(date)), 
           by = .(location, state, epiweek)][, epiweek := NULL]


# option 3

obs3 <- load_observations("2021-02-15")


tail(obs2)
tail(obs)
tail(obs3)

image

@seabbs
Copy link
Contributor

seabbs commented Feb 19, 2021

load_observations exists to load the truth data used for modelling which is stored by EpiNow2 (hence the file path). This means we can evaluate and ensemble against the correct data rather than using data that is updated retrospectively. Once anomaly correction is added this function needs to draw from another folder in which non-adjusted truth data is stored by date.

@seabbs
Copy link
Contributor

seabbs commented Feb 19, 2021

The reason data input is different in the time series is that they were written by different people and standardisation was difficult.

I don't know why load-observations gives a different result but it needs more investigation. I'd suggest graphing it.

In general the use of data here had always been quite disjointed and a little messy. It would be good to rationalise aside from this potential bug.

@nikosbosse
Copy link
Contributor Author

ok it seems like at least the first two do agree (only for some reason the green one has a week of data more when filtering for the same period).
image

Difference apparently mostly comes from Ohio. But for some reason the US curves don't agree even if almost all of the state curves do agree.
image

This is I assume because the data we download with get_us_deaths() gets corrected?

Questions are then:

  • which of these data streams should I use to generate the historic baseline forceasts? (the Rt data?)
  • which of these to use for future baseline forecasts (this shouldn't matter, right? except to avoid the dependency, so the non-Rt-data?)
  • do we want to externalise this to a separate package? I had started working on something here https://github.com/epiforecasts/forecasthubutils a while ago, but it never really got used

@seabbs
Copy link
Contributor

seabbs commented Feb 22, 2021

This sounds like potentially it is due to the internal anomaly handling in EpiNow2 but I am not totally convinced as all that is is setting days with 0 cases to a local average.

I am not sure a split out package is required to handle data only processing? Though potentially for some of the processing tasks.

We need:

  1. Raw data downloading and storage in a dated csv file.
  2. A single processing of data from state level to US level that is then used everywhere.
  3. Raw data processing with a basic anomaly correct (i.e something like set to moving average if outside 3 standard deviations in the data)
  4. Flag on when anomaly correction has occurred and some kind of reporting process for this.
  5. Use anomaly corrected dated data for all downstream modelling work.
  6. Use raw dated data for plotting.

Anomaly correction in EpiNow2.

https://github.com/epiforecasts/EpiNow2/blob/d2b2aa6e76190000d5aad37e66f132f7c44d4644/R/create.R#L34

@seabbs seabbs mentioned this issue Feb 22, 2021
@nikosbosse
Copy link
Contributor Author

Sounds very reasonable. Should we have a quick chat at some point to discuss how to move forward and divide up work?

as this is related to #88 (that one isn't merged because the plotting hasn't happened yet): how should I proceed with the PR? Keep it open until we solved the data issues, then do all the past plots there and then merge?

@seabbs
Copy link
Contributor

seabbs commented Feb 24, 2021

Sounds like a good idea.

Nite sure why this is blocking #88? I'd prefer to keep PRs modular if possible.

@nikosbosse
Copy link
Contributor Author

nikosbosse commented Feb 24, 2021 via email

@seabbs
Copy link
Contributor

seabbs commented Feb 24, 2021

Can you use the structure as present and we can fix the underling data it draws from later?

@nikosbosse
Copy link
Contributor Author

👍 did that. PR can be merged now I think and then we can address this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants