Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sf_ - to_list #70

Closed
14 tasks done
dcooley opened this issue Mar 23, 2020 · 17 comments
Closed
14 tasks done

sf_ - to_list #70

dcooley opened this issue Mar 23, 2020 · 17 comments

Comments

@dcooley
Copy link
Owner

dcooley commented Mar 23, 2020

the reverse of #69 - keep all the rows and make it a list, where each list element is the length of the sfg_id

list_columns can be R_NilValue (or NULL).

I don't want to pass another value through all the sf_() functions. So maybe I need to make an sfheaders::api:: which will switch on the type, and will return to it before the create_df() call, so I can then pass in the list_columns...

TODO

  • sf_point
  • sf_pt
  • sf_linestring
  • sf_line
  • sf_multilinestring
  • sf_mline
  • sf_polygon
  • sf_mpoly
  • sf_multipolygon
  • sf_mpoly
  • everywhere that uses sfheaders:::rcpp_sf_ needs to change to sfheaders:::rcpp_to_sf (including tests)
  • works without specifying an id column
  • works on data.frame and matrices, with names and using numeric indices
  • calling unlist on a non-list column works (or doesn't error) - this has to error? If it's not a list, how do we know how to handle it?
SymbolixAU pushed a commit that referenced this issue Mar 24, 2020
SymbolixAU pushed a commit that referenced this issue Mar 27, 2020
dcooley added a commit that referenced this issue Mar 27, 2020
dcooley added a commit that referenced this issue Mar 27, 2020
@dcooley
Copy link
Owner Author

dcooley commented Mar 27, 2020

@mdsumner I've added a new argument to the sf_ functions called list_columns, which you use to specify any columns of the df where you want to keep all the values

df <- data.frame(
  id = c(1,1,2,2)
  , x = 1:4
  , y = 1:4
  , val = letters[1:4]
  , stringsAsFactors = FALSE
)

sf_linestring(
  obj = df
  , x = "x"
  , y = "y"
  , linestring_id = "id"
  , keep = TRUE
  , list_columns = "val"
)

#   id  val   geometry
# 1  1 a, b 1, 2, 1, 2
# 2  2 c, d 3, 4, 3, 4

vs the current behaviour

sf_linestring(
  obj = df
  , x = "x"
  , y = "y"
  , linestring_id = "id"
  , keep = TRUE
)

## here in 'val', only the first row of the `df$val` is kept
#   id val   geometry
# 1  1   a 1, 2, 1, 2
# 2  2   c 3, 4, 3, 4

Do you foresee any issues if I add this arg to your sf-helper functions?

@mdsumner
Copy link
Contributor

mdsumner commented Mar 28, 2020

That's interesting. I don't think it can be a problem, but I'd need to try it with a few things.

It's strange one, effectively you are allow vertex attributes beyond x, y, z, m - by storing them one-to-one "elsewhere" (which is what every spatial system does when butting up against the limitations of SF).

This is very complementary to PATH in silicate, it tries to keep these vertex attributes and in theory allows for any set to be stored in the $vertex table - but it's only really functional for track data (with date-time, depth, temperature, whatever was measured) and not mature. The issue is $vertex is normalized (deduplicated) in some geometry xy, or xyz, or xyt etc and so other attributes (all in a higher dimensionality along with x,y that has no inherent duplication) end up on the $path_link_vertex table. It's powerful I think but not developed.

(Heh, sorry, but this might help bridge that model).

I'll try to have a look soonish. Thanks for the heads up!

@mdsumner
Copy link
Contributor

Oh, do you have a real-world example data set that's relevant? Pretty much any tracking data is, and (not a great example but relevant) the walrus818 in trip:

d <- tibble::as_tibble(trip::walrus818)

# A tibble: 10,558 x 6
   Deployment DataDT                Wet Forage X_AED170_70 Y_AED170_70
        <int> <dttm>              <int>  <int>       <dbl>       <dbl>
 1        353 2009-09-15 04:00:00     1      0      281017       22532
 2        353 2009-09-15 05:00:00     0      0      281399       22392
...

sf_linestring(d, x = "X_AED170_70", y = "Y_AED170_70", z = "DataDT", linestring_id = "Deployment")

Here list_columns would be c("Wet", "Forage").

Q: does keep remain? So we don't have to specify the columns, "just do it"?

@mdsumner
Copy link
Contributor

crs for Walrus818 fwiw:

crsmeta::crs_proj(trip::walrus818)
[1] "+proj=aeqd +ellps=WGS84 +lon_0=-170 +lat_0=70"

@mdsumner
Copy link
Contributor

Q: does keep remain? So we don't have to specify the columns, "just do it"?

Oh I see, list-columns are separate, and compatible with keep - very good. ( at first I though you only added one column with 1 or more vectors in each list)

@dcooley
Copy link
Owner Author

dcooley commented Mar 28, 2020

Oh, do you have a real-world example data set that's relevant? Pretty much any tracking data

No, nothing specific, but my use-case is the multi-coloured path and polygon vertices, and multi-coloured trips (e.g. to represent loads on public transport)

effectively you are allow vertex attributes beyond x, y, z, m

yep, exactly.

@mdsumner
Copy link
Contributor

All good! Had a bit of refresh on this concept and what I've tried before, this will be a sweet mechanism, with lots of implications across the tools I'm working on.

@mdsumner
Copy link
Contributor

Oh, so one thing to consider - should the "extra" attributes be bundled into a single list column, where each element is a dataframe? Then it's only a do.call(rbind, df$listcol) away and no name handling to do: dplyr::bind_rows(df$listcol, .id = "feature_id") also does the same and makes id-ing the feature easier.

I'm not of a strong opinion, possibly I'd prefer data frames in each element (even if only one column) than a raw list.

@dcooley
Copy link
Owner Author

dcooley commented Mar 28, 2020

should the "extra" attributes be bundled into a single list column, where each element is a dataframe?

Do you have an example of this, I can't quite visualise what you mean?

@mdsumner
Copy link
Contributor

mdsumner commented Mar 28, 2020

Took a bit more than I thought, but this is what I mean:

 library(sfheaders)
 d0 <- data.frame(x = 1:10, y = 1:10, linestring_id = 1)
 
 d <- rbind(d0, 2 * d0)
 d$temp <- sort(c(runif(nrow(d)/2),
                  runif(nrow(d)/2, 2, 5)))
 d$fac <- sample(5:15, nrow(d), replace = TRUE)
 d$time <- Sys.time() + seq_len(nrow(d)) * 2
 
 datcols <- c("temp", "fac", "time")
 ## current scheme for list_columns = c("temp", "fac", "time")
 x <- sf_linestring(d, x = "x", y = "y", linestring_id = "linestring_id", 
                    keep = TRUE,
                    list_columns = datcols)
 ## each are separate lists
 x$temp
#> [[1]]
#>  [1] 0.1012976 0.1808249 0.3301957 0.3396338 0.3452928 0.4666439 0.5471586
#>  [8] 0.5992981 0.7777070 0.7850789
#> 
#> [[2]]
#>  [1] 2.435679 2.778072 2.801986 2.979986 3.047628 3.705985 3.831921 4.083824
#>  [9] 4.682265 4.910941
 x$fac 
#> [[1]]
#>  [1]  6  8  9 10 15 15 15  9  9  9
#> 
#> [[2]]
#>  [1]  7  7 15  6 10  8 13  9 15  9
 x$time
#> [[1]]
#>  [1] "2020-03-29 09:49:23 AEDT" "2020-03-29 09:49:25 AEDT"
#>  [3] "2020-03-29 09:49:27 AEDT" "2020-03-29 09:49:29 AEDT"
#>  [5] "2020-03-29 09:49:31 AEDT" "2020-03-29 09:49:33 AEDT"
#>  [7] "2020-03-29 09:49:35 AEDT" "2020-03-29 09:49:37 AEDT"
#>  [9] "2020-03-29 09:49:39 AEDT" "2020-03-29 09:49:41 AEDT"
#> 
#> [[2]]
#>  [1] "2020-03-29 09:49:43 AEDT" "2020-03-29 09:49:45 AEDT"
#>  [3] "2020-03-29 09:49:47 AEDT" "2020-03-29 09:49:49 AEDT"
#>  [5] "2020-03-29 09:49:51 AEDT" "2020-03-29 09:49:53 AEDT"
#>  [7] "2020-03-29 09:49:55 AEDT" "2020-03-29 09:49:57 AEDT"
#>  [9] "2020-03-29 09:49:59 AEDT" "2020-03-29 09:50:01 AEDT"



 ## but 1) could be a nested data frame
 x2 <- x
 x2$data <- split(d[datcols], d$linestring_id)[unique(d$linestring_id)]

 ## (or 2) a list, closer to current scheme)
 #x2$data <- lapply(split(d[datcols], d$linestring_id)[unique(d$linestring_id)], as.list)
 x2[datcols] <- NULL

## each separate column is bundled in the one list-column
 
x2$data
#> $`1`
#>         temp fac                time
#> 1  0.1012976   6 2020-03-29 09:49:23
#> 2  0.1808249   8 2020-03-29 09:49:25
#> 3  0.3301957   9 2020-03-29 09:49:27
#> 4  0.3396338  10 2020-03-29 09:49:29
#> 5  0.3452928  15 2020-03-29 09:49:31
#> 6  0.4666439  15 2020-03-29 09:49:33
#> 7  0.5471586  15 2020-03-29 09:49:35
#> 8  0.5992981   9 2020-03-29 09:49:37
#> 9  0.7777070   9 2020-03-29 09:49:39
#> 10 0.7850789   9 2020-03-29 09:49:41
#> 
#> $`2`
#>        temp fac                time
#> 11 2.435679   7 2020-03-29 09:49:43
#> 12 2.778072   7 2020-03-29 09:49:45
#> 13 2.801986  15 2020-03-29 09:49:47
#> 14 2.979986   6 2020-03-29 09:49:49
#> 15 3.047628  10 2020-03-29 09:49:51
#> 16 3.705985   8 2020-03-29 09:49:53
#> 17 3.831921  13 2020-03-29 09:49:55
#> 18 4.083824   9 2020-03-29 09:49:57
#> 19 4.682265  15 2020-03-29 09:49:59
#> 20 4.910941   9 2020-03-29 09:50:01

Created on 2020-03-29 by the reprex package (v0.3.0)

@mdsumner
Copy link
Contributor

mdsumner commented Mar 28, 2020

fwiw, it's what nest() does

library(tidyverse)
 d %>% select(-x, -y) %>% group_by(linestring_id) %>% nest(data = datcols)
# A tibble: 2 x 2
# Groups:   linestring_id [2]
  linestring_id data             
          <dbl> <list>           
1             1 <tibble [10 × 3]>
2             2 <tibble [10 × 3]>

And it's what unjoin() does, but by separating and de-duplicating the coords if posssible (a core of silicate):

unjoin::unjoin(d, x, y, linestring_id)

@SymbolixAU
Copy link
Contributor

SymbolixAU commented Mar 29, 2020

Ok I see what you're suggesting.

However, my argument for keeping them separate is so you can run colourvalues over each individual list.

olourvalues::colour_values(x$temp)
[[1]]
 [1] "#440154FF" "#45065AFF" "#460A5DFF" "#471265FF" "#48176AFF" "#482273FF" "#472D7BFF" "#46337FFF" "#463480FF"
[10] "#433D85FF"

[[2]]
 [1] "#26818EFF" "#25858EFF" "#20928CFF" "#1F988BFF" "#1E9D89FF" "#57C766FF" "#B2DD2DFF" "#B3DD2DFF" "#C8E020FF"
[10] "#FDE725FF"

Which means it can go straight into mapdeck ;)

@mdsumner
Copy link
Contributor

Ah ok, no worries. Appreciate the discussion has been clarifying :)

@SymbolixAU
Copy link
Contributor

And thinking about it more, I see things like 'nesting' as very much in the realm of other packages (dplyr, data.table), and I think it would be a fair bit of effort to get it "right", when they already do it.

library(data.table)

dt <- as.data.table( d )
dt[
  , .(
    geometry = sfheaders::sfc_linestring(obj = dt, x = "x", y = "y")
    , data = list(.SD)
  )
  , .SDcols = setdiff(names(dt), c("x","y"))
  , by = .(linestring_id)
]
#    linestring_id                           geometry         data
# 1:             1 LINESTRING (1 1, 2 2, 3 3, ...,... <data.table>
# 2:             2 LINESTRING (1 1, 2 2, 3 3, ...,... <data.table>

Out of interest, what would be the dplyr equivalent of this data.table code?

I don't really speak dplyr, so I've got as far as

library(tidyverse)
d %>% 
  group_by(linestring_id) %>% 
  summarise(
    geometry = sfheaders::sfc_linestring(., "x", "y")
    )

@SymbolixAU
Copy link
Contributor

But thinking some more... maybe it wouldn't be so hard...

We could have a list_data argument, as well as list_columns, which will combine the lists into your suggested data ...

@SymbolixAU
Copy link
Contributor

yeah ok, let's do it

@mdsumner
Copy link
Contributor

haha can't keep up, this does it (but is a bit complicated atm with upcoming dplyr/tibble changes)

  library(sfheaders)
d0 <- data.frame(x = 1:10, y = 1:10, linestring_id = 1)

d <- rbind(d0, 2 * d0)
d$temp <- sort(c(runif(nrow(d)/2),
                 runif(nrow(d)/2, 2, 5)))
d$fac <- sample(5:15, nrow(d), replace = TRUE)
d$time <- Sys.time() + seq_len(nrow(d)) * 2

datcols <- c("temp", "fac", "time")
## current scheme for list_columns = c("temp", "fac", "time")
x <- sf_linestring(d, x = "x", y = "y", linestring_id = "linestring_id", 
                   keep = TRUE,
                   list_columns = datcols)


library(tidyverse)
d %>% 
  group_by(linestring_id) %>% 
  summarise(
    geometry = sfheaders::sfc_linestring(., "x", "y"), 
    data  = list(select(ungroup(.), datcols))
  )
#> # A tibble: 2 x 3
#>   linestring_id geometry data             
#>           <dbl> <list>   <list>           
#> 1             1 <XY>     <tibble [20 × 3]>
#> 2             2 <XY>     <tibble [20 × 3]>

Created on 2020-03-30 by the reprex package (v0.3.0)

@dcooley dcooley closed this as completed in fb4c40e Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants