forked from rich-iannone/DiagrammeR-docs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-selections.Rmd
395 lines (322 loc) · 16.7 KB
/
04-selections.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
```{r load_dgr_04, include=FALSE, results=FALSE}
library(DiagrammeR)
```
```{r make_simple_property_graph, include=FALSE, results=FALSE}
property_graph <-
create_random_graph(
n = 4,
m = 6,
directed = TRUE,
set_seed = 23) %>%
select_nodes() %>%
set_node_attrs(
node_attr = "label",
values = c("sjd", "qhd", "i0p", "235")) %>%
set_node_attrs(
node_attr = "type",
values = c("A", "A", "B", "B")) %>%
select_edges() %>%
set_edge_attrs(
edge_attr = "rel",
values = c("X", "X", "Y", "Y", "Z", "Z")) %>%
set_edge_attrs(
edge_attr = "weight",
values = c(3.7, 8.3, 12.5, 1.3, 2.9, 2.4)) %>%
clear_selection()
```
# Selections {#selections}
Occasionally, you'll want to operate on a select group of nodes or edges. Some functions affect a single node or edge while others (or, sometimes, the same functions) operate on all nodes or edges in a graph. Selections allow you to target specified nodes or edges and then apply specialized functions to operate on just those selected entities. Most of the selection functions support rudimentary set operations across several calls of the selection functions (i.e., for the union, intersection, or difference between selected sets of nodes or edges).
## Setup for this Chapter
Ensure that the latest development build of **DiagrammeR** is installed. If the **devtools** package is not available in your **R** library, install it and then use the `install_github()` function to get the latest **DiagrammeR** build.
```{r install_packages_dgr_chp_04, eval=FALSE, include=TRUE, results=FALSE}
install.packages("devtools")
devtools::install_github("rich-iannone/DiagrammeR")
```
```{r library_chp_04, eval=FALSE, include=TRUE, results=FALSE}
library(DiagrammeR)
```
For this chapter, the graph `property_graph` has been created and it's a simple property graph of 4 nodes and 6 edges. Numerical values are assigned to the nodes as the `value` attribute and to the edges as the `weight` attribute. Nodes have unique `label` values, and, `type` group labels are available for nodes (as `A` and `B`) and `rel` values are set for the edges (as `X`, `Y`, and `Z`). This schematic provides a visual representation of the graph as well as the attributes and data associated with each node and edge.
<img src="diagrams/property_graph_chp_04.png">
This graph adheres to the concept of a property graph in that:
- all nodes have an assigned, non-`NA` `type` value
- all edges have an assigned, non-`NA` `rel` value
- the graph is a *directed* graph
We can always verify whether a graph satisfies these conditions with the `is_property_graph()` function, which either returns a `TRUE` or `FALSE` logical result.
```{r is_property_graph}
is_property_graph(property_graph)
```
Great! We'll use this graph in most of the chapter examples to demonstrate how selections work. Working with a property graph in such examples is especially useful for showing how more complicated selections can be accomplished (i.e., using set operations and/or conditional statements).
## The Selection Functions
The following table provides a summary of all the available `select_...()` functions available in **DiagrammeR**.
```{r table_select_functions, echo=FALSE, results='asis'}
library(knitr)
kable(
data.frame(
Function =
c(
"`select_nodes()`",
"`select_nodes_by_id()`",
"`select_last_nodes_created()`",
"`select_nodes_by_degree()`",
"`select_nodes_in_neighbourhood()`",
"`select_edges()`",
"`select_edges_by_edge_id()`",
"`select_edges_by_node_id()`",
"`select_last_edges_created()`",
"`select_rev_edges_ws()`"
),
Description =
c(
"Select nodes graph using filtering conditions.",
"Select nodes by their ID values.",
"Select the last group of nodes created in the graph.",
"Select nodes on the basis of node degree.",
"Select nodes based on a walk distance from a specified node.",
"Select nodes graph using filtering conditions.",
"Select edges by their ID values.",
"Select edges associated with specified node ID values.",
"Select the last group of edges created in the graph.",
"Select any reverse edges from a selection of edges."
)))
```
When any selection is performed using a `select_...()` function, the selection is stored in the graph object. We can always use `get_selection()` to verify this:
```{r select_nodes_get_selection}
# Select nodes `1` and `4` of
# `property_graph` and then return
# the node IDs for the selection
property_graph %>%
select_nodes(nodes = c(1, 4)) %>%
get_selection()
```
And indeed the selection we get back is the selection we asked for, nodes `1` and `4`. We can similarly get a selection of edge ID values and return a vector of edge IDs with `get_selection()`:
```{r select_nodes_get_selection_2}
# Select edges `1` and `4` of
# `property_graph` and then return
# the node IDs for the selection
property_graph %>%
select_edges_by_edge_id(edges = c(1, 4)) %>%
get_selection()
```
## Creating a Node Selection
Let's begin our in-depth look at how to select graph nodes using the `select_nodes()` function.
Selecting nodes in a graph with `select_nodes()` can be done in multiple ways:
- providing only a set of node ID values
- by providing one or more filtering statements to `conditions`
- doing both of the above, where each set of values/statements work toward filtering the nodes that will comprise the node selection
The first example using `select_nodes()` will pertain to the first case, where you know which node ID values should make up the node selection. To show the node selection with all associated metadata as a node data frame, we'll use 2 functions after the selection: `create_subgraph_ws()` (creates a subgraph, or subset, with only the selected nodes), and, `get_node_df()` (returns a node data frame for all nodes in the graph, now limited by the nodes in the selection).
```{r select_nodes_2_ids, eval=FALSE, include=TRUE}
# From `property_graph`, select nodes
# `1` and `3`; show the ndf based on
# the node selection
property_graph %>%
select_nodes(nodes = c(1, 3)) %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_2_ids_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes(nodes = c(1, 3)) %>%
create_subgraph_ws() %>%
get_node_df()
)
```
The `select_nodes_by_id()` function serves as a plug-in replacement for `select_nodes()` when used exactly in this way. Its `nodes` argument simply takes a vector of node ID values. Passing in the same node ID values will result in the same node data frame as in the previous example.
```{r select_nodes_by_id_2_ids, eval=FALSE, include=TRUE}
# Select nodes `1` and `3` using the
# `select_nodes_by_id()` function
property_graph %>%
select_nodes_by_id(nodes = c(1, 3)) %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_by_id_2_ids_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes_by_id(nodes = c(1, 3)) %>%
create_subgraph_ws() %>%
get_node_df()
)
```
If you don't know the node ID values that should be part of a selection - and this is often the case in practice - you can use a vector of filtering statements as `conditions`. Some of these can be `conditions = "type == 'z'"` (selecting nodes where the `type` value is `z`), or `conditions = "value > 3.0"` (node selection where the `value` attribute is greater than `3.0`), or, `conditions = c("value < 2.0", "type %in% c('a', 'b', 'd')")` (all nodes with `value` less than `2.0` and having a `type` value of `a`, `b`, or `d`). Here's an example using an `==` operator to get all nodes of `type` `A`:
```{r select_nodes_type_A, eval=FALSE, include=TRUE}
# From our `property_graph`, select all
# nodes of type `A`
property_graph %>%
select_nodes(
conditions = "type == 'A'") %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_type_A_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes(
conditions = "type == 'A'") %>%
create_subgraph_ws() %>%
get_node_df()
)
```
Need more than one type? Use the familiar `%in%` operator to get nodes with a `type` of `A` or `B`.
```{r select_nodes_types_A_B, eval=FALSE, include=TRUE}
# From our `property_graph`, select all
# nodes of type `A`
property_graph %>%
select_nodes(
conditions = "type %in% c('A', 'B')") %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_types_A_B_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes(
conditions = "type %in% c('A', 'B')") %>%
create_subgraph_ws() %>%
get_node_df()
)
```
Regular expressions can also be used. Suppose we want nodes in this graph where the label only consists of 3 letters, and no numerals. To do this, we could use the **R** `grepl()` function with the regular expression (`^[a-z]{3}$`) first and then the attribute where matches are sought (in this case, `label`).
```{r select_nodes_cond_regex, eval=FALSE, include=TRUE}
property_graph %>%
select_nodes(conditions = "grepl('^[a-z]{3}$', label)") %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_cond_regex_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes(conditions = "grepl('^[a-z]{3}$', label)") %>%
create_subgraph_ws() %>%
get_node_df()
)
```
Comparison operators such as `>`, `>=`, `==`, `<=`, `<`, `!=` could be used with numerical attributes to filter nodes. Here, we make a selection of nodes where any `value` is greater than `4.0`.
```{r select_nodes_cond_gt_numeric, eval=FALSE, include=TRUE}
# Now, get all nodes with a `value`
# greater than 4.0
property_graph %>%
select_nodes(
conditions = "value > 4.0") %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_cond_gt_numeric_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes(
conditions = "value > 4.0") %>%
create_subgraph_ws() %>%
get_node_df()
)
```
We can extend any conditions string to contain multiple filtering steps, linked by `&` (`AND`) or `|` (`OR`) operators. To select nodes that have a `value` less than `3.0` or a `value` greater than `7.0`, the following will work:
```{r select_nodes_or_cond, eval=FALSE, include=TRUE}
property_graph %>%
select_nodes(
conditions = "value < 3.0 | value > 7.0") %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_or_cond_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes(
conditions = "value < 3.0 | value > 7.0") %>%
create_subgraph_ws() %>%
get_node_df()
)
```
To get a selection of nodes that must satisfy all conditions, link expressions with `&`. The following example will filter nodes with a `value` greater than `4.0`, and, with a `type` of `A`:
```{r select_nodes_and_cond, eval=FALSE, include=TRUE}
property_graph %>%
select_nodes(
conditions = "value > 4.0 & type == 'A'") %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_and_cond_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes(
conditions = "value > 4.0 & type == 'A'") %>%
create_subgraph_ws() %>%
get_node_df()
)
```
A vector of `condition` statements creates a set of `AND` conditions, where all conditions need to be fulfilled. The above example could be reworked using a vector of `conditions` statements:
```{r select_nodes_and_cond_vector, eval=FALSE, include=TRUE}
property_graph %>%
select_nodes(
conditions = c("value > 4.0", "type == 'A'")) %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_and_cond_vector_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes(
conditions = c("value > 4.0", "type == 'A'")) %>%
create_subgraph_ws() %>%
get_node_df()
)
```
This retrieves the same selection and is useful for `AND`ing a sequence of filter statements.
The situation may arise when an even more specialized selection of nodes needs to be made. Or, you'd like to make a selection, do something with it, then modify the selection (and then do yet another thing). This is where the `set_op` argument (short for *set operation*) becomes useful. Essentially the first selection will disregard `set_op`--it's just making an initial selection--but the next selection will modify the previous one and how it does so depends on what's given for `set_op`.
These set operations are:
* `union` — creates a union of selected nodes in consecutive operations that create a selection of nodes (this is the default option)
* `intersect` — modifies the list of selected nodes such that only those nodes common to both consecutive node selection operations will retained
* `difference` — modifies the list of selected nodes such that the only nodes retained are those that are different in the second node selection operation compared to the first
This schematic describes what's happening in 2 consecutive `select_...()` function calls for each of the aforementioned set operations.
<img src="diagrams/set_ops_chp_04.png">
These set operations behave in exactly the same way as the base **R** functions `union()`, `intersect()`, and `setdiff()`. Furthermore, most of the `select_...()` functions contain the `set_op` argument, so, they behave the same way with regard to modifying the node or edge selection in a series of consecutive selection operations.
For our small property graph with 4 nodes (with IDs `1` through `4`), creating an initial selection of nodes with `select_nodes()` will result in the selection of all 4 nodes in the graph. A subsequent call of `select_nodes_by_id()` specifying `nodes = c(1, 3)` and `set_op = "difference"` will result in a selection of nodes `2` and `4` (the difference of removing `1` and `3` from the complete set). Let's verify the final selection by finishing with `get_selection()` this time, which outputs node ID values for the selected nodes.
```{r select_nodes_2x_w_difference}
property_graph %>%
select_nodes() %>%
select_nodes_by_id(
nodes = c(1, 3),
set_op = "difference") %>%
get_selection()
```
For a somewhat more realistic application, where the node IDs are not known, imagine a situation where you would want to select nodes of a certain degree but only those having a certain value range. Here, we do something to that effect, where we want to select nodes with an *outdegree* (number of outward edges from a given node) greater than `1` with the added condition of having a `value` greater than `4.0`. In this case, we want the intersection of these selections.
```{r select_nodes_by_degree_and_value, eval=FALSE, include=TRUE}
property_graph %>%
select_nodes_by_degree(
expressions = "outdeg > 1") %>%
select_nodes(
conditions = "value > 4.0",
set_op = "intersect") %>%
create_subgraph_ws() %>%
get_node_df()
```
```{r select_nodes_by_degree_and_value_kable, echo=FALSE}
knitr::kable(
property_graph %>%
select_nodes_by_degree(
expressions = "outdeg > 1") %>%
select_nodes(
conditions = "value > 4.0",
set_op = "intersect") %>%
create_subgraph_ws() %>%
get_node_df()
)
```
This returns a single value. Just so you know, using `set_op == "union"` would result in 3 nodes in the selection (`1`, `2`, and `4`), and, using `set_op == "difference"` would yield a selection with the node `2`.
Most often, selections will be performed without knowing the node ID values for the nodes to be selected but, instead, some attributes of the nodes to be selected and perhaps the nodes to exclude from that selection. And this sort of makes sense when you have thousands, hundreds of thousands, and maybe millions of nodes: you're not often going to visualize the entire graph at this point (much like you wouldn't usually display a table of even hundreds of rows).
## Creating an Edge Selection
We'll run through selections of edges in perhaps an abbreviated fashion since many of the concepts that apply to node selections also apply to edge selections. First off, recall that edges have ID values! This is great because edges can also be defined by their attachments to nodes. So an edge defined as `1->2` in a directed graph (where `1` and `2` are node ID values) can also be defined as an edge with ID `1` (in this hypothetical case). Let's have a look at the internal edge data frame within `property_graph` to remind ourselves of the edge ID values available in this graph:
```{r get_edge_df_all_edges, eval=FALSE, include=TRUE}
property_graph %>%
get_edge_df()
```
```{r get_edge_df_all_edges_kable, echo=FALSE}
knitr::kable(
property_graph %>%
get_edge_df()
)
```
The first column in the edge data frame is the `id` attribute. The next two are `from` and `to`, and `rel` is always the 4th attribute column. We can select edges using IDs (of both the edge and node varieties) with three different functions:
- `select_edges()`
- `select_edges_by_edge_id()`
- `select_edges_by_node_id()`