Extract edges from table(s) #17

yaslena · 2022-02-18T09:55:04Z

yaslena
Feb 18, 2022
Maintainer

Extracting edge data from one or several tables can be part of data wrangling or preprocessing before starting with a graph creation workflow. Not all data comes in perfectly parsed and structured edge-list format that is usually required to create a graph with a network analysis library (networkX, igraph, networkit ..).

One of the Jupyter notebooks for the network analysis workflow already details edge extraction from a denormalized database output: Network Analysis with NetworkX. There, the extraction is done by using groupby (to extract unique source-target pairs) and the aggregate function (to count the weights).

Sometimes an extraction operation is more complex and it might be more useful to use SQL in this case. (See also this discussion: #10)

The following case could serve as a template for a variety of extraction problems. In this example a new network between items in the first column (network of participants in this case) is created based on a common observation in another column (common event) and an existing overlapping time range in (start_date : end_date).

Orig. table

participant	event	event_start_date	event_end_date
participant1	event1	xxxx-xx-xx	xxxx-xx-xx
participant2	event1	xxxx-xx-xx	xxxx-xx-xx
participant3	event2	xxxx-xx-xx	xxxx-xx-xx
...	...	...	...

New table

source	target	common_event	overlap_time_in_days
participant1	participant2	event1	XX days
participant3	participant15	event2	X days
...	...	...	...

Example cases that partly fit this schema can be found here:
https://stackoverflow.com/questions/64636430/projecting-dynamic-bi-partite-two-mode-network-where-only-edges-overlapping-in-t

https://stackoverflow.com/questions/53824502/how-can-i-select-the-employees-who-worked-the-longest-time-together-on-one-proje

https://stackoverflow.com/questions/7486144/efficient-projection-of-a-bipartite-graph-in-python-using-networkx

https://stackoverflow.com/questions/58396361/find-overlap-time-ranges

Python solution

def overlap(df):
    V = nx.Graph()
    V.add_nodes_from(df.Soldier_id)
    for i in range(len(df)):
        a = df.iloc[i]
        for j in range(i + 1, len(df)):
            b = df.iloc[j]
            if a.Unit_id == b.Unit_id and ((a.isodatefrom <= b.isodateto) and (a.isodateto >= b.isodatefrom)):
                V.add_edge(a.Soldier_id, b.Soldier_id)
     return V

SQL solution (Markus)

query = """
with ALL_EDGES as
         (
             SELECT SOLDIER.soldier_id                        soldier_left,
                    OTHER.soldier_id                          soldier_right,
                    SOLDIER.unit_id,
                    max(SOLDIER.start_date, OTHER.start_date) start_overlap,
                    min(SOLDIER.end_date, OTHER.end_date)     end_overlap
             FROM soldiers SOLDIER
                      INNER JOIN soldiers OTHER ON
                         SOLDIER.soldier_id = 323 AND
                         OTHER.soldier_id > SOLDIER.soldier_id AND
                         OTHER.unit_id = SOLDIER.unit_id AND
                         OTHER.start_date < SOLDIER.end_date AND
                         OTHER.end_date > SOLDIER.start_date
) SELECT
      soldier_left,
      soldier_right,
      unit_id,
      start_overlap,
      end_overlap,
      julianday(end_overlap) - julianday(start_overlap)  overlap_duration
from ALL_EDGES;
"""

makkus · 2022-02-18T14:25:18Z

makkus
Feb 18, 2022
Maintainer

Interesting, yes. I'd like to know how 'generic' we can get here, aka: for an as large as possible number of (tabular) input data sets out in the wild, what sort of input would we typically need from a user so we can reliably convert (qualify) those tables into edge-, edge attribute-, and node attribute-) data? The example here is very specific for a particular data set, we would need the user to qualify what the unique node id is (solder_id), and what the attributes are that connect nodes and in what way (unit_id & date range). The latter would also translate into edge attributes I guess.

Anyway, something to look for in our collected example datasets. It might turn out we will have to have one sql query per dataset, but hopefully it'll be more like us finding 4 or 5 general scenarios that could cover 80% of the data we can typically expect.

1 reply

yaslena Feb 23, 2022
Maintainer Author

Following a discussion about more 'generic' column names that would fit multiple scenarios of 'edge-list extraction' a few suggestions were made:

'nodes' for participants column (as qualifier)
'condition' for events column

If the user wants to use the same table to create a two-mode (bipartite) graph, then the columns would be qualified differently:

'source' for participants
'target' for events
(weight could be added from calculating event-duration from start_date and end_date)

makkus · 2022-02-18T14:29:59Z

makkus
Feb 18, 2022
Maintainer

As an example: if a dataset has an obvious, unique node identifier, as well as a start and end date, we could always create network data where the edges are overlaps in time between two nodes. Or, if we had longtitude/latitude, we should be able to create network data where the edges would be created using how close two nodes where (with maybe a cut-off for when two nodes are not considered 'close' -- this could also go into the edge-weight for such a graph). How much sense that makes, in most cases, is another question of course. One we should also strive to get a handle on.

0 replies

makkus · 2022-02-18T14:32:00Z

makkus
Feb 18, 2022
Maintainer

How do existing tools help with this, does anyone have any examples here?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract edges from table(s) #17

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Extract edges from table(s) #17

yaslena Feb 18, 2022 Maintainer

Orig. table

New table

Python solution

SQL solution (Markus)

Replies: 3 comments · 1 reply

makkus Feb 18, 2022 Maintainer

yaslena Feb 23, 2022 Maintainer Author

makkus Feb 18, 2022 Maintainer

makkus Feb 18, 2022 Maintainer

yaslena
Feb 18, 2022
Maintainer

Replies: 3 comments 1 reply

makkus
Feb 18, 2022
Maintainer

yaslena Feb 23, 2022
Maintainer Author

makkus
Feb 18, 2022
Maintainer

makkus
Feb 18, 2022
Maintainer