Handling edge data and parallel edges #23

yaslena · 2023-03-14T11:44:38Z

yaslena
Mar 14, 2023
Maintainer

Removing parallel edges and data loss

Handling edge data is a problem that is part of a larger issue that is related to re-structuring network data as part of a preprocessing step.

The kiara function "get_property_data" (called as "kiara data explain -p " in the command line) gives an overview of the number of edges in a graph depending on the graph type (undirected, directed, multi-directed) chosen.

In more complex cases users need to decide what should happen to the edge data. The default behaviour in NetworkX is to randomly keep edge data when parallel edges are discarded (when converting a multigraph into a directed or undirtected graph). This is not always the desirable solution, because data is lost in an uncontrolled way.

No edge data

In the case that there is no edge data in the original dataset (only source & target columns) we could also offer to automatically calculate weights as "auto_weight".

If this is the input-table:

Source	Target
A	B
A	B
A	B
C	D
C	D
...	...

The new table would look like this:

Source	Target	auto_weight
A	B	3 (3x A-B)
C	D	2 (2x C-D)
...	...	...

Has edge data (int, str, datetime)

In the case when there is edge data in the original dataset, several solutions would be possible and user input is required to decide what should happen to the other columns. Depending on the data type(int, str, datetime), some standard solutions could be offered.

Discard all parallel edges, keep all edge data

Source	Target	string	int
A	B	Jan	2
A	B	Feb	3
B	C	Mar	6
C	D	Nov	2

Users could decide to keep all the string values and aggregate the int values:

Source	Target	new_string	new_int
A	B	Jan, Feb	5 (2+3)
B	C	Mar	6
C	D	Nov	2

Keep some parallel edges based on edge data

Or they could decide that one of the columns contains a relevant edge destinction (string column:)

Source	Target	string	int
A	B	Jan	2
A	B	Jan	3
A	B	Jan	4
A	B	Feb	1
C	D	Feb	1

The edges would then be aggregated when they contain the same value accross the 3 columns Source, Target and string:

Source	Target	string	new_int
A	B	Jan	9 (2+3+4)
A	B	Feb	1
C	D	Feb	1

Discard parallel edges, deal with datetime values

Source	Target	timeA	timeB	int
A	B	01.01.2022	03.01.2022	3
A	B	10.03.2022	15.03.2022	5

A common solution would be to take the earliest timeA and the latest timeB from the A-B pair when aggregating rows that contain datetime values

Source	Target	timeA	timeB	int
A	B	01.01.2022	15.03.2022	8 (3+5)
...	...	...	...	...

makkus · 2023-05-16T09:19:33Z

makkus
May 16, 2023
Maintainer

Since aggregating always means potentially loosing data, I think we should not do that by default, and only on demand. So, we keep the original data as best as possible, and provide modules that can do aggregations that make sense for a specific data set. One thing that could make sense would be to create that 'auto-weight' column in the edges table.
Actually, it'd probably be 2 columns: one that considers directionality, one that doesn't.

0 replies

makkus · 2023-06-19T09:01:09Z

makkus
Jun 19, 2023
Maintainer

I'm implementing this now, as a separate "flatten.network_data" module (module name open for suggestions). Turns out, actually implementing this is fairly simple, in most cases we can stick with a simple sql query that uses a source/target 'group by', like:

        SELECT 
            _source,
            _target, 
            SUM(weight) as weight,
            LIST(time) as time
        FROM edges_table
        GROUP BY _source, _target

That assumes how we want to flatten/aggregate columns is available as sql function.

The problem I'm having is figuring out how to ask the user to specify what they want to do. For example, that module could have a list of attribute names the user wants to keep as input, and we'd SUM the column up in case the data type is numeric, and otherwise we use LIST (if a string). That would be easiest, in terms of input schema. And I assume it would be what users most likely would want. But that leaves out a lot of those edge cases @yaslena listed in this discussion thread.

So, the other option would be a dict, where the column name to keep would be the key, and what to do with it the value, something like:

network_data: <the alias of the network_data value>
edge_attribute_methods:
   weight: SUM
   date: MIN
   name: LIST

This would cover more use-cases, but would also fall short in some others, like Keep some parallel edges based on edge data from above.

I have to think more about this, but if anyone has any ideas, I'm open to suggestions...

0 replies

makkus · 2023-06-19T09:24:13Z

makkus
Jun 19, 2023
Maintainer

Ok, I think for the Keep some parallel edges based on edge data case, we have 2 options (assuming we'll end up with some sort of dict-line input that has column names as keys:

an extra input, that asks for all the columns that should be used in the GROUP BY clause (apart from _source & _target
a 'marker' value that would indicate that this column should not be aggregated, but seen as uniqueness indicator, something like:

edge_attribute_methods:
  weight: SUM
  string: KEEP

0 replies

makkus · 2023-06-20T13:25:58Z

makkus
Jun 20, 2023
Maintainer

So, bit of an update: after talking with @yaslena we decided to implement a module called 'network_data.redefine_edges', which can be used to:

'flatten' a multigraph into a non-multigraph one
independently aggregate existing edge attribute columns into the new network_data instance
by default does not move over existing edge attributes (to reduce the amount of 'magic' where stuff happens without the user being aware of)
let the user specify columns that will be included in the 'group_by' clause of the generated sql query (in which case we don't create a not-multigraph, but a multigraph with a different set of parallel edges)

The available aggregation functions are:

   "group_by": "Don't aggregate on this column, but keep it as is and use it in the group by clause.",
    "any_val": "Returns the first non-null value",
    "avg": "Calculates the average value for all tuples in arg.",
    "bool_and": "Returns TRUE if every input value is TRUE, otherwise FALSE.",
    "bool_or": "Returns TRUE if any input value is TRUE, otherwise FALSE.",
    "count": "Returns the number of input values.",
    "favg": "Calculates the average using a more accurate floating point summation (Kahan Sum).",
    "first": "Returns the first value of a column.",
    "fsum": "Calculates the sum using a more accurate floating point summation (Kahan Sum).",
    "histogram": "Returns a LIST of STRUCTs with the fields bucket and count.",
    "last": "Returns the last value of a column.",
    "list": "Returns a LIST containing all the values of a column.",
    "max": "Returns the maximum value present in the column.",
    "min": "Returns the minimum value present in the column.",
    "product": "Returns the product of all tuples in the column.",
    "string_agg_comma": "Concatenates the column string values with a comma separator.",
    "sum": "Calculates the sum value for all tuples in arg.",

On the commandline, the module can be used like:

kiara run network_data.redefine_edges network_data=simple attribute_map_strategies=weight attribute_map_strategies=time

which would aggregate and 'move over' the 'weight' and 'time' edge attribute columns, applying the default aggregations, depending on the column type ('SUM' for numerical types, 'LIST' for others).

An example with more control over what happens would be:

kiara run network_data.redefine_edges network_data=simple 'attribute_map_strategies=new_weight=COUNT(weight)'

Here, we create a new edge attribute 'new_weight', by applying the 'COUNT' function against the old 'weight' column. The old weight column will not be moved over to the new network_data instance.

0 replies

makkus · 2023-12-04T11:39:46Z

makkus
Dec 4, 2023
Maintainer

I just realized that it might be a good idea for this operation to also be able to 'flatten' a directed graph into an undirected one by having an option of merging edges that not only have the same source/target combinations, but also target/source. If that option is enabled, the aggregations would apply to all of those, not just the ones with the same direction. @yaslena @CBurge95 , would that be a usecase you have encountered in the past? I'd imagine for data that is undirected in nature it would not make sense to have aggregations that depend on the order of source/target.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling edge data and parallel edges #23

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Handling edge data and parallel edges #23

yaslena Mar 14, 2023 Maintainer

Removing parallel edges and data loss

No edge data

Has edge data (int, str, datetime)

Discard all parallel edges, keep all edge data

Keep some parallel edges based on edge data

Discard parallel edges, deal with datetime values

Replies: 5 comments

makkus May 16, 2023 Maintainer

makkus Jun 19, 2023 Maintainer

makkus Jun 19, 2023 Maintainer

makkus Jun 20, 2023 Maintainer

makkus Dec 4, 2023 Maintainer

yaslena
Mar 14, 2023
Maintainer

makkus
May 16, 2023
Maintainer

makkus
Jun 19, 2023
Maintainer

makkus
Jun 19, 2023
Maintainer

makkus
Jun 20, 2023
Maintainer

makkus
Dec 4, 2023
Maintainer