Replies: 5 comments
-
Since aggregating always means potentially loosing data, I think we should not do that by default, and only on demand. So, we keep the original data as best as possible, and provide modules that can do aggregations that make sense for a specific data set. One thing that could make sense would be to create that 'auto-weight' column in the edges table. |
Beta Was this translation helpful? Give feedback.
-
I'm implementing this now, as a separate "flatten.network_data" module (module name open for suggestions). Turns out, actually implementing this is fairly simple, in most cases we can stick with a simple sql query that uses a source/target 'group by', like:
That assumes how we want to flatten/aggregate columns is available as sql function. The problem I'm having is figuring out how to ask the user to specify what they want to do. For example, that module could have a list of attribute names the user wants to keep as input, and we'd SUM the column up in case the data type is numeric, and otherwise we use LIST (if a string). That would be easiest, in terms of input schema. And I assume it would be what users most likely would want. But that leaves out a lot of those edge cases @yaslena listed in this discussion thread. So, the other option would be a dict, where the column name to keep would be the key, and what to do with it the value, something like:
This would cover more use-cases, but would also fall short in some others, like I have to think more about this, but if anyone has any ideas, I'm open to suggestions... |
Beta Was this translation helpful? Give feedback.
-
Ok, I think for the
|
Beta Was this translation helpful? Give feedback.
-
So, bit of an update: after talking with @yaslena we decided to implement a module called 'network_data.redefine_edges', which can be used to:
The available aggregation functions are:
On the commandline, the module can be used like:
which would aggregate and 'move over' the 'weight' and 'time' edge attribute columns, applying the default aggregations, depending on the column type ('SUM' for numerical types, 'LIST' for others). An example with more control over what happens would be:
Here, we create a new edge attribute 'new_weight', by applying the 'COUNT' function against the old 'weight' column. The old weight column will not be moved over to the new network_data instance. |
Beta Was this translation helpful? Give feedback.
-
I just realized that it might be a good idea for this operation to also be able to 'flatten' a directed graph into an undirected one by having an option of merging edges that not only have the same source/target combinations, but also target/source. If that option is enabled, the aggregations would apply to all of those, not just the ones with the same direction. @yaslena @CBurge95 , would that be a usecase you have encountered in the past? I'd imagine for data that is undirected in nature it would not make sense to have aggregations that depend on the order of source/target. |
Beta Was this translation helpful? Give feedback.
-
Removing parallel edges and data loss
Handling edge data is a problem that is part of a larger issue that is related to re-structuring network data as part of a preprocessing step.
The kiara function "get_property_data" (called as "kiara data explain -p " in the command line) gives an overview of the number of edges in a graph depending on the graph type (undirected, directed, multi-directed) chosen.
In more complex cases users need to decide what should happen to the edge data. The default behaviour in NetworkX is to randomly keep edge data when parallel edges are discarded (when converting a multigraph into a directed or undirtected graph). This is not always the desirable solution, because data is lost in an uncontrolled way.
No edge data
In the case that there is no edge data in the original dataset (only source & target columns) we could also offer to automatically calculate weights as "auto_weight".
If this is the input-table:
The new table would look like this:
Has edge data (int, str, datetime)
In the case when there is edge data in the original dataset, several solutions would be possible and user input is required to decide what should happen to the other columns. Depending on the data type(int, str, datetime), some standard solutions could be offered.
Discard all parallel edges, keep all edge data
Users could decide to keep all the string values and aggregate the int values:
Keep some parallel edges based on edge data
Or they could decide that one of the columns contains a relevant edge destinction (string column:)
The edges would then be aggregated when they contain the same value accross the 3 columns Source, Target and string:
Discard parallel edges, deal with datetime values
A common solution would be to take the earliest timeA and the latest timeB from the A-B pair when aggregating rows that contain datetime values
Beta Was this translation helpful? Give feedback.
All reactions