Replies: 6 comments
-
General Remarks:
Degree Centrality:
Betweenness Module:
Eigenvector Module:
Closeness Module:
Extra Thoughts:
|
Beta Was this translation helpful? Give feedback.
-
Preparing for BerlinThe only issue that needs adjustement in code before the Berlin workshop is the directed - undirected graph type parameter for betweenness, closeness, and eigenvector centralities. I think our solution to only offer calculations for directed graph type makes sense at this point, because it matches the sample data properties that we are going to use at the workshop. Default parametersIn contrast to this, if we were to work with default parameters, I would go for undirected graph type input. One could almost always argue that the undirected version is the most abstract and simplified version of the original directed or multigraph, but this does not work the other way round. Values calculated on a directed version of a graph that is undirected in nature will always be just wrong. Not optional, or alternative, or "difficult to interpret", but simply wrong. Naming and grouping modulesWe will be making changes to the modules' names and how they can be chained or grouped together a lot before we are finished. The more modules we write, the more we will have to do this. Calling something a "centrality module" will raise certain expectations with users who are familiar with the terminology. It will also make it easier for everyone to find the desired methods if they are named properly. Documentation vs. user guidanceNeither Gephi nor networkX have a functionality to offer users advice to run algorithms and functions in a certain sequence. A lot of the remarks above are written with this guidance feature in mind. We could always write better documentation for our modules, but making notes on how the modules can be meaningfully chained will help us later when we build the GUI. That's why the "determine graph type property function" is mentioned everywhere, as well as the "extract largest component module" in connection with the closeness centrality module. They would be good choices to run before the centralities modules. |
Beta Was this translation helpful? Give feedback.
-
Ok, just to confirm I understand the degree and degree_centrality right. Currently, every
Directed or undirected should not matter in this instance, right? Am I right in understanding that those values equal the 'unweighted degree' value you are talking about above? Since it can be calculated when creating the network_data instance, should I also add a '_degree_centrality' and '_degree_centrality_multi' column to every network_data nodes table, that takes the max number in the respective count column and calculates the fraction for each row? We also have For weighted we need a separate module, because we need user input to tell us which attribute contains the 'weight'. But I guess we would also compute the values for both non-multi graph and multi-graph interpretations of the data? |
Beta Was this translation helpful? Give feedback.
-
I have both some initial thoughts / questions about these updates. Is it necessary to calculate the edge count at the network data creation stage? For me it seems like a questionable action - at least in terms of assigning this as a node attribute - given a) that there are a number of decisions that need to be made around calculating degree and b) this begins to look like 'automated' decisions being made on behalf of the researcher/without them actively selecting it. Degree / Weighted degree will both look different depending on if it is directed or not, multi-graph or not, edge merge strategy etc. Assigning 'scores' for all of these possibilities at the start would be both redundant and computationally 'expensive' unnecessarily. Degree (unweighted degree) counts the total number of edges going in or out of a node, irrelevant of graph type. Obviously the result may be different depending on what graph type is selected, but the 'algorithm' is the same. In a directed/multi-directed network, this will also return in-degree and out-degree. Weighted degree counts the total number of weights assigned to the edges attached to a node. Each edge is automatically assigned a weight of 1; if there are no other means of assigning weight to an edge, weighted degree will be the same as degree. Weights may result from pre-assigned values (should be there at import) or from the edge merge strategy, if there are parallel edges and this has been selected. 'Normalisation' would ideally be offered, but not auto-calculated or (especially in degree) returned in place of the raw numbers. In terms of a UI (@caro401) it makes most sense for the network graph selection to happen at the beginning, along with an edge merge strategy, that can then be carried forward into the rest of the centralities. This means all decisions about graph type/edge weight can be stored as a decision (in the metadata of network_data?) that is actively made/logged in the lineage, and makes for less repeat user input. We can make the modules work for every type of graph, but not run/return all of this information when it doesn't make sense for a research question/data type. If a user wants to change the graph type and run the calculations again, it would make sense that this becomes a different 'branch' of the lineage, as these are significant decisions to re-make. It also is a means of avoiding the 're-running' and overwriting of material, as anything that is changed wouldn't replace what existed but 'branch' off in terms of lineage. However, this is also a point where the notes would be really helpful, to encourage users to document what decisions they are making and why! |
Beta Was this translation helpful? Give feedback.
-
Yes, this is something @yaslena and I talked about in depth, and that is the solution we came up with. We specifically don't call this column 'degree' or whatever, but There is no decision involved and those 2 (6 if you count the The reasons those columns are always there and auto-computed for each
I think this lines up with all the columns I create, but I'll have to re-check the code. It's been a while...
Yes, since weight needs user input, there is no way we could meaningfully auto-compute it anyway. That's why there'll be a separate module to do that.
If one of the counts I auto-calculate matches up with unweighted degree (again, unsure there), then I can also auto-calculate the normalisation at For weighted degree, I'd still rather add 2 columns (raw & normalized), instead of just one, unless there is an obvious reason why that is not wanted? Again, we touch the data anyway, so there is no reason not to attach 2 columns instead of one. It's not expensive, those columns aren't huge in terms of byte-size, and we potentially save the user from having the same module with 2 different inputs. The result columns would be named something like
As I said, the (current) This whole things is a good example of the difficulty surrounding the creation of interfaces of modules and data types I was talking about, a lot of the issues that crop up are really hard to predict without having a good number of usecases and starting to work on some of the modules. And it's definitely a different style of programming one would do in a Jupyter notebook. I can elaborate more on why I took the decisions I took in this instance if you want. I also expect some more development and iterating over
I'm not sure I understand, but this is not how lineage works. Lineage only contains one specific, static set of decisions, those that led to the data that contains it. It does not contain a set of alternate decisions you could have made instead of one of some of them. Those decisions would be included in a totally separate lineage tree that is contained in another value (that was created using that different set of decisions). So in the sense you use the workd, there are no 'branches' like that in the lineage tree (I think). |
Beta Was this translation helpful? Give feedback.
-
Most things have been adressed already, but I will try to specifically refer to the questions above:
Since you are counting incoming and outgoing edges separately in those columns, directed or undirected does indeed not matter here.
No, that is not exactly true. As Caitlin writes above, weighted degree can result from pre-assigned values or from merging parallel edges. Your
This is not how normalization works in graphs, so if you calculate that number, that value should not be called '_degree_centrality' because that term is reserved for a different kind of normalization calculation. While it is true that degree centrality 'is the fraction of nodes it is connected to', in graphs, we usually don't take the max existing number as a reference, but the max possible number of edges in a graph, which can be much higher than the max existing (especially in very large graphs, where it is more unrealistic that a graph would be 'full'.)
I'm not sure about that. You can always calculate fractions based on the numbers in a row, but that might just make the attributes table larger and add more confusion, because it is not a standard network analysis operation. |
Beta Was this translation helpful? Give feedback.
-
General remarks (performance & modularity @makkus )
All of the modules have in common that for multigraphs they aggregate parallel edges and calculate weights and then compute the weighted value (weighted degree, weighted betweenness) with those values. This may be not the most efficient solution to write these modules since this operation is repeated for every single centrality value. There might be an option to aggregate edges and calculate weights in a seperate module (only once!) and then use different inputs for the weighted and the un-weighted calculations in the centrality modules.
Degree Ranking module
Although the module is listed among the centralities, it does not in fact compute degree centrality but only node degree and weighted node degree, i.e. values are not normalized as in the other measures. Normalization is not well defined for multigraphs and graphs with self-loops (because the maximimum number of nodes is potentially unknown for these graphs).
Compare networkX documentation on degree and degree centrality.
We could just add degree centrality to the current module or move degree and weighted degree calculations somewhere else for more consistency.
Betweenness module
In this module betweenness is calculated for nodes of the undirected graph and weighted betweenness is calculated for nodes of the directed version of the input graph. This is problematic for all graph types. If the input graph is undirected, then there is no reason to convert it to directed and calculate weighted betweenness values on the directed version. The weighted betweenness values will be very misleading in this case, because edge directionality massively affects shortest path calculations which are the basis for this centrality measure. (See this sample data from a related discussion as an example. The Anna node should always have a betweenness score of around 0.7/0.8 in this undirected network. However, if the graph is converted to directed, all of Anna's edges will be interpreted as "outgoing" connections. As a result, there will be no shortest paths passing through this node at all! It will be assigned the lowest weighted betweenness value (= 0) in the entire network.)
The obverse is also true: If the input graph is directed, then the module will first convert it to undirected and then calculate betweenness values for all nodes. Those values will not represent shortest path betweenness for the original directed graph and may be very misleading.
The same holds true for closeness centrality which also relies on shortest paths.
One solution would be to add graph type as input for this module. Ideally, we could prompt the user to determine graph type first by using kiaras properties function of the network data.
Eigenvector module
For directed graphs this module assigns high eigenvector centrality for nodes connected to nodes that have high in-degree values.
There is an option in networkX to account for eigenvector centrality based on out-degree value instead by reversing edge direction.
Closeness module
Some of these algorithms were not originally designed to work on disconnected graphs. In networkX closeness centrality, there is a logic at work that allows to calculate values in disconnected graphs. The networkX algorithm computes the closeness centrality for each connected part separately scaled by that parts size. However, this should be made explicit in the module. The user might also be prompted to investigate graph connectivity before running this module and to use largest component instead of the whole network.
Beta Was this translation helpful? Give feedback.
All reactions