Centralities modules #26

yaslena · 2023-04-03T14:55:36Z

yaslena
Apr 3, 2023
Maintainer

General remarks (performance & modularity @makkus )

All of the modules have in common that for multigraphs they aggregate parallel edges and calculate weights and then compute the weighted value (weighted degree, weighted betweenness) with those values. This may be not the most efficient solution to write these modules since this operation is repeated for every single centrality value. There might be an option to aggregate edges and calculate weights in a seperate module (only once!) and then use different inputs for the weighted and the un-weighted calculations in the centrality modules.

Degree Ranking module

Although the module is listed among the centralities, it does not in fact compute degree centrality but only node degree and weighted node degree, i.e. values are not normalized as in the other measures. Normalization is not well defined for multigraphs and graphs with self-loops (because the maximimum number of nodes is potentially unknown for these graphs).
Compare networkX documentation on degree and degree centrality.

We could just add degree centrality to the current module or move degree and weighted degree calculations somewhere else for more consistency.

Betweenness module

In this module betweenness is calculated for nodes of the undirected graph and weighted betweenness is calculated for nodes of the directed version of the input graph. This is problematic for all graph types. If the input graph is undirected, then there is no reason to convert it to directed and calculate weighted betweenness values on the directed version. The weighted betweenness values will be very misleading in this case, because edge directionality massively affects shortest path calculations which are the basis for this centrality measure. (See this sample data from a related discussion as an example. The Anna node should always have a betweenness score of around 0.7/0.8 in this undirected network. However, if the graph is converted to directed, all of Anna's edges will be interpreted as "outgoing" connections. As a result, there will be no shortest paths passing through this node at all! It will be assigned the lowest weighted betweenness value (= 0) in the entire network.)

The obverse is also true: If the input graph is directed, then the module will first convert it to undirected and then calculate betweenness values for all nodes. Those values will not represent shortest path betweenness for the original directed graph and may be very misleading.

The same holds true for closeness centrality which also relies on shortest paths.

One solution would be to add graph type as input for this module. Ideally, we could prompt the user to determine graph type first by using kiaras properties function of the network data.

Eigenvector module

For directed graphs this module assigns high eigenvector centrality for nodes connected to nodes that have high in-degree values.

There is an option in networkX to account for eigenvector centrality based on out-degree value instead by reversing edge direction.

Closeness module

Some of these algorithms were not originally designed to work on disconnected graphs. In networkX closeness centrality, there is a logic at work that allows to calculate values in disconnected graphs. The networkX algorithm computes the closeness centrality for each connected part separately scaled by that parts size. However, this should be made explicit in the module. The user might also be prompted to investigate graph connectivity before running this module and to use largest component instead of the whole network.

CBurge95 · 2023-04-04T13:45:22Z

CBurge95
Apr 4, 2023
Collaborator

General Remarks:

This might be a general module/question that we want to ask researchers at the beginning anyway, especially following on from the graph properties function: having shown users the potential outputs of each graph type for their data, may want to select which graph type they want to use for the remainder (or at least have the opportunity to store these as separate variables for future inputs)

Degree Centrality:

Can add normalised degree as an extra calculation to be added into the module for the node attribute table (but I think this is potentially a theory question as much as it is a practical one, and potentially not a priority before May?)

Betweenness Module:

Agree that this goes back to the graph type question: happy to adjust so that there’s an extra input option that allows users to select graph type that they are using, and build the documentation around this.
- Probably a ‘long’ fix for May (as a priority) that can be more elegantly fixed with later updates - do we want the same option in all the other modules too?
- Only question is what we assume and/or use as a default if users don’t know or want to see multiple options. I think these are the kind of ‘multi-universe’ options that work best through a front-end, but may be over-complicated in Jupyter notebooks (and definitely in CLI). Is this therefore a case of greater documentation around assumptions being made, or is this just offering lots of choice in terms of input?

Eigenvector Module:

Yes this a networkX choice that I find interesting and not sure how they made that decision!
- I think again this is a theory question that ultimately relies on a detailed knowledge of the dataset/what the edges mean - we may also want to find a way of making all the edges ‘equal’ in this calculation (or would this just be undirected?)

Closeness Module:

Again, just to clarify, is this code to edit, or just better documentation (to come)?

Extra Thoughts:

How much of this do we want to do before Berlin, and how much of it do we want to use as a discussion point (can frame it as ‘critically reflecting on decisions made whilst employing digital methods’ but also use it as a means to garner genuine feedback about how and with what data researchers might use these modules)

0 replies

yaslena · 2023-04-05T13:52:10Z

yaslena
Apr 5, 2023
Maintainer Author

Preparing for Berlin

The only issue that needs adjustement in code before the Berlin workshop is the directed - undirected graph type parameter for betweenness, closeness, and eigenvector centralities. I think our solution to only offer calculations for directed graph type makes sense at this point, because it matches the sample data properties that we are going to use at the workshop.

Default parameters

In contrast to this, if we were to work with default parameters, I would go for undirected graph type input. One could almost always argue that the undirected version is the most abstract and simplified version of the original directed or multigraph, but this does not work the other way round. Values calculated on a directed version of a graph that is undirected in nature will always be just wrong. Not optional, or alternative, or "difficult to interpret", but simply wrong.

Naming and grouping modules

We will be making changes to the modules' names and how they can be chained or grouped together a lot before we are finished. The more modules we write, the more we will have to do this. Calling something a "centrality module" will raise certain expectations with users who are familiar with the terminology. It will also make it easier for everyone to find the desired methods if they are named properly.
The difference between absolute values and normalized values should be always made explicit (to the users and to our developers). Because normalization in graphs is a special case and is defined differently for undirected and directed graphs, we need to document this properly. It will later affect how the visualization will work.
My instinct is to set the degree rank list apart from the rest of the centrality modules or to group them all under a more general term, like "calculated node properties."

Documentation vs. user guidance

Neither Gephi nor networkX have a functionality to offer users advice to run algorithms and functions in a certain sequence. A lot of the remarks above are written with this guidance feature in mind. We could always write better documentation for our modules, but making notes on how the modules can be meaningfully chained will help us later when we build the GUI. That's why the "determine graph type property function" is mentioned everywhere, as well as the "extract largest component module" in connection with the closeness centrality module. They would be good choices to run before the centralities modules.

0 replies

makkus · 2023-12-05T10:21:03Z

makkus
Dec 5, 2023
Maintainer

Ok, just to confirm I understand the degree and degree_centrality right. Currently, every network_data instance has those two columns in the nodes table (among others):

_count_edges: counts all the edges 'in' and 'out' of a node, if the data is interpreted as a non-multi graph (meaning, parallel edges are flattened into a single one)
_count_edges_multi: counts all the edges 'in' and 'out' of a node, if the data is interpreted as a multi-graph (all parallel edges are counted)

Directed or undirected should not matter in this instance, right?

Am I right in understanding that those values equal the 'unweighted degree' value you are talking about above?

Since it can be calculated when creating the network_data instance, should I also add a '_degree_centrality' and '_degree_centrality_multi' column to every network_data nodes table, that takes the max number in the respective count column and calculates the fraction for each row?

We also have _in_edges and _out_edges for each network_data node, would it make sense to also calculate a fraction in the same way there?

For weighted we need a separate module, because we need user input to tell us which attribute contains the 'weight'. But I guess we would also compute the values for both non-multi graph and multi-graph interpretations of the data?

0 replies

CBurge95 · 2023-12-07T14:25:12Z

CBurge95
Dec 7, 2023
Collaborator

I have both some initial thoughts / questions about these updates. Is it necessary to calculate the edge count at the network data creation stage? For me it seems like a questionable action - at least in terms of assigning this as a node attribute - given a) that there are a number of decisions that need to be made around calculating degree and b) this begins to look like 'automated' decisions being made on behalf of the researcher/without them actively selecting it. Degree / Weighted degree will both look different depending on if it is directed or not, multi-graph or not, edge merge strategy etc. Assigning 'scores' for all of these possibilities at the start would be both redundant and computationally 'expensive' unnecessarily.

Degree (unweighted degree) counts the total number of edges going in or out of a node, irrelevant of graph type. Obviously the result may be different depending on what graph type is selected, but the 'algorithm' is the same. In a directed/multi-directed network, this will also return in-degree and out-degree.

Weighted degree counts the total number of weights assigned to the edges attached to a node. Each edge is automatically assigned a weight of 1; if there are no other means of assigning weight to an edge, weighted degree will be the same as degree. Weights may result from pre-assigned values (should be there at import) or from the edge merge strategy, if there are parallel edges and this has been selected.

'Normalisation' would ideally be offered, but not auto-calculated or (especially in degree) returned in place of the raw numbers.

In terms of a UI (@caro401) it makes most sense for the network graph selection to happen at the beginning, along with an edge merge strategy, that can then be carried forward into the rest of the centralities. This means all decisions about graph type/edge weight can be stored as a decision (in the metadata of network_data?) that is actively made/logged in the lineage, and makes for less repeat user input. We can make the modules work for every type of graph, but not run/return all of this information when it doesn't make sense for a research question/data type. If a user wants to change the graph type and run the calculations again, it would make sense that this becomes a different 'branch' of the lineage, as these are significant decisions to re-make. It also is a means of avoiding the 're-running' and overwriting of material, as anything that is changed wouldn't replace what existed but 'branch' off in terms of lineage.

However, this is also a point where the notes would be really helpful, to encourage users to document what decisions they are making and why!

0 replies

makkus · 2023-12-07T16:44:39Z

makkus
Dec 7, 2023
Maintainer

I have both some initial thoughts / questions about these updates. Is it necessary to calculate the edge count at the network data creation stage? For me it seems like a questionable action - at least in terms of assigning this as a node attribute - given a) that there are a number of decisions that need to be made around calculating degree and b) this begins to look like 'automated' decisions being made on behalf of the researcher/without them actively selecting it. Degree / Weighted degree will both look different depending on if it is directed or not, multi-graph or not, edge merge strategy etc. Assigning 'scores' for all of these possibilities at the start would be both redundant and computationally 'expensive' unnecessarily.

Yes, this is something @yaslena and I talked about in depth, and that is the solution we came up with. We specifically don't call this column 'degree' or whatever, but _count_edges and _count_edges_multi, all those values mean are how many edges are connected to this node, there is no decision making (or interpretation other the one implied in the column name) involved at all, it's just a property of each node that will never change. The only decision that could have been made was whether to interpret the data as multi or not, because for non multi we obviously need to merge the duplicate edges before counting, and in addtion we count in and out edges separately for multi/non-multi, which takes care of directionality.

There is no decision involved and those 2 (6 if you count the _in* & _out* counts) columns are inherent, fixed properties of the network_data. I'm deliberately not calling this network_graph, but network_data since this would imply an interpretation of the data as multi/non-multi, directed/undirected. And it turned out to be a much better fit for this particular data type in the context of the other modules we implemented/plan to implement -- read DHARPA-Project/kiara-website#17 (comment) for a simplified version of how the data type evolved).

The reasons those columns are always there and auto-computed for each network_data value:

they are particularly easy and fast to compute at the point of creation: we have to touch and load the data into memory anyway, so we get it for basically free in terms of resource usage
having those makes some sql queries in some modules that do stuff with network_data down the line a lot simpler and faster (no JOIN statements necessary) -- I'd have to lookup what exactly I was using them for, I think determining whether its a multigraph or not would be one example (but not the only one) -- but I might be wrong here
they are inherent, static properties of the data, and we don't need user input to calculate it
in some/many? cases they match up with the (unweighted?) degree the user might be interested in, which means no operation necessary to calculate it, but it's there for the taking from the start -- that was the one thing I am unsure of and why I asked above

Degree (unweighted degree) counts the total number of edges going in or out of a node, irrelevant of graph type. Obviously the result may be different depending on what graph type is selected, but the 'algorithm' is the same. In a directed/multi-directed network, this will also return in-degree and out-degree.

I think this lines up with all the columns I create, but I'll have to re-check the code. It's been a while...

Weighted degree counts the total number of weights assigned to the edges attached to a node. Each edge is automatically assigned a weight of 1; if there are no other means of assigning weight to an edge, weighted degree will be the same as degree. Weights may result from pre-assigned values (should be there at import) or from the edge merge strategy, if there are parallel edges and this has been selected.

Yes, since weight needs user input, there is no way we could meaningfully auto-compute it anyway. That's why there'll be a separate module to do that.

'Normalisation' would ideally be offered, but not auto-calculated or (especially in degree) returned in place of the raw numbers.

If one of the counts I auto-calculate matches up with unweighted degree (again, unsure there), then I can also auto-calculate the normalisation at network_data creation time. It would have the same benefits/reasons in terms of performance I talked about already.

For weighted degree, I'd still rather add 2 columns (raw & normalized), instead of just one, unless there is an obvious reason why that is not wanted? Again, we touch the data anyway, so there is no reason not to attach 2 columns instead of one. It's not expensive, those columns aren't huge in terms of byte-size, and we potentially save the user from having the same module with 2 different inputs. The result columns would be named something like _degree_<SOME_WEIGHT_COLUMN_NAME_OR_OTHERWISE_INDICATOR> and _degree_<SOME_WEIGHT_COLUMN_NAME_OR_OTHERWISE_INDICATOR>_normalized (or whatever we deem most intuitive for users). That name might also include info on multi/non-multi and directionality, haven't thought that through yet, tbh.

In terms of a UI (@caro401) it makes most sense for the network graph selection to happen at the beginning, along with an edge merge strategy, that can then be carried forward into the rest of the centralities. This means all decisions about graph type/edge weight can be stored as a decision (in the metadata of network_data?) that is actively made/logged in the lineage, and makes for less repeat user input. We can make the modules work for every type of graph, but not run/return all of this information when it doesn't make sense for a research question/data type.

As I said, the (current) network_data data type is not really a graph, and it does not contain 'graph' metadata as such (apart from some stats depending on the type of graph the data is interpreted as). It can be interpreted as any of them, since we haven't disregarded any information. That extra interpretation step happens in each of the modules that use it as an input. This is different to how networkx for example does it. Ignoring the fact that networkx seems to have some issues in regards to clearly defining what is happening in the first place, it turned out to not be a good fit for kiara when we tried it that way. Moving the 'interpretation decision' further down the line made things simpler in terms of module design, data storage efficiency, minimizing the amount of total input we need from a user, etc. In some cases there is no interpretation decission necessary at all, because a module only makes sense for a certain graph type.

This whole things is a good example of the difficulty surrounding the creation of interfaces of modules and data types I was talking about, a lot of the issues that crop up are really hard to predict without having a good number of usecases and starting to work on some of the modules. And it's definitely a different style of programming one would do in a Jupyter notebook. I can elaborate more on why I took the decisions I took in this instance if you want. I also expect some more development and iterating over network_data happening, hopefully non-breaking.

If a user wants to change the graph type and run the calculations again, it would make sense that this becomes a different 'branch' of the lineage, as these are significant decisions to re-make. It also is a means of avoiding the 're-running' and overwriting of material, as anything that is changed wouldn't replace what existed but 'branch' off in terms of lineage.

I'm not sure I understand, but this is not how lineage works. Lineage only contains one specific, static set of decisions, those that led to the data that contains it. It does not contain a set of alternate decisions you could have made instead of one of some of them. Those decisions would be included in a totally separate lineage tree that is contained in another value (that was created using that different set of decisions). So in the sense you use the workd, there are no 'branches' like that in the lineage tree (I think).

0 replies

yaslena · 2023-12-12T11:05:08Z

yaslena
Dec 12, 2023
Maintainer Author

Most things have been adressed already, but I will try to specifically refer to the questions above:

Directed or undirected should not matter in this instance, right?

Since you are counting incoming and outgoing edges separately in those columns, directed or undirected does indeed not matter here.

Am I right in understanding that those values equal the 'unweighted degree' value you are talking about above?

No, that is not exactly true. As Caitlin writes above, weighted degree can result from pre-assigned values or from merging parallel edges. Your _count_edges_multi column counts all edges of a node regardless of in, out, or parallel. If you merge parallel edges in a multi-graph, then assign the resulting values as edge data (4 parallel edges become 1 edge with weight '4'), then add up all those edge weight values for a node, you will get weighted degree for that node. This would in fact be the same number as _count_edges_multi.

Since it can be calculated when creating the network_data instance, should I also add a '_degree_centrality' and '_degree_centrality_multi' column to every network_data nodes table, that takes the max number in the respective count column and calculates the fraction for each row?

This is not how normalization works in graphs, so if you calculate that number, that value should not be called '_degree_centrality' because that term is reserved for a different kind of normalization calculation. While it is true that degree centrality 'is the fraction of nodes it is connected to', in graphs, we usually don't take the max existing number as a reference, but the max possible number of edges in a graph, which can be much higher than the max existing (especially in very large graphs, where it is more unrealistic that a graph would be 'full'.)
The max possible number of edges is different for undirected and directed graphs (where it is twice as high). For multi-graphs and graphs with self loops it is notriously difficult to define.
A similar discussion (including the problem of determining the max possible number of edges in multi-graphs) has been recorded here: DHARPA-Project/DHARPA-Project-workflows-requirements#1 (comment)

We also have _in_edges and _out_edges for each network_data node, would it make sense to also calculate a fraction in the same way there?

I'm not sure about that. You can always calculate fractions based on the numbers in a row, but that might just make the attributes table larger and add more confusion, because it is not a standard network analysis operation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralities modules #26

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Centralities modules #26

yaslena Apr 3, 2023 Maintainer

General remarks (performance & modularity @makkus )

Degree Ranking module

Betweenness module

Eigenvector module

Closeness module

Replies: 6 comments

CBurge95 Apr 4, 2023 Collaborator

yaslena Apr 5, 2023 Maintainer Author

Preparing for Berlin

Default parameters

Naming and grouping modules

Documentation vs. user guidance

makkus Dec 5, 2023 Maintainer

CBurge95 Dec 7, 2023 Collaborator

makkus Dec 7, 2023 Maintainer

yaslena Dec 12, 2023 Maintainer Author

yaslena
Apr 3, 2023
Maintainer

CBurge95
Apr 4, 2023
Collaborator

yaslena
Apr 5, 2023
Maintainer Author

makkus
Dec 5, 2023
Maintainer

CBurge95
Dec 7, 2023
Collaborator

makkus
Dec 7, 2023
Maintainer

yaslena
Dec 12, 2023
Maintainer Author