Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft how-to for getting nodes and edges tables from network #23

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions developer/how-to-view-the-data-in-networks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# How to view the data in a NetworkData type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment in advance: personally, I'd probably have a section in the docs that deals with tabular data, and only explain here how to get to the tables, and then link to the more generic documentaiton re: querying and other things to do with it.


This article assumes you have `kiara_plugin.tabular` installed at version `~0.5.1`, and `kiara_plugin.network_analysis` at version `~0.5.1`.

Quite often, you'll want to inspect the raw contents of the nodes and/or edges tables which contain the data behind a `NetworkData` value. This might be to get an overview of what's in your network, or to look at the values of centrality measures you've just calculated and applied to the network.

The nodes and edges tables can be accessed from a `NetworkData` value by calling the `get_table` method on the `NetworkData`, passing the appropriate table name `"nodes"` or `"edges"` as argument. This resulting value is a `KiaraTable`, which in turn is backed by a `pyarrow.Table` from [Apache arrow](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html). The Arrow table contains the raw data, and can be accessed via the `arrow_table` property on a `KiaraTable`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the refactoring we talked about, the data type is now called 'NetworkGraph, but I tried to keep the interface the same as much as possible. get_tablewould still work, but a user could also just call theedgesandnodesattributes and get the sameKiaraTableinstance they would get withget_table`.


## View the entire contents of the nodes or edges table

In order to view the data contained in the Arrow table, you'll need to turn it into a different data format. The `pyarrow.Table` data type provides a few options for converting the data, for example `to_pandas()` to get a NumPy array or pandas DataFrame, and `to_pydict()` and `to_pylist()` to get plain Python data types, which you can then manipulate as you choose.

Be aware that doing any of these data transformations means your whole nodes or edges table will be loaded into memory on your computer. If your tables are really big, this could cause your code to run slowly and use a lot of memory (RAM).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here it would make sense to point out that using the arrow data directly is considered best practice overall, unless you are writing custom code that is not going to get re-used, or where you know for sure you won't have to deal with unexpectedly large amounts of data.

For frontend developers that would mean using the arrow JS library, and ideally either send/receive 'unserialized' arrow format, or even better try to get a pointer to the data in memory for zero-copy style access (not always possible). For Jupyter users it would mean using polars or duckdb (or any of the modules that use it internally, like the query.table one you point out below.


Here's an example of how to get the edges table from an imagined existing `NetworkData` value, and print it out as a list of Python dictionaries, where each dictionary represents a row in your table, the keys the column names and the values the associated data.

```python
from kiara.api import KiaraAPI
kiara = KiaraAPI.instance()
# some code here to get a NetworkData value loaded into kiara
# let's call it my_network_data

# get the edges table for the network, as a `KiaraTable`
edges_kiara_table = my_network_data.get_table("edges")
# if you're in a jupyter context, printing edges_kiara_table will give you a preview of the data

# get all the data via the underlying Arrow table
edges_data = edges_table.arrow_table.to_pylist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I have never needed to use the to_pylist or to_pydict methods. I think a much more common use-case (at least for Jupyter users) would be the pandas export, since there is a high likely-hood they are using Pandas anyway. If there is indeed a valid use-case for frontend devs to use this over 'pure' arrow access, I'd say we can probably assume frontend devs have more programming background, and can figure things out themselves with a few links we could provide. Long story short, I would tend to document the pandas code, and not to_pylist.

# print just the first edge in that table
print(edges_data[0])
# will look something like this
# {'_edge_id': 0, '_source': 886, '_target': 49, '_count_dup_directed': 1, '_idx_dup_directed': 1, '_count_dup_undirected': 1, '_idx_dup_undirected': 1}
```

## View a specific subset of the nodes or edges table

If instead you want to see a subset of one of your tables, using a SQL query to select some of the data, you can use the `query.table` operation from `kiara_plugin.tabular`. First extract the nodes or edges table using `get_table`, then query the resulting `KiaraTable` value using `query.table`. The result of this is a also a `KiaraTable`, so you may wish to extract all the data in the same way as above.

```python
from kiara.api import KiaraAPI
kiara = KiaraAPI.instance()
# some code here to get a NetworkData value loaded into kiara
# let's call it my_network_data

# get the nodes table for the network, as a `KiaraTable`
nodes_kiara_table = my_network_data.get_table("nodes")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not a good idea, because if you do it like this you break the lineage of the result value. It depends of course if that matters in your particular cicrumstances or not, but I guess it's better to not confuse people by documenting a practice that would only make sense for some sort of frontend-preview scenario, but would be ill-advised within a Jupyter/Python research workflow.

Up until now for all the network analysis examples when there was a usecase like this, the querying always happened on the source tables (before they became network_data/network_graph. We can easily support this scenario too, all it takes is adding a module network_graph.pick.table (or something like that), that takes a network graph and either a 'edges' or 'nodes' string as input, and returns a table as result. I can easily add that, will have it ready in the 'tropy' plugin in the next few days.

Anyway, the result (of type 'table') can subsequently be used in the code below, and lineage will be intact in the result of that.


# Let's get the first 5 nodes with names a bit like "Johan"
# the table name in your SQL query must be "data"
query_inputs = {
"table": nodes_table,
"query": "SELECT _node_id, id FROM data WHERE id LIKE '%Johan%' LIMIT 5",
}
sql_results = kiara.run_job(
"query.table",
inputs=query_inputs,
)["query_result"]
# if you're in a jupyter context, printing sql_results will give you a preview of the resulting data

# if not, you can get the raw data from Arrow in the same way
full_query_data = sql_results.data.arrow_table.to_pylist()
print(full_query_data)
# A list of up to 5 nodes, something like
# [{'_node_id': 9, 'id': 'Aerssen, Johan, 1579-1654'},
# {'_node_id': 33, 'id': 'Antonides van der Goes, Johannes, 1647-1684'}, .....]
```