Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define primary junctions #20

Open
lucventurini opened this issue Aug 17, 2016 · 0 comments
Open

Define primary junctions #20

lucventurini opened this issue Aug 17, 2016 · 0 comments
Assignees

Comments

@lucventurini
Copy link
Collaborator

lucventurini commented Aug 17, 2016

The problem arises by having multiple junctions which are separated by small distances of acceptors and donors. A possible example would be:

1 Chr1 100 400
2 Chr1 95 403
3 Chr1 100 402
4 Chr1 109 410
5 Chr1 115 420
6 Chr1 100 560

In the example above, 1-4 are just shifts around the (here assumed real) junction 100-400, 1-5 is lumped into the group only because of no. 4 but would not be otherwise, and no. 6 should not be considered part of the group.

In graph form, this looks like this:

1 ----- 2
| \ / |
| / \ |
3-----4 ------- 5

6

Ideally we want to select 1, 5 and 6, and discard 2-4 as minor variations on 1.

The way I would do it with a Mikado-like algorithm:

1- lump together junctions by their distance. So create a super-group of 1-6. These are our nodes.
2- Define the edges. Two junctions are connected if their acceptors are within 10 bps of each other, and their donors are within 10 bps of each other.
3- Define the subgraphs, using an algorithm to find connected components in a graph. This is an implementation in Python:
http://networkx.readthedocs.io/en/networkx-1.11/_modules/networkx/algorithms/components/connected.html#connected_components
4- In this case, we have two subgraphs: (1,2,3,4,5) and (6). Consider them separately (this effectively solves the problem of potentially excluding 6).
5- Find the cliques in each subgraph: there are various algorithms but I am banal and I like the very popular Bron-Kerbosh solution. A possible implementation is here:
http://networkx.readthedocs.io/en/networkx-1.11/_modules/networkx/algorithms/clique.html#find_cliques
In this case, the algorithm yields two cliques: (1,2,3,4) and (4,5)
6- Define the scoring. Let us assume it is as follows: 1 > 2==3 > 5 > 4.
7- The first node to consider is 1, as it is the one with the greatest score.
8- Assign all the nodes which are in a clique shared by (1) to be secondary to the (1) group. Remove them from further consideration.
9- This leaves you with disconnected components inside the subgraph. Repeat steps 5-8 until all nodes are exhausted.
10- In our case, step 5-9 yield two "primary" nodes: 1 and 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants