Computational optmization #10

joandre · 2016-04-02T16:40:28Z

Include normalization and pruning process in inflation operation so those operations are computed locally (on nodes)
Improve pruning approach
Optimize Spark distributed matrix operations

akaltsikis · 2016-07-26T20:49:52Z

Hey
I am working on implementing the MCL in apache spark as well and i have some ideas/problems that i would like to discuss with someone that have some experience.
Can you send me and email to discuss that further? ( the same username that i have at gmail dot com )
Thanks

uzadude · 2016-07-31T07:34:29Z

Hi joandre,

akaltsikis approached me also regarding the MCL implementation over Spark.
I wanted to recap me experience from the last few months trying to run your MCL code on relatively big graph – 50M vertices and 2B edges (the adjacency matrix is about 2B*8bytes ~= 16GB size).
MCL’s main step is the “expansion” - multiplying the sparse block matrices.
Spark’s implementation for the multiplication has two stages:

Simulate multiply – preparation before the real multiplication in order to see which blocks should be multiplied.
The actual blocks multiplication and summation.

This implementation has some problems:

A relatively trivial bug I fixed in the first step the caused the process to be very slow [SPARK-16469]
Still after the bug fix, if we have too many blocks the Simulate multiply will take very long time and will multiply the data many times.
Spark supports only multiplication with Dense matrices. Thus, it converts a Sparse matrix into a dense before the multiplication.
For summing the result Spark uses Breeze’s CSC matrix operations – here the problem is that it is very inefficient to update a CSC matrix in a zero value.

That means that with many blocks (default block size is 1024) – in our case 50M/1024 ~= 50K, the simulate multiply will effectively never finish or will generate 50K*16GB ~= 1000TB of data. On the other hand if we use bigger block size e.g. 100k we get OutOfMemoryException in the “toDesne” method of the multiply. We have worked around that by implementing our-selves both the Sparse multiplication and addition in a very naïve way – but at least it works.
I am going to open some Jiras in the Spark project. Both on Sparse matrix multiplication in general and on implementing MCL (have you already tried?)

Ohad.

joandre · 2016-07-31T09:20:00Z

Amazing job Ohad,

That is exactly what I wanted to do for month without finding the time!

No I confirm I have not tried. I was waiting to do what you have done, a complete inventory on big graphs. Actually the last issue I found talking about MCL [SPARK-5832] was which algorithm to choose between AP and MCL. I think indeed it is time to propose an implementation of MCL since it is widely used (especially in bioinformatics).

One last point, I was also thinking about using GraphX library instead of Spark matrices to implement it. So I could compare performance of both approaches. That remained an idea in the back of my mind.

Joan

uzadude · 2016-07-31T10:27:02Z

Funny that you say that,
I have already implemented the GraphX solution for the specific case of expansion=2. It works nicely but I havent tried yet to expand it further.

joandre · 2016-08-15T19:07:17Z

Hi Ohad,

Just to warn I have published the repo as a spark package (MCL).

Joan

uzadude · 2016-08-17T04:52:33Z

great news!

On 15 August 2016 at 22:07, joandre [email protected] wrote:

Hi Ohad,

Just to warn I have published the repo as a spark package (MCL
https://spark-packages.org/package/joandre/MCL_spark).

Joan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#10 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AO68PV5uIHN9yd5PdfO1WBrCJ1-nkYxKks5qgLjlgaJpZM4H-UdB
.

icklerly · 2017-01-24T11:04:15Z

Hi Joan,
I tried your MCL on a Dataset from SNAP (https://snap.stanford.edu/data/com-Amazon.html) which has 334863 nodes and 925872 edges.
Unfortunately, the program always crashes with a java.lang.OutOfMemoryError: Java heap space.

I changed only the input creation of the program:
//read input from file
val edgesRDD01: RDD[Edge[Double]] = sc.textFile(input).map { line =>
val parts = line.split('\t')
Edge(parts(0).toLong-1, parts(1).toLong-1, 1.0)
}
//get symmetric edges
val edgesRDD02: RDD[Edge[Double]] = edgesRDD01.map { e =>
Edge(e.dstId, e.srcId, 1.0)
}
//unite both sets
val edgesRDD: RDD[Edge[Double]] = edgesRDD01.union(edgesRDD02)
val graph = Graph.fromEdges(edgesRDD, 1.0)

Is the input already too large for the algorithm to process?

akaltsikis · 2017-01-24T12:02:11Z

@icklerly I am gonna post a version of the MCL algorithm for Apache Spark which runs for BIG and Sparse Graphs ( much bigger than your dataset) in the following days so please keep an eye to my profile.

joandre · 2017-01-24T19:25:03Z

Hi all,

Nice to hear akaltsikis. Indeed, my implementation suffers of uzadude remark above (Spark uses dense block matrices by default). I will publish a more workable version soon.

@uzadude: have you finally try to push some commits on Spark on that topic?

Joan

akaltsikis · 2017-01-24T21:12:06Z

Hello dear @joandre
With the great help of @uzadude, i managed to get my implementation almost production ready.
I am working on it as we talk. As of now the version @uzadude implemented is pretty naive for Spark in order to include that in mllib, however IT WORKS :-)

Angelos

joandre · 2017-01-25T08:06:57Z

Great! I will add a link to your repo in the README file.

joandre · 2017-01-25T08:13:22Z

@uzadude would you be ok to propose a pull-request so we can introduce your implementation in this version? Otherwise I will do pretty much the same job that you did twice.

uzadude · 2017-01-27T07:44:09Z

Hi guys!
Well, I'll let @akaltsikis post my version with all the "hacky" Sparse Matrix operations. I agree it is not good enough yet to push those changes in MLlib.
I also just found another bug in Spark when I was trying to move to work with Spark 2.1 ([SPARK-19368]) and have a work-around.
I think I'll orgenize all the Sparse Matrix related changes I had to do and will try to push them to the main project.
Meanwhile we'll update the MCL package so others could also run.
@icklerly - I downloaded the Amazon dataset and it looks like it runs fine. I think you can try to work with @akaltsikis version (https://github.com/akaltsikis/MCL) or wait a few days till we merge the fixes to the MCL package.

icklerly · 2017-01-27T08:05:04Z

Hi @uzadude,
thanks for testing the dataset! I will try it again.
Unfortunately, the link to @akaltsikis version ends in a dead end...
Could you give me a working link?

akaltsikis · 2017-01-27T08:15:53Z

@icklerly @uzadude As of now the project is private need to clean the code and it will go public. I will notify you in that post in order to receive an email when its available. You need to run the Dataset now or can you wait a couple of days?

icklerly · 2017-01-27T08:18:20Z

@akaltsikis
No, I can wait a couple of days!
But the runtime would be interesting, @uzadude!?

akaltsikis · 2017-01-27T08:51:11Z

@icklerly I ran that in my shitty laptop and it took 511 Seconds, Finished at 33 Iterations and outputted 68930 Clusters. :-)

icklerly · 2017-01-30T07:47:43Z

@akaltsikis
Thanks for the information! :)
Currently I am writing a paper on distributed graph clustering using Apache Flink. Your MCL Spark implementation seems perfect to compare it to! Therefore, it would be cool to test your code and generate runtimes on my own...
Is it possible to get an earlier access to your code?

icklerly · 2017-02-15T08:33:52Z

Hi all,
I just wanted to check if there are new information about a release date?
Thanks in advance!

joandre · 2017-04-12T18:13:21Z

Hi @icklerly,

Sorry I missed your message. As akaltsikis said, he will publish a new version. It is in progress. No release date scheduled right now.

joandre mentioned this issue Feb 20, 2017

maven update #12

Closed

This was referenced Jun 18, 2017

OOM with 200 million vertex when matrix multiply #15

Closed

java.lang.NoSuchMethodError: MCL.run and MCL$.train #16

Closed

joandre mentioned this issue Sep 10, 2017

can't reproduce sample in readme #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computational optmization #10

Computational optmization #10

joandre commented Apr 2, 2016 •

edited

Loading

akaltsikis commented Jul 26, 2016

uzadude commented Jul 31, 2016

joandre commented Jul 31, 2016

uzadude commented Jul 31, 2016

joandre commented Aug 15, 2016

uzadude commented Aug 17, 2016

icklerly commented Jan 24, 2017

akaltsikis commented Jan 24, 2017

joandre commented Jan 24, 2017

akaltsikis commented Jan 24, 2017

joandre commented Jan 25, 2017

joandre commented Jan 25, 2017

uzadude commented Jan 27, 2017

icklerly commented Jan 27, 2017

akaltsikis commented Jan 27, 2017

icklerly commented Jan 27, 2017 •

edited

Loading

akaltsikis commented Jan 27, 2017 •

edited

Loading

icklerly commented Jan 30, 2017

icklerly commented Feb 15, 2017

joandre commented Apr 12, 2017

Computational optmization #10

Computational optmization #10

Comments

joandre commented Apr 2, 2016 • edited Loading

akaltsikis commented Jul 26, 2016

uzadude commented Jul 31, 2016

joandre commented Jul 31, 2016

uzadude commented Jul 31, 2016

joandre commented Aug 15, 2016

uzadude commented Aug 17, 2016

icklerly commented Jan 24, 2017

akaltsikis commented Jan 24, 2017

joandre commented Jan 24, 2017

akaltsikis commented Jan 24, 2017

joandre commented Jan 25, 2017

joandre commented Jan 25, 2017

uzadude commented Jan 27, 2017

icklerly commented Jan 27, 2017

akaltsikis commented Jan 27, 2017

icklerly commented Jan 27, 2017 • edited Loading

akaltsikis commented Jan 27, 2017 • edited Loading

icklerly commented Jan 30, 2017

icklerly commented Feb 15, 2017

joandre commented Apr 12, 2017

joandre commented Apr 2, 2016 •

edited

Loading

icklerly commented Jan 27, 2017 •

edited

Loading

akaltsikis commented Jan 27, 2017 •

edited

Loading