Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computational optmization #10

Open
1 of 3 tasks
joandre opened this issue Apr 2, 2016 · 20 comments
Open
1 of 3 tasks

Computational optmization #10

joandre opened this issue Apr 2, 2016 · 20 comments

Comments

@joandre
Copy link
Owner

joandre commented Apr 2, 2016

  • Include normalization and pruning process in inflation operation so those operations are computed locally (on nodes)
  • Improve pruning approach
  • Optimize Spark distributed matrix operations
@akaltsikis
Copy link

Hey
I am working on implementing the MCL in apache spark as well and i have some ideas/problems that i would like to discuss with someone that have some experience.
Can you send me and email to discuss that further? ( the same username that i have at gmail dot com )
Thanks

@uzadude
Copy link
Contributor

uzadude commented Jul 31, 2016

Hi joandre,

akaltsikis approached me also regarding the MCL implementation over Spark.
I wanted to recap me experience from the last few months trying to run your MCL code on relatively big graph – 50M vertices and 2B edges (the adjacency matrix is about 2B*8bytes ~= 16GB size).
MCL’s main step is the “expansion” - multiplying the sparse block matrices.
Spark’s implementation for the multiplication has two stages:

  1. Simulate multiply – preparation before the real multiplication in order to see which blocks should be multiplied.
  2. The actual blocks multiplication and summation.

This implementation has some problems:

  1. A relatively trivial bug I fixed in the first step the caused the process to be very slow [SPARK-16469]
  2. Still after the bug fix, if we have too many blocks the Simulate multiply will take very long time and will multiply the data many times.
  3. Spark supports only multiplication with Dense matrices. Thus, it converts a Sparse matrix into a dense before the multiplication.
  4. For summing the result Spark uses Breeze’s CSC matrix operations – here the problem is that it is very inefficient to update a CSC matrix in a zero value.

That means that with many blocks (default block size is 1024) – in our case 50M/1024 ~= 50K, the simulate multiply will effectively never finish or will generate 50K*16GB ~= 1000TB of data. On the other hand if we use bigger block size e.g. 100k we get OutOfMemoryException in the “toDesne” method of the multiply. We have worked around that by implementing our-selves both the Sparse multiplication and addition in a very naïve way – but at least it works.
I am going to open some Jiras in the Spark project. Both on Sparse matrix multiplication in general and on implementing MCL (have you already tried?)

Ohad.

@joandre
Copy link
Owner Author

joandre commented Jul 31, 2016

Amazing job Ohad,

That is exactly what I wanted to do for month without finding the time!

No I confirm I have not tried. I was waiting to do what you have done, a complete inventory on big graphs. Actually the last issue I found talking about MCL [SPARK-5832] was which algorithm to choose between AP and MCL. I think indeed it is time to propose an implementation of MCL since it is widely used (especially in bioinformatics).

One last point, I was also thinking about using GraphX library instead of Spark matrices to implement it. So I could compare performance of both approaches. That remained an idea in the back of my mind.

Joan

@uzadude
Copy link
Contributor

uzadude commented Jul 31, 2016

Funny that you say that,
I have already implemented the GraphX solution for the specific case of expansion=2. It works nicely but I havent tried yet to expand it further.

@joandre
Copy link
Owner Author

joandre commented Aug 15, 2016

Hi Ohad,

Just to warn I have published the repo as a spark package (MCL).

Joan

@uzadude
Copy link
Contributor

uzadude commented Aug 17, 2016

great news!

On 15 August 2016 at 22:07, joandre [email protected] wrote:

Hi Ohad,

Just to warn I have published the repo as a spark package (MCL
https://spark-packages.org/package/joandre/MCL_spark).

Joan


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#10 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AO68PV5uIHN9yd5PdfO1WBrCJ1-nkYxKks5qgLjlgaJpZM4H-UdB
.

@icklerly
Copy link

Hi Joan,
I tried your MCL on a Dataset from SNAP (https://snap.stanford.edu/data/com-Amazon.html) which has 334863 nodes and 925872 edges.
Unfortunately, the program always crashes with a java.lang.OutOfMemoryError: Java heap space.

I changed only the input creation of the program:
//read input from file
val edgesRDD01: RDD[Edge[Double]] = sc.textFile(input).map { line =>
val parts = line.split('\t')
Edge(parts(0).toLong-1, parts(1).toLong-1, 1.0)
}
//get symmetric edges
val edgesRDD02: RDD[Edge[Double]] = edgesRDD01.map { e =>
Edge(e.dstId, e.srcId, 1.0)
}
//unite both sets
val edgesRDD: RDD[Edge[Double]] = edgesRDD01.union(edgesRDD02)
val graph = Graph.fromEdges(edgesRDD, 1.0)

Is the input already too large for the algorithm to process?

@akaltsikis
Copy link

@icklerly I am gonna post a version of the MCL algorithm for Apache Spark which runs for BIG and Sparse Graphs ( much bigger than your dataset) in the following days so please keep an eye to my profile.

@joandre
Copy link
Owner Author

joandre commented Jan 24, 2017

Hi all,

Nice to hear akaltsikis. Indeed, my implementation suffers of uzadude remark above (Spark uses dense block matrices by default). I will publish a more workable version soon.

@uzadude: have you finally try to push some commits on Spark on that topic?

Joan

@akaltsikis
Copy link

Hello dear @joandre
With the great help of @uzadude, i managed to get my implementation almost production ready.
I am working on it as we talk. As of now the version @uzadude implemented is pretty naive for Spark in order to include that in mllib, however IT WORKS :-)

Angelos

@joandre
Copy link
Owner Author

joandre commented Jan 25, 2017

Great! I will add a link to your repo in the README file.

@joandre
Copy link
Owner Author

joandre commented Jan 25, 2017

@uzadude would you be ok to propose a pull-request so we can introduce your implementation in this version? Otherwise I will do pretty much the same job that you did twice.

@uzadude
Copy link
Contributor

uzadude commented Jan 27, 2017

Hi guys!
Well, I'll let @akaltsikis post my version with all the "hacky" Sparse Matrix operations. I agree it is not good enough yet to push those changes in MLlib.
I also just found another bug in Spark when I was trying to move to work with Spark 2.1 ([SPARK-19368]) and have a work-around.
I think I'll orgenize all the Sparse Matrix related changes I had to do and will try to push them to the main project.
Meanwhile we'll update the MCL package so others could also run.
@icklerly - I downloaded the Amazon dataset and it looks like it runs fine. I think you can try to work with @akaltsikis version (https://github.com/akaltsikis/MCL) or wait a few days till we merge the fixes to the MCL package.

@icklerly
Copy link

Hi @uzadude,
thanks for testing the dataset! I will try it again.
Unfortunately, the link to @akaltsikis version ends in a dead end...
Could you give me a working link?

@akaltsikis
Copy link

@icklerly @uzadude As of now the project is private need to clean the code and it will go public. I will notify you in that post in order to receive an email when its available. You need to run the Dataset now or can you wait a couple of days?

@icklerly
Copy link

icklerly commented Jan 27, 2017

@akaltsikis
No, I can wait a couple of days!
But the runtime would be interesting, @uzadude!?

@akaltsikis
Copy link

akaltsikis commented Jan 27, 2017

@icklerly I ran that in my shitty laptop and it took 511 Seconds, Finished at 33 Iterations and outputted 68930 Clusters. :-)

@icklerly
Copy link

@akaltsikis
Thanks for the information! :)
Currently I am writing a paper on distributed graph clustering using Apache Flink. Your MCL Spark implementation seems perfect to compare it to! Therefore, it would be cool to test your code and generate runtimes on my own...
Is it possible to get an earlier access to your code?

@icklerly
Copy link

Hi all,
I just wanted to check if there are new information about a release date?
Thanks in advance!

@joandre joandre mentioned this issue Feb 20, 2017
@joandre
Copy link
Owner Author

joandre commented Apr 12, 2017

Hi @icklerly,

Sorry I missed your message. As akaltsikis said, he will publish a new version. It is in progress. No release date scheduled right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants