-
Notifications
You must be signed in to change notification settings - Fork 0
Experiment: Store all edges in MapDB
Referencing ticket #1504.
I've been looking into MapDB as a possible bucket for graph information. Possible advantages: very short graph startup, outsourcing the persistence layer. Possible disadvantage: slower. So, if it isn't slower, it's a big win.
The data I worked with is a street graph based on an OSM extract of every road in New York State. First, I wrote a custom minimal graph with hard-wired object references between graph vertices, and populated it with OSM data.
I wrote a pretty minimal dijkstra algorithm using Java's built-in priority queue. The algorithm didn't do any key decreasing - it just holds a pqueue of edges and any time popped edge leads to somewhere already in the SPT it just ignores it and pops the next off the queue. I ran that on the NYS data to get a baseline of how long this particular dumb toy dijkstra implementation takes to exhaust an SPT on a very fast in-memory graph. Here's the first 8 retries in milliseconds {3250, 1082, 1000, 987, 957, 986, 974, 977}. That is to say, it takes about 1.00 seconds. This is the performance of the algorithm without MapDB.
Second I pushed the entire NYS dataset into MapDB and called get() on every key in arbitrary order. Note this MapDB was created using .mmapFileEnable().cacheSize(32768*8) options. The first 8 retries in milliseconds: {6702, 435, 434, 433, 441, 439, 432, 432, 437, 462}. 0.44 seconds is a reasonable mode. This is the performance of MapDB without the pathfinding algorithm.
The MapDB version of the graph is simple. An object "SimpleVertex" contains an array of "SimpleEdge" objects, which each have a distance and a string member that points to the label of the destination vertex. Each vertex is stored in MapDB under its string label.
The toy dijkstra algorithm was modified to run on the MapDB graph, and then SPTs were created on the entire New York State dataset. Using MapDB right out the box with no performance enhancement the running times, in milliseconds, were: {14614, 4910, 3890, 4193, 4104, 6414, 5200, 8220}. Mean time after the first: 5276 ms.
{15833, 4085, 2344, 2338, 3094, 2253, 2946, 2186} Mean, after the first: 2749 ms.
{14283, 2193, 2024, 2144, 2032, 2070, 3053, 3474} Mean, after the first: 2427 ms.
{8710, 2167, 2261, 2068, 1951, 1997, 1996, 2052} Mean, after the first: 2070 ms.
{48379, 38081, 34330, 34526, 33249} I think a MapDB cache size larger than Java's heap size may have resulted in a lot of swapping.
{8832, 2225, 2175, 2093, 2068, 2112, 2081, 2081} Mean: 2119 ms. Not better than simply cacheSize(32768*8).
Note this requires a MapDB graph rebuild.
{7304, 3385, 3320, 3321, 4121, 3079, 3011, 3095}
Mean: 3333 ms. Not a winner.
{6962, 2066, 2068, 2139, 2089, 2035, 2107, 2128} Mean: 2090ms. Tied with cacheSize(32768*8).
{7794, 2119, 3074, 2054, 1998, 1989, 1983, 2021} Mean: 2177 ms. Median: 2021ms. I mention median here because if it wasn't for that one 3074ms time the performance would be pretty good.
The best performance simply comes from increasing the cache to some large fraction of the size of the graph, at which point we get SPT exhaustion times around 2.0 seconds.
I gave each edge a 256-byte payload of random bits. The following performance tests are from the New York State street graph with padded edges. It's about 2gb on disk. For these tests I ran Java with the argument "-Xmx1000m". I made and used the database with the ".mmapFileEnable()" option, with the default cache size.
When attempting to access every item by key, it stops at about the 760000th iteration with "Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded".
When attempting to build an exhaustive SPT, it stops at about the 1470000th iteration with the same error message.
Increasing memory using "-Xmx8000m":
Accessing every item: 8 tries, in milliseconds: {6238, 3277, 1362, 1349, 1302, 1253, 1297, 1243, 1305, 1306}.
Exhaustive SPTs: elapsed: {14700, 5216, 4664, 3602, 3689, 3444, 3403, 3338}
Our best running times in MapDB were about 2.0 seconds, of which 1.0 seconds was spent outside MapDB. So, while creating an exhaustive SPT over the dataset, MapDB spent about 1.0 seconds. We know that accessing every object in MapDB by key "in order" takes about 0.44 seconds, indicating that approximately half the time, 0.56 seconds, is a performance hit resulting from the particular order that the dijkstra algorithm accessed objects in MapDB.
These running times are decent, but if the goal is to achieve routing times across the eastern seaboard in under 1.0 seconds it may not be possible unless the search is pruned to access a small fraction of the vertices in the graph.
It's important to understand how MapDB would work in a truly memory constrained scenario, where the operating system is not able to completely cache the underlying database file. On a OS X laptop with 8gb memory, this experiment was never truly memory constrained; never went into a swap. We'd need to set up an actual test if we want to know the performance characteristics of MapDB in a memory constrained environment.