-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe performance degredation when working with large datasets #7
Comments
Hi @mproffitt, thank you very much for pointing out the issue and apologies if I come back to you with such a delay. That performance problem has been known to me for a while, as you guessed, and it is just for lack of time if I haven't immediately fixed them. I didn't know however what the magnitude of the bottleneck was, and from what you're saying it is evident that I should start working on it ASAP. The problem there is that index creation is a costly operation. Off the top of my head I can't remember how pandas implements the merge, but it is quite possible that in most cases it will be significantly faster than creating a bunch of indices on all columns. This also leads me to suggest, if you haven't already done so: Specifying the join column explicitly to pyupset, or indeed passing it the minimal dataframes with all and only the information to intersect, helps improve the performance. As for the memory issues, using pandas and dataframes as the underlying representation makes it inevitable that some operations will be slow and hit memory limits. This prompted me to have a quick look at what alternatives exist for the fast computation of set intersections. Most of what I saw refers to the intersection of 2 sets, which in pyupset's case may be problematic, as we need to compute all intersections. It may well be, on the other hand, that even doing many set intersections in sequence ends up being faster than relying on indices or merges, in which case I'll change the underlying representation of the sets. I need to do some experiments though. I think one would still need to deal with the original dataframes throughout the whole process though, as things like subset-highlighting in graphs require accessing all the initial information jointly. |
Hi @ImSoErgodic, thank you for getting back on this issue and no problem about the delay. I think it's a case of swings vs roundabouts over whether a merge is faster than creating a bunch of indexes. On small data-sets indexes may be faster but as the datasets become larger, merging definitely offers performance enhancements until you hit the memory ceiling. The below code details how I'm calling the library, with if self._graph is not None:
frame = pyu.plot(
datasets,
unique_keys=[dataset.group_by],
additional_plots=[
{
'kind':'scatter',
'data_quantities':{'x': self._graph.xdata, 'y': self._graph.ydata},
'graph_properties':{'alpha': .8, 'lw': .4, 'edgecolor': 'w', 's': 50}
},
{
'kind':'hist',
'data_quantities':{'x': self._graph.xdata},
'graph_properties':{'bins': 50}
}
],
query=in_sets
)
else:
frame = pyu.plot(
datasets,
unique_keys=[dataset.group_by],
query=in_sets
) The below details a bit more about what I'm doing prior to calling the graph as I feel there is scope within this to drastically improve performance by offering a path around the bottleneck. Each original datafram starts out with 19 columns * n rows, out of which I initially select only 3 columns, one of which is unique. Once selected, the 4 resulting datasets are merged into a single dataframe containing 5 columns:
Within my sample set, this gives me the 22GiB dataframe I mentioned earlier. Now in this instance memory isn't too much of a problem, I'm building on a server with 128GiB RAM and Pandas merges in 65GiB RAM, however before I can pass it in to py-upset, the single dataframe needs splitting again (fan in, fan out), only to be re-joined back inside py-upset on gene_name as the unique key. At this point it becomes very costly and even attempting to carry out a second merge cost over 128GiB RAM. One possible approach to solving this, whilst it wouldn't solve all of the memory issues, would solve some of the performance issues, would be to allow the previously merged dataframe to be passed directly in inplace of Performance can then be improved by working column-wise across a single dataframe rather than set-wise across multiple dataframes and can be achieved by looping the combination of set names, and then selecting a row-count from the dataframe, whereby:
Of course, capability of merging / indexing would still need to exist inside py-upset for those instances where a user doesn't wish to carry out a full merge ahead of the call to plot, but the onus would be then put back on the user to handle memory exceptions as they occur rather than risking an index taking several days as is presently the case. Looking at your concern regarding subset-highlighting, again I think this can be achieved without relying on full datasets in most instances where a single merge table is used, by treating non-unique column names as being the dataset name, or again, allowing the set names be specified via an API call or as a dictionary element. E.g. pyu.plot({'dictionary': dataset, 'names': ['file1', 'file2', 'file3', 'file4']}, unique_keys=[column_id_or_name], ... ) Regards |
Is there been any fix for the performance, I need to plot data for the data sources containing almost 200m above records. Any work around to this ? |
It looks like no one is interested to actively update this repository. |
When plotting data sets with 2m or more rows, the performance of PyUpset is severely degraded.
During testing, each data set has 2 columns, the first is a text field containing a name as the joining column, the second a numeric value.
These datasets were generated from a full outer join on 4 other datasets, which took approx 1m 30 to create. On the contrary, pyupset took > 26 hours to count indexes before being killed.
The issue seems to come down to the method
extract_intersection_data
(visulisation.py lines 564 - 574) which currently use pd.Index to check for combinations.Unless I've misunderstood the comments at the bottom of the file, these lines are already marked for refactoring to use a merge? Do you have any details of how you were planning this merge? (left, right, inner, outer) and when such an amendment might be available?
Memory warning
It should be noted that even for medium sized datasets a merge will likely run into memory issues. For example, a full outer join on the 4 datasets (details below) used to produce the row count above resulted in a memory utilisation of >64GB on 12 columns with the resulting 5 column dataframe utilising 21GB of memory.
In the counts below, each dataset had 3 columns,
name | start | end
Resulting dataframe =
name, (start * 4) x 2816085
== 22121.8GBWith the caveat of memory not being an issue, one way to overcome the performance degradation is to create a
merge outer
for each set combination and then query where the value of non-unique columns != np.nan.Thoughts?
The text was updated successfully, but these errors were encountered: