Apply UMAP to the SAEs features #1

lc82111 · 2024-12-05T07:28:00Z

UMAP Implementation for Large-Scale SAE Features

Thank you for sharing the SAE training codebase. I noticed that the CUML UMAP implementation for dimension reduction is missing.

Issue Description

When working with extremely large SAE feature matrices (N samples × over-complete features), I encounter GPU out-of-memory errors during UMAP processing.

Request

Could you please share the UMAP implementation code that handles large-scale feature matrices efficiently?

RE-N-Y · 2024-12-05T23:15:52Z

Sure. I'm currently occupied for next couple days, but I can share the snippets for implementation.
Out of curiosity, how large are your matrices? In my case, standard cuml umap + hdbscan implementation was already performant. What hardware are you running this GPU on? I've used a single H100 GPU to perform the processing

If you're constraint by resources, I would recommend training a UMAP projector on say 15% of your data and using that to project the rest of the vectors.

beckernick · 2024-12-06T04:39:12Z

When working with extremely large SAE feature matrices (N samples × over-complete features), I encounter GPU out-of-memory errors during UMAP processing.

Hi! I came across this issue due to the cuML / UMAP reference. I work on accelerated data science at NVIDIA.

We recently significantly improved both the performance and scalability of GPU-accelerated UMAP in RAPIDS cuML.

As long as you can fit the full dataset within CPU memory, you should now be able to use our new (opt-in) clustering-based batching technique to process datasets that would otherwise cause your GPU to OOM.

Happy to provide more info, if interested.

lc82111 · 2024-12-06T06:42:49Z

@RE-N-Y Thanks for your response. I worked on the sparse sae_feature with a shape of (2091432, 1536*32) using an RTX 3090 GPU to perform UMAP.

@beckernick Thanks for your comments. I came across the NVIDIA blog about the cuML/UMAP batch solution at this link. However, with the following configuration:

# Batching NN Descent with 4 clusters
umap = UMAP(n_neighbors=16, build_algo="nn_descent", build_kwds={"nnd_do_batch": True, "nnd_n_clusters": 4})
emb = umap.fit_transform(data, data_on_host=True)

I can fit the full dataset within the CPU, but I still encountered an OOM error during umap fit. Any advice would be appreciated.

beckernick · 2024-12-06T14:00:29Z

Sorry to hear that! Each individual cluster needs to fit in GPU memory, which might be an issue here if I'm correctly interpreting your dataset size as ~400GB (2091432 * 1536 * 32 elements * 4 bytes per element (fp32)).

The nn-descent algorithm doesn't support sparse inputs (and we use it in the batched approach) and using only 4 clusters for 400GB would overwhelm the 3090 GPU.

Could you file a cuML issue and also include your system info (CPU/GPU memory, etc.), total dataset size, and your Python environment in the issue?

RE-N-Y · 2024-12-07T22:32:45Z

@lc82111
Here's the exact snippet I've used

import cuml
from cuml import UMAP

umap = UMAP(n_neighbors=15, n_components=2, metric='cosine', min_dist=0.05)
reduced = umap.fit_transform(embeddings)

and here's the one for HDBSCAN

import cuml
from cuml import HDBSCAN, UMAP

scan = HDBSCAN()
clusters = scan.fit_predict(embeddings)
clusters.max()

where embeddings is the num_sample x dimension.

On a sidenote @beckernick, thank you sooo much for cuML. My visualizations would take a decade if it weren't for you guys.

lc82111 · 2024-12-08T08:35:48Z

@RE-N-Y Thanks!
@beckernick The Issue has been filed, thanks.

RE-N-Y · 2024-12-09T01:32:03Z

Since the issue doesn't seem specific to SAE implement, will close the issue for now. Feel free to re-open anytime if needed

RE-N-Y closed this as completed Dec 9, 2024

dantegd mentioned this issue Dec 12, 2024

[QST] GPU out-of-memory errors when applying UMAP to extremely large SAE feature matrices rapidsai/cuml#6167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply UMAP to the SAEs features #1

Apply UMAP to the SAEs features #1

lc82111 commented Dec 5, 2024

RE-N-Y commented Dec 5, 2024

beckernick commented Dec 6, 2024 •

edited

Loading

lc82111 commented Dec 6, 2024 •

edited

Loading

beckernick commented Dec 6, 2024 •

edited

Loading

RE-N-Y commented Dec 7, 2024

lc82111 commented Dec 8, 2024

RE-N-Y commented Dec 9, 2024

Apply UMAP to the SAEs features #1

Apply UMAP to the SAEs features #1

Comments

lc82111 commented Dec 5, 2024

UMAP Implementation for Large-Scale SAE Features

Issue Description

Request

RE-N-Y commented Dec 5, 2024

beckernick commented Dec 6, 2024 • edited Loading

lc82111 commented Dec 6, 2024 • edited Loading

beckernick commented Dec 6, 2024 • edited Loading

RE-N-Y commented Dec 7, 2024

lc82111 commented Dec 8, 2024

RE-N-Y commented Dec 9, 2024

beckernick commented Dec 6, 2024 •

edited

Loading

lc82111 commented Dec 6, 2024 •

edited

Loading

beckernick commented Dec 6, 2024 •

edited

Loading