Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply UMAP to the SAEs features #1

Closed
lc82111 opened this issue Dec 5, 2024 · 7 comments
Closed

Apply UMAP to the SAEs features #1

lc82111 opened this issue Dec 5, 2024 · 7 comments

Comments

@lc82111
Copy link

lc82111 commented Dec 5, 2024

UMAP Implementation for Large-Scale SAE Features

Thank you for sharing the SAE training codebase. I noticed that the CUML UMAP implementation for dimension reduction is missing.

Issue Description

When working with extremely large SAE feature matrices (N samples × over-complete features), I encounter GPU out-of-memory errors during UMAP processing.

Request

Could you please share the UMAP implementation code that handles large-scale feature matrices efficiently?

@RE-N-Y
Copy link
Owner

RE-N-Y commented Dec 5, 2024

Sure. I'm currently occupied for next couple days, but I can share the snippets for implementation.
Out of curiosity, how large are your matrices? In my case, standard cuml umap + hdbscan implementation was already performant. What hardware are you running this GPU on? I've used a single H100 GPU to perform the processing

If you're constraint by resources, I would recommend training a UMAP projector on say 15% of your data and using that to project the rest of the vectors.

@beckernick
Copy link

beckernick commented Dec 6, 2024

When working with extremely large SAE feature matrices (N samples × over-complete features), I encounter GPU out-of-memory errors during UMAP processing.

Hi! I came across this issue due to the cuML / UMAP reference. I work on accelerated data science at NVIDIA.

We recently significantly improved both the performance and scalability of GPU-accelerated UMAP in RAPIDS cuML.

As long as you can fit the full dataset within CPU memory, you should now be able to use our new (opt-in) clustering-based batching technique to process datasets that would otherwise cause your GPU to OOM.

Happy to provide more info, if interested.

@lc82111
Copy link
Author

lc82111 commented Dec 6, 2024

@RE-N-Y Thanks for your response. I worked on the sparse sae_feature with a shape of (2091432, 1536*32) using an RTX 3090 GPU to perform UMAP.

@beckernick Thanks for your comments. I came across the NVIDIA blog about the cuML/UMAP batch solution at this link. However, with the following configuration:

# Batching NN Descent with 4 clusters
umap = UMAP(n_neighbors=16, build_algo="nn_descent", build_kwds={"nnd_do_batch": True, "nnd_n_clusters": 4})
emb = umap.fit_transform(data, data_on_host=True)

I can fit the full dataset within the CPU, but I still encountered an OOM error during umap fit. Any advice would be appreciated.

@beckernick
Copy link

beckernick commented Dec 6, 2024

Sorry to hear that! Each individual cluster needs to fit in GPU memory, which might be an issue here if I'm correctly interpreting your dataset size as ~400GB (2091432 * 1536 * 32 elements * 4 bytes per element (fp32)).

The nn-descent algorithm doesn't support sparse inputs (and we use it in the batched approach) and using only 4 clusters for 400GB would overwhelm the 3090 GPU.

Could you file a cuML issue and also include your system info (CPU/GPU memory, etc.), total dataset size, and your Python environment in the issue?

@RE-N-Y
Copy link
Owner

RE-N-Y commented Dec 7, 2024

@lc82111
Here's the exact snippet I've used

import cuml
from cuml import UMAP

umap = UMAP(n_neighbors=15, n_components=2, metric='cosine', min_dist=0.05)
reduced = umap.fit_transform(embeddings)

and here's the one for HDBSCAN

import cuml
from cuml import HDBSCAN, UMAP

scan = HDBSCAN()
clusters = scan.fit_predict(embeddings)
clusters.max()

where embeddings is the num_sample x dimension.

On a sidenote @beckernick, thank you sooo much for cuML. My visualizations would take a decade if it weren't for you guys.

@lc82111
Copy link
Author

lc82111 commented Dec 8, 2024

@RE-N-Y Thanks!
@beckernick The Issue has been filed, thanks.

@RE-N-Y
Copy link
Owner

RE-N-Y commented Dec 9, 2024

Since the issue doesn't seem specific to SAE implement, will close the issue for now. Feel free to re-open anytime if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants