-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply UMAP to the SAEs features #1
Comments
Sure. I'm currently occupied for next couple days, but I can share the snippets for implementation. If you're constraint by resources, I would recommend training a UMAP projector on say 15% of your data and using that to project the rest of the vectors. |
Hi! I came across this issue due to the cuML / UMAP reference. I work on accelerated data science at NVIDIA. We recently significantly improved both the performance and scalability of GPU-accelerated UMAP in RAPIDS cuML. As long as you can fit the full dataset within CPU memory, you should now be able to use our new (opt-in) clustering-based batching technique to process datasets that would otherwise cause your GPU to OOM. Happy to provide more info, if interested. |
@RE-N-Y Thanks for your response. I worked on the sparse sae_feature with a shape of (2091432, 1536*32) using an RTX 3090 GPU to perform UMAP. @beckernick Thanks for your comments. I came across the NVIDIA blog about the cuML/UMAP batch solution at this link. However, with the following configuration:
I can fit the full dataset within the CPU, but I still encountered an OOM error during umap fit. Any advice would be appreciated. |
Sorry to hear that! Each individual cluster needs to fit in GPU memory, which might be an issue here if I'm correctly interpreting your dataset size as ~400GB (2091432 * 1536 * 32 elements * 4 bytes per element (fp32)). The nn-descent algorithm doesn't support sparse inputs (and we use it in the batched approach) and using only 4 clusters for 400GB would overwhelm the 3090 GPU. Could you file a cuML issue and also include your system info (CPU/GPU memory, etc.), total dataset size, and your Python environment in the issue? |
@lc82111 import cuml
from cuml import UMAP
umap = UMAP(n_neighbors=15, n_components=2, metric='cosine', min_dist=0.05)
reduced = umap.fit_transform(embeddings) and here's the one for HDBSCAN import cuml
from cuml import HDBSCAN, UMAP
scan = HDBSCAN()
clusters = scan.fit_predict(embeddings)
clusters.max() where On a sidenote @beckernick, thank you sooo much for |
@RE-N-Y Thanks! |
Since the issue doesn't seem specific to SAE implement, will close the issue for now. Feel free to re-open anytime if needed |
UMAP Implementation for Large-Scale SAE Features
Thank you for sharing the SAE training codebase. I noticed that the CUML UMAP implementation for dimension reduction is missing.
Issue Description
When working with extremely large SAE feature matrices (N samples × over-complete features), I encounter GPU out-of-memory errors during UMAP processing.
Request
Could you please share the UMAP implementation code that handles large-scale feature matrices efficiently?
The text was updated successfully, but these errors were encountered: