Update UMAP notebook, test precomputed_knn, write data to parquet #813

rishic3 · 2024-12-25T07:45:29Z

Updated UMAP notebook run with Pyspark+Jupyter, aligning with instructions.
Added test/functionality for precomputed_knn.
Updated model writer/reader to save array attributes to parquet via Spark DF.
Lowered error thresholds to track perf regressions based on empirical tests (bumped to +3 stdevs above mean error)

Signed-off-by: Rishi Chandra <[email protected]>

rishic3 · 2024-12-25T07:52:11Z

build

rishic3 · 2024-12-26T01:48:24Z

build

eordentlich · 2024-12-30T17:21:56Z

python/src/spark_rapids_ml/umap.py

+                    StructField("shape", ArrayType(IntegerType(), False), False),
+                ]
+            )
+            data_df = spark.createDataFrame(


By putting all data in a single row, what are the limits on data sizes? Here and below for dense? Should add these limits to documentation.

Alternative is to create regular possibly multiple dataframes, save in multiple files/subfolders and add a column to order/sort by to restore original order.

eordentlich · 2024-12-30T17:37:06Z

python/tests/test_umap.py

+            num_workers=gpu_number,
+            metric="sqeuclidean",
+            random_state=random_state,
+            precomputed_knn=precomputed_knn,


Can we convert back to numpy to emphasize that numpy ok here in keeping with our no need for gpu on driver?

rishic3 · 2024-12-31T02:28:58Z

build

eordentlich · 2024-12-31T04:23:57Z

python/src/spark_rapids_ml/umap.py

+        def write_sparse_array(array: scipy.sparse.spmatrix, df_dir: str) -> None:
+            indptr_df = spark.createDataFrame(array.indptr, schema=["indptr"])
+            indices_data_df = spark.createDataFrame(
+                pd.DataFrame({"indices": array.indices, "data": array.data})


Better to add the row_id here in the pd dataframe in case it is shuffled when creating spark dataframe.

rishic3 · 2024-12-31T05:29:00Z

build

eordentlich

👍

rishic3 added 6 commits December 25, 2024 07:25

Update notebook, test precomputed_knn, bump threshold

8a4726a

bump trust_diff

43f6073

comments

ea1354a

type checking

bcca72e

typo

b1a5585

signoff

b023f96

Signed-off-by: Rishi Chandra <[email protected]>

rishic3 marked this pull request as ready for review December 25, 2024 07:50

rishic3 added 2 commits December 26, 2024 01:46

cleanup

9ec8559

formatting

3879b9b

eordentlich reviewed Dec 30, 2024

View reviewed changes

Address comments

f82b818

eordentlich reviewed Dec 31, 2024

View reviewed changes

add row_id in pdf

d707599

eordentlich approved these changes Dec 31, 2024

View reviewed changes

rishic3 merged commit 9c5df37 into NVIDIA:branch-24.12 Dec 31, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update UMAP notebook, test precomputed_knn, write data to parquet #813

Update UMAP notebook, test precomputed_knn, write data to parquet #813

rishic3 commented Dec 25, 2024

rishic3 commented Dec 25, 2024

rishic3 commented Dec 26, 2024

eordentlich Dec 30, 2024

eordentlich Dec 30, 2024

rishic3 commented Dec 31, 2024

eordentlich Dec 31, 2024

rishic3 commented Dec 31, 2024

eordentlich left a comment

Update UMAP notebook, test precomputed_knn, write data to parquet #813

Update UMAP notebook, test precomputed_knn, write data to parquet #813

Conversation

rishic3 commented Dec 25, 2024

rishic3 commented Dec 25, 2024

rishic3 commented Dec 26, 2024

eordentlich Dec 30, 2024

Choose a reason for hiding this comment

eordentlich Dec 30, 2024

Choose a reason for hiding this comment

rishic3 commented Dec 31, 2024

eordentlich Dec 31, 2024

Choose a reason for hiding this comment

rishic3 commented Dec 31, 2024

eordentlich left a comment

Choose a reason for hiding this comment