-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update UMAP notebook, test precomputed_knn, write data to parquet #813
Conversation
rishic3
commented
Dec 25, 2024
- Updated UMAP notebook run with Pyspark+Jupyter, aligning with instructions.
- Added test/functionality for precomputed_knn.
- Updated model writer/reader to save array attributes to parquet via Spark DF.
- Lowered error thresholds to track perf regressions based on empirical tests (bumped to +3 stdevs above mean error)
Signed-off-by: Rishi Chandra <[email protected]>
build |
build |
python/src/spark_rapids_ml/umap.py
Outdated
StructField("shape", ArrayType(IntegerType(), False), False), | ||
] | ||
) | ||
data_df = spark.createDataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By putting all data in a single row, what are the limits on data sizes? Here and below for dense? Should add these limits to documentation.
Alternative is to create regular possibly multiple dataframes, save in multiple files/subfolders and add a column to order/sort by to restore original order.
num_workers=gpu_number, | ||
metric="sqeuclidean", | ||
random_state=random_state, | ||
precomputed_knn=precomputed_knn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert back to numpy to emphasize that numpy ok here in keeping with our no need for gpu on driver?
build |
python/src/spark_rapids_ml/umap.py
Outdated
def write_sparse_array(array: scipy.sparse.spmatrix, df_dir: str) -> None: | ||
indptr_df = spark.createDataFrame(array.indptr, schema=["indptr"]) | ||
indices_data_df = spark.createDataFrame( | ||
pd.DataFrame({"indices": array.indices, "data": array.data}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add the row_id here in the pd dataframe in case it is shuffled when creating spark dataframe.
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍