Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update UMAP notebook, test precomputed_knn, write data to parquet #813

Merged
merged 10 commits into from
Dec 31, 2024

Conversation

rishic3
Copy link
Collaborator

@rishic3 rishic3 commented Dec 25, 2024

  • Updated UMAP notebook run with Pyspark+Jupyter, aligning with instructions.
  • Added test/functionality for precomputed_knn.
  • Updated model writer/reader to save array attributes to parquet via Spark DF.
  • Lowered error thresholds to track perf regressions based on empirical tests (bumped to +3 stdevs above mean error)

@rishic3 rishic3 marked this pull request as ready for review December 25, 2024 07:50
@rishic3
Copy link
Collaborator Author

rishic3 commented Dec 25, 2024

build

@rishic3
Copy link
Collaborator Author

rishic3 commented Dec 26, 2024

build

StructField("shape", ArrayType(IntegerType(), False), False),
]
)
data_df = spark.createDataFrame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By putting all data in a single row, what are the limits on data sizes? Here and below for dense? Should add these limits to documentation.

Alternative is to create regular possibly multiple dataframes, save in multiple files/subfolders and add a column to order/sort by to restore original order.

num_workers=gpu_number,
metric="sqeuclidean",
random_state=random_state,
precomputed_knn=precomputed_knn,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we convert back to numpy to emphasize that numpy ok here in keeping with our no need for gpu on driver?

@rishic3
Copy link
Collaborator Author

rishic3 commented Dec 31, 2024

build

def write_sparse_array(array: scipy.sparse.spmatrix, df_dir: str) -> None:
indptr_df = spark.createDataFrame(array.indptr, schema=["indptr"])
indices_data_df = spark.createDataFrame(
pd.DataFrame({"indices": array.indices, "data": array.data})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add the row_id here in the pd dataframe in case it is shuffled when creating spark dataframe.

@rishic3
Copy link
Collaborator Author

rishic3 commented Dec 31, 2024

build

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rishic3 rishic3 merged commit 9c5df37 into NVIDIA:branch-24.12 Dec 31, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants