Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEGA-106-Segmentation-Metrics #36

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
7d4e472
setting up boiler plate
cornhundred Oct 31, 2024
d9a7241
metrics addition + pre-process nb corrections
jaspreetishar Nov 15, 2024
9158935
remove comments
jaspreetishar Nov 15, 2024
03b913a
column name fix
jaspreetishar Nov 15, 2024
7e85a6c
clean up
jaspreetishar Nov 15, 2024
1d4459a
added extra metrics info as comments
jaspreetishar Nov 19, 2024
4e39919
included gene specific metrics
jaspreetishar Nov 20, 2024
95956a1
remove image intensity calc, clean up segmentation_qc.ipynb
jaspreetishar Nov 20, 2024
4e8f4ae
qc module name update
jaspreetishar Nov 20, 2024
172a1fa
xenium-default support, remove cells per um2 metric, concise metric n…
jaspreetishar Nov 22, 2024
9fd6060
output address fix, notebook changes
jaspreetishar Nov 26, 2024
f95d06b
working on custom segmentation
jaspreetishar Nov 26, 2024
de4ca6e
Merge branch 'main' into DEGA-106-Segmentation-Metrics
jaspreetishar Nov 26, 2024
cf5a68e
disjoint marker script addition in qc
jaspreetishar Dec 19, 2024
8eb29bf
segmentation_qc notebook clean up
jaspreetishar Dec 19, 2024
92bf616
more qc functions for default xenium results
jaspreetishar Jan 3, 2025
5bdd4bc
pulling changes from main to segmentation metrics branch
jaspreetishar Jan 3, 2025
2767e74
update branch with main
jaspreetishar Jan 15, 2025
57ec0bb
remove redundant functions
jaspreetishar Jan 16, 2025
af66df4
notebook check run
jaspreetishar Jan 16, 2025
c4f6b91
update with main
jaspreetishar Jan 16, 2025
82d95cc
remove algorithm specific, cell type specific lines
jaspreetishar Jan 17, 2025
3b65e0e
changes after review
jaspreetishar Jan 23, 2025
31a8709
notebook changes
jaspreetishar Jan 23, 2025
066a6ac
more review changes
jaspreetishar Jan 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -2128,7 +2128,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
"version": "3.9.20"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
Expand Down
436 changes: 436 additions & 0 deletions notebooks/Segmentation_QC.ipynb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaspreetishar These are minor but can enhance readability, simplicity, and mostly important, reproducibility.

For the string that needs to be repeated multiple times (say more than 2-3 times),it is better to make it a variable so you only need to change one place later if needed. 2. Additionally, use for loop if possible.

For example:
cell_polygon_metadata_files = [f"data/segmentation_metrics_data/inputs/{directories[0]}/cell_metadata.parquet", f"data/segmentation_metrics_data/inputs/{directories[1]}/cell_metadata.parquet", f"data/segmentation_metrics_data/inputs/{directories[2]}/cell_metadata.parquet", None]

and

paths_output_cell_metrics = [f"data/segmentation_metrics_data/outputs/cell_specific_metrics_{dataset_name}-{segmentation_approaches[0]}.csv", f"data/segmentation_metrics_data/outputs/cell_specific_metrics_{dataset_name}-{segmentation_approaches[1]}.csv", f"data/segmentation_metrics_data/outputs/cell_specific_metrics_{dataset_name}-{segmentation_approaches[2]}.csv", f"data/segmentation_metrics_data/outputs/cell_specific_metrics_{dataset_name}-{segmentation_approaches[3]}.csv"]

can be simplified as:
cell_polygon_metadata_files = [ f'{seg_qc_dir}/inputs/{d}/cell_metadata.parquet' for d in directories[:3] ] + [None]

paths_output_cell_metrics = cell_polygon_metadata_files = [ f'{seg_qc_dir}/outputs/cell_specific_metrics_{dataset_name}-{sa}.csv' for sa in segmentation_approaches ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the example data. I am getting an error during the 4th iteration of the for loop, presumably, it is from skin_xenium_default. Do you have any idea?
Screenshot 2025-01-21 at 3 51 22 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error was due to a faulty transformation matrix. I have fixed it and will share it shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the effort to simplify the code. In lines with reproducibility, the variable directories can be calculated as following so you don't need to spell them out and next person on a different dataset only need to change the value of variable dataset_name instead of 4 strings: directories = [f"{dataset_name.lower()}_{d.lower().replace('-','_')}" for d in segmentation_approaches]. This one is minor you don't have to change the code but it will be good to get familiar with.

jaspreetishar marked this conversation as resolved.
Show resolved Hide resolved

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ dependencies = [
"polars~=1.10.0",
"libpysal~=4.8.1",
"spatialdata~=0.2.6",
"spatialdata_io~=0.1.6"
"spatialdata_io~=0.1.6",
"alphashape~=1.3.1"
]
readme = "README.md"

Expand Down
1 change: 1 addition & 0 deletions src/celldega/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from celldega.viz import Landscape, Matrix
from celldega.pre import landscape
from celldega.qc import qc_segmentation
from celldega.nbhd import alpha_shape

from celldega.clust import Network
Expand Down
243 changes: 122 additions & 121 deletions src/celldega/pre/boundary_tile.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

import numpy as np
import pandas as pd
import os
Expand All @@ -7,6 +6,123 @@
import geopandas as gpd
from shapely.geometry import Polygon, MultiPolygon

def numpy_affine_transform(coords, matrix):
"""Apply affine transformation to numpy coordinates."""
jaspreetishar marked this conversation as resolved.
Show resolved Hide resolved
# Homogeneous coordinates for affine transformation
coords = np.hstack([coords, np.ones((coords.shape[0], 1))])
transformed_coords = coords @ matrix.T
return transformed_coords[:, :2] # Drop the homogeneous coordinate

def batch_transform_geometries(geometries, transformation_matrix, scale):
"""
Batch transform geometries using numpy for optimized performance.
"""
# Extract affine transformation parameters into a 3x3 matrix for numpy
affine_matrix = np.array([
[transformation_matrix[0, 0], transformation_matrix[0, 1], transformation_matrix[0, 2]],
[transformation_matrix[1, 0], transformation_matrix[1, 1], transformation_matrix[1, 2]],
[0, 0, 1]
])

transformed_geometries = []

for polygon in geometries:
# Extract coordinates and transform them
if isinstance(polygon, MultiPolygon):
polygon = next(polygon.geoms) # Use the first geometry

# Transform the exterior of the polygon
exterior_coords = np.array(polygon.exterior.coords)

# Apply the affine transformation and scale
transformed_coords = numpy_affine_transform(exterior_coords, affine_matrix) / scale

# Append the result to the transformed_geometries list
transformed_geometries.append([transformed_coords.tolist()])

return transformed_geometries

def filter_and_save_fine_boundary(coarse_tile, fine_i, fine_j, fine_tile_x_min, fine_tile_x_max, fine_tile_y_min, fine_tile_y_max, path_output):
cell_ids = coarse_tile.index.values

tile_filter = (
(coarse_tile["center_x"] >= fine_tile_x_min) & (coarse_tile["center_x"] < fine_tile_x_max) &
(coarse_tile["center_y"] >= fine_tile_y_min) & (coarse_tile["center_y"] < fine_tile_y_max)
)
filtered_indices = np.where(tile_filter)[0]

keep_cells = cell_ids[filtered_indices]
fine_tile_cells = coarse_tile.loc[keep_cells, ["GEOMETRY"]]
fine_tile_cells = fine_tile_cells.assign(name=fine_tile_cells.index)

if not fine_tile_cells.empty:
filename = f"{path_output}/cell_tile_{fine_i}_{fine_j}.parquet"
fine_tile_cells.to_parquet(filename)

def process_fine_boundaries(coarse_tile, i, j, coarse_tile_x_min, coarse_tile_x_max, coarse_tile_y_min, coarse_tile_y_max, tile_size, path_output, x_min, y_min, n_fine_tiles_x, n_fine_tiles_y, max_workers):
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = []
for fine_i in range(n_fine_tiles_x):
fine_tile_x_min = x_min + fine_i * tile_size
fine_tile_x_max = fine_tile_x_min + tile_size

if not (fine_tile_x_min >= coarse_tile_x_min and fine_tile_x_max <= coarse_tile_x_max):
continue

for fine_j in range(n_fine_tiles_y):
fine_tile_y_min = y_min + fine_j * tile_size
fine_tile_y_max = fine_tile_y_min + tile_size

if not (fine_tile_y_min >= coarse_tile_y_min and fine_tile_y_max <= coarse_tile_y_max):
continue

futures.append(executor.submit(
filter_and_save_fine_boundary, coarse_tile, fine_i, fine_j, fine_tile_x_min, fine_tile_x_max, fine_tile_y_min, fine_tile_y_max, path_output
))

for future in futures:
future.result()

def get_cell_polygons(technology, path_cell_boundaries, transformation_matrix, path_output=None, image_scale=1, path_meta_cell_micron=None):

# Load cell boundary data based on the technology
if technology == "MERSCOPE":
df_meta = pd.read_parquet(f"{path_output.replace('cell_segmentation','cell_metadata.parquet')}")
entity_to_cell_id_dict = pd.Series(df_meta.index.values, index=df_meta.EntityID).to_dict()
cells_orig = gpd.read_parquet(path_cell_boundaries)
cells_orig['cell_id'] = cells_orig['EntityID'].map(entity_to_cell_id_dict)
cells_orig = cells_orig[cells_orig["ZIndex"] == 1]

# Correct cell_id issues with meta_cell
meta_cell = pd.read_csv(path_meta_cell_micron)
meta_cell['cell_id'] = meta_cell['EntityID'].map(entity_to_cell_id_dict)
cells_orig.index = meta_cell[meta_cell["cell_id"].isin(cells_orig['cell_id'])].index

# Correct 'MultiPolygon' to 'Polygon'
cells_orig["geometry"] = cells_orig["Geometry"].apply(
lambda x: list(x.geoms)[0] if isinstance(x, MultiPolygon) else x
)

cells_orig.set_index('cell_id', inplace=True)

elif technology == "Xenium":
xenium_cells = pd.read_parquet(path_cell_boundaries)
grouped = xenium_cells.groupby("cell_id")[["vertex_x", "vertex_y"]].agg(lambda x: x.tolist())
grouped["geometry"] = grouped.apply(lambda row: Polygon(zip(row["vertex_x"], row["vertex_y"])), axis=1)
cells_orig = gpd.GeoDataFrame(grouped, geometry="geometry")[["geometry"]]

elif technology == "custom":
cells_orig = gpd.read_parquet(path_cell_boundaries)

# Transform geometries
cells_orig["GEOMETRY"] = batch_transform_geometries(cells_orig["geometry"], transformation_matrix, image_scale)

# Convert transformed geometries to polygons and calculate centroids
cells_orig["polygon"] = cells_orig["GEOMETRY"].apply(lambda x: Polygon(x[0]))
gdf_cells = gpd.GeoDataFrame(geometry=cells_orig["polygon"])
gdf_cells["GEOMETRY"] = cells_orig["GEOMETRY"]

return gdf_cells

def make_cell_boundary_tiles(
technology,
Expand Down Expand Up @@ -55,131 +171,16 @@ def make_cell_boundary_tiles(
None
"""

def numpy_affine_transform(coords, matrix):
"""Apply affine transformation to numpy coordinates."""
# Homogeneous coordinates for affine transformation
coords = np.hstack([coords, np.ones((coords.shape[0], 1))])
transformed_coords = coords @ matrix.T
return transformed_coords[:, :2] # Drop the homogeneous coordinate

def batch_transform_geometries(geometries, transformation_matrix, scale):
"""
Batch transform geometries using numpy for optimized performance.
"""
# Extract affine transformation parameters into a 3x3 matrix for numpy
affine_matrix = np.array([
[transformation_matrix[0, 0], transformation_matrix[0, 1], transformation_matrix[0, 2]],
[transformation_matrix[1, 0], transformation_matrix[1, 1], transformation_matrix[1, 2]],
[0, 0, 1]
])

transformed_geometries = []

for polygon in geometries:
# Extract coordinates and transform them
if isinstance(polygon, MultiPolygon):
polygon = next(polygon.geoms) # Use the first geometry

# Transform the exterior of the polygon
exterior_coords = np.array(polygon.exterior.coords)

# Apply the affine transformation and scale
transformed_coords = numpy_affine_transform(exterior_coords, affine_matrix) / scale

# Append the result to the transformed_geometries list
transformed_geometries.append([transformed_coords.tolist()])

return transformed_geometries


def filter_and_save_fine_boundary(coarse_tile, fine_i, fine_j, fine_tile_x_min, fine_tile_x_max, fine_tile_y_min, fine_tile_y_max, path_output):
cell_ids = coarse_tile.index.values

tile_filter = (
(coarse_tile["center_x"] >= fine_tile_x_min) & (coarse_tile["center_x"] < fine_tile_x_max) &
(coarse_tile["center_y"] >= fine_tile_y_min) & (coarse_tile["center_y"] < fine_tile_y_max)
)
filtered_indices = np.where(tile_filter)[0]

keep_cells = cell_ids[filtered_indices]
fine_tile_cells = coarse_tile.loc[keep_cells, ["GEOMETRY"]]
fine_tile_cells = fine_tile_cells.assign(name=fine_tile_cells.index)

if not fine_tile_cells.empty:
filename = f"{path_output}/cell_tile_{fine_i}_{fine_j}.parquet"
fine_tile_cells.to_parquet(filename)

def process_fine_boundaries(coarse_tile, i, j, coarse_tile_x_min, coarse_tile_x_max, coarse_tile_y_min, coarse_tile_y_max, tile_size, path_output, x_min, y_min, n_fine_tiles_x, n_fine_tiles_y):
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = []
for fine_i in range(n_fine_tiles_x):
fine_tile_x_min = x_min + fine_i * tile_size
fine_tile_x_max = fine_tile_x_min + tile_size

if not (fine_tile_x_min >= coarse_tile_x_min and fine_tile_x_max <= coarse_tile_x_max):
continue

for fine_j in range(n_fine_tiles_y):
fine_tile_y_min = y_min + fine_j * tile_size
fine_tile_y_max = fine_tile_y_min + tile_size

if not (fine_tile_y_min >= coarse_tile_y_min and fine_tile_y_max <= coarse_tile_y_max):
continue

futures.append(executor.submit(
filter_and_save_fine_boundary, coarse_tile, fine_i, fine_j, fine_tile_x_min, fine_tile_x_max, fine_tile_y_min, fine_tile_y_max, path_output
))

for future in futures:
future.result()

tile_size_x = tile_size
tile_size_y = tile_size

transformation_matrix = pd.read_csv(path_transformation_matrix, header=None, sep=" ").values

# Load cell boundary data based on the technology
if technology == "MERSCOPE":
df_meta = pd.read_parquet(f"{path_output.replace('cell_segmentation','cell_metadata.parquet')}")
entity_to_cell_id_dict = pd.Series(df_meta.index.values, index=df_meta.EntityID).to_dict()
cells_orig = gpd.read_parquet(path_cell_boundaries)
cells_orig['cell_id'] = cells_orig['EntityID'].map(entity_to_cell_id_dict)
cells_orig = cells_orig[cells_orig["ZIndex"] == 1]

# Correct cell_id issues with meta_cell
meta_cell = pd.read_csv(path_meta_cell_micron)
meta_cell['cell_id'] = meta_cell['EntityID'].map(entity_to_cell_id_dict)
cells_orig.index = meta_cell[meta_cell["cell_id"].isin(cells_orig['cell_id'])].index

# Correct 'MultiPolygon' to 'Polygon'
cells_orig["geometry"] = cells_orig["Geometry"].apply(
lambda x: list(x.geoms)[0] if isinstance(x, MultiPolygon) else x
)

cells_orig.set_index('cell_id', inplace=True)

elif technology == "Xenium":
xenium_cells = pd.read_parquet(path_cell_boundaries)
grouped = xenium_cells.groupby("cell_id")[["vertex_x", "vertex_y"]].agg(lambda x: x.tolist())
grouped["geometry"] = grouped.apply(lambda row: Polygon(zip(row["vertex_x"], row["vertex_y"])), axis=1)
cells_orig = gpd.GeoDataFrame(grouped, geometry="geometry")[["geometry"]]

elif technology == "custom":
cells_orig = gpd.read_parquet(path_cell_boundaries)
# Ensure the output directory exists
if not os.path.exists(path_output):
os.makedirs(path_output)

# Transform geometries
cells_orig["GEOMETRY"] = batch_transform_geometries(cells_orig["geometry"], transformation_matrix, image_scale)
gdf_cells = get_cell_polygons(technology, path_cell_boundaries, transformation_matrix, path_output, image_scale, path_meta_cell_micron)

# Convert transformed geometries to polygons and calculate centroids
cells_orig["polygon"] = cells_orig["GEOMETRY"].apply(lambda x: Polygon(x[0]))
gdf_cells = gpd.GeoDataFrame(geometry=cells_orig["polygon"])
gdf_cells["center_x"] = gdf_cells.geometry.centroid.x
gdf_cells["center_y"] = gdf_cells.geometry.centroid.y
gdf_cells["GEOMETRY"] = cells_orig["GEOMETRY"]

# Ensure the output directory exists
if not os.path.exists(path_output):
os.makedirs(path_output)

# Calculate tile bounds and fine/coarse tiles
x_min, x_max = tile_bounds["x_min"], tile_bounds["x_max"]
Expand All @@ -203,4 +204,4 @@ def process_fine_boundaries(coarse_tile, i, j, coarse_tile_x_min, coarse_tile_x_
(gdf_cells["center_y"] >= coarse_tile_y_min) & (gdf_cells["center_y"] < coarse_tile_y_max)
]
if not coarse_tile.empty:
process_fine_boundaries(coarse_tile, i, j, coarse_tile_x_min, coarse_tile_x_max, coarse_tile_y_min, coarse_tile_y_max, tile_size, path_output, x_min, y_min, n_fine_tiles_x, n_fine_tiles_y)
process_fine_boundaries(coarse_tile, i, j, coarse_tile_x_min, coarse_tile_x_max, coarse_tile_y_min, coarse_tile_y_max, tile_size, path_output, x_min, y_min, n_fine_tiles_x, n_fine_tiles_y, max_workers)
Loading
Loading