Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dbit #95

Merged
merged 99 commits into from
Feb 9, 2024
Merged

Dbit #95

Show file tree
Hide file tree
Changes from 91 commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
af2a72f
Added draft of DBiT reader
lillux Dec 1, 2023
c50fc92
Hardcoded variable for grid scaling
lillux Dec 1, 2023
faeb59d
Merge branch 'main' of https://github.com/lillux/spatialdata-io
lillux Dec 1, 2023
1666707
Added comments
lillux Dec 1, 2023
61b71a0
Ignore spyder data
lillux Dec 1, 2023
7dec791
Remove spyder data
lillux Dec 1, 2023
22c25e4
Removed metadata
lillux Dec 1, 2023
8c35aef
Add symbol export spec in DBiT
lillux Dec 4, 2023
d944741
Fixed typos
lillux Dec 4, 2023
b8fd2a6
Added _constants for DBiT, and related check
lillux Dec 4, 2023
d12c518
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 4, 2023
6ea5614
Open .h5ad with anndata instead of scanpy
lillux Dec 5, 2023
b41451d
Merge branch 'dbit' of https://github.com/lillux/spatialdata-io into …
lillux Dec 5, 2023
ed34628
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 5, 2023
fb55943
Fixed import
lillux Dec 5, 2023
697064e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 5, 2023
f2960aa
fixes from pre-commit.ci
lillux Dec 5, 2023
1c493e3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 5, 2023
63a916f
fixed exceptions and type hints
lillux Dec 5, 2023
3234bba
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 5, 2023
9f862da
Fix typing
lillux Dec 5, 2023
d7dc6e2
fixed variable names
lillux Dec 5, 2023
24bb806
Merge branch 'scverse:main' into dbit
lillux Jan 9, 2024
48fca73
Merge branch 'scverse:main' into dbit
lillux Jan 10, 2024
fbe6c82
code revision started
lillux Jan 17, 2024
3821757
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 17, 2024
8d4ae3e
Code revision part 2
lillux Jan 17, 2024
92df11d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 17, 2024
6c9c748
code revision fix docstring
lillux Jan 17, 2024
2c62b45
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 17, 2024
fbd48b3
Code revision small fixes
lillux Jan 18, 2024
aef5275
Code review small fixes to docstrings
lillux Jan 18, 2024
96efc11
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 18, 2024
d0e3023
Code review inject_docs
lillux Jan 18, 2024
6adfcf4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 18, 2024
0544e46
Code review remove inject_doc
lillux Jan 18, 2024
d5ad8f8
Code review docs api.md
lillux Jan 18, 2024
5d50168
Code review docstring fix
lillux Jan 18, 2024
568566a
Code review Typo
lillux Jan 18, 2024
3bb717f
Typo
lillux Jan 18, 2024
e3a1a66
Docs fix
lillux Jan 18, 2024
a4abc3a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 18, 2024
f5c34db
Typo docstring
lillux Jan 18, 2024
1f7d88b
Typo
lillux Jan 18, 2024
06177c5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 18, 2024
f76c89b
Searching error...
lillux Jan 18, 2024
912948b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 18, 2024
950b2b1
Typo fix
lillux Jan 18, 2024
8fcd5b6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 18, 2024
158b0bb
Fixing docstring
lillux Jan 18, 2024
b1d35ad
Docstring done. Fixing typo
lillux Jan 18, 2024
d40a00b
Fixed dbit function to remove duplcated code
lillux Jan 23, 2024
2f4e59a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 23, 2024
090d4f7
Docstring fix
lillux Jan 23, 2024
cc2add1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 23, 2024
c97196f
Docstring fix
lillux Jan 23, 2024
e326942
Fix docstring typing
lillux Jan 23, 2024
82dea42
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 23, 2024
583eb81
Typing stuff
lillux Jan 23, 2024
0795475
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 23, 2024
1426f39
Typing ...
lillux Jan 23, 2024
69191f0
typing
lillux Jan 23, 2024
0ed986d
Fix naming
lillux Jan 23, 2024
187d1dc
Typing
lillux Jan 23, 2024
ce53ad7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 23, 2024
1254a89
Typing
lillux Jan 23, 2024
82119de
Working on type hints in docstring
lillux Jan 24, 2024
8e8e400
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2024
efd4805
Fixed typo
lillux Jan 24, 2024
f214968
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2024
88b7197
Tyiping
lillux Jan 24, 2024
4372901
Fix typo
lillux Jan 24, 2024
76b228b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2024
5e3c33e
Mypy fix
lillux Jan 24, 2024
632cb22
Mypy
lillux Jan 24, 2024
c5785d2
Mypy complains
lillux Jan 24, 2024
1813285
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2024
7681f0e
Test for mypy
lillux Jan 24, 2024
1edaaff
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2024
6cddf83
Fix mypy
lillux Jan 24, 2024
aa4cd2f
Typo docstring and fix
lillux Jan 25, 2024
d68969e
Fix mypy
lillux Jan 25, 2024
c2227ab
Mypy
lillux Jan 25, 2024
74ca806
Fix mypy
lillux Jan 25, 2024
7578970
fix mypy
lillux Jan 25, 2024
ee959b8
Docstring fix
lillux Jan 25, 2024
c04ba2a
mypy errors fix
lillux Jan 29, 2024
1b95a97
mypy errors
lillux Jan 29, 2024
fd6d9cc
ignore mypy
lillux Jan 29, 2024
aa58adc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 29, 2024
a965e32
Merge branch 'scverse:main' into dbit
lillux Jan 29, 2024
c49c5d5
Now the dimension of the DBiT grid depends on the number of barcodes.
lillux Feb 6, 2024
ffd8a9d
Docstrings fix
lillux Feb 6, 2024
b1173c2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 6, 2024
2ec3f7a
Added DBiT-seq reader to changelog
lillux Feb 6, 2024
955a52c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 6, 2024
90ba226
Merge branch 'scverse:main' into dbit
lillux Feb 7, 2024
b9ddfb0
Merge branch 'main' into dbit
lillux Feb 8, 2024
cb94f9b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ __pycache__/
# IDEs
/.idea/
/.vscode/
/.spyproject/

# docs
/docs/generated/
Expand Down
1 change: 1 addition & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ I/O for the `spatialdata` project.
steinbock
merscope
mcmicro
dbit
```
12 changes: 2 additions & 10 deletions src/spatialdata_io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,13 @@
from spatialdata_io.readers.codex import codex
from spatialdata_io.readers.cosmx import cosmx
from spatialdata_io.readers.curio import curio
from spatialdata_io.readers.dbit import dbit
from spatialdata_io.readers.mcmicro import mcmicro
from spatialdata_io.readers.merscope import merscope
from spatialdata_io.readers.steinbock import steinbock
from spatialdata_io.readers.visium import visium
from spatialdata_io.readers.xenium import xenium

__all__ = [
"curio",
"visium",
"xenium",
"codex",
"cosmx",
"mcmicro",
"steinbock",
"merscope",
]
__all__ = ["curio", "visium", "xenium", "codex", "cosmx", "mcmicro", "steinbock", "merscope", "dbit"]

__version__ = version("spatialdata-io")
12 changes: 12 additions & 0 deletions src/spatialdata_io/_constants/_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,3 +197,15 @@ class MerscopeKeys(ModeEnum):
GLOBAL_Z = "global_z"
Z_INDEX = "ZIndex"
REGION_KEY = "cells_region"


@unique
class DbitKeys(ModeEnum):
"""Keys for DBiT formatted dataset."""

# files and directories
COUNTS_FILE = ".h5ad"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity, but which pipeline/preprocessing output an h5ad directly?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing is that DBiT seq can be used for both scATAC and scRNA data. We currently use a simple script to process fastq so that they can be feeded to any scRNA/scATAC processing pipeline. From that point on, it's just SOTA processing producing h5ad files. The obs_name in the h5ad will be then used to map texels on the surface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thank you for the clarification @dawe !

# barcodes_file
BARCODE_POSITION = "barcode_list"
# image
IMAGE_LOWRES_FILE = "tissue_lowres_image.png"
328 changes: 328 additions & 0 deletions src/spatialdata_io/readers/dbit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
from __future__ import annotations

import os
import re
from pathlib import Path
from re import Pattern
from typing import Optional, Union

import anndata as ad
import numpy as np
import pandas as pd
import shapely
import spatialdata as sd
from dask_image.imread import imread
from numpy.typing import NDArray
from spatialdata import SpatialData
from spatialdata._logging import logger
from xarray import DataArray

from spatialdata_io._constants._constants import DbitKeys
from spatialdata_io._docs import inject_docs

__all__ = ["dbit"]


def _check_path(
path: Path,
pattern: Pattern[str],
key: DbitKeys,
path_specific: Optional[str | Path] = None,
optional_arg: bool = False,
) -> tuple[Union[Path, None], bool]:
"""
Check that the path is valid and match a regex pattern.

Parameters
----------
path :
The path of the main directory where to search for the path.
path_specific :
path to the file, if it is not in the main directory.
pattern :
regex pattern.
key :
String to match in the path or path_specific path.
optional_arg :
User specify if the file to search is mandatory (optional_arg=True, raise an Error if not found)
or optional (optional_arg=False, raise a Warning if not found).

Raises
------
FileNotFoundError
The error is raised if no match is found in the given paths and optional_arg=True.

Returns
-------
tuple(pathlib.PosixPath, bool)
return a tuple(pathlib.PosixPath, bool). The bool is a flag that indicate if one of the supplied path arguments points to a file that match the supplied key.
"""
flag = False
file_path = None
try:
checked_file = [i for i in os.listdir(path) if pattern.match(i)][0] # this is the filename
file_path = Path.joinpath(path, checked_file)
flag = True
except IndexError:
# handle case in which the searched file is not in the same directory as path
if path_specific is not None:
if os.path.isfile(path_specific):
file_path = Path(path_specific)
flag = True
else:
if optional_arg:
logger.warning(f"{path_specific} is not a valid path for {key}. No {key} will be used.")
else:
raise FileNotFoundError(f"{path_specific} is not a valid path for a {key} file.")
else:
if optional_arg:
logger.warning(f"No file containing {key} found in folder {path}. No {key} will be used.")
else:
raise FileNotFoundError(f"No file containing {key} found in folder {path}.")
return file_path, flag


def _barcode_check(barcode_file: Path) -> pd.DataFrame:
"""
Check that the barcode file is formatted as expected.

What do we expect :
A tab separated file, headless, with 2 columns:
column 0 : str, composed by "A" or "B", followed by a number, like 'A6', 'A22','B7'
column 1 : str, of 8 chars representing nucleotides, like 'AACTGCTA'

Parameters
----------
barcode_file :
The path to the barcode file.

Raises
------
ValueError :
ValueError is raised if a field of the file does not comply with the expected pattern.
Appropriate error message is printed.

Returns
-------
pd.DataFrame :
A pandas.DataFrame with 2 columns, named 'A' and 'B', with a barcode as row index.
Columns 'A' and 'B' contains an int each, that are the spatial coordinate of the barcode.
The columns are ordered in ascending order.

"""
df = pd.read_csv(barcode_file, header=None, sep="\t")
# check if there are 2 columns
if len(df.columns) != 2:
raise ValueError(
f"The barcode file you passed at {barcode_file} does not have 2 columns.\nYour file has to be formatted with 2 columns, the first for positions, the second for the barcode, as follows:\n\nA1 AACCTTGG\nA2 GGCATGTA\nA3 GCATATGC\n..."
)
# check if the data in the columns are correct.
# Pattern 1: match A or B at the start, then match 1 or 2 numbers at the end. Case sensitive.
patt_position = re.compile(r"(^[A|B])([\d]{1,2}$)")
# Pattern 2: match nucleotides string of 8 char. Case insensitive.
patt_barcode = re.compile(r"^[A|a|T|t|C|c|G|g]{8}$")
# dict, used to collect data after matching
bc_positions: dict[str, dict[str, str]] = {}
# line[0]: row index, line[1] row values. line[1][0] : barcode coordinates, line[1][1] : barcode
for line in df.iterrows():
if not bool(patt_position.fullmatch(line[1][0])):
raise ValueError(
f"Row {line[0]}, has an incorrect positional id: {line[1][0]}, \nThe correct pattern for the position is a str, containing a letter between A or B, and one or two digits. Case insensitive."
)
if not bool(patt_barcode.fullmatch(line[1][1])):
raise ValueError(
f"Row {line[0]} has an incorrect barcode: {line[1][1]}, \nThe correct pattern for a barcode is a str of 8 nucleotides, each char is a letter between A,T,C,G. Case insensitive."
)
barcode = line[1][1]
letter = line[1][0][0]
try:
bc_positions[barcode][letter] = line[1][0][1:]
except KeyError:
bc_positions[barcode] = {}
bc_positions[barcode][letter] = line[1][0][1:]
# return pandas.DataFrame, in (pseudo)long form
return pd.DataFrame(bc_positions).transpose()


def _xy2edges(xy: list[int], scale: float = 1.0, border: bool = True, border_scale: float = 1) -> NDArray[np.double]:
"""
Construct vertex coordinate of a square from the barcode coordinates.

The constructed square has a scalable border.

Parameters
----------
xy :
lillux marked this conversation as resolved.
Show resolved Hide resolved
coordinate of the spot identified by its barcode.
scale :
Resize the square.
border :
If True, the square is shrinked toward its center, leaving an empty border.
border_scale :
The factor by which the border is scaled.
The default is 1. It corresponds to a border length of 0.125 * length of the square's edge

Returns
-------
ndarray
The resized square derived from the barcoded reads coordinates of a certain spot.

lillux marked this conversation as resolved.
Show resolved Hide resolved
"""
# unpack coordinates
x, y = xy
# create border grid
border_grid = np.array([[0.125, 0.125], [0.125, -0.125], [-0.125, -0.125], [-0.125, 0.125], [0.125, 0.125]])
# create square grid
square = np.array([[0, 0], [0, 1], [1, 1], [1, 0], [0, 0]])
# position square on the right place following the barcodes coordinates
# barcode coordinates starts at 1, so we subtract 1 to start at 0
square[:, 0] += x - 1
square[:, 1] += y - 1
# check if user wants border, and scale accordingly
if border:
return (np.array(square) * scale) + (border_grid * border_scale)

return np.array(square) * scale


@inject_docs(vx=DbitKeys)
def dbit(
path: str | Path,
anndata_path: Optional[str] = None,
barcode_position: Optional[str] = None,
image_path: Optional[str] = None,
dataset_id: Optional[str] = None,
border: bool = True,
border_scale: float = 1,
) -> SpatialData:
"""
Read DBiT experiment data (Deterministic Barcoding in Tissue)

This function reads the following files:

- ``{vx.COUNTS_FILE!r}`` : Counts matrix.
- ``{vx.BARCODE_POSITION!r}`` : Barcode file.
- ``{vx.IMAGE_LOWRES_FILE!r}`` : Histological image | Optional.

.. seealso::

- `High-Spatial-Resolution Multi-Omics Sequencing via Deterministic Barcoding in Tissue <https://www.cell.com/cell/fulltext/S0092-8674(20)31390-8/>`_.

Parameters
----------
path :
Path to the directory containing the data.
anndata_path :
path to the counts and metadata file.
barcode_position :
path to the barcode coordinates file.
dataset_id :
Dataset identifier to name the constructed `SpatialData` elements.
If not given, filename is used as dataset_id
image_path :
path to the low resolution image.
It expect the image to be correctly cropped and transformed.
border :
Value passed internally to _xy2edges()
If True, the square is shrinked toward its center, leaving an empty border.
border_scale :
Value passed internally to _xy2edges()
The factor by which the border is scaled.
The default is 1. It corresponds to a border length of (0.125 * length of the square's edge)

Returns
-------
:class:`spatialdata.SpatialData`.
"""
path = Path(path)
# if path is invalid, raise error
if not os.path.isdir(path):
raise FileNotFoundError(
f"The path you have passed: {path} has not been found. A correct path to the data directory is needed."
)
# compile regex pattern to find file name in path, according to _constants.DbitKeys()
patt_h5ad = re.compile(f".*{DbitKeys.COUNTS_FILE}")
patt_barcode = re.compile(f".*{DbitKeys.BARCODE_POSITION}.*")
patt_lowres = re.compile(f".*{DbitKeys.IMAGE_LOWRES_FILE}")

# search for files paths. Gives priority to files matching the pattern found in path.
anndata_path_checked = _check_path(
path=path, path_specific=anndata_path, pattern=patt_h5ad, key=DbitKeys.COUNTS_FILE
)[0]
barcode_position_checked = _check_path(
path=path, path_specific=barcode_position, pattern=patt_barcode, key=DbitKeys.BARCODE_POSITION
)[0]
image_path_checked, hasimage = _check_path(
path=path,
path_specific=image_path,
pattern=patt_lowres,
key=DbitKeys.IMAGE_LOWRES_FILE,
optional_arg=True,
)

# read annData.
adata = ad.read_h5ad(anndata_path_checked)
# Read barcode.
bc_df = _barcode_check(barcode_file=barcode_position_checked) # type: ignore

# add barcode positions to annData.
# A and B naming follow original publication and protocol
adata.obs["array_A"] = [int(bc_df.loc[x[8:16], "A"]) for x in adata.obs_names]
adata.obs["array_B"] = [int(bc_df.loc[x[:8], "B"]) for x in adata.obs_names]
# sort annData by barcode position. Barcode A first, then Barcode B
adata.obs.sort_values(by=["array_A", "array_B"], inplace=True)

# populate annData
if dataset_id is None: # if no dataset_id, use file name as id.
logger.warning("No dataset_id received as input.")
dataset_id = ".".join(
anndata_path_checked.name.split(".")[:-1] # type: ignore
) # this is the filename stripped from the file extension
logger.warning(f"{dataset_id} is used as dataset_id.")

adata.obs["region"] = dataset_id
adata.obs["sample"] = dataset_id
adata.obs["region"] = adata.obs["region"].astype("category")
# assignment of pixel id follow the row ordering of adata.obs
adata.obs["pixel_id"] = np.arange(len(adata.obs_names))
adata.uns["spatialdata_attrs"] = {"region_key": "region", "region": dataset_id, "instance_key": "pixel_id"}
# the array that we will create below has 2 columns without header
# so it is up to the user to know that:
# the first columns is 'array_A'
# the second column in 'array_B'.
adata.obsm["spatial"] = adata.obs[["array_A", "array_B"]].values
# parse data from annData using SpatialData parser
table_data = sd.models.TableModel.parse(adata)

# read and convert image for SpatialData
# check if image exist, and has been passed as argument
if hasimage:
image = imread(image_path_checked).squeeze().transpose(2, 0, 1) # channel, y, x
image = DataArray(image, dims=("c", "y", "x"), name=dataset_id)
image_sd = sd.models.Image2DModel.parse(image)
# calculate scale factor of the grid wrt the histological image.
# this is needed because we want to mantain the original histological image
# dimensions, and scale the grid accordingly in such a way that the grid
# overlaps with the histological image.
# We assume that the grid is a square, and indeed it is if we follow the DBiT protocol.
grid_length = 50 # hardcoded lines number
# we are passing an image, that is supposed to be perfectly
# cropped to match the tissue part from which the data are collected.
# We assume the user has already cropped and transformed everything correctly.
if hasimage:
scale_factor = max(image_sd.shape) / grid_length
else:
scale_factor = 1
# contruct polygon grid with grid coordinates
xy = adata.obs[["array_A", "array_B"]].values.astype(int)
f = np.array(
[_xy2edges(x, scale=scale_factor, border=border, border_scale=scale_factor * border_scale) for x in xy]
)
ra = shapely.to_ragged_array([shapely.Polygon(x) for x in f])
grid = sd.models.ShapesModel.parse(ra[1], geometry=ra[0], offsets=ra[2], index=adata.obs["pixel_id"].copy())
# create SpatialData object!
sdata = sd.SpatialData(table=table_data, shapes={dataset_id: grid})
if hasimage:
imgname = dataset_id + "_image"
sdata.images[imgname] = image_sd
return sdata
Loading