MRscHiC

This repo is moved to 23AIBox

The three-dimensional genome structure plays a key role in cellular function and gene regulation. Single-cell Hi-C technology can capture genome structure information at the cell level, which provides the opportunity to study how genome structure varies among different cell types. However, few methods are well designed for single-cell Hi-C clustering, because of high sparsity, noise and heterogeneity of single-cell Hi-C data. In this manuscript, we propose a multi-scale representation learning framework, named MRscHiC, for single-cell Hi-C data representation and clustering. MRscHiC mainly contains two parts: imputation at bin level, feature extraction at chromosome and cell level. The evaluation results show that the proposed method outperforms existing state-of-the-art approaches on both human and mouse datasets.

MRscHiC

The environment of MRscHiC

Linux OS
python 3.8.5 
PyTorch 1.4.0

Data

The Flyamer dataset is downloaded from GSE80006.
The Ramani dataset is downloaded from GSE84920.
The 4DN dataset is downloaded from https://noble.gs.washington.edu/proj/schic-topic�model.

Contact matrix

The current single-cell Hi-C experimental data is generally stored as a two-dimensional contact matrix. The data obtained by high-throughput sequencing technologies such as Hi-C can be used to construct a contact matrix. The data information obtained by the Hi-C experiment represents the interaction information between any two sites in the whole genome (the information is usually reflected by the number of read-pairs or interaction pairs), and the information is stored in the contact matrix, which is a symmetric matrix. Each element in the contact matrix of the chromosome is said to support the number of interactions between the corresponding two chromosome fragments. For a chromosome contact matrix M, each element Mij in the matrix represents the number of read-pairs that support the interaction between two chromosome fragments i and j. As shown in Table 1, the table is a part of the contact matrix of chromosome 1 of an Oocyte cell. The number 62 in bold in the table indicates that the number of read-pairs that support the interaction between the two chromosome fragments chr1: 3000000-4000000 and chr1:3000000-4000000 is 62, and 12 in bold represents the number of read-pairs that support the interaction between the two chromosome fragments chr1: 3000000-4000000 and chr1: 4000000 -5000000 is 12.

Table 1 Part of the contact matrix.

bin	3000000-4000000	4000000-5000000	5000000-6000000	6000000-7000000	7000000-8000000	8000000-9000000
3000000-4000000	62.00	12.00	2.00	1.00	0.00	2.00
4000000-5000000	12.00	132.00	8.00	10.00	0.00	6.00
5000000-6000000	2.00	8.00	86.00	16.00	21.00	16.00
6000000-7000000	1.00	10.00	16.00	172.00	24.00	10.00
7000000-8000000	0.00	0.00	21.00	24.00	104.00	8.00
8000000-9000000	2.00	6.00	16.00	10.00	8.00	58.00

Contact matrix preprocessing

The raw data downloaded in GSE80006 is shown in Table 2. The bolded data rows in the table indicate: the number of read-pairs that support the interaction between the two chromosome fragments chr1:3000000-3200000 and chr1:3200000-3400000 is 1.

Table 2 Raw data of GSE80006.

chrom1	chrom2	start1	end1	start2	end2	count
1	1	3000000	3200000	3000000	3200000	4
1	1	3000000	3200000	3200000	3400000	1
1	1	3000000	3200000	4000000	4200000	1
1	1	3000000	3200000	4400000	4600000	1
1	1	3000000	3200000	4600000	4800000	1
1	1	3000000	3200000	5200000	5400000	1

chrom1: chromosome name for fragment 1.
chrom2: chromosome name for fragment 2.
start1: start location for fragment 1.
end1: end location for fragment 1.
start2: start location for fragment 2.
end2: end location for fragment 2.
count: count number or normalized weight for the interaction.
(Note that only intra-chromosomal reads are used in MRscHiC.)

Input files

Contact matrix file

In this work, we process the raw data in Table 2 into the data in Table 3 to enter the script. The input data is divided into three columns, and each column is separated by a tab, respectively indicating bins and the number of read-pairs that support interaction. As shown in Table 3, the bolded data row indicates that at the 200-kb resolution, the number of read-pairs that support the interaction between the two chromosome fragments chr1: 3000000-3200000 and chr1: 3200000-3400000 is 78. The first column indicates the starting position of the fragment of the first chromosome divided by the resolution, the second column indicates the starting position of the fragment of the second chromosome divided by the resolution, and the third column indicates the number of read-pairs that support the interaction in these two fragments. The end position of the chromosomes fragments can be calculated using the start position plus the resolution.

Table 3 Part of contact matrix file.

chrom1_start	chrom2_start	count
15	15	237
15	16	78
15	17	22
15	18	24
15	19	11
15	20	14
15	21	8
15	22	10
15	23	9

chrom1_start: start location for fragment 1.
chrom2_start: start location for fragment 2.
count: count number or normalized weight for the interaction.
(Note that the chrom1_start and chrom2_start are calculated in the following way: chrom1_start=(start location for fragment 1)/resolution.)

Cell list file

Information in this file: the location information of the cells that are input to the script.

1mbres/Oocyte/NSN/NSN_1  
1mbres/Oocyte/NSN/NSN_4  
1mbres/Oocyte/NSN/NSN_5  
1mbres/Oocyte/NSN/NSN_6

Usage

1.Run MRscHiC_run.py to convert input files（contact matrix file & cell list file） to feature matrix.

$ python MRscHiC_run.py

2.Run ftrain.py (Flyamer dataset) or rtrain.py (Ramani dataset 4DN dataset) to convert feature matrix to the cell embedding.

$ python ftrain.py
or
$ python rtrain.py

3.Use cell embedding for downstream analysis.

Acknowledgments

We really thank the Zhou et al. open the source code of scHiCluster at this link.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
__pycache__		__pycache__
save_model		save_model
venv		venv
.DS_Store		.DS_Store
MRscHiC_run.py		MRscHiC_run.py
README.md		README.md
flyamer_model.py		flyamer_model.py
ftrain.py		ftrain.py
imputation.py		imputation.py
loadmodel.py		loadmodel.py
model.py		model.py
multi_methods.py		multi_methods.py
rtrain.py		rtrain.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MRscHiC

The environment of MRscHiC

Data

Contact matrix

Contact matrix preprocessing

Input files

Contact matrix file

Cell list file

Usage

1.Run MRscHiC_run.py to convert input files（contact matrix file & cell list file） to feature matrix.

2.Run ftrain.py (Flyamer dataset) or rtrain.py (Ramani dataset 4DN dataset) to convert feature matrix to the cell embedding.

3.Use cell embedding for downstream analysis.

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

MedicineBiology-AI/MRscHiC

Folders and files

Latest commit

History

Repository files navigation

MRscHiC

The environment of MRscHiC

Data

Contact matrix

Contact matrix preprocessing

Input files

Contact matrix file

Cell list file

Usage

1.Run MRscHiC_run.py to convert input files（contact matrix file & cell list file） to feature matrix.

2.Run ftrain.py (Flyamer dataset) or rtrain.py (Ramani dataset 4DN dataset) to convert feature matrix to the cell embedding.

3.Use cell embedding for downstream analysis.

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages