Releases: zktuong/dandelion
Releases · zktuong/dandelion
v0.3.2
What's Changed
Mainly to fix compatibility with dependencies.
- minor wording/renaming tweaks in tutorial by @ktpolanski in #252
- ensure additional column names are present in strict mode by @zktuong in #253
- fix weekly tests by @zktuong in #255
- fix singularity preprocessing for org by @zktuong in #256
- Easy docs2 by @zktuong in #260
- Update _network.py by @zktuong in #258
- fix requirements by @zktuong in #262
- Create dependabot.yml by @zktuong in #263
- pip prod(deps): update pandas requirement from <1.5.0,>=1.0.3 to >=1.0.3,<2.1.0 by @dependabot in #264
- getting rid of CI warnings by @zktuong in #265
- update actions by @zktuong in #266
- barplot bug by @zktuong in #267
- fix query behaviour for merge by @zktuong in #268
- Update colab doc by @zktuong in #271
New Contributors
- @dependabot made their first contribution in #264
Full Changelog: v0.3.1...v0.3.2
v0.3.1
What's Changed
Just to update pypi - Some bug fixes to accompany the revision
Doesn't affect the container image (but i should add a tag on sylabs to also call it 0.3.1 just to be consisten).
- Enforce pickle priorty to be 4 by @zktuong in #216
- minor doc aesthetic updates by @zktuong in #218
- add citation to preprint by @zktuong in #221
- Does this help? by @zktuong in #222
- add missing function to api by @zktuong in #223
- update-docstrings by @zktuong in #224
- Minor behavior update by @zktuong in #228
- fix re-indexing issue by @zktuong in #229
- parse main calls by @zktuong in #231
- add mouse-preprocess by @zktuong in #237
- vdj mapping causing an issue? by @zktuong in #241
- revert awk change by @zktuong in #242
- fix macos tests by @zktuong in #243
- add logic for jmultimap checking by @zktuong in #244
- Revert "add logic for jmultimap checking" by @zktuong in #245
- fix empty columns by @zktuong in #246
- Update _core.py by @zktuong in #247
- increased transparency tutorial by @ktpolanski in #248
- update email by @zktuong in #249
- Return gamma delta notebook by @zktuong in #250
Full Changelog: v0.3.0...v0.3.1
v0.3.0
What's Changed
This release adds a number of new features and minor restructuring to accompany Dandelion's manuscript (uploading soon). Kudos to @suochenqu and @ktpolanski
- data strategy to handle non-productive contigs, partial contigs and 'J multi-mappers'
- new V(D)J pseudotime trajectory inference!
- revamped tutorials and documents
Detailed PRs
- multimappers by @zktuong in #165
- Update environment.yml by @zktuong in #168
- fix-typo by @zktuong in #172
- add J multimap to BCR workflow by @zktuong in #178
- fix pandas dependency by @zktuong in #181
- fix unreference variable 182 by @zktuong in #183
- add trajectory utils by @suochenqu in #185
- select left most J call in multimappers by @zktuong in #186
- update container definitions by @zktuong in #187
- fix the column names by @zktuong in #188
- Update _trajectory.py by @zktuong in #189
- Pseudobulking improvements by @ktpolanski in #193
- Further tweaks by @ktpolanski in #194
- Update api.rst by @zktuong in #197
- save calculate threshold plot by @zktuong in #200
- Adjust toggling of productive/non-productive filtering for setup pseudobulk by @zktuong in #201
- compute_pseudobulk_gex by @zktuong in #202
- change update_metadata to always reinitialise by @zktuong in #203
- two quick pseudobulking fixes by @ktpolanski in #204
- Customise setup pseudobulk by @zktuong in #205
- refactor pseudobulking, update gex by @ktpolanski in #206
- add tests by @zktuong in #208
- singularity changeo pipeline by @ktpolanski in #207
- quickstart tutorial by @ktpolanski in #209
- place nxviz as an external submodule by @zktuong in #210
- update notebooks by @zktuong in #211
- Update external by @zktuong in #212
- fix api documentation styling by @zktuong in #214
New Contributors
- @suochenqu made their first contribution in #185
Full Changelog: v0.2.4...v0.3.0
v0.2.4
What's Changed
- slicing and check contigs by @zktuong in #159
- add new functions and rework github actions by @zktuong in #161
New features
slicing functionality
- the
Dandelion
object can now be sliced like aAnnData
, or pandasDataFrame
!vdj[vdj.data['productive'] == 'T'] Dandelion class object with n_obs = 38 and n_contigs = 94 data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status' metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
vdj[vdj.metadata['productive_VDJ'] == 'T'] Dandelion class object with n_obs = 17 and n_contigs = 36 data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status' metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
vdj[vdj.metadata_names.isin(['cell1', 'cell2', 'cell3', 'cell4', 'cell5'])] Dandelion class object with n_obs = 5 and n_contigs = 20 data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status' metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
vdj[vdj.data_names.isin(['contig1','contig2','contig3','contig4','contig5'])] Dandelion class object with n_obs = 2 and n_contigs = 5 data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'rearrangement_status' metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'v_call_gdT_VDJ', 'd_call_gdT_VDJ', 'j_call_gdT_VDJ', 'v_call_gdT_VJ', 'j_call_gdT_VJ', 'productive_gdT_VDJ', 'productive_gdT_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'duplicate_count_gdT_VDJ', 'duplicate_count_gdT_VJ', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
- not sure implementing it like
adata[:, adata.var.something]
make sense as it's not really row information in the data slot? - also the base slot in
Dandelion
is.data
, and doesn't make sense for.metadata
to be the 'row' - maybe scverse/scirpy#327 can come up with a better strategy and i can adopt that later on.
- not sure implementing it like
ddl.pp.check_contigs
- created a new function
ddl.pp.check_contigs
as a way to just check if contigs are ambiguous, rather than outright removing them. I envisage that this will eventually replacesimple
mode inddl.pp.filter_contigs
in the future.- new column in
.data
:ambiguous
, T/F to indicate whether contig is considered ambiguous or not (different from cell level ambiguous). - the
.metadata
and several other functions ignores any contigs marked asT
to maintain the same behaviour - The largest difference between
ddl.pp.check_contigs
andddl.pp.filter_contigs
is that the onus is on the user to remove any 'bad' cells from the GEX data (illustrated in the tutorial) withcheck_contigs
whereas this happens semi-automatically withfilter_contigs
.
- new column in
ddl.update_metadata
now comes with a 'by_celltype' option
- This brings a new feature - B cell, alpha-beta T cell and gamma-delta T cell associated columns for V,D,J,C and productive columns!
- this is achieved through a new
.retrieve_celltype
subfunction in theQuery
class, which breaks up the retrieval into the 3 major groups ifby_celltype = True
. - No longer the need to guess which belongs to which and allows for easy slicing! This does cause a bit of
.obs
bloating. - Which leads to the removal of
constant_status_VDJ
,constant_status_VJ
,productive_status_VDJ
,productive_status_VJ
as the metadata is getting bloated with the slight rework of Dandelion metadata slot to account for the new B/abT/gdT columns
- this is achieved through a new
tl.productive_ratio
- Calculates a cell-level representation of productive vs non-productive contigs.
- Plotting is achieved through
pl.productive_ratio
- Plotting is achieved through
tl.vj_usage_pca
- Computes PCA on a cell-level representation of V/J gene usage across designated groupings
- uses
scanpy.pp.pca
internally - Plotting can be achieved through
scanpy.pl.pca
- uses
bug fixes
- fix cell ordering issue scverse/scirpy#347
- small refactor of
ddl.pp.filter_contigs
- moved some of the repetitive loops into callable functions
- deprecate
filter_vj_chains
argument and replaced withfilter_extra_vdj_chains
andfilter_extra_vj_chains
to hopefully enable more interpretable behaviour. fixes #158 - umi adjustment step was buggy but i have now made the behaviour consistent with how it functions in
ddl.pp.check_contigs
rearrangement_status_VDJ
andrearrangement_status_VJ
(renamed fromrearrangement_VDJ_status
andrearrangement_VJ_status
) from now gives a single value for whether a chimeric rearrangement occured e.g. TRDV pairing with TRAJ and TRAC as in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267242/- fixed issues with progress bars getting out of hand
- fixed issue with
ddl.tl.find_clones
crashing if more than 1 type of loci is found in the data.- now a
B
,abT
andgdT
prefix will be appended to BCR/TR-ab/TR-gd clones.
- now a
- check_contigs, find_clones and define_clones were removing non-productive contigs even though there's no need to. May cause issues with filter_contigs... but there's a problem for next time.
- fix issue with min_size in network not behaving as intended. switch to using connected components to find which nodes to trim
other changes
- new column
chain_status
, to summarise the reworkedlocus_status
column.- Should contain values like
ambiguous
,Orphan VDJ
,Single pair
etc, similar tochain_pairing
in scirpy.
- Should contain values like
- Also fixed the ordering of metadata to make it more presentable, instead of just randomly slotting into the...
v0.2.3
same as v0.2.2 but i seemed to have messed up the upload to pypi. so trying again.
What's Changed
- try and add youtube video to docs by @zktuong in #148
- testing_rpy2_update by @zktuong in #150
- Speed upgrade - Refactor generate network by @zktuong in #152
- remove nxviz from requirements by @zktuong in #157
Bug fixes and Improvements
- Speed up
generate_network
- pair-wise hamming distance is calculated on per clone/clonotype only if more than 1 cell is assigned to a clone/clonotype
.distance
slot is removed and is now directly stored/converted from the.graph
slot.- new options:
compute_layout: bool = True
. If dataset is too large,generate_layout
can be switched toFalse
in which case only thenetworkx
graph is returned. The data can still be visualised later withscirpy's
plotting method (see below).layout_method: Literal['sfdp', 'mod_fr'] = 'sfdp'
. New default uses the ultra-fast C++ implementedsfdp_layout
algorithm ingraph-tools
to generate final layout.sfdp
stands for Scalable Force Directed Placement.- Minor caveat is that the repulsion is not as good - when there's a lot of singleton nodes, they don't separate well unless you some how work out which of the parameters in sfdp_layout to tweak will produce an effective separate. changing
gamma
alone doesn't really seem to do much. - The original layout can still be generated by specifying
layout_method = 'mod_fr'
. Requires a separate installation ofgraph-tool
via conda (not managed by pip) as it has several C++ dependencies. - pytest on macos may also stall because of a different backend being called - this is solved by changing tests that calls
generate_network
to run last.
- Minor caveat is that the repulsion is not as good - when there's a lot of singleton nodes, they don't separate well unless you some how work out which of the parameters in sfdp_layout to tweak will produce an effective separate. changing
- added steps to reduce memory hogging.
min_size
was doing the opposite previously and this is now fixed. #155
- Speed up
transfer
- Found a faster way to create the connectivity matrix.
- this also now transfer a dictionary that
scirpy
can use to generate the plots scverse/scirpy#286 - Fix #153
- rename
productive
toproductive_status
.
- rename
- Fix #154
- reorder the if-else statements.
- Speed up
filter_contigs
- tree construction is simplified and replaced for-loops with dictionary updates.
- Speed up
initialise_metadata
.Dandelion
should now initialise and read faster.- Removed an unnecessary data sanitization step when loading data.
- Now
load_data
will renameumi_count
toduplicate_count
- Speed up
Query
- tree construction is simplified and replaced for-loops with dictionary updates.
- didn't need to use an airr validator as that slows things down.
- data initialised by
Dandelion
will be ordered based on productive first, then followed by umi count (largest to smallest).
Breaking Changes
initialise_metadata/update_metadata/Dandelion
- For-loops to initialise the object has veen vectorized, resulting in a minor speed uprade
- This results in reduction of some columns in the
.metadata
which were probably bloated and not used.vdj_status
andvdj_status_summary
removed and replaced withrearrangement_VDJ_status
andrearrange_VJ_status
constant_status
andconstant_summary
removed and replaced withconstant_VDJ_status
andconstant_VJ_status
.productive
andproductive_summary
combined and replaced withproductive_status
.locus_status
andlocus_status_summary
combined and replaced withlocus_status
.isotype_summary
replaced withisotype_status
.
- where there was previously
unassigned
or''
has been changed to :str:None
in.metadata
.- Not changed to
NoneType
as there's quite a bit of text processing internally that gets messed up if swapped. No_contig
will still be populated after transfer toAnnData
to reflect cells with no TCR/BCR info.
- Not changed to
- deprecate use of nxviz<0.7.4
- reworked code to use the updated version at https://github.com/zktuong/nxviz/tree/custom_color_mapping_circos_nodes_and_edges
Minor changes
- Rename and deprecate
read_h5/write_h5
. Use ofread_h5ddl/write_h5ddl
will be enforced in the next update.
Full Changelog: v0.2.1...v0.2.2
v0.2.2
What's Changed
- try and add youtube video to docs by @zktuong in #148
- testing_rpy2_update by @zktuong in #150
- Speed upgrade - Refactor generate network by @zktuong in #152
- remove nxviz from requirements by @zktuong in #157
Bug fixes and Improvements
- Speed up
generate_network
- pair-wise hamming distance is calculated on per clone/clonotype only if more than 1 cell is assigned to a clone/clonotype
.distance
slot is removed and is now directly stored/converted from the.graph
slot.- new options:
compute_layout: bool = True
. If dataset is too large,generate_layout
can be switched toFalse
in which case only thenetworkx
graph is returned. The data can still be visualised later withscirpy's
plotting method (see below).layout_method: Literal['sfdp', 'mod_fr'] = 'sfdp'
. New default uses the ultra-fast C++ implementedsfdp_layout
algorithm ingraph-tools
to generate final layout.sfdp
stands for Scalable Force Directed Placement.- Minor caveat is that the repulsion is not as good - when there's a lot of singleton nodes, they don't separate well unless you some how work out which of the parameters in sfdp_layout to tweak will produce an effective separate. changing
gamma
alone doesn't really seem to do much. - The original layout can still be generated by specifying
layout_method = 'mod_fr'
. Requires a separate installation ofgraph-tool
via conda (not managed by pip) as it has several C++ dependencies. - pytest on macos may also stall because of a different backend being called - this is solved by changing tests that calls
generate_network
to run last.
- Minor caveat is that the repulsion is not as good - when there's a lot of singleton nodes, they don't separate well unless you some how work out which of the parameters in sfdp_layout to tweak will produce an effective separate. changing
- added steps to reduce memory hogging.
min_size
was doing the opposite previously and this is now fixed. #155
- Speed up
transfer
- Found a faster way to create the connectivity matrix.
- this also now transfer a dictionary that
scirpy
can use to generate the plots scverse/scirpy#286 - Fix #153
- rename
productive
toproductive_status
.
- rename
- Fix #154
- reorder the if-else statements.
- Speed up
filter_contigs
- tree construction is simplified and replaced for-loops with dictionary updates.
- Speed up
initialise_metadata
.Dandelion
should now initialise and read faster.- Removed an unnecessary data sanitization step when loading data.
- Now
load_data
will renameumi_count
toduplicate_count
- Speed up
Query
- tree construction is simplified and replaced for-loops with dictionary updates.
- didn't need to use an airr validator as that slows things down.
- data initialised by
Dandelion
will be ordered based on productive first, then followed by umi count (largest to smallest).
Breaking Changes
initialise_metadata/update_metadata/Dandelion
- For-loops to initialise the object has veen vectorized, resulting in a minor speed uprade
- This results in reduction of some columns in the
.metadata
which were probably bloated and not used.vdj_status
andvdj_status_summary
removed and replaced withrearrangement_VDJ_status
andrearrange_VJ_status
constant_status
andconstant_summary
removed and replaced withconstant_VDJ_status
andconstant_VJ_status
.productive
andproductive_summary
combined and replaced withproductive_status
.locus_status
andlocus_status_summary
combined and replaced withlocus_status
.isotype_summary
replaced withisotype_status
.
- where there was previously
unassigned
or''
has been changed to :str:None
in.metadata
.- Not changed to
NoneType
as there's quite a bit of text processing internally that gets messed up if swapped. No_contig
will still be populated after transfer toAnnData
to reflect cells with no TCR/BCR info.
- Not changed to
- deprecate use of nxviz<0.7.4
- reworked code to use the updated version at https://github.com/zktuong/nxviz/tree/custom_color_mapping_circos_nodes_and_edges
Minor changes
- Rename and deprecate
read_h5/write_h5
. Use ofread_h5ddl/write_h5ddl
will be enforced in the next update.
Full Changelog: v0.2.1...v0.2.2
v0.2.1
0.2.0
What's Changed
- add hiconf to pipeline by @ktpolanski in #130
- Singularity ci by @zktuong in #131
- v0.1.13 by @zktuong in #132
- add db-all file by @zktuong in #134
- fix bcr strict option by @zktuong in #136
- clarify and add new detail by @ktpolanski in #137
- fix container workflow by @zktuong in #138
- Internal external preprocessing steps (igblastn, blastn) and Query classes are also simplified/sped up.
- AIRR format sanitisation is also enforced to prevent issues during I/O.
- Full distance slot is no longer automatically saved to reduce I/O times
Full Changelog: v0.1.12...v0.2.0