Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 #33

spoweekkk · 2025-01-16T22:03:36Z

When I tried to train the model with the following pipline:
PREDICTION_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2
INPUT_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2/
OUTPUT_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2
DORADO_PATH="/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/dorado/bin"
DeepMod2_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/DeepMod2

${DORADO_PATH}/dorado basecaller --emit-moves --reference ${INPUT_DIR}/ac_reference.fa ${DORADO_PATH}/models/[email protected] ${INPUT_DIR}/ac_train/pod5/ac_train.pod5 > ${INPUT_DIR}/ac_train/bam/ac_train.bam
${DORADO_PATH}/dorado basecaller --emit-moves --reference ${INPUT_DIR}/ac_reference.fa ${DORADO_PATH}/models/[email protected] ${INPUT_DIR}/ac_train/pod5/control_train.pod5 > ${INPUT_DIR}/ac_train/bam/control_train.bam

python ${DeepMod2_DIR}/train/generate_features.py --bam ${INPUT_DIR}/ac_train/bam/ac_train.bam --input ${INPUT_DIR}/ac_train/pod5/ac_train.pod5 --ref ${INPUT_DIR}/ac_reference.fa --file_type pod5 --seq_type dna --threads 16 --output ${OUTPUT_DIR}/features/mod/ --window 10 --motif C 0 --motif_label 1
python ${DeepMod2_DIR}/train/generate_features.py --bam ${INPUT_DIR}/ac_train/bam/control_train.bam --input ${INPUT_DIR}/ac_train/pod5/control_train.pod5 --ref ${INPUT_DIR}/ac_reference.fa --file_type pod5 --seq_type dna --threads 16 --output ${OUTPUT_DIR}/features/can/ --window 10 --motif C 0 --motif_label 0

python ${DeepMod2_DIR}/train/train_models.py --can_training_dataset ${OUTPUT_DIR}/features/can/ --mod_training_dataset ${OUTPUT_DIR}/features/mod/ --validation_type split --validation_fraction 0.5 --model_save_path ${OUTPUT_DIR}/ac_can_mod_transformer/ --epochs 10 --batch_size 128 --model_type transformer --num_layers 2 --num_fc 32 --dim_feedforward 32 --lr 0.01 --include_ref --l2_coef 0.01 --seed 0
#train model using modified and canonical base samples
python ${DeepMod2_DIR}/train/train_models.py --can_training_dataset ${OUTPUT_DIR}/features/can/ --mod_training_dataset ${OUTPUT_DIR}/features/mod/ --validation_type split --validation_fraction 0.5 --model_save_path ${OUTPUT_DIR}/ac_can_mod_bilstm/ --epochs 10 --batch_size 128 --model_type bilstm --num_layers 2 --num_fc 32 --dim_feedforward 32 --lr 0.01 --window 10 --include_ref --l2_coef 0.01 --seed 0

I missed the error:
Traceback (most recent call last):
File "/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/DeepMod2/train/train_models.py", line 426, in
valid_data, window, norm_type, strides_per_base, model_depth = check_training_files(mixed_training_dataset, can_training_dataset,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs1/jdhan_pkuhpc/zhoushibo/Software/DeepMod2/train/utils.py", line 62, in check_training_files
strides_per_base=[int(np.load(file)['strides_per_base']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/lustre2/jdhan_pkuhpc/common/mamba/envs/deepmod2/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py", line 263, in getitem
raise KeyError(f"{key} is not a file in the archive")
KeyError: 'strides_per_base is not a file in the archive'

What does this error mean? Could you offer some help, thanks a lot!

spoweekkk · 2025-01-16T22:32:37Z

I notice that when writing the features to npz file, there're codes in generate_features.py:

def write_to_npz(output_file_path, mat, base_qual, base_seq, ref_seq, label, ref_coordinates, read_name, ref_name, window, norm_type):
np.savez(output_file_path, mat=mat, base_qual=base_qual, base_seq=base_seq, ref_seq=ref_seq, label=label, ref_coordinates=ref_coordinates, read_name=read_name, ref_name=ref_name, window=window, norm_type=norm_type)

There's not any information about strides_per_base or model_depth when creating npz files, however, when reading features, the codes are the following:

def check_training_files(mixed_training_dataset, can_training_dataset,
mod_training_dataset, validation_dataset):
norm_type=[str(np.load(file)['norm_type']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
mod_training_dataset, validation_dataset])]
window=[int(np.load(file)['window']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
mod_training_dataset, validation_dataset])]
strides_per_base=[int(np.load(file)['strides_per_base']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
mod_training_dataset, validation_dataset])]

model_depth=[int(np.load(file)['model_depth']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,\
                                           mod_training_dataset, validation_dataset])]

I wonder where to set "strides_per_base" and "model_depth" when creating npz files

umahsn · 2025-01-17T02:28:08Z

Hi,

I have fixed the issue in the 203673d commit of main branch. I made some changes to training method to work with some generalized training strategies that is not yet published, but forgot to update the earlier code. You should be able to generate features again that have strides_per_base, model_depth and another variable "full_signal" created and saves in the npz files. As a result, you also don't need to provide the --window parameter during training after this update. I have updated the training document as well: https://github.com/WGLab/DeepMod2/blob/main/docs/Training.md

If you dont want to regenerate the training files, you can still work with the current files. You can manually set these parameters by replacing these lines L62-L67 in train/utils.py with the values below:

strides_per_base=1
model_depth=9
full_signal=False

Let me know if this solves the issue.

Best,
Umair

spoweekkk · 2025-01-17T08:07:45Z

Hi,

I have fixed the issue in the 203673d commit of main branch. I made some changes to training method to work with some generalized training strategies that is not yet published, but forgot to update the earlier code. You should be able to generate features again that have strides_per_base, model_depth and another variable "full_signal" created and saves in the npz files. As a result, you also don't need to provide the --window parameter during training after this update. I have updated the training document as well: https://github.com/WGLab/DeepMod2/blob/main/docs/Training.md

If you dont want to regenerate the training files, you can still work with the current files. You can manually set these parameters by replacing these lines L62-L67 in train/utils.py with the values below:
strides_per_base=1
model_depth=9
full_signal=False
Let me know if this solves the issue.

Best, Umair

Thanks for your quick reply, it works!

spoweekkk changed the title ~~Miss Error "KeyError: 'strides_per_base is not a file in the archive'"~~ Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when tring to train deepmod2 Jan 16, 2025

spoweekkk changed the title ~~Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when tring to train deepmod2~~ Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 Jan 16, 2025

umahsn closed this as completed Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 #33

Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 #33

spoweekkk commented Jan 16, 2025

spoweekkk commented Jan 16, 2025

umahsn commented Jan 17, 2025

spoweekkk commented Jan 17, 2025

Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 #33

Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 #33

Comments

spoweekkk commented Jan 16, 2025

spoweekkk commented Jan 16, 2025

umahsn commented Jan 17, 2025

spoweekkk commented Jan 17, 2025