-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 #33
Comments
I notice that when writing the features to npz file, there're codes in generate_features.py: def write_to_npz(output_file_path, mat, base_qual, base_seq, ref_seq, label, ref_coordinates, read_name, ref_name, window, norm_type): There's not any information about strides_per_base or model_depth when creating npz files, however, when reading features, the codes are the following: def check_training_files(mixed_training_dataset, can_training_dataset,
I wonder where to set "strides_per_base" and "model_depth" when creating npz files |
Hi, I have fixed the issue in the 203673d commit of main branch. I made some changes to training method to work with some generalized training strategies that is not yet published, but forgot to update the earlier code. You should be able to generate features again that have strides_per_base, model_depth and another variable "full_signal" created and saves in the npz files. As a result, you also don't need to provide the --window parameter during training after this update. I have updated the training document as well: https://github.com/WGLab/DeepMod2/blob/main/docs/Training.md If you dont want to regenerate the training files, you can still work with the current files. You can manually set these parameters by replacing these lines L62-L67 in train/utils.py with the values below:
Let me know if this solves the issue. Best, |
Thanks for your quick reply, it works! |
When I tried to train the model with the following pipline:
PREDICTION_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2
INPUT_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2/
OUTPUT_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2
DORADO_PATH="/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/dorado/bin"
DeepMod2_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/DeepMod2
${DORADO_PATH}/dorado basecaller --emit-moves --reference ${INPUT_DIR}/ac_reference.fa ${DORADO_PATH}/models/[email protected] ${INPUT_DIR}/ac_train/pod5/ac_train.pod5 > ${INPUT_DIR}/ac_train/bam/ac_train.bam
${DORADO_PATH}/dorado basecaller --emit-moves --reference ${INPUT_DIR}/ac_reference.fa ${DORADO_PATH}/models/[email protected] ${INPUT_DIR}/ac_train/pod5/control_train.pod5 > ${INPUT_DIR}/ac_train/bam/control_train.bam
python ${DeepMod2_DIR}/train/generate_features.py --bam ${INPUT_DIR}/ac_train/bam/ac_train.bam --input ${INPUT_DIR}/ac_train/pod5/ac_train.pod5 --ref ${INPUT_DIR}/ac_reference.fa --file_type pod5 --seq_type dna --threads 16 --output ${OUTPUT_DIR}/features/mod/ --window 10 --motif C 0 --motif_label 1
python ${DeepMod2_DIR}/train/generate_features.py --bam ${INPUT_DIR}/ac_train/bam/control_train.bam --input ${INPUT_DIR}/ac_train/pod5/control_train.pod5 --ref ${INPUT_DIR}/ac_reference.fa --file_type pod5 --seq_type dna --threads 16 --output ${OUTPUT_DIR}/features/can/ --window 10 --motif C 0 --motif_label 0
python ${DeepMod2_DIR}/train/train_models.py --can_training_dataset ${OUTPUT_DIR}/features/can/ --mod_training_dataset ${OUTPUT_DIR}/features/mod/ --validation_type split --validation_fraction 0.5 --model_save_path ${OUTPUT_DIR}/ac_can_mod_transformer/ --epochs 10 --batch_size 128 --model_type transformer --num_layers 2 --num_fc 32 --dim_feedforward 32 --lr 0.01 --include_ref --l2_coef 0.01 --seed 0
#train model using modified and canonical base samples
python ${DeepMod2_DIR}/train/train_models.py --can_training_dataset ${OUTPUT_DIR}/features/can/ --mod_training_dataset ${OUTPUT_DIR}/features/mod/ --validation_type split --validation_fraction 0.5 --model_save_path ${OUTPUT_DIR}/ac_can_mod_bilstm/ --epochs 10 --batch_size 128 --model_type bilstm --num_layers 2 --num_fc 32 --dim_feedforward 32 --lr 0.01 --window 10 --include_ref --l2_coef 0.01 --seed 0
I missed the error:
Traceback (most recent call last):
File "/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/DeepMod2/train/train_models.py", line 426, in
valid_data, window, norm_type, strides_per_base, model_depth = check_training_files(mixed_training_dataset, can_training_dataset,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs1/jdhan_pkuhpc/zhoushibo/Software/DeepMod2/train/utils.py", line 62, in check_training_files
strides_per_base=[int(np.load(file)['strides_per_base']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/lustre2/jdhan_pkuhpc/common/mamba/envs/deepmod2/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py", line 263, in getitem
raise KeyError(f"{key} is not a file in the archive")
KeyError: 'strides_per_base is not a file in the archive'
What does this error mean? Could you offer some help, thanks a lot!
The text was updated successfully, but these errors were encountered: