Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 #33

Closed
spoweekkk opened this issue Jan 16, 2025 · 3 comments

Comments

@spoweekkk
Copy link

When I tried to train the model with the following pipline:
PREDICTION_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2
INPUT_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2/
OUTPUT_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/lustre2/nanopore/nanopore_20241122ac1/no_sample/20241226ac/deepmod2
DORADO_PATH="/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/dorado/bin"
DeepMod2_DIR=/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/DeepMod2

${DORADO_PATH}/dorado basecaller --emit-moves --reference ${INPUT_DIR}/ac_reference.fa ${DORADO_PATH}/models/[email protected] ${INPUT_DIR}/ac_train/pod5/ac_train.pod5 > ${INPUT_DIR}/ac_train/bam/ac_train.bam
${DORADO_PATH}/dorado basecaller --emit-moves --reference ${INPUT_DIR}/ac_reference.fa ${DORADO_PATH}/models/[email protected] ${INPUT_DIR}/ac_train/pod5/control_train.pod5 > ${INPUT_DIR}/ac_train/bam/control_train.bam

python ${DeepMod2_DIR}/train/generate_features.py --bam ${INPUT_DIR}/ac_train/bam/ac_train.bam --input ${INPUT_DIR}/ac_train/pod5/ac_train.pod5 --ref ${INPUT_DIR}/ac_reference.fa --file_type pod5 --seq_type dna --threads 16 --output ${OUTPUT_DIR}/features/mod/ --window 10 --motif C 0 --motif_label 1
python ${DeepMod2_DIR}/train/generate_features.py --bam ${INPUT_DIR}/ac_train/bam/control_train.bam --input ${INPUT_DIR}/ac_train/pod5/control_train.pod5 --ref ${INPUT_DIR}/ac_reference.fa --file_type pod5 --seq_type dna --threads 16 --output ${OUTPUT_DIR}/features/can/ --window 10 --motif C 0 --motif_label 0

python ${DeepMod2_DIR}/train/train_models.py --can_training_dataset ${OUTPUT_DIR}/features/can/ --mod_training_dataset ${OUTPUT_DIR}/features/mod/ --validation_type split --validation_fraction 0.5 --model_save_path ${OUTPUT_DIR}/ac_can_mod_transformer/ --epochs 10 --batch_size 128 --model_type transformer --num_layers 2 --num_fc 32 --dim_feedforward 32 --lr 0.01 --include_ref --l2_coef 0.01 --seed 0
#train model using modified and canonical base samples
python ${DeepMod2_DIR}/train/train_models.py --can_training_dataset ${OUTPUT_DIR}/features/can/ --mod_training_dataset ${OUTPUT_DIR}/features/mod/ --validation_type split --validation_fraction 0.5 --model_save_path ${OUTPUT_DIR}/ac_can_mod_bilstm/ --epochs 10 --batch_size 128 --model_type bilstm --num_layers 2 --num_fc 32 --dim_feedforward 32 --lr 0.01 --window 10 --include_ref --l2_coef 0.01 --seed 0

I missed the error:
Traceback (most recent call last):
File "/home/jdhan_pkuhpc/profiles/zhoushibo/gpfs1/Software/DeepMod2/train/train_models.py", line 426, in
valid_data, window, norm_type, strides_per_base, model_depth = check_training_files(mixed_training_dataset, can_training_dataset,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs1/jdhan_pkuhpc/zhoushibo/Software/DeepMod2/train/utils.py", line 62, in check_training_files
strides_per_base=[int(np.load(file)['strides_per_base']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/lustre2/jdhan_pkuhpc/common/mamba/envs/deepmod2/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py", line 263, in getitem
raise KeyError(f"{key} is not a file in the archive")
KeyError: 'strides_per_base is not a file in the archive'

What does this error mean? Could you offer some help, thanks a lot!

@spoweekkk spoweekkk changed the title Miss Error "KeyError: 'strides_per_base is not a file in the archive'" Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when tring to train deepmod2 Jan 16, 2025
@spoweekkk spoweekkk changed the title Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when tring to train deepmod2 Miss Error "KeyError: 'strides_per_base is not a file in the archive'" when trying to train deepmod2 Jan 16, 2025
@spoweekkk
Copy link
Author

I notice that when writing the features to npz file, there're codes in generate_features.py:

def write_to_npz(output_file_path, mat, base_qual, base_seq, ref_seq, label, ref_coordinates, read_name, ref_name, window, norm_type):
np.savez(output_file_path, mat=mat, base_qual=base_qual, base_seq=base_seq, ref_seq=ref_seq, label=label, ref_coordinates=ref_coordinates, read_name=read_name, ref_name=ref_name, window=window, norm_type=norm_type)

There's not any information about strides_per_base or model_depth when creating npz files, however, when reading features, the codes are the following:

def check_training_files(mixed_training_dataset, can_training_dataset,
mod_training_dataset, validation_dataset):
norm_type=[str(np.load(file)['norm_type']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
mod_training_dataset, validation_dataset])]
window=[int(np.load(file)['window']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
mod_training_dataset, validation_dataset])]
strides_per_base=[int(np.load(file)['strides_per_base']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,
mod_training_dataset, validation_dataset])]

model_depth=[int(np.load(file)['model_depth']) for file in itertools.chain.from_iterable([mixed_training_dataset, can_training_dataset,\
                                           mod_training_dataset, validation_dataset])]

I wonder where to set "strides_per_base" and "model_depth" when creating npz files

@umahsn
Copy link
Collaborator

umahsn commented Jan 17, 2025

Hi,

I have fixed the issue in the 203673d commit of main branch. I made some changes to training method to work with some generalized training strategies that is not yet published, but forgot to update the earlier code. You should be able to generate features again that have strides_per_base, model_depth and another variable "full_signal" created and saves in the npz files. As a result, you also don't need to provide the --window parameter during training after this update. I have updated the training document as well: https://github.com/WGLab/DeepMod2/blob/main/docs/Training.md

If you dont want to regenerate the training files, you can still work with the current files. You can manually set these parameters by replacing these lines L62-L67 in train/utils.py with the values below:

strides_per_base=1
model_depth=9
full_signal=False

Let me know if this solves the issue.

Best,
Umair

@spoweekkk
Copy link
Author

Hi,

I have fixed the issue in the 203673d commit of main branch. I made some changes to training method to work with some generalized training strategies that is not yet published, but forgot to update the earlier code. You should be able to generate features again that have strides_per_base, model_depth and another variable "full_signal" created and saves in the npz files. As a result, you also don't need to provide the --window parameter during training after this update. I have updated the training document as well: https://github.com/WGLab/DeepMod2/blob/main/docs/Training.md

If you dont want to regenerate the training files, you can still work with the current files. You can manually set these parameters by replacing these lines L62-L67 in train/utils.py with the values below:

strides_per_base=1
model_depth=9
full_signal=False

Let me know if this solves the issue.

Best, Umair

Thanks for your quick reply, it works!

@umahsn umahsn closed this as completed Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants