We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好,請問為什麼在lora_trainer.py中prepare_rotary_positional_embeddings是如下code
def prepare_rotary_positional_embeddings( self, height: int, width: int, num_frames: int, transformer_config: Dict, vae_scale_factor_spatial: int, device: torch.device, ) -> Tuple[torch.Tensor, torch.Tensor]: grid_height = height // (vae_scale_factor_spatial * transformer_config.patch_size) grid_width = width // (vae_scale_factor_spatial * transformer_config.patch_size) if transformer_config.patch_size_t is None: base_num_frames = num_frames else: base_num_frames = (num_frames + transformer_config.patch_size_t - 1) // transformer_config.patch_size_t freqs_cos, freqs_sin = get_3d_rotary_pos_embed( embed_dim=transformer_config.attention_head_dim, crops_coords=None, grid_size=(grid_height, grid_width), temporal_size=base_num_frames, grid_type="slice", max_size=(grid_height, grid_width), device=device, )
inference的時候在pipeline_cogvideox_image2video.py卻是如下code
def _prepare_rotary_positional_embeddings( self, height: int, width: int, num_frames: int, device: torch.device, ) -> Tuple[torch.Tensor, torch.Tensor]: grid_height = height // (self.vae_scale_factor_spatial * self.transformer.config.patch_size) grid_width = width // (self.vae_scale_factor_spatial * self.transformer.config.patch_size) p = self.transformer.config.patch_size p_t = self.transformer.config.patch_size_t base_size_width = self.transformer.config.sample_width // p base_size_height = self.transformer.config.sample_height // p if p_t is None: # CogVideoX 1.0 grid_crops_coords = get_resize_crop_region_for_grid( (grid_height, grid_width), base_size_width, base_size_height ) freqs_cos, freqs_sin = get_3d_rotary_pos_embed( embed_dim=self.transformer.config.attention_head_dim, crops_coords=grid_crops_coords, grid_size=(grid_height, grid_width), temporal_size=num_frames, device=device, ) else: # CogVideoX 1.5 base_num_frames = (num_frames + p_t - 1) // p_t freqs_cos, freqs_sin = get_3d_rotary_pos_embed( embed_dim=self.transformer.config.attention_head_dim, crops_coords=None, grid_size=(grid_height, grid_width), temporal_size=base_num_frames, grid_type="slice", max_size=(base_size_height, base_size_width), device=device, )
想請問一下training跟inference哪個是對的呢
The text was updated successfully, but these errors were encountered:
pipeline里的代码由于一些历史原因写起来有点啰嗦,建议以training的代码作为参考。虽然实现方式有区别,但是最后的结果都是一样的。
Sorry, something went wrong.
想請問你們當初在訓練現在CogVideoX-5b-I2V 跟 CogVideoX1.5-5B-I2V的pretrain weight時是用training的code還是inference的code ?
OleehyO
No branches or pull requests
您好,請問為什麼在lora_trainer.py中prepare_rotary_positional_embeddings是如下code
inference的時候在pipeline_cogvideox_image2video.py卻是如下code
想請問一下training跟inference哪個是對的呢
The text was updated successfully, but these errors were encountered: