You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
select some languages in Sources tab and press Proceed to Finetune
when on Finetune tab press Run Filter
filter stops with:
Status: failed
Error: 'utf-8' codec can't decode byte 0xf4 in position 6992: invalid continuation byte this, while helpful, does not narrow down issue to a particular file.
on large repos (10k+ files, 10+ languages) it is hard to tell what file caused problem
Additional info
I've checked server logs, haven't found file name provided in them
I've checked Rejected files, they clearly had no encoding issues and were rejected due to small size (one-line files), none of them were a culprit
Trying to deselect some languages does not seem to affect the issue, there could be comments made in non-utf-8 among "good" sources
Possible solution
While finetuning on file that cannot be parsed is clearly impossible, it would make sense to add it to Rejected list, maybe with some remark that file could not be read.
This would
allow to finetune repository as-is, a few files would hardly affect the quality of LoRA
plan fixes to Rejected source if it is a legitimate file but incorrectly saved
The text was updated successfully, but these errors were encountered:
Just tested latest image 5390a6d7598f, same error is now reported as 'utf-8' codec can't decode byte 0xb3 in position 374: invalid start byte path/to/file/with/chinese-characters.cpp
in Rejected list and filtering finished successfully.
I do not see any false positives in that list.
Consider issue closed.
Attempt to finetune repo with files containing "strange" (non-decodeable) characters is prevented at filter stage.
Environment
NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)
NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0
version 24.0.7, build afdd53b
b3bfadaff5e6
)Steps to reproduce
Status: failed
Error: 'utf-8' codec can't decode byte 0xf4 in position 6992: invalid continuation byte
this, while helpful, does not narrow down issue to a particular file.
on large repos (10k+ files, 10+ languages) it is hard to tell what file caused problem
Additional info
Possible solution
While finetuning on file that cannot be parsed is clearly impossible, it would make sense to add it to Rejected list, maybe with some remark that file could not be read.
This would
The text was updated successfully, but these errors were encountered: