refact refuses to finetune if finds weird bytes in files #267

worldemar · 2024-01-13T15:42:22Z

Attempt to finetune repo with files containing "strange" (non-decodeable) characters is prevented at filter stage.

Environment

GPU: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)
SMI: NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0
Docker: version 24.0.7, build afdd53b
smallcloud/refact_self_hosting_enterprise:nightly (image b3bfadaff5e6)
large repository with non-english text in it

Steps to reproduce

run refact
add repos in web UI
select some languages in Sources tab and press Proceed to Finetune
when on Finetune tab press Run Filter
filter stops with:
- Status: failed
- Error: 'utf-8' codec can't decode byte 0xf4 in position 6992: invalid continuation byte
  this, while helpful, does not narrow down issue to a particular file.
  on large repos (10k+ files, 10+ languages) it is hard to tell what file caused problem

Additional info

I've checked server logs, haven't found file name provided in them
I've checked Rejected files, they clearly had no encoding issues and were rejected due to small size (one-line files), none of them were a culprit
Trying to deselect some languages does not seem to affect the issue, there could be comments made in non-utf-8 among "good" sources

Possible solution

While finetuning on file that cannot be parsed is clearly impossible, it would make sense to add it to Rejected list, maybe with some remark that file could not be read.
This would

allow to finetune repository as-is, a few files would hardly affect the quality of LoRA
plan fixes to Rejected source if it is a legitimate file but incorrectly saved

The text was updated successfully, but these errors were encountered:

olegklimov · 2024-01-15T13:46:54Z

Thanks for reporting @worldemar !

worldemar · 2024-01-16T13:38:58Z

Just tested latest image 5390a6d7598f, same error is now reported as
'utf-8' codec can't decode byte 0xb3 in position 374: invalid start byte path/to/file/with/chinese-characters.cpp
in Rejected list and filtering finished successfully.

I do not see any false positives in that list.
Consider issue closed.

JegernOUTT self-assigned this Jan 15, 2024

JegernOUTT added the 🛑 blocker label Jan 15, 2024

JegernOUTT linked a pull request Jan 15, 2024 that will close this issue

Files filtering fix #269

Merged

worldemar closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refact refuses to finetune if finds weird bytes in files #267

refact refuses to finetune if finds weird bytes in files #267

worldemar commented Jan 13, 2024

olegklimov commented Jan 15, 2024

worldemar commented Jan 16, 2024

refact refuses to finetune if finds weird bytes in files #267

refact refuses to finetune if finds weird bytes in files #267

Comments

worldemar commented Jan 13, 2024

Environment

Steps to reproduce

Additional info

Possible solution

olegklimov commented Jan 15, 2024

worldemar commented Jan 16, 2024