Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refact refuses to finetune if finds weird bytes in files #267

Closed
worldemar opened this issue Jan 13, 2024 · 2 comments · Fixed by #269
Closed

refact refuses to finetune if finds weird bytes in files #267

worldemar opened this issue Jan 13, 2024 · 2 comments · Fixed by #269
Assignees

Comments

@worldemar
Copy link
Contributor

Attempt to finetune repo with files containing "strange" (non-decodeable) characters is prevented at filter stage.

Environment

  • GPU: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)
  • SMI: NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0
  • Docker: version 24.0.7, build afdd53b
  • smallcloud/refact_self_hosting_enterprise:nightly (image b3bfadaff5e6)
  • large repository with non-english text in it

Steps to reproduce

  • run refact
  • add repos in web UI
  • select some languages in Sources tab and press Proceed to Finetune
  • when on Finetune tab press Run Filter
  • filter stops with:
    • Status: failed
    • Error: 'utf-8' codec can't decode byte 0xf4 in position 6992: invalid continuation byte
      this, while helpful, does not narrow down issue to a particular file.
      on large repos (10k+ files, 10+ languages) it is hard to tell what file caused problem

Additional info

  • I've checked server logs, haven't found file name provided in them
  • I've checked Rejected files, they clearly had no encoding issues and were rejected due to small size (one-line files), none of them were a culprit
  • Trying to deselect some languages does not seem to affect the issue, there could be comments made in non-utf-8 among "good" sources

Possible solution

While finetuning on file that cannot be parsed is clearly impossible, it would make sense to add it to Rejected list, maybe with some remark that file could not be read.
This would

  • allow to finetune repository as-is, a few files would hardly affect the quality of LoRA
  • plan fixes to Rejected source if it is a legitimate file but incorrectly saved
@JegernOUTT JegernOUTT self-assigned this Jan 15, 2024
@JegernOUTT JegernOUTT linked a pull request Jan 15, 2024 that will close this issue
@olegklimov
Copy link
Contributor

Thanks for reporting @worldemar !

@worldemar
Copy link
Contributor Author

Just tested latest image 5390a6d7598f, same error is now reported as
'utf-8' codec can't decode byte 0xb3 in position 374: invalid start byte path/to/file/with/chinese-characters.cpp
in Rejected list and filtering finished successfully.

I do not see any false positives in that list.
Consider issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants