Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No clear indication why filetune GPU filter has stopped #126

Closed
hazratisulton opened this issue Sep 25, 2023 · 5 comments
Closed

No clear indication why filetune GPU filter has stopped #126

hazratisulton opened this issue Sep 25, 2023 · 5 comments
Assignees
Labels
bug Something isn't working release-enterprise

Comments

@hazratisulton
Copy link

Finetune stops with retcode 1
image

update /perm_storage/uploaded-files/nextjs-starter-prismic-blog repo => success
stats: 67 good, 1 too large, 1 generated, 2 vendored
marking 71 files from nextjs-starter-prismic-blog to which_set="train", to_db=False
total files 71
dedup...
after dedup 63 files
Reading /perm_storage/cfg/sources_filetypes.cfg
Will not overwrite '/tmp/unpacked-files/train_set.jsonl' because it is exactly the same as the current output
Will not overwrite '/tmp/unpacked-files/test_set.jsonl' because it is exactly the same as the current output
Will not overwrite '/tmp/unpacked-files/database_set.jsonl' because it is exactly the same as the current output
Retrieving dataset length per epoch, it may take a while...
Dataset length per epoch = 3
Lora parameters heuristic avg_loss=0.00, ds_len=3 => complexity score=0
Selected the schedule by heuristics ds_len=3:
Total training steps: 50

Selected low_gpu_mem_mode=True by total gpu memory

Freeze exceptions: ['wte', 'lm_head', 'lora']
Lora config:  lora_target_modules ['qkv', 'out', 'mlp']
Lora config:               lora_r 64
Lora config:           lora_alpha 128
Lora config:         lora_dropout 0.01
Lora config:      lora_init_scale 0.01
creating model...
/model 14850.1ms
@olegklimov
Copy link
Contributor

python ~/code/refact/refact_data_pipeline/finetune/finetune_train.py

Remove all models, run this please, post logs.

@olegklimov olegklimov moved this to In Progress in Self-hosted / Enterprise Oct 3, 2023
@olegklimov olegklimov changed the title finetune stops suddendly with status "failed" No clear indication why filetune GPU filter has stopped Oct 3, 2023
@hazratisulton
Copy link
Author

King Dave from discord:
Finetune just fails silently:
image

I started the container from scratch. Edited the file to adjust the T value. Then did filtering via the GUI.
After that was finished, I ran docker exec -it refact python /usr/local/lib/python3.8/dist-packages/refact_data_pipeline/finetune/finetune_train.py
This is the output of the script. It just suddenly stops.

20231003 20:34:32 FTUNE NumExpr defaulting to 4 threads.
20231003 20:34:32 FTUNE STATUS working
20231003 20:34:32 FTUNE starting finetune at /perm_storage/loras/lora-20231003-203432
Retrieving dataset length per epoch, it may take a while...
Dataset length per epoch = 2
Lora parameters heuristic avg_loss=0.65, ds_len=2 => complexity score=0
Selected the schedule by heuristics ds_len=2:
Total training steps: 50

Selected low_gpu_mem_mode=True by total gpu memory

Freeze exceptions: ['wte', 'lm_head', 'lora']
Lora config:  lora_target_modules ['qkv', 'out', 'mlp']
Lora config:               lora_r 64
Lora config:           lora_alpha 128
Lora config:         lora_dropout 0.01
Lora config:      lora_init_scale 0.01
creating model...
/model 6387.2ms
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
20231003 20:34:39 FTUNE Added key: store_based_barrier_key:1 to store for rank: 0
20231003 20:34:39 FTUNE Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
20231003 20:34:39 FTUNE Added key: store_based_barrier_key:2 to store for rank: 0
20231003 20:34:39 FTUNE Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2023-10-03 20:34:40,701] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Using /root/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.503849983215332 seconds
Using /root/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Emitting ninja build file /root/.cache/torch_extensions/py38_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.19377613067626953 seconds
Rank: 0 partition count [1] and sizes[(265814016, False)]

@JegernOUTT
Copy link
Member

These logs are completely uninformative
Guess we need to fix this, enabling some extra debug logs

@JegernOUTT JegernOUTT moved this from In Progress to TODO in Self-hosted / Enterprise Oct 5, 2023
@klink klink added the bug Something isn't working label Oct 17, 2023
@JegernOUTT
Copy link
Member

#199

@olegklimov
Copy link
Contributor

Should be fixed.

@github-project-automation github-project-automation bot moved this from TODO to Released in Docker Nightly in Self-hosted / Enterprise Nov 26, 2023
@olegklimov olegklimov moved this from Released in Docker Nightly to Released in Docker V1.2 in Self-hosted / Enterprise Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working release-enterprise
Projects
Status: Released in Docker V1.2
Development

No branches or pull requests

4 participants