No clear indication why filetune GPU filter has stopped #126

hazratisulton · 2023-09-25T22:49:52Z

Finetune stops with retcode 1

update /perm_storage/uploaded-files/nextjs-starter-prismic-blog repo => success
stats: 67 good, 1 too large, 1 generated, 2 vendored
marking 71 files from nextjs-starter-prismic-blog to which_set="train", to_db=False
total files 71
dedup...
after dedup 63 files
Reading /perm_storage/cfg/sources_filetypes.cfg
Will not overwrite '/tmp/unpacked-files/train_set.jsonl' because it is exactly the same as the current output
Will not overwrite '/tmp/unpacked-files/test_set.jsonl' because it is exactly the same as the current output
Will not overwrite '/tmp/unpacked-files/database_set.jsonl' because it is exactly the same as the current output
Retrieving dataset length per epoch, it may take a while...
Dataset length per epoch = 3
Lora parameters heuristic avg_loss=0.00, ds_len=3 => complexity score=0
Selected the schedule by heuristics ds_len=3:
Total training steps: 50

Selected low_gpu_mem_mode=True by total gpu memory

Freeze exceptions: ['wte', 'lm_head', 'lora']
Lora config:  lora_target_modules ['qkv', 'out', 'mlp']
Lora config:               lora_r 64
Lora config:           lora_alpha 128
Lora config:         lora_dropout 0.01
Lora config:      lora_init_scale 0.01
creating model...
/model 14850.1ms

The text was updated successfully, but these errors were encountered:

olegklimov · 2023-09-26T09:33:20Z

python ~/code/refact/refact_data_pipeline/finetune/finetune_train.py

Remove all models, run this please, post logs.

hazratisulton · 2023-10-03T21:00:30Z

King Dave from discord:
Finetune just fails silently:

I started the container from scratch. Edited the file to adjust the T value. Then did filtering via the GUI.
After that was finished, I ran docker exec -it refact python /usr/local/lib/python3.8/dist-packages/refact_data_pipeline/finetune/finetune_train.py
This is the output of the script. It just suddenly stops.

20231003 20:34:32 FTUNE NumExpr defaulting to 4 threads.
20231003 20:34:32 FTUNE STATUS working
20231003 20:34:32 FTUNE starting finetune at /perm_storage/loras/lora-20231003-203432
Retrieving dataset length per epoch, it may take a while...
Dataset length per epoch = 2
Lora parameters heuristic avg_loss=0.65, ds_len=2 => complexity score=0
Selected the schedule by heuristics ds_len=2:
Total training steps: 50

Selected low_gpu_mem_mode=True by total gpu memory

Freeze exceptions: ['wte', 'lm_head', 'lora']
Lora config:  lora_target_modules ['qkv', 'out', 'mlp']
Lora config:               lora_r 64
Lora config:           lora_alpha 128
Lora config:         lora_dropout 0.01
Lora config:      lora_init_scale 0.01
creating model...
/model 6387.2ms
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
20231003 20:34:39 FTUNE Added key: store_based_barrier_key:1 to store for rank: 0
20231003 20:34:39 FTUNE Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
20231003 20:34:39 FTUNE Added key: store_based_barrier_key:2 to store for rank: 0
20231003 20:34:39 FTUNE Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2023-10-03 20:34:40,701] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Using /root/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.503849983215332 seconds
Using /root/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Emitting ninja build file /root/.cache/torch_extensions/py38_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.19377613067626953 seconds
Rank: 0 partition count [1] and sizes[(265814016, False)]

JegernOUTT · 2023-10-04T09:07:00Z

These logs are completely uninformative
Guess we need to fix this, enabling some extra debug logs

JegernOUTT · 2023-11-01T06:52:02Z

#199

olegklimov · 2023-11-26T17:49:40Z

Should be fixed.

hazratisulton added the release-enterprise label Sep 25, 2023

olegklimov assigned JegernOUTT Oct 3, 2023

olegklimov added this to Self-hosted / Enterprise Oct 3, 2023

olegklimov moved this to In Progress in Self-hosted / Enterprise Oct 3, 2023

olegklimov changed the title ~~finetune stops suddendly with status "failed"~~ No clear indication why filetune GPU filter has stopped Oct 3, 2023

JegernOUTT moved this from In Progress to TODO in Self-hosted / Enterprise Oct 5, 2023

klink added the bug Something isn't working label Oct 17, 2023

JegernOUTT added the test me label Nov 1, 2023

olegklimov closed this as completed Nov 26, 2023

github-project-automation bot moved this from TODO to Released in Docker Nightly in Self-hosted / Enterprise Nov 26, 2023

olegklimov moved this from Released in Docker Nightly to Released in Docker V1.2 in Self-hosted / Enterprise Nov 26, 2023

hazratisulton removed the test me label Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No clear indication why filetune GPU filter has stopped #126

No clear indication why filetune GPU filter has stopped #126

hazratisulton commented Sep 25, 2023

olegklimov commented Sep 26, 2023

hazratisulton commented Oct 3, 2023

JegernOUTT commented Oct 4, 2023

JegernOUTT commented Nov 1, 2023

olegklimov commented Nov 26, 2023

No clear indication why filetune GPU filter has stopped #126

No clear indication why filetune GPU filter has stopped #126

Comments

hazratisulton commented Sep 25, 2023

olegklimov commented Sep 26, 2023

hazratisulton commented Oct 3, 2023

JegernOUTT commented Oct 4, 2023

JegernOUTT commented Nov 1, 2023

olegklimov commented Nov 26, 2023