Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quickstart train on cpu with m1? #1718

Open
tyoc213 opened this issue Feb 2, 2025 · 0 comments
Open

quickstart train on cpu with m1? #1718

tyoc213 opened this issue Feb 2, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@tyoc213
Copy link

tyoc213 commented Feb 2, 2025

Environment

composer_collect_env
Collecting system information...
---------------------------------
System Environment Report        
Created: 2025-02-02 12:32:57 CST
---------------------------------

PyTorch information
-------------------
PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.5
Libc version: N/A

Python version: 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1

Versions of relevant libraries:
[pip3] numpy==2.1.3
[pip3] onnx==1.17.0
[pip3] onnxruntime==1.20.1
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.5.1
[pip3] torch-optimizer==0.3.0
[pip3] torchmetrics==1.6.0
[pip3] torchvision==0.20.1
[conda] Could not collect


Composer information
--------------------
Composer Version: 0.28.0
Composer Commit Hash: None
CPU Model: Apple M1
CPU Count: 8
Number of Nodes: 1
GPU Model: N/A
GPUs per Node: 0
GPU Count: 1
CUDA Device Count: 0

To reproduce

Steps to reproduce the behavior:

  1. install from source without GPU
  2. follow quickstart step 1 is OK, trainning is not:
composer train/train.py \
  train/yamls/pretrain/mpt-125m.yaml \
  variables.data_local=my-copy-c4 \
  train_loader.dataset.split=train_small \
  eval_loader.dataset.split=val_small \
  max_duration=10ba \
  eval_interval=0 \
  save_folder=mpt-125m \
  model.attn_config.attn_impl=torch model.loss_fn=torch_crossentropy precision=fp32



2025-02-02 11:16:09,562: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Initializing dist with device...
2025-02-02 11:16:09,566: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Testing barrier with device...
2025-02-02 11:16:09,566: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.
/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py:351: UserWarning: FSDP is not applicable for single-GPU training. Reverting to DDP.
  warnings.warn(
/Users/devworks/github.com/uv-llm/llmfoundry/utils/config_utils.py:525: UserWarning: Using `cfg.model.init_device='meta'` is only valid when using FSDP! Reverting to `cfg.model.init_device='cpu'`.
  warnings.warn(
2025-02-02 11:16:09,567: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building tokenizer...
2025-02-02 11:16:09,922: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building train loader...
2025-02-02 11:16:09,922: rank0[2908][MainThread]: INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64.
2025-02-02 11:16:09,933: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building eval loader...
2025-02-02 11:16:09,933: rank0[2908][MainThread]: INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64.
2025-02-02 11:16:09,935: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Initializing model...
MPTForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
2025-02-02 11:16:09,936: rank0[2908][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: Instantiating an MPTForCausalLM model from /Users/devworks/github.com/uv-llm/llmfoundry/models/mpt/modeling_mpt.py
2025-02-02 11:16:10,611: rank0[2908][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: We recommend using config.init_device="meta" with Composer + FSDP for faster initialization.
2025-02-02 11:16:12,196: rank0[2908][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: MPTModel(
  (wte): SharedEmbedding(50368, 768)
  (wpe): Embedding(2048, 768)
  (emb_drop): Dropout(p=0.0, inplace=False)
  (blocks): ModuleList(
    (0-11): 12 x MPTBlock(
      (norm_1): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): MultiheadAttention(
        (Wqkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (norm_2): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (ffn): MPTMLP(
        (up_proj): Linear(in_features=768, out_features=3072, bias=True)
        (down_proj): Linear(in_features=3072, out_features=768, bias=True)
      )
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_ffn_dropout): Dropout(p=0.0, inplace=False)
    )
  )
  (norm_f): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2025-02-02 11:16:12,197: rank0[2908][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: Using kaiming_normal_ initialization.
2025-02-02 11:16:12,298: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building trainer...
2025-02-02 11:16:12,299: rank0[2908][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2025-02-02 11:16:12,301: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Run name: 1738516572-ginger-mushroom
/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/callbacks/memory_monitor.py:137: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
2025-02-02 11:16:12,380: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Setting seed to 17
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Logging config
variables:
  data_local: my-copy-c4
  data_remote: null
  max_seq_len: 2048
  global_seed: 17
  run_name: null
max_seq_len: 2048
run_name: null
model:
  name: mpt_causal_lm
  init_device: meta
  d_model: 768
  n_heads: 12
  n_layers: 12
  expansion_ratio: 4
  max_seq_len: 2048
  vocab_size: 50368
  attn_config:
    attn_impl: torch
  loss_fn: torch_crossentropy
tokenizer:
  name: EleutherAI/gpt-neox-20b
  kwargs:
    model_max_length: 2048
train_loader:
  name: text
  dataset:
    local: my-copy-c4
    remote: null
    split: train_small
    shuffle: true
    max_seq_len: 2048
    shuffle_seed: 17
  drop_last: true
  num_workers: 8
eval_loader:
  name: text
  dataset:
    local: my-copy-c4
    remote: null
    split: val_small
    shuffle: false
    max_seq_len: 2048
    shuffle_seed: 17
  drop_last: false
  num_workers: 8
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1
optimizer:
  name: decoupled_adamw
  lr: 0.0006
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0
max_duration: 10ba
eval_interval: 0
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 256
seed: 17
device_eval_batch_size: 16
device_train_microbatch_size: 16
precision: fp32
fsdp_config: null
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
save_folder: mpt-125m
n_gpus: 1
device_train_batch_size: 256
device_train_grad_accum: 16
merge: true
tp_config: null
n_params: 125311488
n_active_params: 125311488
n_trainable_params: 125311488

2025-02-02 11:16:12,495: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Starting training...
2025-02-02 11:16:12,495: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.FP32
******************************
Config:
algorithms:
  gradient_clipping:
    clipping_threshold: 1.0
    clipping_type: norm
callbacks:
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
  speed_monitor:
    window_size: 10
composer_commit_hash: None
composer_version: 0.28.0
console_log_interval: 1ba
device_eval_batch_size: 16
device_train_batch_size: 256
device_train_grad_accum: 16
device_train_microbatch_size: 16
enabled_algorithms/GradientClipping: true
eval_first: false
eval_interval: 0
eval_loader:
  dataset:
    local: my-copy-c4
    max_seq_len: 2048
    remote: null
    shuffle: false
    shuffle_seed: 17
    split: val_small
  drop_last: false
  name: text
  num_workers: 8
eval_subset_num_batches: -1
fsdp_config: null
global_train_batch_size: 256
log_to_console: true
max_duration: 10ba
max_seq_len: 2048
merge: true
model:
  attn_config:
    attn_impl: torch
  d_model: 768
  expansion_ratio: 4
  init_device: meta
  loss_fn: torch_crossentropy
  max_seq_len: 2048
  n_heads: 12
  n_layers: 12
  name: mpt_causal_lm
  vocab_size: 50368
n_active_params: 125311488
n_gpus: 1
n_params: 125311488
n_trainable_params: 125311488
node_name: unknown because NODENAME environment variable not set
num_cpus_per_node: 1
num_nodes: 1
optimizer:
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  lr: 0.0006
  name: decoupled_adamw
  weight_decay: 0.0
precision: fp32
progress_bar: false
rank_zero_seed: 17
run_name: null
save_folder: mpt-125m
scheduler:
  alpha_f: 0.1
  name: cosine_with_warmup
  t_warmup: 100ba
seed: 17
time/remaining_estimate_unit: hours
tokenizer:
  kwargs:
    model_max_length: 2048
  name: EleutherAI/gpt-neox-20b
tp_config: null
train_loader:
  dataset:
    local: my-copy-c4
    max_seq_len: 2048
    remote: null
    shuffle: true
    shuffle_seed: 17
    split: train_small
  drop_last: true
  name: text
  num_workers: 8
variables:
  data_local: my-copy-c4
  data_remote: null
  global_seed: 17
  max_seq_len: 2048
  run_name: null

******************************
2025-02-02 11:16:12,497: rank0[2908][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
[rank0]: Traceback (most recent call last):
[rank0]:   File "/Users/devworks/github.com/uv-llm/scripts/train/train.py", line 9, in <module>
[rank0]:     train_from_yaml(yaml_path, args_list)
[rank0]:   File "/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py", line 662, in train_from_yaml
[rank0]:     return train(yaml_cfg)
[rank0]:            ^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py", line 643, in train
[rank0]:     trainer.fit()
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2297, in fit
[rank0]:     self._train_loop()
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2447, in _train_loop
[rank0]:     self._spin_dataloaders_to_cur_epoch()
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2381, in _spin_dataloaders_to_cur_epoch
[rank0]:     for _ in dataloader:
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 479, in __iter__
[rank0]:     self._iterator = self._get_iterator()
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 415, in _get_iterator
[rank0]:     return _MultiProcessingDataLoaderIter(self)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1138, in __init__
[rank0]:     w.start()
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/process.py", line 121, in start
[rank0]:     self._popen = self._Popen(self)
[rank0]:                   ^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/context.py", line 224, in _Popen
[rank0]:     return _default_context.get_context().Process._Popen(process_obj)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/context.py", line 289, in _Popen
[rank0]:     return Popen(process_obj)
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__
[rank0]:     super().__init__(process_obj)
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
[rank0]:     self._launch(process_obj)
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 47, in _launch
[rank0]:     reduction.dump(process_obj, fp)
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/reduction.py", line 60, in dump
[rank0]:     ForkingPickler(file, protocol).dump(obj)
[rank0]: AttributeError: Can't get local object 'get_tokens_per_batch_func.<locals>.get_num_tokens_in_batch'
2025-02-02 11:16:12,512: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing the engine.
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback ConsoleLogger
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback SpeedMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback LRMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback MemoryMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback RuntimeEstimator
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback CheckpointSaver
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback ConsoleLogger
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback SpeedMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback LRMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback MemoryMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback RuntimeEstimator
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback CheckpointSaver
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Engine closed.
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 2908) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 2908) exited with code 1

Expected behavior

Step of the quickstart to be successful

Additional context

the line added for composer trainning model.attn_config.attn_impl=torch model.loss_fn=torch_crossentropy precision=fp32 si to allow it to run on m1 cpu

Not sure how to go about this error as the part above seems OK?

[rank0]:     ForkingPickler(file, protocol).dump(obj)
[rank0]: AttributeError: Can't get local object 'get_tokens_per_batch_func.<locals>.get_num_tokens_in_batch'

ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
@tyoc213 tyoc213 added the bug Something isn't working label Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant