Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Memory Issue #775

Open
stringing opened this issue May 30, 2024 · 10 comments
Open

GPU Memory Issue #775

stringing opened this issue May 30, 2024 · 10 comments
Labels
FS-LLM Federated learning in LLM

Comments

@stringing
Copy link

The gpu memory usage continues to increase after each round while finetuning LLM with an adapter. The gpu memory increment after each round was approximately the same. I speculate it's because that there are new clients joining in each round and there would be new model parameters. I've already set share_local_model and llm.adapter.mv_to_cpu to True, it should move the adapter to cpu after each round but why would the gpu memory still increase? I'd appreaciate it if anyone could help me with this issue. Thanks in advance!

@rayrayraykk
Copy link
Collaborator

rayrayraykk commented May 31, 2024

Please provide a YAML file to help us reproduce the issue, thanks.
Also, please try to set cfg.eval.count_flops to False. When enabling this, Torch's garbage collection mechanism may not be timely resulting in OOM. If you wan to get an exact GPU memory usage, please use torch.cuda.empty_cache().

@rayrayraykk rayrayraykk added the FS-LLM Federated learning in LLM label May 31, 2024
@stringing
Copy link
Author

Please provide a YAML file to help us reproduce the issue, thanks. Also, please try to set cfg.eval.count_flops to False. When enabling this, Torch's garbage collection mechanism may not be timely resulting in OOM.

Thank you for your response, this is my YAML configuration:

use_gpu: True

device: 0


early_stop:
  patience: 10


federate:
  mode: standalone
  client_num: 100
  sample_client_num: 5
  total_round_num: 10
  save_to: "/root/autodl-tmp/model/finetuned_codellama13b/codellama13b.ckpt"
  share_local_model: True
  online_aggr: False


data:
  root: /root/autodl-tmp/data/TutorCode
  type: 'TutorCode.json@llm'
  splits: [0.98,0.01,0.01]
  splitter: 'iid'


dataloader:
  batch_size: 1


llm:
  tok_len: 2048
  chat:
    max_len: 2048
  adapter:
    use: True
    args: [ { 'adapter_package': 'peft', 'adapter_method': 'qlora', 'r': 32, 'lora_alpha': 32, 'lora_dropout': 0.05, 'load_in_4bit': True, 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_compute_dtype': True, 'bnb_4bit_use_double_quant': True, 'module_int4': True, } ] # Quantized LoRA hyperparameter
    mv_to_cpu: True


model:
  type: 'codellama/CodeLlama-13b-Instruct-hf@huggingface_llm'
  


train:
  local_update_steps: 30
  batch_or_epoch: batch
  is_enable_half: False
  
  optimizer:
    type: 'AdamW'
    lr: 0.0001
    weight_decay: 0.00



criterion:
  type: CrossEntropyLoss


trainer:
  type: llmtrainer


eval:
  freq: 20
  metrics: ['loss']
  count_flops: False

This is the federated/llm/model/model_builder.py and adapter_builder.py that I modified to enable quantized LoRA:

from federatedscope.llm.model.adapter_builder import AdapterModel


def get_model_from_huggingface(model_name, config):
    """
    Load a causal language model from HuggingFace transformers library.

    Args:
        model_name (str): The name of the pre-trained model to load.
        config (Config): The configuration object that contains the model
            parameters.

    Returns:
        AutoModelForCausalLM: A causal language model object.
    """
    from transformers import AutoModelForCausalLM

    kwargs = {}
    if len(config.llm.cache.model):
        kwargs['cache_dir'] = config.llm.cache.model
    
    args = config.llm.adapter.args[0]
    if args.get('adapter_method', 'lora') in ['qlora', 'qlora-pissa']:
        from transformers import BitsAndBytesConfig
        import torch
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=args.pop('load_in_4bit', True),
            bnb_4bit_quant_type=args.pop('bnb_4bit_quant_type', 'nf4'),
            bnb_4bit_compute_dtype=torch.bfloat16 if args.pop('bnb_4bit_compute_dtype', True) else torch.float32,
            bnb_4bit_use_double_quant=args.pop('bnb_4bit_use_double_quant', True),
        )
        kwargs['quantization_config'] = bnb_config
        if args.get('bnb_4bit_compute_dtype', True):
            kwargs['torch_dtype'] = torch.bfloat16
    kwargs['device_map'] = f'cuda:{config.device}'

    return AutoModelForCausalLM.from_pretrained(model_name, **kwargs)


def get_model_from_modelscope(model_name, config):
    """
    Load a causal language model from ModelScope models library.

    Args:
        model_name (str): The name of the pre-trained model to load.
        config (Config): The configuration object that contains the model
            parameters.

    Returns:
        Model: A causal language model object.
    """
    from modelscope import AutoModelForCausalLM

    kwargs = {}
    if len(config.llm.cache.model):
        kwargs['cache_dir'] = config.llm.cache.model

    return AutoModelForCausalLM.from_pretrained(model_name, **kwargs)


def get_llm(config):
    """
    Get a causal language model based on the configuration.

    Args:
        config (Config): The configuration object that contains the model
            parameters.

    Returns:
        AdapterModel: A causal language model object with optional adapter
            layers.
    """
    from federatedscope.llm.dataloader import get_tokenizer

    model_config = config.model
    model_name, model_hub = model_config.type.split('@')
    if model_hub == 'huggingface_llm':
        model = get_model_from_huggingface(model_name=model_name,
                                           config=config)
    elif model_hub == 'modelscope_llm':
        model = get_model_from_modelscope(model_name=model_name, config=config)
    else:
        raise NotImplementedError(f'Not support LLM {model_name} in'
                                  f' {model_hub}.')

    # Resize LLM model based on settings
    tokenizer, num_new_tokens = \
        get_tokenizer(model_name, config.data.root, config.llm.tok_len,
                      model_hub)
    model.resize_token_embeddings(len(tokenizer))
    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
            dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
            dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg

    args = config.llm.adapter.args[0] if len(
        config.llm.adapter.args[0]) > 0 else {}
    
    model = AdapterModel(model, use_adapter=config.llm.adapter.use, **args)

    return model

import torch
import torch.nn as nn
from collections import OrderedDict
from federatedscope.llm.MyUtils.qlora_util import find_all_linear_names


def enable_adapter(model, package, adapter, **kwargs):
    """
    Enables an adapter for a given model and package.

    Args:
        model: A pre-trained model from HuggingFace Transformers library.
        package: A string indicating the name of the package that provides
            the adapter. Currently, only 'peft' and 'adapterhub' is supported.
        adapter: A string indicating the name of the adapter to enable. The
            available adapters depend on the package.
        **kwargs: Additional keyword arguments that are passed to the
            adapter configuration.

    Returns:
        A model object that has the adapter enabled.

    Raises:
        NotImplementedError: If the package or the adapter is not supported.
    """
    adapter = adapter.lower()
    if package == 'peft':
        """
        PEFT: https://github.com/huggingface/peft
        Support methods:
            LoRA
            Prefix Tuning
            P-Tuning
            Prompt Tuning
            AdaLoRA
        """
        from peft import get_peft_model, TaskType
        if adapter == 'lora':
            from peft import LoraConfig
            peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, **kwargs)
            model = get_peft_model(model, peft_config)
        elif adapter == 'prefix':
            from peft import PrefixTuningConfig
            peft_config = PrefixTuningConfig(task_type=TaskType.CAUSAL_LM,
                                             **kwargs)
            model = get_peft_model(model, peft_config)
        elif adapter == 'prompt':
            from peft import PromptTuningConfig
            peft_config = PromptTuningConfig(task_type=TaskType.CAUSAL_LM,
                                             **kwargs)
            model = get_peft_model(model, peft_config)
        elif adapter == 'p-tuning':
            from peft import PromptEncoderConfig
            peft_config = PromptEncoderConfig(task_type=TaskType.CAUSAL_LM,
                                              **kwargs)
            model = get_peft_model(model, peft_config)
        elif adapter == 'qlora':
            from peft import prepare_model_for_kbit_training, LoraConfig
            model = prepare_model_for_kbit_training(model)
            peft_config = LoraConfig(
                    r=kwargs.pop('r', 32),
                    lora_alpha=kwargs.pop('lora_alpha', 16),
                    target_modules=find_all_linear_names(model, int4=kwargs.pop('module_int4', True)),
                    lora_dropout=kwargs.pop('lora_dropout', 0.05),
                    bias="none",
                    task_type=TaskType.CAUSAL_LM,
                )
            model = get_peft_model(model, peft_config)
        elif adapter == 'qlora-pissa':
            from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
            model = prepare_model_for_kbit_training(model)
            model = PeftModel.from_pretrained(model, kwargs.pop('adapter_path', ''), subfolder=kwargs.pop('subfolder', 'pissa_init'), is_trainable=True)
            for name, params in model.named_parameters():
                if "embed_tokens" in name or "lm_head" in name:
                    params.requires_grad=False
        else:
            raise NotImplementedError
        model.print_trainable_parameters()

    elif package == 'adapterhub':
        """
        AdapterHub: https://docs.adapterhub.ml/model_overview.html
        Support methods:
            Bottleneck Adapters
            Prefix Tuning
            LoRA
            Compacter
            Adapter Fusion
            Invertible Adapters
            Parallel block
        """
        # TODO:  After supporting adapterhub, we will move the following
        #   parameters in yaml file for users' convenient
        if adapter == 'lora':
            from transformers.adapters import LoRAConfig

            config = LoRAConfig(r=8, alpha=16)
            model.add_adapter("lora_adapter", config=config)
            model.train_adapter(['lora_adapter'])
        elif adapter == 'bottleneck':
            from transformers.adapters import AdapterConfig

            config = AdapterConfig(mh_adapter=True,
                                   output_adapter=True,
                                   reduction_factor=16,
                                   non_linearity="relu")
            model.add_adapter("bottleneck_adapter", config=config)
            model.train_adapter(['bottleneck_adapter'])
        elif adapter == 'lang':
            from transformers.adapters import PfeifferInvConfig

            config = PfeifferInvConfig()
            model.add_adapter("lang_adapter", config=config)
            model.train_adapter(['lang_adapter'])
        elif adapter == 'prefix':
            from transformers.adapters import PrefixTuningConfig

            config = PrefixTuningConfig(flat=False, prefix_length=30)
            model.add_adapter("prefix_tuning", config=config)
            model.train_adapter(['prefix_tuning'])
        elif adapter == 'compacter':
            from transformers.adapters import CompacterConfig

            config = CompacterConfig()
            model.add_adapter("dummy", config=config)
            model.train_adapter(['dummy'])
        elif adapter == 'ia_3':
            from transformers.adapters import IA3Config

            config = IA3Config()
            model.add_adapter("ia3_adapter", config=config)
            model.train_adapter(['ia3_adapter'])
        elif adapter == 'union':
            from transformers.adapters import AdapterConfig, ConfigUnion

            # TODO: configure these args in cfg
            config = ConfigUnion(
                AdapterConfig(mh_adapter=True,
                              output_adapter=False,
                              reduction_factor=16,
                              non_linearity="relu"),
                AdapterConfig(mh_adapter=False,
                              output_adapter=True,
                              reduction_factor=2,
                              non_linearity="relu"),
            )
            model.add_adapter("union_adapter", config=config)
            model.train_adapter(['union_adapter'])
        elif adapter == 'mam':
            from transformers.adapters import \
                ConfigUnion, ParallelConfig, PrefixTuningConfig

            config = ConfigUnion(
                PrefixTuningConfig(bottleneck_size=800),
                ParallelConfig(),
            )
            model.add_adapter("mam_adapter", config=config)
            model.train_adapter(['mam_adapter'])
        else:
            raise NameError(
                f"There is no adapter named {adapter} in {package}")
    else:
        raise NotImplementedError
    return model


class AdapterModel(nn.Module):
    """
    A wrapper class for a model that can use adapters for fine-tuning.

    This class inherits from torch.nn.Module and implements a wrapper for a
    model that can optionally use adapters for fine-tuning. Adapters are small
    modules that can be inserted between the layers of a pretrained model and
    trained on a specific task, while keeping the original parameters frozen.
    This class can use different adapter packages and methods, such as PEFT
    and LoRA. It also provides methods for saving and loading the model state
    dict, as well as generating text using the model.

    Attributes:
        model: A torch.nn.Module object that represents the original or
            adapted model.

    """
    def __init__(self, model, use_adapter=False, *args, **kwargs):
        """
        Initializes the wrapper with the given model and arguments.

        Args:
            model: A torch.nn.Module object that represents the original model.
            use_adapter: A boolean indicating whether to use adapters for
                fine-tuning. Default is False.
            *args: Additional positional arguments to pass to the adapter
                package or method.
            **kwargs: Additional keyword arguments to pass to the adapter
                package or method. These may include adapter_package,
                adapter_method, etc.
        """
        super().__init__()

        self.model = None
        if use_adapter:
            adapter_package = kwargs.pop('adapter_package', 'peft')
            adapter_method = kwargs.pop('adapter_method', 'lora')

            self.model = enable_adapter(model, adapter_package, adapter_method,
                                        **kwargs)
        else:
            self.model = model

    def forward(self, *args, **kwargs):
        """
        Calls the forward method of the wrapped model.

        Args:
            *args: Positional arguments to pass to the model's forward method.
            **kwargs: Keyword arguments to pass to the model's forward method.

        Returns:
            The output of the model's forward method.
        """
        return self.model.forward(*args, **kwargs)

    def generate(self, *args, **kwargs):
        """
        Calls the generate method of the wrapped model.

        Args:
            *args: Positional arguments to pass to the model's generate method.
            **kwargs: Keyword arguments to pass to the model's generate method.

        Returns:
            The output of the model's generate method.
        """
        try:
            res = self.model.generate(*args, **kwargs)
        except RuntimeError as e:
            # When does evaluation in HELM,
            # half precision will cause RuntimeError,
            # the following solves it
            if 'do_sample' in kwargs.keys():
                del kwargs['do_sample']
                res = self.model.generate(*args, **kwargs)
            else:
                raise RuntimeError(e)
        return res

    def state_dict(self, return_trainable=True, *args, **kwargs):
        """
        Returns the state dict of the wrapped model.

        Args:
            return_trainable: A boolean indicating whether to return only the
                trainable parameters of the model. Default is True.
            *args: Additional positional arguments to pass to the model's
                state_dict method.
            **kwargs: Additional keyword arguments to pass to the model's
                state_dict method.

        Returns:
            A dictionary containing the state dict of the model. If
            return_trainable is True, only the parameters that require grad are
            included. Otherwise, all parameters are included.
        """
        if return_trainable:
            return self.get_trainable_state_dict()
        else:
            return self.model.state_dict(*args, **kwargs)

    def load_state_dict(self, state_dict, strict=False):
        """
        Loads the state dict into the wrapped model.

        Args:
            state_dict: A dictionary containing the state dict to load into
                the model.
            strict: A boolean indicating whether to strictly enforce that the
                keys in state_dict match the keys returned by this module’s
                state_dict() function. Default is False.
        """
        return self.model.load_state_dict(state_dict, strict=False)

    def get_trainable_state_dict(self):
        """
        Returns only the trainable parameters of the wrapped model.

        This method can be used to get only the parameters that require grad,
        such as adapters or task-specific layers.

        Returns:
            A dictionary containing the state dict of the trainable parameters
            of the model.
        """
        grad_params = []
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                grad_params.append(name)
        model_state_dict = self.model.state_dict()
        new_state_dict = OrderedDict()
        for k, v in model_state_dict.items():
            if k in grad_params:
                new_state_dict[k] = v
        return new_state_dict

    def save_model(self, path, state=0):
        """
        Saves the model state dict and the current round to a file.

        Args:
            path: A string representing the file path to save the model to.
            state: An integer representing the current round of training or
                evaluation. Default is 0.

        """
        ckpt = {'cur_round': state, 'model': self.model.state_dict()}
        torch.save(ckpt, path)

    # TODO: Fix `__getattr__`
    # def __getattr__(self, item):
    #     return getattr(self.model, item)
    
    def save_pretrained(self, path):
        """
        Saves the pretrained model to a directory.
        
        Args:
            path: A string representing the directory path to save the model to.
        """
        self.model.save_pretrained(path)

By the way, the "save_to" attribute seems not able to really save the model when using adapter, the save_model function in federated/core/aggregators/aggregator.py only saves the weights:

def save_model(self, path, cur_round=-1):
        assert self.model is not None

        ckpt = {'cur_round': cur_round, 'model': self.model.state_dict()}
        torch.save(ckpt, path)

probably it is better to use self.model.save_pretrained in this case since it saves the finetuned adapter weights and the configuration files.

Thank you very much!

@stringing stringing reopened this May 31, 2024
@stringing
Copy link
Author

sorry, just accidentally clicked the close issue button...

@fourcake
Copy link

I encountered the same problem. On the dolly dataset, each time a client trained, the video memory usage increased by about 2.5GB, resulting in my 24GB 4090 machine only being able to support about 8 clients training, and even unable to complete a round

@fourcake
Copy link

Algorithm 1 in the appendix of your article indicates that the client calculates the latest model by accumulating (seed-gradient) values ​​at the beginning of each round of training. However, in the code you provided, the model update is performed by the server after the (seed-gradient) values ​​of each client are collected. When each client is ready to train, server.model is deepcopied as a parameter to the client, and then the client executes "self.model = pulled_model self.model.to(self.device)". Is this the reason for the excessive use of video memory? Is the code version you provided incorrect?

@fourcake
Copy link

However, when I set the number of clients to 1 and the sampling rate to 1, the memory usage does not fluctuate much regardless of how many rounds of training I perform.

@stringing
Copy link
Author

I encountered the same problem. On the dolly dataset, each time a client trained, the video memory usage increased by about 2.5GB, resulting in my 24GB 4090 machine only being able to support about 8 clients training, and even unable to complete a round

Yes, especially for larger models. I've tried torch.cuda.empty_cache() and gc.collect(), it did help in a way but not really solved the problem. Like you mentioned, it might have something to do with the training process. Please let me know if you figure out how to fix it. Thanks! ^^

@fourcake
Copy link

I encountered the same problem. On the dolly dataset, each time a client trained, the video memory usage increased by about 2.5GB, resulting in my 24GB 4090 machine only being able to support about 8 clients training, and even unable to complete a round

Yes, especially for larger models. I've tried torch.cuda.empty_cache() and gc.collect(), it did help in a way but not really solved the problem. Like you mentioned, it might have something to do with the training process. Please let me know if you figure out how to fix it. Thanks! ^^

I even think this is related to pytorch’s memory release mechanism, or maybe it’s related to the machine.If you find a good solution, please share it, thank you.

@fourcake
Copy link

By chance, I ran the code on another 4-GPU 4090 machine, and I found that the video memory problem disappeared, the garbage video memory was recycled in time, and the model was trained normally. Therefore, this problem has nothing to do with the code, but may be related to hardware settings, driver versions, etc. Good luck.

@stringing
Copy link
Author

By chance, I ran the code on another 4-GPU 4090 machine, and I found that the video memory problem disappeared, the garbage video memory was recycled in time, and the model was trained normally. Therefore, this problem has nothing to do with the code, but may be related to hardware settings, driver versions, etc. Good luck.

That's great. But I got this problem on 4090. Which driver version are you using? I'll have a try :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FS-LLM Federated learning in LLM
Projects
None yet
Development

No branches or pull requests

3 participants