Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak is observed when using the AutoTokenizer and AutoModel with Python 3.10.* #1706

Open
2 of 4 tasks
KhoiTrant68 opened this issue Dec 27, 2024 · 4 comments
Open
2 of 4 tasks

Comments

@KhoiTrant68
Copy link

KhoiTrant68 commented Dec 27, 2024

System Info

A memory leak is observed when using the AutoTokenizer and AutoModel class with Python version 3.10.*. The same code does not exhibit the memory leak issue when running on Python 3.8.11. The issue may arise due to differences in how Python 3.10.* handles memory allocation, deallocation, or compatibility with the libraries used.


Setup:

  1. Environment:

    • Python 3.8.11 (No memory leak observed)
    • Python 3.10.* (Memory leak occurs)
  2. Dependencies:

    • tokenizers==0.20.3
    • torch==2.0.1+cu117
    • torchvision==0.15.2+cu117
    • tqdm==4.67.0
    • transformers==4.46.0

Attempts to Resolve:
We tried various strategies to address the memory leak, but none were successful. These include:

  1. Explicit Garbage Collection:
    • Used gc.collect() to manually invoke garbage collection after each batch.
  2. Variable Deletion:
    • Explicitly deleted intermediate variables with del to release memory.
  3. CUDA Cache Management:
    • Used torch.cuda.empty_cache() to free up GPU memory.
  4. Library Versions:
    Tried multiple versions of tokenizers and transformers libraries but observed no improvement.

Despite these efforts, the memory leak persisted in Python 3.10.*.


Call for Assistance: We have exhausted our efforts to identify and resolve the memory leak issue. If anyone with expertise in Python memory management, PyTorch, or Hugging Face Transformers can assist, we would greatly appreciate your help

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import profile
import torch
import torch.nn.functional as F
from transformers import AutoModel
from transformers import AutoTokenizer
import gc
import numpy as np
import onnxruntime as ort
from faker import Faker
from memory_profiler import profile
from typing import Mapping, Dict, List
import json
# Create a Faker instance with Japanese locale
fake = Faker('ja_JP')

# Generate random Japanese text
def generate_random_japanese_text():
    return fake.text()
def move_to_cuda(sample):
            if len(sample) == 0:
                return {}

            def _move_to_cuda(maybe_tensor):
                if torch.is_tensor(maybe_tensor):
                    return maybe_tensor.cuda(non_blocking=True)
                elif isinstance(maybe_tensor, dict):
                    return {key: _move_to_cuda(value) for key, value in maybe_tensor.items()}
                elif isinstance(maybe_tensor, list):
                    return [_move_to_cuda(x) for x in maybe_tensor]
                elif isinstance(maybe_tensor, tuple):
                    return tuple([_move_to_cuda(x) for x in maybe_tensor])
                elif isinstance(maybe_tensor, Mapping):
                    return type(maybe_tensor)({k: _move_to_cuda(v) for k, v in maybe_tensor.items()})
                else:
                    return maybe_tensor

            return _move_to_cuda(sample)
        
def create_batch_dict(tokenizer, input_texts, max_length: int = 512):
    return tokenizer(
        input_texts,
        max_length=max_length,
        padding=True,
        pad_to_multiple_of=8,
        return_token_type_ids=False,
        truncation=True,
        return_tensors='pt'
    )
def pool(last_hidden_states,
    attention_mask,
    pool_type: str):
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)

    if pool_type == "avg":
        emb = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    elif pool_type == "weightedavg":  # position-weighted mean pooling from SGPT (https://arxiv.org/abs/2202.08904)
        attention_mask *= attention_mask.cumsum(dim=1)  # [0,1,1,1,0,0] -> [0,1,2,3,0,0]
        s = torch.sum(last_hidden * attention_mask.unsqueeze(-1).float(), dim=1)
        d = attention_mask.sum(dim=1, keepdim=True).float()
        emb = s / d
    elif pool_type == "cls":
        emb = last_hidden[:, 0]
    elif pool_type == "last":
        left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
        if left_padding:
            emb = last_hidden[:, -1]
        else:
            sequence_lengths = attention_mask.sum(dim=1) - 1
            batch_size = last_hidden.shape[0]
            emb = last_hidden[torch.arange(batch_size, device=last_hidden.device), sequence_lengths]
    else:
        raise ValueError(f"pool_type {pool_type} not supported")

    return emb

class KVEmbedding:
    def __init__(self, device):
        self.device = device

        # Load tokenizer and model from pretrained multilingual-e5-small
        self.tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-small")
        self.model = AutoModel.from_pretrained("intfloat/multilingual-e5-small").to(self.device)

        self.model.eval()  # Set model to evaluation mode

    def average_pool(self, last_hidden_states, attention_mask):
        # Apply mask to hidden states, set masked positions to 0
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
        # Average the hidden states along the sequence dimension
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    
    @profile
    def embedding(self, l_transcription, batch_size=32):
        # Tokenize input transcriptions
        batch_dict = self.tokenizer(
            l_transcription,
            max_length=512,
            padding=True,
            truncation=True,
            return_tensors="pt",
        ).to(self.device)

        return batch_dict
    def _do_encode(self, input_texts) -> np.ndarray:
        encoded_embeds = []
        batch_size = 64 
        for start_idx in range(0, len(input_texts), batch_size):
            batch_input_texts = input_texts[start_idx: start_idx + batch_size]

            batch_dict = create_batch_dict(self.tokenizer, batch_input_texts)
            batch_dict = move_to_cuda(batch_dict)
        return encoded_embeds


import random
from faker import Faker

# # Lists of Japanese characters
hiragana = ["あ", "い", "う", "え", "お", "か", "き", "く", "け", "こ", "さ", "し", "す", "せ", "そ", "た", "ち", "つ", "て", "と", "な", "に", "ぬ", "ね", "の", "は", "ひ", "ふ", "へ", "ほ", "ま", "み", "む", "め", "も", "や", "ゆ", "よ", "ら", "り", "る", "れ", "ろ", "わ", "を", "ん"]
katakana = ["ア", "イ", "ウ", "エ", "オ", "カ", "キ", "ク", "ケ", "コ", "サ", "シ", "ス", "セ", "ソ", "タ", "チ", "ツ", "テ", "ト", "ナ", "ニ", "ヌ", "ネ", "ノ", "ハ", "ヒ", "フ", "ヘ", "ホ", "マ", "ミ", "ム", "メ", "モ", "ヤ", "ユ", "ヨ", "ラ", "リ", "ル", "レ", "ロ", "ワ", "ヲ", "ン"]
kanji = ["日", "本", "語", "学", "校", "生", "時", "間", "人", "大", "小", "中", "山", "川", "口", "目", "耳", "手", "足", "力", "男", "女", "子", "父", "母"]

# Combine all character sets
all_characters = hiragana + katakana + kanji

# Generate random Japanese text
def generate_random_japanese(length):
    return ''.join(random.choices(all_characters, k=length))

def remove_invalid_characters(valid_chars, text):
    """
    Removes all invalid characters from the given text, keeping only the characters present in char_dicts.

    Args:
    char_dicts (dict): Dictionary of valid characters.
    text (str): Input text string.

    Returns:
    str: Text string with only valid characters.
    """
      # Convert dict keys to a set for faster lookup
    filtered_text = ''.join(c for c in text if c in valid_chars)
    return filtered_text


if __name__ == "__main__":
    print("Start app ...")
    with open("multilingual-e5-small/tokenizer.json", 'r') as file:
        character_info = json.load(file)
    character_dict = {}
    print("Vocab is loading ...")
    for data in character_info["model"]["vocab"]:
        character_dict[data[0]] = data[1]
    valid_chars = set(character_dict.keys())
    print("Start loading model")
    kv_embedding = KVEmbedding('cuda')
    print("Loading model: Done!!!")
    for i in range(7500):
        print(f"============{i}==============")
        length = random.randint(600, 1000)
        # print(length)
        input_texts = []
        for s in range(length):
            text_length = random.randint(1, 10000)
            
            random_text = generate_random_japanese(text_length)
            
            # before = len(random_text)
            random_text = remove_invalid_characters(valid_chars, random_text)
            # after = len(random_text)
            # if after != before:
            #     print(before, after)
            random_text = random_text[:450]
            input_texts.append(random_text)
    
        filter_output = input_texts[:512]
        
        del input_texts

        # print(len(filter_output))

        output = kv_embedding.embedding(filter_output)

Logs

newplot

============4==============
Filename: test_kv_embed.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    92   2293.9 MiB   2293.9 MiB           1       @profile
    93                                             def embedding(self, l_transcription, batch_size=32):
    94                                                 # Tokenize input transcriptions
    95   2295.7 MiB      1.8 MiB           3           batch_dict = self.tokenizer(
    96   2293.9 MiB      0.0 MiB           1               l_transcription,
    97   2293.9 MiB      0.0 MiB           1               max_length=512,
    98   2293.9 MiB      0.0 MiB           1               padding=True,
    99   2293.9 MiB      0.0 MiB           1               truncation=True,
   100   2293.9 MiB      0.0 MiB           1               return_tensors="pt",
   101   2295.7 MiB      0.0 MiB           1           ).to(self.device)
   102                                         
   103   2295.7 MiB      0.0 MiB           1           return batch_dict


============5==============
Filename: test_kv_embed.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    92   2295.7 MiB   2295.7 MiB           1       @profile
    93                                             def embedding(self, l_transcription, batch_size=32):
    94                                                 # Tokenize input transcriptions
    95   2296.5 MiB      0.8 MiB           3           batch_dict = self.tokenizer(
    96   2295.7 MiB      0.0 MiB           1               l_transcription,
    97   2295.7 MiB      0.0 MiB           1               max_length=512,
    98   2295.7 MiB      0.0 MiB           1               padding=True,
    99   2295.7 MiB      0.0 MiB           1               truncation=True,
   100   2295.7 MiB      0.0 MiB           1               return_tensors="pt",
   101   2296.5 MiB      0.0 MiB           1           ).to(self.device)
   102                                         
   103   2296.5 MiB      0.0 MiB           1           return batch_dict

Expected behavior

No memory leaks occur on Python 3.10.*.

@ArthurZucker
Copy link
Collaborator

Hey which version of tokenizers are you using ? 🤗

@KhoiTrant68
Copy link
Author

KhoiTrant68 commented Jan 3, 2025 via email

@Narsil
Copy link
Collaborator

Narsil commented Jan 9, 2025

Hi we cannot reproduce the leak you are mentioning.

You are not even mentioning what kind of leak you're referring suggest CPU RAM leaks, but you're putting all your tensors on GPU.
You suggest a leak, however your graph over 5h only shows a small bump at one given point in time (around 13:00).

Please, try and clarify very clearly what's the problem you've identified. Make sure it's not your profiling code that's "leaking" (for instance by just measuring RAM usages without any sort of profiling).
Also if Python 3.10 is an issue, why not just try 3.11, 3.12 or any other ? There shouldn't be any reason for a difference but why not.
Also a small bump in RAM usage is quite normal with torch's allocator. It's a ring allocator, if a single loops for any reason requires more VRAM, torch will simply allocate another chunk (potentially much larger than what it actually needs) and keep it around (even if you try to ask it to free, torch allocator may not release it simply because the various pages contain a single bit of data.).

@KhoiTrant68
Copy link
Author

KhoiTrant68 commented Jan 21, 2025

Memory Leak Issue with tokenizer and transformer Libraries

Dear Narsil,

Thank you for your response and support. I would like to report and clarify some concerns regarding a memory leak observed when using the tokenizer and transformer libraries:

Python Version Limitation:

Due to integration requirements, our system is currently constrained to Python 3.10. Upgrading to Python 3.11 or 3.12 is not an option for us at this time.

Reproduction Example:

The example I've included is a simplified illustration to show the memory leak issue when applying the mentioned libraries.
I have attached an HTML file generated by "memray" to visualize memory usage. The file shows clear evidence of a memory leak when the libraries are integrated into our module.

CPU RAM Leak Observed:

This issue appears to affect CPU RAM only. Regardless of whether we run the code on a GPU or CPU device, there is no change in GPU memory usage. The memory leak is consistently observed on the CPU.

Memory Management Attempts:

We understand that small increases in RAM usage are expected with Python libraries like PyTorch, and we have employed common techniques such as gc.collect() and torch.cuda.empty_cache() to free memory.
However, the memory leak persists even after applying these techniques, indicating a deeper issue.

To support this report, I have attached the "memray" HTML file and relevant images showing the memory behavior during execution:

We hope these details help identify and address the issue. Please let us know if additional information or further testing is needed.

Best regards,
KhoiTran

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants