-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Memory leak(?) when using batching with large dataset (>450K items) #706
Comments
How much memory does your machine have? I fear this OOM error might be “real” in the sense that it is legitimately not enough memory for doing the calculations up to this maxsize and in this many processes (with this large a dataset) at once. Keep in mind that even with batching, it will still periodically evaluate and compute gradients on the entire dataset (although perhaps in the future we could aggregate over batches rather than all at once…). I think it may be trending towards more complex expressions which is why the memory usage increases later in training. Although I don’t want to rule out a memory leak issue, it could still be the case. Maybe you could try with fewer processes, but the same 200MB heap size limit, and see if the OOM still occurs? Thanks for the detailed report! |
I have 32GB with around 16GB free when I'm testing. I have tried running this with procs=8 with the same behavior.
I just ran it with procs=4, same settings as above with This seems to be independent of procs based on this testing.
Is it continuously updating this memory with new expressions? If it could offload to storage feasibly that would be nice. (I have TBs of storage). The loss seems to slowly go down, so I'm interested in having this run forever. |
Thanks, that's interesting. Let me walk you through my thought process on debugging – So a vector of length 460k with Now, one tricky part is that PySR can also use multithreading during evaluations, at this part of the code: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/cd23a6e25c64d00565c3ae3905d06dc3c63033ed/src/SingleIteration.jl#L112. @threads_if !(options.deterministic) for j in 1:(pop.n)
if options.should_simplify
tree = pop.members[j].tree
tree = simplify_tree!(tree, options.operators)
if tree isa Node
tree = combine_operators(tree, options.operators)
end
pop.members[j].tree = tree
end
if options.should_optimize_constants && do_optimization[j]
# TODO: Might want to do full batch optimization here?
pop.members[j], array_num_evals[j] = optimize_constants(
dataset, pop.members[j], options
)
end
end By default threading is enabled (unless you set the environment variable this means if each process has like 14 threads (which might be automatically set if you have 14 cores), at worst, assuming all the processes hit this part of the code at the same time, there is 16.8 GB of memory usage. That is the very worst case though. And since Julia uses garbage collection, memory is not deallocated immediately (unless you were to use Maybe that could be something to try first. Does It's weird that using fewer processes doesn't help. Maybe you could also try |
https://astroautomata.com/PySR/api/ This says default is 32.
I ran that test overnight and it indeed runs without issues. I ran it for over 10 hours and it never went over 2GB of memory. I'd probably need to play with settings though as the loss kind of stopped lowering it seemed. (It was above the levels I got when using procs=14 and such I assume due to less variation or something).
OOM at 5 hours and 51 minutes. That does appear to help. I'll try running it with lower procs and bumper to see what happens. |
Sorry you are totally right about the precision, oops! Really interesting that bumper didn’t help fix it. If that has no effect then I am a bit confused as to the cause. Do you have another machine to test this on, preferably one that isn’t Windows? I know that Windows can sometimes have issues with multiprocessing in Julia. Maybe the garbage collection is having issues. It would also be interesting to test this on Julia 1.11 (not yet released) as I know they’ve improved the garbage collection in a certain way. Perhaps it can help with issues like this. Though I admit I’m confused as to the actual source of the memory usage here since the bumper option didn’t change it. Lastly, have you also tried multithreading instead of multiprocessing, and how did it change things? |
Oh wait it sounds like bumper actually did help, since the OOM happened 4x later? I guess it’s just the remaining cause of the continual increase. Are you also able to monitor per-process memory consumption? My current guess is that the head process is using all of the memory while the children processes use something close to that requested (200 MB). This is because JuliaCall does not yet give a way to size a heap size hint: JuliaPy/PythonCall.jl#546 which would be required to set it for the head process (children processes are set up from within the backend, so should work fine). |
Since multithreading=True by default I've been using a single process this whole time it seems. As seen here: When using the above procs=14 setups WSL is using 9.8GB at 3 hours So I've been using threads. I tested multiprocessing and it failed after 32 minutes outputting this:
My program:
(The ncycles_per_iteration is high because lower values had the head worker at 99%). |
Ah, the This parameter is basically used to prevent OOM errors. When the memory usage gets close to the heap size hint, Julia's garbage collection gets much more aggressive. However, in PySR, this parameter only gets set in new processes. This is because Julia is not aware of the memory usage of other Julia processes, so it can help to set it in advance for newly spawned processes. For a single process, it's usually not needed. I guess it is here because the garbage collection isn't working hard enough on WSL (?). Right now you can't set the heap size hint on the main Julia process if using PythonCall.jl. So, let me prioritise JuliaPy/PythonCall.jl#546 and try to add this, and we can see if it helps at all. |
Ok I made a PR to PythonCall.jl here: JuliaPy/PythonCall.jl#547. Once that gets in you can see if that fixes it (will be available via the |
See JuliaLang/julia#56759 and JuliaLang/julia#56801. Basically there was a real memory leak inside Julia which is now fixed. This should be fixed in Julia 1.11.3 when it is released: JuliaLang/julia#56741. I will also see if they can backport to 1.10. Let me know if this is still an issue in the latest Julia when it is released. |
What happened?
It seems like when letting PySR run forever after a while it gets an OOM error after a while, but only when using a large dataset. I can watch it steadily grow memory.
I used the following setup in each of my tests changing the sample_size 10000 value to a specific value or commenting the three lines out to run the whole dataset:
My dataset has 460K records which I know isn't advised, but it's for a niche problem. The memory issues appears to only happen when running on a dataset over a certain size.
I saw a comment about heap_size_hint_in_bytes needing to be set and I've played with values for it, but it doesn't appear to change the behavior. I've set it to 1.2 GB and also 200MB for instance. I've tried smaller batch sizes like 50 and it doesn't appear to change the behavior either. None of the other settings appear to change things either. I've tried procs=8 and smaller populations and population_size, and smaller ncycles_per_iteration.
100K random records then WSL starts at 3.85 GB. At 10 minutes 4GB. At 50 minutes 4.3 GB. At 1 hour 3.2 GB. At 1.5 hours 3 GBs. No issues.
200K random records then WSL starts at ~4GB. At 20 minutes 4.2GB, At 1 hour 3.6GB. At 2 hours 15 minutes 3.5GB. No issues.
300K random records then WSL starts at ~4.3GB. At 20 minutes 5.3GB. Grew to 5.9 GBs then dropped to 5.2GB at 30 minutes. 1 hour 6GB. 1 hour 15 minutes 5.4GB. 1 hour 30 minutes 4.7GB. 1 hour 32 minutes 4.4GB. No issues.
400k random records then WSL starts at ~4.2GB. 3 minutes 4.5GB. 8 hours 30 minutes 7.7GB.
460K then WSL starts at ~4.8GB. 2 minutes 5.2GB. 30 minutes 9.6GB, 40 minutes 12.3GB. 55 minutes 14.5 GB. 1 hour 15.1GB. 1 hour 10 minutes 15.4GB. 1 hour 19 minutes 15.5 GB. 1 hour 26 minutes OOM. I ran this also using:
Just to be sure and it failed at 1 hour 23 minutes so no difference.
I've attached my test2.json file.
test2.json
Version
0.19.4
Operating System
Linux
Package Manager
pip
Interface
IPython Terminal
Relevant log output
No response
Extra Info
No response
The text was updated successfully, but these errors were encountered: