Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Better error message when usage limit arrived. (HTTPTooManyRequests) #1893

Open
orena1 opened this issue Dec 23, 2024 · 7 comments · May be fixed by #1895
Open

Feature Request: Better error message when usage limit arrived. (HTTPTooManyRequests) #1893

orena1 opened this issue Dec 23, 2024 · 7 comments · May be fixed by #1895

Comments

@orena1
Copy link

orena1 commented Dec 23, 2024

Is your feature request related to a problem? Please describe.

I encountered frequent errors like the following when running Neptune with NeMo:

Experiencing connection interruptions. Will try to reestablish communication with Neptune. Internal exception was: HTTPTooManyRequests

Once this error occurred, the job would hang and not progress, eventually resulting in a cu10 error message. The only solution I found was to disable Neptune entirely.

After contacting support, I learned this issue is caused by reaching the default workspace usage limit. Here is their response:

Hi Oren,
Thanks for reaching out. Yes, it looks like you're reaching the default workspace usage limit

It would be much better if the error message directly indicated the actual problem, such as:
Experiencing connection interruptions with Neptune. It appears you are reaching the default workspace usage limit. Please review your workspace limits or contact support for assistance.

This would have saved me time (it took 4 hours to diagnose the issue) and prevented frustration. If I had not reached out to support, our company might have abandoned the idea of using Neptune entirely.

Additionally, it would be beneficial if the job did not fail or freeze in cases where usage limits are exceeded. A graceful handling of such situations would improve the user experience.

Additional context:

Where can I check whether I have indeed reached the usage limit? The dashboard currently only shows storage limits, not connection limits. Clarifying this in the UI or documentation would also be helpful.

Thank you!

@orena1 orena1 changed the title Feature Request: Better error message when usage limit arrived. Feature Request: Better error message when usage limit arrived. (HTTPTooManyRequests) Dec 23, 2024
@SiddhantSadangi SiddhantSadangi self-assigned this Dec 24, 2024
@SiddhantSadangi SiddhantSadangi added this to the 1.14 milestone Dec 24, 2024
@SiddhantSadangi
Copy link
Member

Hey @orena1 👋

Thank you for the detailed feature request. We do have a page in the docs that deals with this error: https://docs.neptune.ai/help/reducing_requests/, but I'll add more details, as you've mentioned, with a link to this page in the error message itself 📝

Additionally, it would be beneficial if the job did not fail or freeze in cases where usage limits are exceeded. A graceful handling of such situations would improve the user experience.

The job doesn't actually freeze. Neptune's Lightning integration (on which NeMo's Neptune integration is built) calls a wait() internally to ensure all logging calls have reached the server before proceeding with execution. When already rate-limited, this wait can make it seem as if the training has frozen, when it hasn't. If you check the Neptune WebApp, you should be able to see monitoring metrics being updated (unless there's a large file, like model checkpoint, being uploaded).

Where can I check whether I have indeed reached the usage limit? The dashboard currently only shows storage limits, not connection limits. Clarifying this in the UI or documentation would also be helpful.

Currently, this information is only available on the back end. I'll pass on this feedback to the product team if we can include this on the dashboard somehow 📝

@SiddhantSadangi SiddhantSadangi linked a pull request Dec 24, 2024 that will close this issue
2 tasks
@SiddhantSadangi SiddhantSadangi removed this from the 1.14 milestone Dec 24, 2024
@SiddhantSadangi
Copy link
Member

@orena1 - We have a PR to add a more descriptive error message, complete with links to the docs and who to contact for support.

Can you install this version of neptune from the source to check if this works for you?

pip install git+https://github.com/neptune-ai/neptune-client-scale.git@ss/1.x/HTTPTooManyRequests

@orena1
Copy link
Author

orena1 commented Dec 24, 2024

Thanks @SiddhantSadangi that is much more informative! I can not really test it as these error stopped for now.

@GeorgePearse
Copy link

This is definitely freezing my training, which like the other poster, makes me tempted to ditch neptune.

@GeorgePearse
Copy link

You can't even see that you're hitting into any useage limits via the UI?

image

@GeorgePearse
Copy link

GeorgePearse commented Jan 10, 2025

This was definitely blocking my training, paying unblocked it (but nothing was making it obvious that this was the problem).

Just:

[neptune] [warning] Experiencing connection interruptions. Will try to reestablish communication with Neptune. Internal exception was: HTTPTooManyRequests

@SiddhantSadangi
Copy link
Member

Hey @GeorgePearse 👋

As mentioned in a previous comment, the training doesn't really freeze, but depending on the sync backlog, can be throttled down enough to appear that it has frozen, especially if large files are being uploaded to Neptune 👇

The job doesn't actually freeze. Neptune's Lightning integration (on which NeMo's Neptune integration is built) calls a wait() internally to ensure all logging calls have reached the server before proceeding with execution. When already rate-limited, this wait can make it seem as if the training has frozen, when it hasn't. If you check the Neptune WebApp, you should be able to see monitoring metrics being updated (unless there's a large file, like model checkpoint, being uploaded).

The lack of context with the HTTPTooManyRequests error and no information about usage limits in the UI are a very valid points 💯
We have a version of neptune pending merge and release that adds more context to the HTTPTooManyRequests error, and UX changes that shall add this and other usage limits in the UI are in the backlog.

These changes are, however, currently low priority as we are working on an entirely new version of Neptune - both the API and UI - from the ground up where these issues don't exist. It is currently in private beta, but you can sign-up for early access here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants