-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 filesystem connection pool depleted when using Rubix Caching #3524
Comments
This has been fixed in latest version (0.3.8). We noticed that lot of connections are left with OS in TIME_WAIT state leading to depletion of sockets. It has been fixed by pooling connections in rubix and reusing them. c.c @harmandeeps |
This might not be it, the part that this solves would not apply to embedded mode used here. Will see if I can get this repro to while checking #3494 |
@losipiuk is it possible to repeat your experiment with parallel warmup disabled? |
Sure I can try to run it. Not sure if this one is easily reproducable though. Just saw it once. |
FYI: With |
I have trying to repro this but with no luck yet. @losipiuk can you help with few details:
|
I was running those sequentially.
I do not have the cluster anymore. What I observed is that starting at one of the queries the issues started to occur. And it persisted through rest of the queries which were run (each taking lots of time due to timeouts). I did not observe healthy cluster after issues started to occur.
I looked at logged configuration at presto startup for |
Ok thanks @losipiuk , I will continue the repro efforts. |
Update on this, I was able to repro it with queries running in a loop with cancellations on every other query. Got into the error state quite early, within half an hour. Tried it over the branch with qubole/rubix#368 as it changes a lot of existing code, I could not repro and same tests ran fine for hours. |
Found the root cause for this. Explained it in detail in qubole/rubix#375. |
Not seeing it with qubole/rubix#368 not just because of lower reads from remote filesystem but because we moved to positional reads in rubix master and this race doesnt happen with positional reads. |
@stagraqubole does qubole/rubix#368 fix the issue? |
qubole/rubix@87ae6e1 has fixed it for Presto. This will not occur after rubix upgrade. |
Fixed by: #3772 (as it upgrades Rubix so that it has qubole/rubix#368). |
I exercised Rubix Caching in the same setup as in #3494
After ~20 queries run, the following ones started to failing with
io.prestosql.spi.PrestoException: Error reading from s3://<redacted>/tpch-sf1000-ORC/lineitem/20180106_235637_00251_ecxdi_19e8c24a-7fc9-4512-a296-555ab4efc0c2 at position 732097590
.Exception stacktrace from worker log:
cc: @stagraqubole
The text was updated successfully, but these errors were encountered: