-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix WebSocket memory free to be thread safe. #209
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to create a test for the wrong behavior, or at least test on my own server.
-- Release a socket retrieved with Get_Socket above, this socket will be | ||
-- then available again. | ||
|
||
procedure Free (WebSocket : in out Object_Class); | ||
-- Free WebSocket immediately if not taken by another tasK, otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"tasK" -> "task"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a confusion in the API because the websocket itself has a primitive operation Free (aws-net-websocket.adb:215) which does an immediate deallocation. Shouldn't the latter internally call Release_Socket
instead, and perform the actual deallocation later as needed ? As it is, the user might still free the socket directly, breaking the code. Perhaps you have an idea how to make this clearer ?
At the very least, it seems we should document that behavior, and perhaps rename AWS.Net.WebSocket.Registry.DB.Free to something like Deferred_Free
or something to clear possible confusion in the AWS code itself ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about the Deferred_Free
naming, not bad but confusing in the other direction as the implementation does free immediately if the socket is not currently handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Free_Or_Defer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me! I'll do the change. Thanks.
Unchecked_Free (WebSocket); | ||
end if; | ||
end Free; | ||
|
||
---------------- | ||
-- Get_Socket -- | ||
---------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't Get_Socket also avoid sockets that are marked with To_Free=True ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because we do return only socket not in Sending container and in this case if it was to be free it would have been already done in Release_Socket.
-- record this socket to be freed as soon as it is released | ||
-- (Release_Socket) call. | ||
|
||
if Sending.Contains (WebSocket.Id) then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking in the code why Sending is a set. It seems to be it might just as well be a flag in the Object itself, rather than an independent set. The difficulty is to make sure it is only accessed via the protected object (though it could also be made atomic for instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe but could be done as a clean-up and does not seem do be the issue at hand.
Also noticed aws-net-websocket-registry.adb:445 which does a direct call to Unchecked_Free on the socket. Should this one be a call to Release_Socket instead ? |
(in my own patch, I had replaced Unchecked_Free with a call to |
Similar call in |
I ran my test multiple times with this patch, and did not notice the same thread-sanitizer report on use-after-free. There are however a few data races (two in the context of AWS, and one already reported to AdaCore in the runtime), which makes things a bit harder to read. I'll need some time to analyze. |
Unfortunately I think I just had the same issue with thread-sanitizer:
I double-checked that the patch is correctly applied. Since we are not using a protected type in message_reader, I think we can have this error at any time if it happens that we free the socket at the wrong time. |
Here is a test similar to the one I am using. It shows thread-sanitizer reports for data races, but I haven't yet seen the same use-after-free error. I do recompile the AWS file (and the runtime files) with the same -fsanitize=thread switch, which might be needed to reproduce some of the errors.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a typo in commit message "to ne thread"
I'd like your input on the testcase above: you'll notice is calls the primitive |
70de059
to
265967f
Compare
Yep, now fixed. |
A slightly different take on the take thread-sanitizer report I sent. This one doesn't involve free, but still shows a lack of mutex somewhere. One of the tasks in my server is writing to the websocket. This apparently doesn't involve any mutex (maybe my code is responsible for getting one, but this isn't clear from the API what such a mutex/protected object would be). Later, the same socket is written to by message_reader also without a mutex. So it is possible that both write would happen at the same time, leading to garbage of course.
|
Not sure if it will help but I have a new version of this PR. All Unchecked_Free in the registry are now done in the DB.Free() and so only possibly a deferred free. One question, do you confirm that if the comment out the Unchecked_Free in the registry there is no crash ? At least this will make it clear that the only issue on the AWS WebSocket implementation is the WebSocket memory handling. EDIT: forgot that in the last implementation I skip all WebSocket having a deferred free state in Create_Set. |
|
Yes, users should not call AWS.Net.WebSocket.Free directly. This is on the spec to be able to override it for SSL implementation (as described in the comment). But this routine is needed as the garbage collection is done by Finalize(). |
Can't An alternative could be to have |
I thought about this, but how can it be possible. Free() call Free_Or_Defer() which does the Unchecked_Free() which calls the Free() because the object is finalized. Or maybe you meant something else ? EDIT: I see, a special flag... well will be hard to implement correctly I fear! |
I don't know precisely what I want, except it would be nice to have a safe API where people cannot easily shoot themselves in the foot by calling I think we both agree that as soon as someone frees memory in an uncontrolled fashion, there is no way the registry and others will work correctly. In your sentence, there is ambiguity between the primitive
That seems to be working as expected, right ? To me, the important is that Unchecked_Free only occurs when the websocket is no longer in use (either in the registry, or in the queue of messages read by Maybe we need to somehow formalize the rules and ownership here...
Going on a tangent here: perhaps the idea with the type-with-discriminant is nice: users can no longer store the websocket in their own code, but always need to go via the registry. We then provide a procedure like:
while the callback executes, AWS garantees that memory is not freed for the websocket. That simplifies lifetime management significantly since only AWS is responsible for it and user has no way around that. This is the same kind of API we use for refcounted types like GNATCOLL.Strings. This This breaks the current API slightly, but it prevents very dangerous issues. The current API needs some improvements anyway as shown in Maxim's report in #138. |
This makes sense indeed. I'll work on this soon and propose a new patch where the release of WebSocket object is fully controlled by the registry and done only when the WebSocket is not used anymore. |
Don't you like my proposal with the |
The issue is that will break the API ! You meant concurrent write as the other issue ? Because the Send and Free should be fixed by the WIP we've been discussing. |
I just realized the implementation of But even that is not enough anyway. Instead, At it currently is implemented, it is even worse, since both tasks will receive the socket, and the first one will release it, potentially triggering a free. The second one is then accessing dangling pointers... |
Well the "fix" to use a boolean in WebSocket object is wrong because the state (boolean) is not shared across all copies of WebSocket object. Either we should use an access to a boolean or revert to using a container. |
Or move To_Free and Sending boolean into the Internal_State record which was made for this ! |
Continued work for U504-028.
Not sure I followed that. The socket uses the same UID everywhere right, so the container for Sending (which maps from this UID to the boolean) would have the same issue ? That's a part of the code I did not look into, so I'll certainly trust you there. Still, that would not solve the need to make |
Also, what copies of the socket ? Internally, you use Object_Class, so this is a pointer to a single |
Continued work for U504-028.
@briot : With the latest change I feel that we are really in better position now. My feeling is that we could merge this now and see if some other fixes are to be applied. Is that ok with you ? |
I have to review your change (missed it when you committed it). As I remember there were a number of points of discussion still pending, but things have getting a bit confusing in this PR indeed. I'll review and test the patch tomorrow, will let you know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really comfortable with all the "exception when others => null" which simply ignore exceptions. I wonder what these will hide...
There are a number of cases where we do DB.Unregister(...); DB.Free_Or_Defer(...)
I thought the latter already handle de-registration, so it seems the former call is unneeded (and inefficient since we need to lock the PO).
I still would like to argue for adding a Use_Socket
function that will automatically take care of the Get_Socket+Release_Socket. It will cleanup the AWS code a bit, but more importantly it allows users to do things safely in their own code. In my case, I have a queue of messages I want to send to users, but this is done asynchronously. So the queue has the socket UID and the message to send. The only way to be task-safe is to use get_socket+release_socket, but those are currently hidden to users. A Use_Socket that encapsulate that is also safer to use...
The code looks like
procedure Use_Socket
(Id : UID;
Callback : not null access procedure (WS : in out Object'Class))
is
WS : Object_Class;
begin
DB.Get_Socket (Id, WS);
if WS /= null then
begin
Callback (WS.all);
exception
when others =>
DB.Release_Socket (WS);
raise;
end;
DB.Release_Socket (WS);
end if;
end Use_Socket;
Thanks for the clean up of Sending
.
I am testing the patch (merging takes me a very long time since I have a number of additional patches, mostly from #159, so I have a local copy of the AWS files to which I backport the changes. That doesn't look like the most efficient way to do things though. Once we merge this one and perhaps #159, I'll try to get back to an easier setup.
@@ -225,10 +235,9 @@ package body AWS.Net.WebSocket.Registry is | |||
Signal : Boolean := False; -- Transient signal, release Not_Emtpy | |||
S_Signal : Boolean := False; -- Shutdown is in progress | |||
New_Pending : Boolean := False; -- New pending socket | |||
New_State : Boolean := False; -- A sokcet has been released |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sokcet => socket
end; | ||
|
||
else | ||
-- The socket is not registerred anymore, just leave now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registerred => registered
My testsuite never completed, full of
This is the call to the entry |
The implementation of
The proper implementation is to replace New_State with |
And this really shows the AWS testsuite for websocket is lacking... I believe the code I sent in #209 (comment) shows this issue |
Hi Pascal, any news on this patch ? did you manage to fix the issue with requeue ? |
Hi Manu, no progress so far but work is restarting on this issue. |
Hi Manu, with your reproducer we cannot reproduce issues. I'm wondering if you are using some specific options (we use only -fsanitize=thread) ? Or maybe we are not looking at the right place or for the right report ? Can you describe more your procedure ? TIA. |
We use the same simple switch that Dmitriy proposed in #214, namely The latest error I reported in I am assuming you tried the test I sent (#209 (comment)) without any of the patches in this discussion (which did improve things, no mistake, even if not perfect yet). I'll try double-check again |
Hi Pascal and Dmitry Any news on this issue ? We are almost there, the patch is almost done and definitely improves things. I think mostly the implementation of |
I did not see the prove that current AWS sources is not thread safe. Manu, if I miss something, could you show reproducer ? |
Agreed, so may I propose to merge this part and see where we are and get to last issue if any. At this point the discussion is long, the path is not trivial, so I think moving forward by step will certainly help converging. How does that sound to you Manu? |
Agreed the discussion is indeed too long, Github is not quite up to handling such long patches it seems. I'll open a separate issue afterwards. |
I don't think we can merge this pull request. I see a lot of |
Here is a reproducer that shows issues in thread sanitizer. To be honest, I am not sure whether those are related to this issue (#209) or to the one reported by Maxim in #138, though both need to be fixed anyway. Dmitry, just take the time to run the demos/websockets under any sanitizer/valgrind tool you want, and you'll see the issue reported by Maxim in #138. This might need to be fixed before the issues in #209 are apparent (I have run locally with workarounds for #138 for a few years now, so indeed those are no longer an issue for me) As the long discussion here shows, there are a lot of concurrency issues in the current websocket code that we can see just by reading the code. Pascal's patch fixes a lot of them (except the one in Get_Socket I reported in May -- I kind of lost the context from then, but we'll get to that after Pascal merges the initial set of cleanups) |
(I have deleted my example, I think it contained errors -- still working on a reproducer) |
So we have to do some clean-up first indeed. But still I'd like to move forward on this. Many improvements have been made in this PR. |
An actual reproducer now. It shows some sanitizer errors in close_socket, which I think are related to this thread. It also shows PROTOCOL_ERROR, which I think are likely errors in my test, though I must admit it isn't obvious to me right now.
The project file is
And here is how I compiled a fresh clone of AWS:
|
Manu, your example gives strange output at the end even without thread sanitizer raised AWS.CLIENT.PROTOCOL_ERROR : �;someone joined the party and sent a message some message 35HTTP/1.1 101 Switching Protocols raised AWS.CLIENT.PROTOCOL_ERROR : �;someone joined the party and sent a message some message 35HTTP/1.1 101 Switching Protocols raised AWS.CLIENT.PROTOCOL_ERROR : Invalid accept from server raised AWS.CLIENT.PROTOCOL_ERROR : �;someone joined the party and sent a message some message 37HTTP/1.1 101 Switching Protocols double free or corruption (out) |
Maybe you have an idea what's wrong. I did not find anything obvious in my code, so this might be early freed memory in AWS for instance, which is kind of what we are looking for here |
Ok. I'll try to understand. |
For U504-028.