Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WebSocket memory free to be thread safe. #209

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

TurboGit
Copy link
Collaborator

@TurboGit TurboGit commented May 4, 2021

For U504-028.

@TurboGit TurboGit marked this pull request as draft May 4, 2021 18:28
@TurboGit TurboGit added the bug label May 4, 2021
@TurboGit TurboGit requested a review from anisimkov May 4, 2021 18:28
Copy link
Contributor

@briot briot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to create a test for the wrong behavior, or at least test on my own server.

-- Release a socket retrieved with Get_Socket above, this socket will be
-- then available again.

procedure Free (WebSocket : in out Object_Class);
-- Free WebSocket immediately if not taken by another tasK, otherwise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tasK" -> "task"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a confusion in the API because the websocket itself has a primitive operation Free (aws-net-websocket.adb:215) which does an immediate deallocation. Shouldn't the latter internally call Release_Socket instead, and perform the actual deallocation later as needed ? As it is, the user might still free the socket directly, breaking the code. Perhaps you have an idea how to make this clearer ?
At the very least, it seems we should document that behavior, and perhaps rename AWS.Net.WebSocket.Registry.DB.Free to something like Deferred_Free or something to clear possible confusion in the AWS code itself ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the Deferred_Free naming, not bad but confusing in the other direction as the implementation does free immediately if the socket is not currently handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Free_Or_Defer ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me! I'll do the change. Thanks.

Unchecked_Free (WebSocket);
end if;
end Free;

----------------
-- Get_Socket --
----------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't Get_Socket also avoid sockets that are marked with To_Free=True ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because we do return only socket not in Sending container and in this case if it was to be free it would have been already done in Release_Socket.

-- record this socket to be freed as soon as it is released
-- (Release_Socket) call.

if Sending.Contains (WebSocket.Id) then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking in the code why Sending is a set. It seems to be it might just as well be a flag in the Object itself, rather than an independent set. The difficulty is to make sure it is only accessed via the protected object (though it could also be made atomic for instance).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe but could be done as a clean-up and does not seem do be the issue at hand.

@briot
Copy link
Contributor

briot commented May 5, 2021

Also noticed aws-net-websocket-registry.adb:445 which does a direct call to Unchecked_Free on the socket. Should this one be a call to Release_Socket instead ?

@briot
Copy link
Contributor

briot commented May 5, 2021

(in my own patch, I had replaced Unchecked_Free with a call to AWS.Net.WebSocket.Free (WebSocket);, but it was a long time ago and I forgot why it was needed

@briot
Copy link
Contributor

briot commented May 5, 2021

Similar call in Close and On_Close, it does Unchecked_Free directly, not clear to be whether it should be Release_Socket instead. I had also replaced this one with AWS.Net.WebSocket.Free in my patch.

@briot
Copy link
Contributor

briot commented May 5, 2021

I ran my test multiple times with this patch, and did not notice the same thread-sanitizer report on use-after-free. There are however a few data races (two in the context of AWS, and one already reported to AdaCore in the runtime), which makes things a bit harder to read. I'll need some time to analyze.
At first sight the patch looks good

@briot
Copy link
Contributor

briot commented May 5, 2021

Unfortunately I think I just had the same issue with thread-sanitizer:

WARNING: ThreadSanitizer: heap-use-after-free
  Read of size 8 at 0x7b60000c6010 by thread T30:
    #0 aws__net__websocket__registry__message_reader__read_message aws-net-websocket.adb:469
    #1 aws__net__websocket__registry__message_readerTB aws-net-websocket-registry.adb:367
    #2 system__tasking__stages__task_wrapper s-tassta.adb:1201

  Previous write of size 8 at 0x7b60000c6010 by thread T15 (mutexes: write M457):
    #0 free tsan_interceptors_posix.cpp:707
    #1 __gnat_free s-memory.adb:122
    #2 system__pool_global__deallocate s-pooglo.adb:134
    #3 system__storage_pools__subpools__deallocate_any_controlled s-stposu.adb:470
    #4 aws__net__websocket__free__P149b__2.90 <null>
    #5 aws__net__websocket__free__2 aws-net-websocket.adb:199
    #6 aws__net__websocket__registry__db__closeN__2 aws-net-websocket-registry.adb:491
    #7 aws__net__websocket__registry__db__closeP__2 aws-net-websocket-registry.adb:469
    #8 aws__net__websocket__registry__close__3 aws-net-websocket-registry.adb:1150
    #9 gui__senders__gs_send_ws__process_message__do_send gui-senders.adb:168
    #10 gnatcoll__strings__access_string__dispatch gnatcoll-strings_impl.adb:694
    #11 gnatcoll__strings__access_string__do_access gnatcoll-strings_impl.adb:673
    #12 gnatcoll__strings__access_string gnatcoll-strings_impl.adb:699
    #13 gui__senders__gs_send_ws__process_message gui-senders.adb:178
    #14 gui__senders__gs_send_wsTB gui-senders.adb:190
    #15 system__tasking__stages__task_wrapper s-tassta.adb:1201

I double-checked that the patch is correctly applied. Since we are not using a protected type in message_reader, I think we can have this error at any time if it happens that we free the socket at the wrong time.

@briot
Copy link
Contributor

briot commented May 5, 2021

Here is a test similar to the one I am using. It shows thread-sanitizer reports for data races, but I haven't yet seen the same use-after-free error. I do recompile the AWS file (and the runtime files) with the same -fsanitize=thread switch, which might be needed to reproduce some of the errors.

with Ada.Unchecked_Deallocation;
with AWS.Config.Set;
with AWS.Net.WebSocket.Registry.Control;
with AWS.Server;
with AWS.Services.Dispatchers.URI;
with AWS.Status;
with Client;

procedure Main is

   procedure Unchecked_Free is new Ada.Unchecked_Deallocation
      (Client.Client_Type, Client.Client_Access);

   function Create_Socket
     (Request : AWS.Status.Data) return AWS.Net.WebSocket.Object_Class is
   begin
      return new AWS.Net.WebSocket.Object;  --  server-side socket
   end Create_Socket;

   Arr     : Client.Client_Array (1 .. 60);
   Conf    : AWS.Config.Object := AWS.Config.Get_Current;
   Handler : AWS.Services.Dispatchers.URI.Handler;
begin
   AWS.Config.Set.Reuse_Address (Conf, True);
   AWS.Net.WebSocket.Registry.Register
      ("/streams", Create_Socket'Unrestricted_Access);
   AWS.Server.Start
      (Web_Server => Client.Server,
       Config     => Conf,
       Dispatcher => Handler);
   AWS.Net.WebSocket.Registry.Control.Start;

   for A in Arr'Range loop
      Arr (A) := new Client.Client_Type (Id => A);
   end loop;

   delay 20.0;

   Client.Finish := True;

   for A in Arr'Range loop
      Arr (A).Terminated;

      --  Uncomment this to get thread-sanitizer error in runtime
--      Unchecked_Free (Arr (A));
   end loop;

   AWS.Net.WebSocket.Registry.Control.Shutdown;
   AWS.Server.Shutdown (Client.Server);
end Main;


with AWS.Server;
package Client is
   task type Client_Type (Id : Positive) is
      entry Terminated;
   end Client_Type;

   type Client_Access is access Client_Type;
   type Client_Array is array (Natural range <>) of Client_Access;

   Finish : Boolean := False with Atomic;
   --  Request termination of tasks

   Server : AWS.Server.HTTP;

end Client;



with Ada.Exceptions;
with Ada.Numerics.Float_Random;
with Ada.Text_IO;
with AWS.Net.WebSocket;
with AWS.Server.Status;
package body Client is

   task body Client_Type is
      use AWS.Net.WebSocket;
      Sock : AWS.Net.WebSocket.Object_Class;
      Gen  : Ada.Numerics.Float_Random.Generator;
   begin
      Ada.Numerics.Float_Random.Reset (Gen);
      while not Finish loop
         Sock := new AWS.Net.WebSocket.Object;  --  client-side socket
         AWS.Net.WebSocket.Connect
            (Sock.all,
             AWS.Server.Status.Local_URL (Server) & "/streams");

         Sock.Send ("some message");
         delay 3.0 * Duration (Ada.Numerics.Float_Random.Random (Gen));

         Sock.Close ("no longer needed");
         --  AWS.Net.WebSocket.Free (Sock);
      end loop;

      accept Terminated;

   exception
      when E : others =>
         Ada.Text_IO.Put_Line (Ada.Exceptions.Exception_Information (E));
         accept Terminated;
   end Client_Type;
end Client;



with "aws.gpr";
project Default is
   for Main use ("main.adb");

   package Compiler is
      for Switches ("Ada") use
         ("-O2", "-fsanitize=thread", "-fno-omit-frame-pointer");
   end Compiler;

   package Linker is
      for Switches ("Ada") use ("-fsanitize=thread", "-fuse-ld=gold");
   end Linker;
end Default;

Copy link
Contributor

@anisimkov anisimkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a typo in commit message "to ne thread"

@briot
Copy link
Contributor

briot commented May 5, 2021

I'd like your input on the testcase above: you'll notice is calls the primitive Close directly on the socket. Should it instead call AWS.Net.WebSocket.Registry.Close ?
Also, I have commented out the call to Unchecked_Free for the socket, because it seems like user code should not free memory directly, but I am not sure how to free memory then.

@TurboGit TurboGit force-pushed the po/fix-free-ws branch 2 times, most recently from 70de059 to 265967f Compare May 5, 2021 12:46
@TurboGit TurboGit changed the title Fix WebSocket memory free to ne thread safe. Fix WebSocket memory free to be thread safe. May 5, 2021
@TurboGit
Copy link
Collaborator Author

TurboGit commented May 5, 2021

Looks like a typo in commit message "to ne thread"

Yep, now fixed.

@briot
Copy link
Contributor

briot commented May 5, 2021

A slightly different take on the take thread-sanitizer report I sent. This one doesn't involve free, but still shows a lack of mutex somewhere. One of the tasks in my server is writing to the websocket. This apparently doesn't involve any mutex (maybe my code is responsible for getting one, but this isn't clear from the API what such a mutex/protected object would be). Later, the same socket is written to by message_reader also without a mutex. So it is possible that both write would happen at the same time, leading to garbage of course.

WARNING: ThreadSanitizer: data race
  Write of size 1 at 0x7b7400501e10 by thread T30:
    #0 memmove sanitizer_common_interceptors.inc:789
    #1 memmove sanitizer_common_interceptors.inc:787
    #2 aws__net__buffered__write aws-net-buffered.adb:448
    #3 aws__net__websocket__protocol__rfc6455__send_frame_header aws-net-websocket-protocol-rfc6455.adb:767
    #4 aws__net__websocket__protocol__rfc6455__send_frame aws-net-websocket-protocol-rfc6455.adb:677
    #5 aws__net__websocket__protocol__rfc6455__receive aws-net-websocket-protocol-rfc6455.adb:501
    #6 aws__net__websocket__receive__2 aws-net-websocket.adb:549
    #7 aws__net__websocket__registry__message_reader__read_message aws-net-websocket.adb:469
    #8 aws__net__websocket__registry__message_readerTB aws-net-websocket-registry.adb:367
    #9 system__tasking__stages__task_wrapper s-tassta.adb:1201

  Previous read of size 8 at 0x7b7400501e10 by thread T17:
    #0 sendto sanitizer_common_interceptors.inc:6446
    #1 gnat__sockets__thin__c_sendto g-socthi.adb:362
    #2 gnat__sockets__send_socket g-socket.adb:2458
    #3 aws__net__std__send aws-net-std__gnat.adb:682
    #4 aws__net__websocket__send__5 aws-net-websocket.adb:574
    #5 aws__net__send aws-net.adb:474
    #6 aws__net__buffered__flush aws-net-buffered.adb:75
    #7 aws__net__websocket__protocol__rfc6455__send aws-net-websocket-protocol-rfc6455.adb:652
    #8 aws__net__websocket__send__4 aws-net-websocket.adb:612
    #9 aws__net__websocket__send__2 aws-net-websocket.adb:590
    #10 gui__senders__send gui-senders-send.adb:18
    #11 gui__senders__gs_send_ws__process_message__do_send gui-senders.adb:155
    #12 gnatcoll__strings__access_string__dispatch gnatcoll-strings_impl.adb:694
    #13 gnatcoll__strings__access_string__do_access gnatcoll-strings_impl.adb:673
    #14 gnatcoll__strings__access_string gnatcoll-strings_impl.adb:699
    #15 gui__senders__gs_send_ws__process_message gui-senders.adb:178
    #16 gui__senders__gs_send_wsTB gui-senders.adb:190
    #17 system__tasking__stages__task_wrapper s-tassta.adb:1201

@TurboGit
Copy link
Collaborator Author

TurboGit commented May 5, 2021

Not sure if it will help but I have a new version of this PR. All Unchecked_Free in the registry are now done in the DB.Free() and so only possibly a deferred free.

One question, do you confirm that if the comment out the Unchecked_Free in the registry there is no crash ? At least this will make it clear that the only issue on the AWS WebSocket implementation is the WebSocket memory handling.

EDIT: forgot that in the last implementation I skip all WebSocket having a deferred free state in Create_Set.

@briot
Copy link
Contributor

briot commented May 5, 2021

  • don't we still have the possible issue of users doing a free on their own (AWS.Net.WebSocket.Free), or is that also handed out to DB.Free ?
  • I do not know about crashes. My server crashes every few days, but I do not have a systematic reproducer (which of course heavily hints at invalid memory usage and race conditions). The reports I did (which the reproducer I sent earlier today will also show) are by using the thread-sanitizer tool which detects potential issues, even if they happen to have no influence during the specific build. You have to run such tests multiple times, because sanitizer might detect different things depending on the timing of the tasks.
  • I actually do not believe freeing is the only issue, as highlighted by my last sanitizer report Fix WebSocket memory free to be thread safe. #209 (comment) which shows possible data races when writing to the socket. This one might not be fatal to an application, but might end up sending corrupted data.

@TurboGit
Copy link
Collaborator Author

TurboGit commented May 5, 2021

* don't we still have the possible issue of users doing a free on their own (`AWS.Net.WebSocket.Free`), or is that also handed out to `DB.Free` ?

Yes, users should not call AWS.Net.WebSocket.Free directly. This is on the spec to be able to override it for SSL implementation (as described in the comment). But this routine is needed as the garbage collection is done by Finalize().

@briot
Copy link
Contributor

briot commented May 5, 2021

Can't Free be in the private part, if it needs to be overridden by specific parts of AWS, but not called by users ?
Come to think of it, the subprogram you need to override is the one that receives an Object, but the one that uses an Object_Class can probably be hidden altogether in some private part of even body ? At least users would need to instantiate Unchecked_Deallocation on their own, which might raise some kind of warning in their head. It seems to me that only the one that takes the Object_Class is dangerous. Otherwise, we still have a valid pointer, even though the socket itself might have become unusable (but then it can be checked)

An alternative could be to have Free call Free_Or_Defer in general, unless some special flag is set on the socket before (from Finalize or other places that legitimately need to free things). This defers checks to run time as opposed to compile time, but might be safer nonetheless...

@TurboGit
Copy link
Collaborator Author

TurboGit commented May 5, 2021

An alternative could be to have Free call Free_Or_Defer in general,

I thought about this, but how can it be possible. Free() call Free_Or_Defer() which does the Unchecked_Free() which calls the Free() because the object is finalized. Or maybe you meant something else ?

EDIT: I see, a special flag... well will be hard to implement correctly I fear!

@briot
Copy link
Contributor

briot commented May 5, 2021

I don't know precisely what I want, except it would be nice to have a safe API where people cannot easily shoot themselves in the foot by calling AWS.Net.WebSocket.Free (whose name is kind of tempting...)

I think we both agree that as soon as someone frees memory in an uncontrolled fashion, there is no way the registry and others will work correctly. In your sentence, there is ambiguity between the primitive Free and the one that applies to Object_Class. So let me rewrite it:

Free(Object_Class) calls Free_Or_Defer, which does the Unchecked_Free (if possible), which calls Free(Object) because the object is finalized.

That seems to be working as expected, right ?

To me, the important is that Unchecked_Free only occurs when the websocket is no longer in use (either in the registry, or in the queue of messages read by Messages_Reader, or in user code). The latter is the major problem, because you do not know when the user is manipulating the websocket.

Maybe we need to somehow formalize the rules and ownership here...

  • we hide Free(Object_Class), users should not call this directly; Perhaps harder: remove Object_Class altogether in the public API (still needed internally of course), or use a type with access discriminant. Then users cannot free memory...
  • Users should call Close(Object_Class) when they are done with the websocket. The effect is to close the actual C socket, and remove the socket from the registry. Perhaps also free all pending messages in the queue. Close needs to call the user's On_Close
  • AWS itself might be calling Close on error, or when the client closes its end of the socket. In this case too, On_Close is called.
  • After On_Close was called, users should not longer use the socket, which might get freed at any time convenient to AWS.
  • Since the socket was unregistered, no more messages will be queued for it, so messages_reader will not try to manipulate it either.
  • We need additional mutexes to protect socket operations. In my last sanitizer report, we saw that there could be several tasks writing to the socket concurrently. With this proposal, it might also be the case that we are closing the socket while the background messages_reader task is using the socket (hence the need to defer the actual freeing).

Going on a tangent here: perhaps the idea with the type-with-discriminant is nice: users can no longer store the websocket in their own code, but always need to go via the registry. We then provide a procedure like:

    procedure Use_Socket (Id : UID; Callback : not null access procedure (WebSocket: in out Type_With_Discriminant));
    --  or a generic procedure if we prefer

while the callback executes, AWS garantees that memory is not freed for the websocket. That simplifies lifetime management significantly since only AWS is responsible for it and user has no way around that. This is the same kind of API we use for refcounted types like GNATCOLL.Strings. This Use_Socket is also a good place to have the mutexes in place to prevent multiple concurrent writers on the socket..

This breaks the current API slightly, but it prevents very dangerous issues. The current API needs some improvements anyway as shown in Maxim's report in #138.

@TurboGit
Copy link
Collaborator Author

TurboGit commented May 6, 2021

That seems to be working as expected, right ?

This makes sense indeed. I'll work on this soon and propose a new patch where the release of WebSocket object is fully controlled by the registry and done only when the WebSocket is not used anymore.

@briot
Copy link
Contributor

briot commented May 6, 2021

Don't you like my proposal with the Use_Socket procedure, and users only manipulate UID objects ? I think this is the best way to protect against concurrent calls to Send and Free. Free is not the only issue here, we also need to address the issue with write.

@TurboGit
Copy link
Collaborator Author

TurboGit commented May 6, 2021

Don't you like my proposal with the Use_Socket procedure, and users only manipulate UID objects ? I think this is the best way to protect against concurrent calls to Send and Free. Free is not the only issue here, we also need to address the issue with write.

The issue is that will break the API ! You meant concurrent write as the other issue ? Because the Send and Free should be fixed by the WIP we've been discussing.

@briot
Copy link
Contributor

briot commented May 17, 2021

I just realized the implementation of Get_Socket is wrong. I had mentioned before that it should not return the socket if it is already in use (Sending is true). I could not find your answer again, but I don't think this is done automatically.

But even that is not enough anyway. Instead, Get_Socket needs to be blocking, and wait until the socket is available (see Pat Roger's recent blog post on the standard pattern using requeue for that if you were not familiar with it already). The reason is that otherwise if we have two tasks calling Use_Socket concurrently (or the equivalent low-level Get_Socket/Release_Socket), then only of them will see the socket (the other will get null and not perform any work on it).

At it currently is implemented, it is even worse, since both tasks will receive the socket, and the first one will release it, potentially triggering a free. The second one is then accessing dangling pointers...

@TurboGit
Copy link
Collaborator Author

I just realized the implementation of Get_Socket is wrong. I had mentioned before that it should not return the socket if it is already in use (Sending is true). I could not find your answer again, but I don't think this is done automatically.

Well the "fix" to use a boolean in WebSocket object is wrong because the state (boolean) is not shared across all copies of WebSocket object. Either we should use an access to a boolean or revert to using a container.

@TurboGit
Copy link
Collaborator Author

Or move To_Free and Sending boolean into the Internal_State record which was made for this !

@briot
Copy link
Contributor

briot commented May 18, 2021

Not sure I followed that. The socket uses the same UID everywhere right, so the container for Sending (which maps from this UID to the boolean) would have the same issue ? That's a part of the code I did not look into, so I'll certainly trust you there.

Still, that would not solve the need to make Get_Socket blocking

@briot
Copy link
Contributor

briot commented May 18, 2021

Also, what copies of the socket ? Internally, you use Object_Class, so this is a pointer to a single Object. This object must be the same all the type, since it might be a user-defined type with its own data (that's more related to the second PR, but we should not break that in this PR).

Continued work for U504-028.
@TurboGit
Copy link
Collaborator Author

@briot : With the latest change I feel that we are really in better position now. My feeling is that we could merge this now and see if some other fixes are to be applied. Is that ok with you ?

@briot
Copy link
Contributor

briot commented May 26, 2021

I have to review your change (missed it when you committed it). As I remember there were a number of points of discussion still pending, but things have getting a bit confusing in this PR indeed. I'll review and test the patch tomorrow, will let you know

Copy link
Contributor

@briot briot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really comfortable with all the "exception when others => null" which simply ignore exceptions. I wonder what these will hide...

There are a number of cases where we do DB.Unregister(...); DB.Free_Or_Defer(...) I thought the latter already handle de-registration, so it seems the former call is unneeded (and inefficient since we need to lock the PO).

I still would like to argue for adding a Use_Socket function that will automatically take care of the Get_Socket+Release_Socket. It will cleanup the AWS code a bit, but more importantly it allows users to do things safely in their own code. In my case, I have a queue of messages I want to send to users, but this is done asynchronously. So the queue has the socket UID and the message to send. The only way to be task-safe is to use get_socket+release_socket, but those are currently hidden to users. A Use_Socket that encapsulate that is also safer to use...
The code looks like

   procedure Use_Socket
      (Id       : UID;
       Callback : not null access procedure (WS : in out Object'Class))
   is
      WS : Object_Class;
   begin
      DB.Get_Socket (Id, WS);
      if WS /= null then
         begin
            Callback (WS.all);
         exception
            when others =>
               DB.Release_Socket (WS);
               raise;
         end;

         DB.Release_Socket (WS);
      end if;
   end Use_Socket;

Thanks for the clean up of Sending.

I am testing the patch (merging takes me a very long time since I have a number of additional patches, mostly from #159, so I have a local copy of the AWS files to which I backport the changes. That doesn't look like the most efficient way to do things though. Once we merge this one and perhaps #159, I'll try to get back to an easier setup.

@@ -225,10 +235,9 @@ package body AWS.Net.WebSocket.Registry is
Signal : Boolean := False; -- Transient signal, release Not_Emtpy
S_Signal : Boolean := False; -- Shutdown is in progress
New_Pending : Boolean := False; -- New pending socket
New_State : Boolean := False; -- A sokcet has been released
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sokcet => socket

end;

else
-- The socket is not registerred anymore, just leave now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registerred => registered

@briot
Copy link
Contributor

briot commented May 27, 2021

My testsuite never completed, full of

14:59:19.561  [GUI.WEBSOCKETS@ERROR/00001458AA649000] On_Close PROGRAM_ERROR System.Tasking.Protected_Objects.Operations.Protected_Entry_Call: potentially blocking operation
0x1d7f09e System.Tasking.Protected_Objects.Operations.Protected_Entry_Call at s-tpobop.adb:512
0xe6cf0a Aws.Net.Websocket.Registry.Use_Socket at aws-net-websocket-registry.adb:1663
0x1b1422a Gui.Websockets.Active_Websockets.RemoveN at gui-websockets.adb:597
0x1b14465 Gui.Websockets.Do_Close at gui-websockets.adb:586
0xe66420 Aws.Net.Websocket.Registry.Message_Reader.On_Ws at aws-net-websocket.adb:482
0xe6cf2c Aws.Net.Websocket.Registry.Use_Socket at aws-net-websocket-registry.adb:1666
0xe6d087 Aws.Net.Websocket.Registry.Message_ReaderT at aws-net-websocket-registry.adb:421
0x1d7bdad System.Tasking.Stages.Task_Wrapper at s-tassta.adb:1201
[/lib/x86_64-linux-gnu/libpthread.so.0]
0x1458b1dec6d9
[/lib/x86_64-linux-gnu/libc.so.6]

This is the call to the entry Get_Socket (used to be a protected procedure). As I mention before, I think making this an entry was correct though, so the logic error is in my code. That however will delay the testing... Will come back to you shortly. Worth noting that this change might impact users though (again, I think the new behavior is the correct one)

@briot
Copy link
Contributor

briot commented May 27, 2021

The implementation of Get_Socket is wrong I think (blocking forever).
It has a barrier on the New_State entry, which you immediately reset to False. So imagine the following scenario:

  • task 1 gets the socket 1
  • task 2 also gets the socket 1, and blocks (it is Sending)
  • task 3 gets the socket 2
  • task 4 also gets the socket and blocks (it is Sending)
  • then task 3 releases the socket 2. This sets New_State to True, which unblocks either task 2 or task 4. Let's assume task 2. It starts executing Get_Socket, sets New_State to False, then discovers socket 1 is still Sending, so goes back to sleep. But task 4 cannot enter Get_Socket, since New_State is now False !

The proper implementation is to replace New_State with Available : Natural. In Release_Socket, you set Available to Get_Socket'Count so that all currently blocked tasks will get a chance to unblock. Again, see Pat's latest blog https://blog.adacore.com/on-the-benefits-of-families

@briot
Copy link
Contributor

briot commented May 27, 2021

And this really shows the AWS testsuite for websocket is lacking... I believe the code I sent in #209 (comment) shows this issue

@briot
Copy link
Contributor

briot commented Jun 16, 2021

Hi Pascal, any news on this patch ? did you manage to fix the issue with requeue ?

@TurboGit
Copy link
Collaborator Author

Hi Manu, no progress so far but work is restarting on this issue.

@TurboGit
Copy link
Collaborator Author

Hi Manu, with your reproducer we cannot reproduce issues. I'm wondering if you are using some specific options (we use only -fsanitize=thread) ? Or maybe we are not looking at the right place or for the right report ? Can you describe more your procedure ? TIA.

@briot
Copy link
Contributor

briot commented Jun 25, 2021

We use the same simple switch that Dmitriy proposed in #214, namely -fsanitize=thread.
One big difference is our use of the configuration pragma Detect_Blocking which raises errors on potentially blocking operations done from protected objects (not sure it plays a role here).

The latest error I reported in Get_Socket might not have been detected by the test. I discovered it when I was looking at why things are executing much more slowly now, and then it was by reading the code. The scenario I described then could be used to write another dedicated test. If you run it long enough with enough tasks, but it isn't clear how to detect from an automatic test that things are blocking for too long.

I am assuming you tried the test I sent (#209 (comment)) without any of the patches in this discussion (which did improve things, no mistake, even if not perfect yet).

I'll try double-check again

@briot
Copy link
Contributor

briot commented Sep 15, 2021

Hi Pascal and Dmitry

Any news on this issue ? We are almost there, the patch is almost done and definitely improves things. I think mostly the implementation of Get_Socket is wrong, based on my message from May 27.

@anisimkov
Copy link
Contributor

I did not see the prove that current AWS sources is not thread safe. Manu, if I miss something, could you show reproducer ?

@TurboGit
Copy link
Collaborator Author

We are almost there, the patch is almost done and definitely improves things.

Agreed, so may I propose to merge this part and see where we are and get to last issue if any. At this point the discussion is long, the path is not trivial, so I think moving forward by step will certainly help converging. How does that sound to you Manu?

@briot
Copy link
Contributor

briot commented Sep 15, 2021

Agreed the discussion is indeed too long, Github is not quite up to handling such long patches it seems. I'll open a separate issue afterwards.
Give me an hour or so, I am trying to come up with a reproducer for the initial issue at least

@anisimkov
Copy link
Contributor

anisimkov commented Sep 15, 2021

I don't think we can merge this pull request. I see a lot of
+Execution of ./turl2 terminated by unhandled exception
+raised PROGRAM_ERROR : finalize/adjust raised exception
+Call stack traceback locations:
+0x4137de 0x868add 0x413803 0x41480c 0x7f5b1b2e50b1 0x4107dc 0xfffffffffffffffe
in regression tests. And again, what are we trying to fix ? Where is prove of defect ?

@briot
Copy link
Contributor

briot commented Sep 15, 2021

Here is a reproducer that shows issues in thread sanitizer. To be honest, I am not sure whether those are related to this issue (#209) or to the one reported by Maxim in #138, though both need to be fixed anyway.

Dmitry, just take the time to run the demos/websockets under any sanitizer/valgrind tool you want, and you'll see the issue reported by Maxim in #138. This might need to be fixed before the issues in #209 are apparent (I have run locally with workarounds for #138 for a few years now, so indeed those are no longer an issue for me)

As the long discussion here shows, there are a lot of concurrency issues in the current websocket code that we can see just by reading the code. Pascal's patch fixes a lot of them (except the one in Get_Socket I reported in May -- I kind of lost the context from then, but we'll get to that after Pascal merges the initial set of cleanups)

@briot
Copy link
Contributor

briot commented Sep 15, 2021

(I have deleted my example, I think it contained errors -- still working on a reproducer)
The demos/websockets shows at least one error in thread sanitizer, as a starting point

@TurboGit
Copy link
Collaborator Author

I don't think we can merge this pull request. I see a lot of
+Execution of ./turl2 terminated by unhandled exception
+raised PROGRAM_ERROR : finalize/adjust raised exception

So we have to do some clean-up first indeed. But still I'd like to move forward on this. Many improvements have been made in this PR.

@briot
Copy link
Contributor

briot commented Sep 15, 2021

An actual reproducer now. It shows some sanitizer errors in close_socket, which I think are related to this thread. It also shows PROTOCOL_ERROR, which I think are likely errors in my test, though I must admit it isn't obvious to me right now.

with Ada.Unchecked_Deallocation;
with AWS.Config.Set;
with AWS.Net.WebSocket.Registry.Control;
with AWS.Server;
with AWS.Services.Dispatchers.URI;
with AWS.Status;
with Client;

procedure Main is
   procedure Unchecked_Free is new Ada.Unchecked_Deallocation
      (Client.Client_Type, Client.Client_Access);

   Arr     : Client.Client_Array (1 .. 160);
   Conf    : AWS.Config.Object := AWS.Config.Get_Current;
   Handler : AWS.Services.Dispatchers.URI.Handler;
begin
   AWS.Config.Set.Reuse_Address (Conf, True);

   AWS.Net.WebSocket.Registry.Control.Start;
   AWS.Net.WebSocket.Registry.Register
      ("/streams", Client.Create_Socket'Access);
   AWS.Server.Start
      (Web_Server => Client.Server,
       Config     => Conf,
       Dispatcher => Handler);

   for A in Arr'Range loop
      Arr (A) := new Client.Client_Type (Id => A);
   end loop;

   delay 20.0;

   Client.Finish := True;

   for A in Arr'Range loop
      Arr (A).Terminated;

      while not Arr (A)'Terminated loop
         delay 0.0001;
      end loop;

      Unchecked_Free (Arr (A));
   end loop;

   AWS.Net.WebSocket.Registry.Control.Shutdown;
   AWS.Server.Shutdown (Client.Server);
end Main;
with AWS.Net.WebSocket;
with AWS.Server;
with AWS.Status;
package Client is
   task type Client_Type (Id : Positive) is
      entry Terminated;
   end Client_Type;

   type Client_Access is access Client_Type;
   type Client_Array is array (Natural range <>) of Client_Access;

   Finish : Boolean := False with Atomic;
   --  Request termination of tasks

   Server : AWS.Server.HTTP;

   --  Server-side sockets
   type My_Socket is new AWS.Net.WebSocket.Object with null record;
   overriding procedure On_Message (Self : in out My_Socket; Msg : String);
   overriding procedure On_Close (Self : in out My_Socket; Msg : String);
   overriding procedure On_Error (Self : in out My_Socket; Msg : String);

   function Create_Socket
     (Socket : AWS.Net.Socket_Access;
      Request : AWS.Status.Data) return AWS.Net.WebSocket.Object'Class;

end Client;
with Ada.Unchecked_Deallocation;
with Ada.Exceptions;
with Ada.Numerics.Float_Random;
with Ada.Text_IO;
with AWS.Net.WebSocket.Registry;
with AWS.Server.Status;
package body Client is
   Debug : constant Boolean := False;

   type Client_Socket is new AWS.Net.WebSocket.Object with null record;
   overriding procedure On_Message (Self : in out Client_Socket; Msg : String);

   overriding procedure On_Message
      (Self : in out Client_Socket; Msg : String) is
   begin
      Ada.Text_IO.Put_Line ("Client received" & Msg);
   end On_Message;

   procedure Unchecked_Free is new Ada.Unchecked_Deallocation
      (AWS.Net.WebSocket.Object'Class, AWS.Net.WebSocket.Object_Class);

   task body Client_Type is
      use AWS.Net.WebSocket;
      Sock : AWS.Net.WebSocket.Object_Class;
      Gen  : Ada.Numerics.Float_Random.Generator;
   begin
      Ada.Numerics.Float_Random.Reset (Gen);
      while not Finish loop
         Sock := new Client_Socket;  --  client-side socket
         AWS.Net.WebSocket.Connect
            (Sock.all,
             AWS.Server.Status.Local_URL (Server) & "/streams");

         Sock.Send ("some message" & Id'Image);

         delay 0.1 * Duration (Ada.Numerics.Float_Random.Random (Gen));

         Sock.Close ("no longer needed" & Id'Image);
         Unchecked_Free (Sock);

      end loop;

      accept Terminated;

   exception
      when E : others =>
         Ada.Text_IO.Put_Line (Ada.Exceptions.Exception_Information (E));
         accept Terminated;
   end Client_Type;

   overriding procedure On_Message (Self : in out My_Socket; Msg : String) is
   begin
      if Debug then
         Ada.Text_IO.Put_Line ("On_Message: " & Msg);
      end if;

      --  Let all recipients know about the new message.
      --  How efficient is that ? :-)
      AWS.Net.WebSocket.Registry.Send
         (AWS.Net.WebSocket.Registry.Recipient'
            (AWS.Net.WebSocket.Registry.Create ("/streams")),
          "someone joined the party and sent a message " & Msg);
   end On_Message;

   overriding procedure On_Error (Self : in out My_Socket; Msg : String) is
   begin
      if Debug then
         Ada.Text_IO.Put_Line ("On_Error: " & Msg);
      end if;
   end On_Error;

   overriding procedure On_Close (Self : in out My_Socket; Msg : String) is
   begin
      if Debug then
         Ada.Text_IO.Put_Line ("On_Close: " & Msg);
      end if;
   end On_Close;

   function Create_Socket
     (Socket : AWS.Net.Socket_Access;
      Request : AWS.Status.Data) return AWS.Net.WebSocket.Object'Class
   is
      --  strange code here to create our own socket, see issue 138
   begin
      if Debug then
         Ada.Text_IO.Put_Line ("Create_Socket");
      end if;
      return My_Socket'   --  server-side socket
         (AWS.Net.WebSocket.Object (AWS.Net.WebSocket.Create (Socket, Request))
          with null record);
   end Create_Socket;

end Client;

The project file is

with "aws.gpr";
project Default is
   for Main use ("main.adb");

   package Compiler is
      for Switches ("Ada") use
         ("-O2", "-fsanitize=thread", "-fno-omit-frame-pointer");
   end Compiler;

   package Linker is
      for Switches ("Ada") use ("-fsanitize=thread"); --  , "-fuse-ld=gold");
   end Linker;
end Default;

And here is how I compiled a fresh clone of AWS:

make setup LAL=false XMLADA=false THREAD_SANITIZER=true
make
make install prefix=`pwd`/install
cd reproducer
GPRPATH=`pwd`/../install/lib/gpr gprbuild -Pdefault.gpr

@anisimkov
Copy link
Contributor

Manu, your example gives strange output at the end even without thread sanitizer

raised AWS.CLIENT.PROTOCOL_ERROR : �;someone joined the party and sent a message some message 35HTTP/1.1 101 Switching Protocols

raised AWS.CLIENT.PROTOCOL_ERROR : �;someone joined the party and sent a message some message 35HTTP/1.1 101 Switching Protocols

raised AWS.CLIENT.PROTOCOL_ERROR : Invalid accept from server

raised AWS.CLIENT.PROTOCOL_ERROR : �;someone joined the party and sent a message some message 37HTTP/1.1 101 Switching Protocols

double free or corruption (out)
Aborted (core dumped)

@briot
Copy link
Contributor

briot commented Sep 15, 2021

Maybe you have an idea what's wrong. I did not find anything obvious in my code, so this might be early freed memory in AWS for instance, which is kind of what we are looking for here

@anisimkov
Copy link
Contributor

Ok. I'll try to understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants