chore: fix flaky test when leader peer terminates #981

milmazz · 2023-10-31T14:07:21Z

This commit should fix a flaky test that happens when the leadership changes because the previous leader terminates for a given reason.

The issue happens because the signal emitted in the unit test is a GenServer.stop/1, but the peer process is under the test supervisor, so, in certain circumstances, the supervisor is fast enough to spawn a new process. That process ends as the new leader.

A way to reproduce the flaky test locally could be like this:

for i in (seq 1 10)
  mix test --seed 856757

  if test $status -eq 2
    break
  end
end

I'm using the fish shell here; the previous script might not work on your favorite shell.

The previous command should produce an output like:

Excluding tags: [:skip]

...................................................................................................................................................................................................................................................................................................................
Finished in 4.2 seconds (1.4s async, 2.7s sync)
24 doctests, 7 properties, 278 tests, 0 failures, 2 excluded

Randomized with seed 856757
Excluding tags: [:skip]

................................................................................................................................................................................................................................................................................................

  1) test leadership changes when a peer terminates (Oban.Peers.GlobalTest)
     test/oban/peers/global_test.exs:24
     Expected truthy, got nil
     code: assert Enum.find([peer_1, peer_2] -- [leader], &Global.leader?/1)
     arguments:

         # 1
         [#PID<0.3478.0>]

         # 2
         &Oban.Peers.Global.leader?/1

     stacktrace:
       test/oban/peers/global_test.exs:44: anonymous fn/3 in Oban.Peers.GlobalTest."test leadership changes when a peer terminates"/1
       (oban 2.16.3) test/support/case.ex:73: Oban.Case.with_backoff/4
       test/oban/peers/global_test.exs:43: (test)

..................
Finished in 5.3 seconds (1.4s async, 3.8s sync)
24 doctests, 7 properties, 278 tests, 1 failure, 2 excluded

Randomized with seed 856757

Here, you can see that the PID of the new leader is not in the target list (Enum.find), so, as I mentioned before, one theory is that the test supervisor spawned a new process, and that process ended being the leader.

This commit should fix a flaky test that happens when the leadership changes because the previous leader terminates for a given reason. The issue happens because the signal emitted in the unit test is a `GenServer.stop/1`, but the peer process is under the test supervisor, so, in certain circumstances, the supervisor is fast enough to spawn a new process. That process ends as the new leader. A way to reproduce the flaky test locally could be like this: ```console for i in (seq 1 10) mix test --seed 856757 if test $status -eq 2 break end end ``` I'm using the fish shell here; the previous script might not work on your favorite shell. The previous command should produce an output like: ```console Excluding tags: [:skip] ................................................................................................................................................................................................................................................................................................................... Finished in 4.2 seconds (1.4s async, 2.7s sync) 24 doctests, 7 properties, 278 tests, 0 failures, 2 excluded Randomized with seed 856757 Excluding tags: [:skip] ................................................................................................................................................................................................................................................................................................ 1) test leadership changes when a peer terminates (Oban.Peers.GlobalTest) test/oban/peers/global_test.exs:24 Expected truthy, got nil code: assert Enum.find([peer_1, peer_2] -- [leader], &Global.leader?/1) arguments: # 1 [#PID<0.3478.0>] # 2 &Oban.Peers.Global.leader?/1 stacktrace: test/oban/peers/global_test.exs:44: anonymous fn/3 in Oban.Peers.GlobalTest."test leadership changes when a peer terminates"/1 (oban 2.16.3) test/support/case.ex:73: Oban.Case.with_backoff/4 test/oban/peers/global_test.exs:43: (test) .................. Finished in 5.3 seconds (1.4s async, 3.8s sync) 24 doctests, 7 properties, 278 tests, 1 failure, 2 excluded Randomized with seed 856757 ``` Here, you can see that the PID of the new leader is not in the target list (`Enum.find`), so, as I mentioned before, one theory is that the test supervisor spawned a new process, and that process ended being the leader.

sorentwo

This is fantastic! Thanks for tracking this down 👏

test/oban/peers/global_test.exs

This commit should fix a flaky test that happens when the leadership changes because the previous leader terminates for a given reason. The issue happens because the signal emitted in the unit test is a `GenServer.stop/1`, but the peer process is under the test supervisor, so, in certain circumstances, the supervisor is fast enough to spawn a new process. That process ends as the new leader. A way to reproduce the flaky test locally could be like this: ```console for i in (seq 1 10) mix test --seed 856757 if test $status -eq 2 break end end ``` I'm using the fish shell here; the previous script might not work on your favorite shell. The previous command should produce an output like: ```console Excluding tags: [:skip] ................................................................................................................................................................................................................................................................................................................... Finished in 4.2 seconds (1.4s async, 2.7s sync) 24 doctests, 7 properties, 278 tests, 0 failures, 2 excluded Randomized with seed 856757 Excluding tags: [:skip] ................................................................................................................................................................................................................................................................................................ 1) test leadership changes when a peer terminates (Oban.Peers.GlobalTest) test/oban/peers/global_test.exs:24 Expected truthy, got nil code: assert Enum.find([peer_1, peer_2] -- [leader], &Global.leader?/1) arguments: # 1 [#PID<0.3478.0>] # 2 &Oban.Peers.Global.leader?/1 stacktrace: test/oban/peers/global_test.exs:44: anonymous fn/3 in Oban.Peers.GlobalTest."test leadership changes when a peer terminates"/1 (oban 2.16.3) test/support/case.ex:73: Oban.Case.with_backoff/4 test/oban/peers/global_test.exs:43: (test) .................. Finished in 5.3 seconds (1.4s async, 3.8s sync) 24 doctests, 7 properties, 278 tests, 1 failure, 2 excluded Randomized with seed 856757 ``` Here, you can see that the PID of the new leader is not in the target list (`Enum.find`), so, as I mentioned before, one theory is that the test supervisor spawned a new process, and that process ended being the leader.

milmazz force-pushed the chore/fix-flaky-test branch from 10feb3f to 9bdf8e2 Compare October 31, 2023 14:08

sorentwo approved these changes Oct 31, 2023

View reviewed changes

test/oban/peers/global_test.exs Outdated Show resolved Hide resolved

Update test/oban/peers/global_test.exs

4008321

sorentwo merged commit 38a93d2 into oban-bg:main Oct 31, 2023
2 checks passed

sorentwo added area:oss Related to Oban OSS kind:chore Internal improvements labels Oct 31, 2023

milmazz deleted the chore/fix-flaky-test branch October 31, 2023 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: fix flaky test when leader peer terminates #981

chore: fix flaky test when leader peer terminates #981

milmazz commented Oct 31, 2023

sorentwo left a comment

chore: fix flaky test when leader peer terminates #981

chore: fix flaky test when leader peer terminates #981

Conversation

milmazz commented Oct 31, 2023

sorentwo left a comment

Choose a reason for hiding this comment