-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More robust sync from git #72
Comments
And doing a clone also avoid issue if someone do a dirty rebase or a push -f, as it can happen from time to time. |
Ok so I was wrong on the container --sync, it also do clone, and indeed, /tmp/ is wiped on restart, so that part is good. However, the git clone/pull is done without verifying the return code and python do not trigger a exception, so a failure here would not stop the sync. So my point on robustness still apply I think. |
So the issue here is failing to check for an exception correctly? Will look into it. Also, are you saying that you think that maybe there was a container misfire, where /tmp/ was not writeable, and sync deleted the elections for that reason? |
Well, I suspect something made elekto remove the election, and the only path I can imagine is related to the sync. However, I elaborated a false theory as I didn't see elekto clone the git repo at start, and not just with the webhook, so just ignore the 1st comment (I should maybe edit it). I can't find why only 1 election got removed. If my theory on sync silently failing was right, all elections would have been lost, not just one. So either the theory is false, or my theory is right and there is a unknown reason that could explain why the 3 others elections (the empty ones) were still here. The sync function use 2 try/except, so a failure to delete a table with session.delete wouldn't stop the sync, and the code would continue. Maybe it is worth testing what happen if elekto is restart when meta is not here, and if delete behave as we expect (cause sql alchemy wouldn't delete the object, just mark it as deleted, and I wonder if that could have unexpected side effects ) |
As for the misfire, a more likely explanation would be that the container restarted when github was down, or DNS issue, or something like this. |
So I just verified (not on prod), but failure to checkout git repository (her, triggered with a mistyped URL) result in all elections to be removed. So that do not explain why only 2022-TOC disappeared. |
Ok so seems I am confused. The election (from table election) itself was not removed. But the ballots were. |
So my hypothesis is likely right, I just badly explained. In my DB dump from the upgrade day, I have the 3 elections in the elections table, but no ballots. What I think happened is the following: Every 2 weeks, our cluster is upgraded, which move pods around, by restarting them. I suspect that during one of those upgrade, the elekto pod ran a However, on our setup, the checkout is on /tmp, which is wiped by the reboot. So the code will continue, see there is no meta files, and erase all elections, which in turn remove all ballots. Then 2 weeks later, a new upgrade is triggered. The pod got restarted again, and this time, the git clone work, so elekto add back the elections, but the ballots are gone. And this is exactly what I see in my dump. The elections are here in the table, the ballots are not, and the ballot_id sequence is not 0. So the DB wasn't removed, just the ballots, and the elections are here as they were added later on next sync (and where all created at the same time). Ergo, I think the sync operation should account for git errors, and stop the sync without preventing elekto from starting. |
I don't think only 2022-TOC disappeared. I think all three elections disappeared; it's that only 2022-TOC was not backed up. |
Yu, that's what I meant when I said I will submit a new PR, as #75 is not going to fit. |
Addressing part of this with #76 |
The other parts of this issue have been fixed. Closing this and tracking the remaining enhancement via #74 |
TLDR:
The current system of using a git checkout with either clone or and syncing is IMHO not robust enough, and should be replaced by a sync function that create a temporary directory, do a git clone, and then use that clone for the sync. This would prevent the issue that I dubbed "the mysterious case of the knative 2022-TOC disparition".
So earlier this week, @jberkus asked us to investigate why the elekto instance we setup on Openshift had lost 1 election after the upgrade to 0.6. Upon further debugging yesterday, we found this was due to the ballot table being empty, and after ruling out data corruptions and the like, we stopped investigating. But we found this happened before the update, which puzzled us a lot.
I restarted the investigation this evening and found I think the cause.
So ballots are not removed from the db, unless a election is removed. Upon looking the database backups we had, I noticed that indeed, the election of 2022 wasn't here, and so my hypothesis at that time was that the election removal triggered the ballots cleanup. However, it was unclear how the election could be removed in the first place. Upon checking the code, I found that the sync function do remove elections if they are not in the yaml file for the election, and that sync is triggered by the webhook.
But the yaml file should never be absent, because the webhook populate them the git repo if it doesn't exist, correct ? Yes, but... the sync function can also be run from the CLI using --sync, and this is done when starting the container
I wasn't aware that the meta directory had to persist (because that's not how I would have done). So I placed it on /tmp, which is likely wiped on restart, or cleaned somehow, for example, when we have a upgrade as it happen on a regular basis.
So the 1st obvious fix is to tell "make sure META_PATH is persistent". But I think a better fix would be to not need META_PATH in the first place, and have the sync() function do a git clone (maybe a shallow one) each time it is called, in a temporary directory and operate from here.
Besides avoiding the problem I faced, this also would be more robust in case of container crash, for example, in the middle of a git clone that leave some lock file and requires a human to fix. It would also simplify the code a bit, by removing 1 option.
The text was updated successfully, but these errors were encountered: