-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a where clause to support soft delete #326
base: master
Are you sure you want to change the base?
Conversation
@fabriziomello curious to know what are your thoughts on this |
There is a |
How can this be prevented? |
Thanks, removed the workspace file |
The arguments to where clause as specified by the user is intended to delete data. Users would need to verify if data being deleted is what they want to delete. Dry run of the pg-repack shows precisely which data is going to be remain. It is recommended to think through the outputs of dry run before running real command. |
We have been running this repack at instacart on few tables weekly to clean up zombie tuples for last 4 months. Without any manual intervention, on a cron. |
@ankitml Can you add unit tests to ensure this is working as intended? |
Why
Tons of databases in the wild are bloated not only by dead tuples, also by
zombie
tuples. ie Data that exists in table just because it is hard to cleanup and prune the old data.Pg_repack is everyone's favourite tool to clean up dead tuples. It can easily clean up data that is not needed with a new flag.
--where-clause="deleted_at IS NOT NULL"
. The where clause is generic and can reference to foreign tables as well.To keep data that is updated in last 90 days in mytable.
--table="<mytable>" --where-clause="updated_at > NOW() - Interval '90 days'"
This PR adds Data cleanup and soft-delete support with pg_repack.
pg_repack --dbname="ankitmittal" --table="test_repack" --echo --elevel=DEBUG --where-clause="deleted_at IS NOT NULL"
pg_repack --dbname="ankitmittal" --table="test_repack" --echo --elevel=DEBUG --where-clause="updated_at < NOW() - Interval '90 days'"
This has been discussed before (#279) with a different approach.
Why not
It could cause data loss if used incorrectly.
If used properly this cleans up
logical bloat
ie data that is thrown in database but not cleaned up. An alternative here is to perform repack-like online table-swap manually.What it doesnt do
Incoming stream of data while repack is running is left as it is for sake of simplicity.