Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak because the database is corrupted #872

Closed
fengwei0328 opened this issue Dec 12, 2024 · 3 comments
Closed

memory leak because the database is corrupted #872

fengwei0328 opened this issue Dec 12, 2024 · 3 comments

Comments

@fengwei0328
Copy link

The issue occurs when Docker accesses the metadata.db database. The manifestation of the database corruption is the creation of an invalid pageNode,

bbolt/cursor.go

Lines 172 to 185 in 92c7414

var ref = &c.stack[len(c.stack)-1]
if ref.isLeaf() {
break
}
// Keep adding pages pointing to the first element to the stack.
var pgId common.Pgid
if ref.node != nil {
pgId = ref.node.inodes[ref.index].Pgid()
} else {
pgId = ref.page.BranchPageElement(uint16(ref.index)).Pgid()
}
p, n := c.bucket.pageNode(pgId)
c.stack = append(c.stack, elemRef{page: p, node: n, index: 0})

where the flags are 16 instead of the leafPage value of 0x02, resulting in continuous creation of invalid pageNodes.
I provided a detailed explanation in Moby:
moby/moby#49074

I tried to add a new leafPage to c.stack. First, I checked p.flags. If the value is 16, which is an invalid value, I do not add it and simply break out of the loop. This way, it directly causes a panic and throws an error up the call stack

		p, n := c.bucket.pageNode(pgid)
		if p.flags == 16 {
			break
		}
		c.stack = append(c.stack, elemRef{page: p, node: n, index: 0})
@ahrtr
Copy link
Member

ahrtr commented Dec 13, 2024

Thanks for reporting the issue. The db file is somehow corrupted.

The steps to fix the corrupted db file:

$ ./bbolt surgery clear-page-elements --pageId 9 --from-index 3 --to-index 4 --output ./new.db ~/tmp/etcd/bbolt/metadata.db
Please consider executing `./bbolt surgery freelist abandon ...`
All elements in [3, 4) in page 9 were cleared

$ ./bbolt surgery freelist abandon --output ./new2.db ./new.db  
The freelist was abandoned in both meta pages.
It may cause some delay on next startup because bbolt needs to scan the whole db to reconstruct the free list.

$ ./bbolt check ./new2.db 
OK

The final new2.db should work, please feel free to rename it to metadata.db.

The root cause isn't super clear yet. It might be due to mistakenly reusing a non-free page. We will try refactor the freelist management and improve the test coverage in next major release. Refer to #789

@fengwei0328
Copy link
Author

fengwei0328 commented Dec 14, 2024

Thanks for reporting the issue. The db file is somehow corrupted.

The steps to fix the corrupted db file:

$ ./bbolt surgery clear-page-elements --pageId 9 --from-index 3 --to-index 4 --output ./new.db ~/tmp/etcd/bbolt/metadata.db
Please consider executing `./bbolt surgery freelist abandon ...`
All elements in [3, 4) in page 9 were cleared

$ ./bbolt surgery freelist abandon --output ./new2.db ./new.db  
The freelist was abandoned in both meta pages.
It may cause some delay on next startup because bbolt needs to scan the whole db to reconstruct the free list.

$ ./bbolt check ./new2.db 
OK

The final new2.db should work, please feel free to rename it to metadata.db.

The root cause isn't super clear yet. It might be due to mistakenly reusing a non-free page. We will try refactor the freelist management and improve the test coverage in next major release. Refer to #789

I feel that your response helped me successfully fix the metadata.db. I will continue to monitor the root cause moving forward.
If it's another corrupted database, how do I determine the pageId and index range?

@ahrtr
Copy link
Member

ahrtr commented Dec 14, 2024

If it's another corrupted database, how do I determine the pageId and index range?

We have several surgery commands, we may need to use different command for different case. Unfortunately, it needs some expertise on bbolt to figure out how to fix the corrupted db file.

surgeryCmd.AddCommand(newSurgeryRevertMetaPageCommand())
surgeryCmd.AddCommand(newSurgeryCopyPageCommand())
surgeryCmd.AddCommand(newSurgeryClearPageCommand())
surgeryCmd.AddCommand(newSurgeryClearPageElementsCommand())
surgeryCmd.AddCommand(newSurgeryFreelistCommand())
surgeryCmd.AddCommand(newSurgeryMetaCommand())

Also not every corrupted db file is guaranteed to be fixed by the surgery commands, just like a doctor cannot guarantee to cure every patient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants