Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA : Hosts persist in the Suspect state in HA cluster with ShareMountPoint #10166

Open
Luskan777 opened this issue Jan 7, 2025 · 6 comments
Open
Milestone

Comments

@Luskan777
Copy link

ISSUE TYPE
  • Bug Report
COMPONENT NAME
HA, KVM
CLOUDSTACK VERSION
4.20
CONFIGURATION

Zone type : Advanced Network
Primary Storage: ShareMountPoint

OS / ENVIRONMENT

Hosts OS: Ubuntu 22.04 (HPE ProLiant BL460c Gen10)
Management Server OS: Ubuntu 22.04
out-of-band management driver: IPMI

SUMMARY

Hello, I configured out-of-band management on my hosts, however, the HA status of my hosts is always between Suspect or DEGRADED, I have already checked the IPMI communication and everything is working, my servers are also on and operational.

image

STEPS TO REPRODUCE
Configure Hosts KVM
Configure HA provider with KVMHAProvider
Configure out-of-band management with IPMI driver
Enable HA and see HA State
EXPECTED RESULTS
HA hosts with AVAILABLE state
ACTUAL RESULTS

Managemente Server logs:

@MSLOG@:2025-01-07 00:29:25,698 DEBUG [o.a.c.h.HAManagerImpl] (pool-4-thread-21:[]) HA state post-transition:: new state=[Suspect], old state=[Checking], for resource id=[3], status=[true], ha config state=[Suspect].
@MSLOG@:2025-01-07 00:29:25,707 DEBUG [o.a.c.h.HAManagerImpl] (pool-4-thread-21:[]) Transitioned host HA state from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host id:3
@MSLOG@:2025-01-07 00:29:41,622 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-2:[ctx-28440d8d]) HA state post-transition:: new state=[Checking], old state=[Suspect], for resource id=[2], status=[true], ha config state=[Checking].
@MSLOG@:2025-01-07 00:29:41,629 DEBUG [o.a.c.h.HAManagerImpl] (BackgroundTaskPollManager-2:[ctx-28440d8d]) Transitioned host HA state from:Suspect to:Checking due to event:PerformActivityCheck for the host id:2

2025-01-07 15:44:06,928 DEBUG [o.a.c.u.p.ProcessRunner] (pool-2-thread-11:[]) Process standard output for command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.16.20.21 -p 623 -U cloudstack -P ***** chassis power status]: [Chassis Power is on
].
2025-01-07 15:44:06,928 DEBUG [o.a.c.u.p.ProcessRunner] (pool-2-thread-11:[]) Process standard error output command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.16.20.21 -p 623 -U cloudstack -P ***** chassis power status]: [Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20
Error response 0xc1 from Get PICMG Properities
Running Get VSO Capabilities my_addr 0x20, transit 0, target 0x20
Invalid completion code received: Invalid command
Discovered IPMB address 0x0
].
2025-01-07 15:44:06,929 DEBUG [o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-2-thread-11:[]) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.16.20.21 -p 623 -U cloudstack -P PASSWORD  chassis power status] was successful and got the result [Chassis Power is on].

KVM hosts logs:

2025-01-07 15:49:52,534 DEBUG [kvm.resource.KVMHAChecker] (pool-1067-thread-1:[]) (logid:) Checking heart beat with KVMHAChecker for host IP [IP_SERVER] in pools []
2025-01-07 15:49:52,534 WARN  [kvm.resource.KVMHAChecker] (pool-1067-thread-1:[]) (logid:) All checks with KVMHAChecker for host IP [IP_SERVER] in pools [] considered it as dead. It may cause a shutdown of the host.
@DaanHoogland
Copy link
Contributor

@Luskan777 , can you try the ipmitool from the MS log by hand and examine the output?

The value kvm.ha.activity.check.max.attempts is used to decide how often the HA provider will try before deciding to fence or reinstate the host. The default is 10. The line

@MSLOG@:2025-01-07 00:29:25,707 DEBUG [o.a.c.h.HAManagerImpl] (pool-4-thread-21:[]) Transitioned host HA state from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host id:3

should be there 10 times as a consequence.

Also maybe you can try a force reconnect for the host.

@slavkap
Copy link
Contributor

slavkap commented Jan 8, 2025

Hi @Luskan777, as far as I know, Host HA is only available for NFS, StorPool and Linstor storage pools. That's why there aren't any listed pools in your log

2025-01-07 15:49:52,534 DEBUG [kvm.resource.KVMHAChecker] (pool-1067-thread-1:[]) (logid:) Checking heart beat with KVMHAChecker for host IP [IP_SERVER] in pools []
2025-01-07 15:49:52,534 WARN  [kvm.resource.KVMHAChecker] (pool-1067-thread-1:[]) (logid:) All checks with KVMHAChecker for host IP [IP_SERVER] in pools [] considered it as dead. It may cause a shutdown of the host.

@Luskan777
Copy link
Author

Hi @Luskan777, as far as I know, Host HA is only available for NFS, StorPool and Linstor storage pools. That's why there aren't any listed pools in your log

2025-01-07 15:49:52,534 DEBUG [kvm.resource.KVMHAChecker] (pool-1067-thread-1:[]) (logid:) Checking heart beat with KVMHAChecker for host IP [IP_SERVER] in pools []
2025-01-07 15:49:52,534 WARN  [kvm.resource.KVMHAChecker] (pool-1067-thread-1:[]) (logid:) All checks with KVMHAChecker for host IP [IP_SERVER] in pools [] considered it as dead. It may cause a shutdown of the host.

Hi @slavkap ,

Thanks for your reply, I'm using ShareMountPoint with GFS2, maybe that explains the problem.

I would like to develop support for pools that use ShareMountPoint, looking at the code it seems possible, however, I'm new to Cloudstack development, @slavkap and @DaanHoogland , could you tell me where I can start to solve this problem? Maybe pointing to some files or some documentation, I would be grateful for any help :)

@DaanHoogland
Copy link
Contributor

I would like to develop support for pools that use ShareMountPoint, looking at the code it seems possible, however, I'm new to Cloudstack development, @slavkap and @DaanHoogland , could you tell me where I can start to solve this problem? Maybe pointing to some files or some documentation, I would be grateful for any help :)

@Luskan777, there are several start points;

git blame is also a great help to look for help ;)

As a code entry point, I would start looking in the HAManagerImpl

hope this helps.

@slavkap
Copy link
Contributor

slavkap commented Jan 9, 2025

@Luskan777, as @DaanHoogland mentioned cwiki, those two guides could help:
Host HA
High Availability Developer guide

Probably there is only a need for a change here

to support SharedMountPoint if the kvm heartbeat script works for SharedMountPoint storage

@Luskan777
Copy link
Author

Hi @slavkap @DaanHoogland ,

Thanks for your reply, it helps me a lot, I'm going to start developing HA support for ShareMountPoint.

I found an issue with the same problem #9750 , I believe this will solve this issue too.

@Luskan777 Luskan777 changed the title HA : Hosts persist in the Suspect state in HA cluster with KVMHAProvider HA : Hosts persist in the Suspect state in HA cluster with ShareMountPoint Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants