Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 内存predicate计算有点问题 #21301

Open
66545shiwo opened this issue Sep 26, 2024 · 14 comments
Open

[BUG] 内存predicate计算有点问题 #21301

66545shiwo opened this issue Sep 26, 2024 · 14 comments
Labels

Comments

@66545shiwo
Copy link

问题描述/What happened:
我测试,同时启动了7个虚机在同一个宿主机,有一个调度失败,报错如下:

no enough resource: test001, requested: 61440, total: 507321, free: 34133

宿主机上面还有个40G的虚机, 还有6个成功的 : 6*60=360G ,宿主机一共使用了400G。
但报错提示total 495G, free是33G 。
我把失败的一个删除,然后在重建在这个宿主机上 ,又能启动了。
像是同时启动多个虚拟机时(瞬时调用7次创建虚拟机API),内存predicate计算有点问题。

打印了一些源码日志,查看freeMemSize的计算方式:
pkg/scheduler/cache/candidate/hosts.go

func (h *HostDesc) GetFreeMemSize(useRsvd bool) int64 {
	fmt.Println("FreeMemSize is %d", h.FreeMemSize)
	fmt.Println("GuestReservedMemSizeFree is %d", h.GuestReservedMemSizeFree())
	fmt.Println("GetPendingUsage().Memory is %d", h.GetPendingUsage().Memory)
	return reservedResourceAddCal(h.FreeMemSize, h.GuestReservedMemSizeFree(), useRsvd) - int64(h.GetPendingUsage().Memory)
}

启动7台虚机日志:

[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
FreeMemSize is %s 456021
GuestReservedMemSizeFree is %s 8192
GetPendingUsage().Memory is %s 0
[info 2024-09-26 02:59:27 handler.timer.func1(handler.go:55)] Handler "/scheduler" cost: 23.813787ms
[info 2024-09-26 02:59:27 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 3.380576ms
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 cache.(*schedulerCache).loadObjects(cache.go:155)] Start load CandidateCache - Hosts, period: 30s, ttl: 0s
[info 2024-09-26 02:59:27 cache.(*schedulerCache).loadObjects.func1(cache.go:159)] End load CandidateCache - Hosts, elapsed 84.126592ms
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:27 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
FreeMemSize is %s 394581
GuestReservedMemSizeFree is %s 8192
GetPendingUsage().Memory is %s 61440
[info 2024-09-26 02:59:27 handler.timer.func1(handler.go:55)] Handler "/scheduler" cost: 100.925152ms
FreeMemSize is %s 394581
GuestReservedMemSizeFree is %s 8192
GetPendingUsage().Memory is %s 122880
[info 2024-09-26 02:59:27 handler.timer.func1(handler.go:55)] Handler "/scheduler" cost: 25.887185ms
[info 2024-09-26 02:59:27 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7?session=1727319567768" cost: 1.887189ms
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 2.075336ms
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 cache.(*schedulerCache).loadObjects(cache.go:155)] Start load CandidateCache - Hosts, period: 30s, ttl: 0s
[info 2024-09-26 02:59:28 cache.(*schedulerCache).loadObjects.func1(cache.go:159)] End load CandidateCache - Hosts, elapsed 92.61507ms
[info 2024-09-26 02:59:28 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:436)] Cancel session 1727319567768 usage, count: 0
[info 2024-09-26 02:59:28 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:445)] delete session usage &models.SessionPendingUsage{HostId:"c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7", SessionId:"1727319567768", Usage:(*models.SPendingUsage)(0xc0010e1f40), countLock:(*sync.Mutex)(0xc001fd45f8), count:0, cancelCh:(chan string)(0xc0012f0f00)}
FreeMemSize is %s 341333
GuestReservedMemSizeFree is %s 0
GetPendingUsage().Memory is %s 122880
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler" cost: 112.005501ms
FreeMemSize is %s 341333
GuestReservedMemSizeFree is %s 0
GetPendingUsage().Memory is %s 184320
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler" cost: 20.433912ms
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7?session=1727319567821" cost: 1.700734ms
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 1.770653ms
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 cache.(*schedulerCache).loadObjects(cache.go:155)] Start load CandidateCache - Hosts, period: 30s, ttl: 0s
[info 2024-09-26 02:59:28 cache.(*schedulerCache).loadObjects.func1(cache.go:159)] End load CandidateCache - Hosts, elapsed 84.108955ms
[info 2024-09-26 02:59:28 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:436)] Cancel session 1727319567821 usage, count: 0
[info 2024-09-26 02:59:28 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:445)] delete session usage &models.SessionPendingUsage{HostId:"c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7", SessionId:"1727319567821", Usage:(*models.SPendingUsage)(0xc001017780), countLock:(*sync.Mutex)(0xc00176de70), count:0, cancelCh:(chan string)(0xc0012f0960)}
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7?session=1727319567910" cost: 1.808238ms
FreeMemSize is %s 279893
GuestReservedMemSizeFree is %s 0
GetPendingUsage().Memory is %s 184320
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler" cost: 100.336524ms
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 1.646004ms
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:239)] original schedtag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 models.applyResourceSchedPolicy(schedpolicies.go:263)] updated sched tag map[string]*compute.SchedtagConfig{}
[info 2024-09-26 02:59:28 cache.(*schedulerCache).loadObjects(cache.go:155)] Start load CandidateCache - Hosts, period: 30s, ttl: 0s
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7?session=1727319568031" cost: 2.000487ms
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 1.849613ms
[info 2024-09-26 02:59:28 cache.(*schedulerCache).loadObjects.func1(cache.go:159)] End load CandidateCache - Hosts, elapsed 96.215717ms
[info 2024-09-26 02:59:28 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:436)] Cancel session 1727319567910 usage, count: 0
[info 2024-09-26 02:59:28 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:445)] delete session usage &models.SessionPendingUsage{HostId:"c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7", SessionId:"1727319567910", Usage:(*models.SPendingUsage)(0xc001330a00), countLock:(*sync.Mutex)(0xc000fb6dc0), count:0, cancelCh:(chan string)(0xc0012f1260)}
FreeMemSize is %s 218453
GuestReservedMemSizeFree is %s 0
GetPendingUsage().Memory is %s 184320
[warning 2024-09-26 02:59:28 predicates.(*PredicateHelper).GetResult(predicates.go:89)]Filter Result: candidate: "c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7", filter: "host_memory", is_ok: false, reason: "no enough resource: test001, requested: 61440, total: 507321, free: 34133", error: <nil>
[error 2024-09-26 02:59:28 manager.(*Task).onError(task_queue.go:400)] Remove Session on error: 1727319568537
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler" cost: 106.819947ms
[warning 2024-09-26 02:59:28 middleware.Logger.func1(log.go:82)]GIN: | 400 | 127.713168ms | 10.111.42.159 | POST    /scheduler
Error #01: genericScheduler.Schedule: No resource are avaliable that match all of the following predicates: filter by host_memory(-1), session_id="1727319568537"
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7?session=1727319568133" cost: 1.881527ms
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 1.695563ms
[info 2024-09-26 02:59:28 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7?session=1727319568313" cost: 1.986213ms
[info 2024-09-26 02:59:29 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 1.949777ms
[info 2024-09-26 02:59:29 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 2.02622ms
[info 2024-09-26 02:59:29 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 2.119358ms
[info 2024-09-26 02:59:29 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 1.888347ms
[info 2024-09-26 02:59:29 cache.(*schedulerCache).loadObjects(cache.go:155)] Start load CandidateCache - Hosts, period: 30s, ttl: 0s
[info 2024-09-26 02:59:29 cache.(*schedulerCache).loadObjects.func1(cache.go:159)] End load CandidateCache - Hosts, elapsed 89.961968ms
[info 2024-09-26 02:59:29 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:436)] Cancel session 1727319568133 usage, count: 0
[info 2024-09-26 02:59:29 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:445)] delete session usage &models.SessionPendingUsage{HostId:"c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7", SessionId:"1727319568133", Usage:(*models.SPendingUsage)(0xc001102340), countLock:(*sync.Mutex)(0xc000fe9298), count:0, cancelCh:(chan string)(0xc000139740)}
[info 2024-09-26 02:59:29 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:436)] Cancel session 1727319568031 usage, count: 0
[info 2024-09-26 02:59:29 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:445)] delete session usage &models.SessionPendingUsage{HostId:"c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7", SessionId:"1727319568031", Usage:(*models.SPendingUsage)(0xc001331540), countLock:(*sync.Mutex)(0xc002941f68), count:0, cancelCh:(chan string)(0xc0007af560)}
[info 2024-09-26 02:59:29 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:436)] Cancel session 1727319568313 usage, count: 0
[info 2024-09-26 02:59:29 models.(*SessionPendingUsage).StartTimer.func1(pending_usage.go:445)] delete session usage &models.SessionPendingUsage{HostId:"c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7", SessionId:"1727319568313", Usage:(*models.SPendingUsage)(0xc0013bc440), countLock:(*sync.Mutex)(0xc00200b8d0), count:0, cancelCh:(chan string)(0xc001323bc0)}
[info 2024-09-26 02:59:29 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 1.837837ms
[info 2024-09-26 02:59:29 handler.timer.func1(handler.go:55)] Handler "/scheduler/clean-cache/c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7" cost: 1.673463ms
[info 2024-09-26 02:59:32 cache.(*schedulerCache).loadObjects(cache.go:155)] Start load CandidateCache - Hosts, period: 30s, ttl: 0s
[info 2024-09-26 02:59:32 cache.(*schedulerCache).loadObjects.func1(cache.go:159)] End load CandidateCache - Hosts, elapsed 88.784308ms
[info 2024-09-26 02:59:39 schedtag.(*dataManager).syncOnce(sync.go:118)] Schedtag data start sync
[info 2024-09-26 02:59:39 schedtag.(*dataManager).syncOnce(sync.go:135)] Schedtag finish sync, elapsed 26.708036ms

打印了这个方法的日志。
1、创建第一台虚机
FreeMemSize is %s 456021
GuestReservedMemSizeFree is %s 8192
GetPendingUsage().Memory is %s 0
2、创建第二台虚机
FreeMemSize is %s 394581
GuestReservedMemSizeFree is %s 8192
GetPendingUsage().Memory is %s 61440

在创建第二台虚机时合理的情况应该是两种:

  • 如果第一台虚机已经启动占用了内存,那么FreeMemSize减少60G 约等于 394581,GetPendingUsage().Memory 应该是0
  • 如果第一台虚机没有启动,那么FreeMemSize没有减少 约等于 456021,GetPendingUsage().Memory 应该是61440

当前输出的日志观察到的情况是FreeMemSize减少60G,同时GetPendingUsage().Memory=61440

环境/Environment:
v3.10.14

  • OS (e.g. cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Host: (e.g. dmidecode | egrep -i 'manufacturer|product' |sort -u)
  • Service Version (e.g. kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list):
@66545shiwo 66545shiwo added the bug Something isn't working label Sep 26, 2024
@66545shiwo
Copy link
Author

FreeMemSize is %s 456021
GuestReservedMemSizeFree is %s 8192
GetPendingUsage().Memory is %s 0

success: 0   pending: 0


FreeMemSize is %s 394581
GuestReservedMemSizeFree is %s 8192
GetPendingUsage().Memory is %s 61440

success: 1   pending: 1

FreeMemSize is %s 394581
GuestReservedMemSizeFree is %s 8192
GetPendingUsage().Memory is %s 122880

success: 1   pending: 2

FreeMemSize is %s 341333
GuestReservedMemSizeFree is %s 0
GetPendingUsage().Memory is %s 122880

success: 2   pending: 2

FreeMemSize is %s 341333
GuestReservedMemSizeFree is %s 0
GetPendingUsage().Memory is %s 184320

success: 2   pending: 3

FreeMemSize is %s 279893
GuestReservedMemSizeFree is %s 0
GetPendingUsage().Memory is %s 184320

success: 3   pending: 3

FreeMemSize is %s 218453
GuestReservedMemSizeFree is %s 0
GetPendingUsage().Memory is %s 184320

success: 4   pending: 3

[warning 2024-09-26 02:59:28 predicates.(*PredicateHelper).GetResult(predicates.go:89)]Filter Result: candidate: "c5ffc6c1-f59b-44d4-8c95-53ae20ab38c7", filter: "host_memory", is_ok: false, reason: "no enough resource: 006机, requested: 61440, total: 507321, free: 34133", error: <nil>

可以看到最后一次的 FreeMemSize 是 218453, GetPendingUsage 是 184320,success: 4 pending: 3,此时 218453满足pending的184320 3个虚拟机创建, 但又多减了一次去判断

func (h *HostDesc) GetFreeMemSize(useRsvd bool) int64 {
	fmt.Println("FreeMemSize is %d", h.FreeMemSize)
	fmt.Println("GuestReservedMemSizeFree is %d", h.GuestReservedMemSizeFree())
	fmt.Println("GetPendingUsage().Memory is %d", h.GetPendingUsage().Memory)
	return reservedResourceAddCal(h.FreeMemSize, h.GuestReservedMemSizeFree(), useRsvd) - int64(h.GetPendingUsage().Memory)
}

@zexi
Copy link
Member

zexi commented Sep 26, 2024

@66545shiwo 应该是有个中间 pending 的状态,这个时候虚拟机分配出去了,但还没有变成 running 被通知回来,所以一直占用着,可以试下分2次创建,不要一次并发这么多,应该也没问题

@66545shiwo
Copy link
Author

@zexi 这么说,这种在同一个宿主机并发创建2个也有概览出现这种问题。
还发现个可能导致的问题:
addSessionUsage 和 CancelPendingUsage 执行顺序也能导致校验不通过,也就是add多次,但校验在cancel前面

reservedResourceAddCal(h.FreeMemSize, h.GuestReservedMemSizeFree(), useRsvd) - int64(h.GetPendingUsage().Memory)

@66545shiwo
Copy link
Author

我测试了创建2个大内存虚机也会出现这种问题 @zexi

image

@yulongz
Copy link

yulongz commented Sep 27, 2024

pkg/scheduler/cache/candidate/hosts.go中GetFreeMemSize方法

func (h *HostDesc) GetFreeMemSize(useRsvd bool) int64 {
	return reservedResourceAddCal(h.FreeMemSize, h.GuestReservedMemSizeFree(), useRsvd) - int64(h.GetPendingUsage().Memory)
}

pkg/scheduler/manager/manager.go中schedule方法

func (sm *SchedulerManager) schedule(info *api.SchedInfo) (*core.ScheduleResult, error) {
	// force sync clean expire cache before do schedule
	sm.ExpireManager.Trigger()

	log.V(10).Infof("SchedulerManager do schedule, input: %#v", info)
	task, err := sm.TaskManager.AddTask(sm, info)  
	// 上一行代码执行之后,task(vm创建)进入调度状态,IsGuestCreating方法判断vm状态之后CreatingGuestCount+1
	if err != nil {
		return nil, err
	}

        //在下一行代码之前CreatingGuestCount和GetPendingUsage().Memory的状态是不一致。
        //在一个宿主机上多个session并发创建虚机时,在pkg/scheduler/cache/candidate/hosts.go判断GetFreeMemSize方法大概率出现多减掉一部分内存的情况。
        //实际上这部分内存已经变成CreatingGuest的内存,GetPendingUsage().Memory未减掉这部分。
	sm.HistoryManager.NewHistoryItem(task)   //这一行调用CancelPendingUsage。
	results, err := task.Wait()
	if err != nil {
		return nil, err
	}
	log.V(10).Infof("SchedulerManager finish schedule, selected candidates: %#v", results)
	return results, nil
}

Copy link

github-actions bot commented Nov 4, 2024

If you do not provide feedback for more than 37 days, we will close the issue and you can either reopen it or submit a new issue.

您超过 37 天未反馈信息,我们将关闭该 issue,如有需求您可以重新打开或者提交新的 issue。

@github-actions github-actions bot closed this as completed Nov 4, 2024
@66545shiwo
Copy link
Author

问题没有解决

@yulongz
Copy link

yulongz commented Nov 11, 2024

问题没有解决
增加重试,待pending usage状态一致。

@zexi
Copy link
Member

zexi commented Nov 27, 2024

@yulongz 目前修复 PR 已经合并,可以等我们下个版本发布再测试下,或者你提前拉取对应分支最新代码编译打包测试,感谢反馈。

Copy link

github-actions bot commented Jan 4, 2025

If you do not provide feedback for more than 37 days, we will close the issue and you can either reopen it or submit a new issue.

您超过 37 天未反馈信息,我们将关闭该 issue,如有需求您可以重新打开或者提交新的 issue。

@yulongz
Copy link

yulongz commented Jan 7, 2025

@yulongz 目前修复 PR 已经合并,可以等我们下个版本发布再测试下,或者你提前拉取对应分支最新代码编译打包测试,感谢反馈。

我们在3.10.15上合并了#21675 ,问题不会复现,但是导致了新的问题。

我们创建了超过宿主机内存的虚机,MemoryPredicate.execute() 验证失效。无法正常拦截住。

我们加了比较多的日志来说明这个情况,我们发现#21675 虽然是的pending cache的数据更准确,但是scheduler cache并不能每次创建新虚机执行MemoryPredicate.execute() 之前就reload,导致MemoryPredicate.execute()每次执行时,scheduler cache存在的guestId可能缺失。详细情况看日志及截图

详细测试情况我们测试创建了7个虚拟机分别对应ABCDEFG,宿主机空闲内存足够6台机器启动。
我们对比了每次虚机创建时执行MemoryPredicate.Execute() 的状态与schedulerCache的当前状态。发现schedulerCache的状态滞后于虚机创建的,比如当第6台提交执行MemoryPredicate.Execute() 时,schedulerCache中ABCD四台虚机的状态。直接导致即使第7台虚机创建MemoryPredicate.Execute() 中freeMemSize也比reqMemSize大得多。MemoryPredicate的校验功能不起作用了。

scheduler-log.txt

image
image

@zexi
Copy link
Member

zexi commented Jan 9, 2025

现在情况是正常情况只能调度6台到宿主机,但实际情况可以调度7台上去?

@yulongz
Copy link

yulongz commented Jan 9, 2025

现在情况是正常情况只能调度6台到宿主机,但实际情况可以调度7台上去?

我指定这台宿主机上创建7台虚机,宿主机的内存只能运行6台虚机,理论上第7台的内存检测应该返回异常即无内存资源,但是现在的情况是内存检测会通过,不抛出内存不足异常。

@yulongz
Copy link

yulongz commented Jan 11, 2025

暂时我们不合并#21675 了,这个引入新问题更难以解决。

我们仍然使用延迟等待pending状态一致。我们测试了并发创建二百台以上虚拟机(25台宿主机,每个并发创建8个)时,虽然有少量(不会超过宿主机个数)虚拟机拖慢整个调度速度,但是基本避免了异常的发生。生产运行了一个月基本没有报错。

后来我们又扩容到50台宿主机,并发200~400台虚拟机,调度的时间更长了,200台虚机从api发起请求至虚拟机可连接由原来2分钟延长到了4~5分钟。集群的压力也变大了,宿主机偶尔出现了offline的情况。透传设备在宿主机竞争压力也大了,还出现了找不到透传设备的报错(fmt.Errorf("Can't found model %s on host %s", devConfig.Model, host.Id)),这种情况之前并没有。

我们暂时解决MemoryPredicate.Execute方式如下,IsolatedDevicePredicate.Execute方式类似。


func (p *MemoryPredicate) Execute(ctx context.Context, u *core.Unit, c core.Candidater) (bool, []core.PredicateFailureReason, error) {
	h := predicates.NewPredicateHelper(p, u, c)
	d := u.SchedData()

	reqMemSize := int64(d.Memory)
	maxRetries := 5
	retryInterval := time.Second * 1
	for retry := 0; retry <= maxRetries; retry++ {
		useRsvd := h.UseReserved()
		getter := c.Getter()
		totalMemSize := getter.TotalMemorySize(useRsvd)
		freeMemSize := getter.FreeMemorySize(useRsvd)
		h.SetCapacity(freeMemSize / reqMemSize)
		if freeMemSize >= reqMemSize {
			break
		} else {
			pendingUsageMemory := getter.GetPendingUsage().Memory
			fmt.Printf("reqMemSize is %d, freeMemSize is %d, totalMemSize is %d, pendingUsageMemory is %d, retry is %d \n",
				reqMemSize, freeMemSize, totalMemSize, pendingUsageMemory, retry)
			if pendingUsageMemory == 0 || retry == maxRetries {
				h.AppendInsufficientResourceError(reqMemSize, totalMemSize, freeMemSize)
				break
			}
			time.Sleep(retryInterval)
		}
	}
	return h.GetResult()
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants