Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: BIOS version & settings management #138

Closed
wants to merge 13 commits into from
Closed

Conversation

aobort
Copy link
Contributor

@aobort aobort commented Oct 3, 2024

Proposed Changes

This PR contains PoC implementation for management of server's BIOS version and settings:

  • API types
  • Controller
  • Service managing tasks (scan, apply settings, update version)

This approach fully separate concrete job implementation from the reconciliation flow.

To discuss

BootOrder

Do we need to move boot order field from server to serverBIOS CR? Semantically, yes. As it's one of BIOS settings.

BIOS settings in .status

From my perspective, it reasonable to reflect in .status.bios.settings only those settings which are set in .spec.bios.settings. This will make comparison of .spec and .status much easier.

Storing bios settings

Do we need custom type for bios setting:

type SettingsEntry struct {
    // +required
    Name string `json:"name"`
    
    // +required
    Value string `json:"value"`
        
    // +optional
    Unsupported bool `json:"unsupported,omitempty"`
}

to avoid attempting to apply settings which are not supported in the specified BIOS version? I.e.:

apiVersion: metal.ironcore.dev/v1alpha1
kind: ServerBIOS
metadata:
  name: foo
spec:
  scanPeriodMinutes: 30
  serverRef:
    name: bar
  bios:
    # changing both version and 'unsupported' flag in the same time
    version: 1.0.0  --> 2.0.0
    settings:
    - name: legacyboot
      value: enabled
      unsupported: false  --> true

upgrading from version 1.0.0 to 2.0.0 will lead to the bios setting responsible for legacy boot is deprecated/unsupported, thus it is explicitly marked as unsupported and will not be considered during settings applying.

@aobort aobort force-pushed the poc-firmware-controllers branch from a6733e4 to b87e059 Compare October 7, 2024 07:18
@aobort aobort requested a review from stefanhipfel October 9, 2024 09:20
@aobort aobort requested a review from stefanhipfel October 11, 2024 07:25
@aobort aobort force-pushed the poc-firmware-controllers branch 4 times, most recently from 0aea100 to 958928c Compare November 4, 2024 10:52
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 7, 2024
@aobort aobort force-pushed the poc-firmware-controllers branch 2 times, most recently from b682a57 to 3468011 Compare November 11, 2024 06:26
func (s *ServerHTTP) registerRoutes() {
s.mux.HandleFunc("/scan", s.scanHandler)
s.mux.HandleFunc("/settings-apply", s.settingsApplyHandler)
s.mux.HandleFunc("/version-update", s.versionUpdateHandler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: would be great to have an endpoint to get all current active tasks from the sync.Map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhipfel thanks for idea.
Totally agree. Also we might need to implement queues for requests, rate limiting, retrying on client side and tons of other stuff. Apart from that, I'd also like to generate http API from spec without hardcoding endpoints.

But for now, I think the main goal is to come to agreement - will we proceed further with this design or not.

// if referred server is not in Available state - stop reconciliation
if server.Status.State != metalv1alpha1.ServerStateAvailable {
return ctrl.Result{RequeueAfter: r.RequeueInterval}, nil
}
Copy link
Contributor

@stefanhipfel stefanhipfel Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to stop the server reconciliation in the meantime?
e.g.: do not allow serverClaim

Copy link
Contributor Author

@aobort aobort Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not need to stop server's reconciliation - we do not care what is the state of the server. But we need to stop serverBIOS reconciliation, bc both BIOS version and settings update will lead to server reboot. So for now we decided that we'll work only with servers which are in "Available" state.

If the server is in available state, then we'll need to set it to "maintenance". But the servers' maintenance topic is still open. Thus I decide to not to mention it at all for now.

#76

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but during a serverBios update, someone could still claim the server. Any available server can be claimed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but during a serverBios update, someone could still claim the server. Any available server can be claimed

@stefanhipfel that's a good point. However, I do not really like the idea that the serverBIOS controller would change server's state to exclude it from reconciliation. For number of reasons the server's "owner" might want to postpone the update. Especially in terms of version upgrade.
I could add the "Maintenance" state for server and update the controller, so it will check whether server's state is "Maintenance" instead of "Available".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aobort yes i think so as well, that the serverBIOS should not change a server's state.

Alternative idea: the server reconciler checks on the serverBIOS state.

@stefanhipfel
Copy link
Contributor

overall I think the POC looks ok

// if referred server is not in Available state - stop reconciliation
if server.Status.State != metalv1alpha1.ServerStateAvailable {
return ctrl.Result{RequeueAfter: r.RequeueInterval}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but during a serverBios update, someone could still claim the server. Any available server can be claimed

type ServerBIOSStatus struct {
// BIOS contains a bios version and settings.
// +optional
BIOS BIOSSettings `json:"bios,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add the current state of the bios task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhipfel Taking into account:

  • possible separating of metal-operator and bios/fw operator
  • possible separation of the bios/fw operator and task runner

I'd suggest to proceed as is for now. Then we could add some endpoint to retrieve active tasks for concrete server or something like this. But it will require to store the references for running tasks etc. So I'd avoid adding it right now due to necessity to design this better.

@stefanhipfel
Copy link
Contributor

@aobort overall I think the POC looks good.

Next step would be to gather some improvements and test it with real examples!

@stefanhipfel stefanhipfel requested a review from gehoern November 20, 2024 14:41
@stefanhipfel
Copy link
Contributor

stefanhipfel commented Dec 4, 2024

@defo89 @aobort @afritzler what do u guys think about taking the bios controller out of the metal operator, same as the boot/ipam operator? We would be more flexible in the future and changes could be implemented independently of the metal operator

@aobort
Copy link
Contributor Author

aobort commented Dec 5, 2024

@defo89 @aobort @afritzler what do u guys think about taking the bios controller out of the metal operator, same as the boot/ipam operator? We would be more flexible in the future and changes could be implemented independently of the metal operator

I proposed it in the beginning of this "epic". Thus, I agree.

@aobort aobort force-pushed the poc-firmware-controllers branch 2 times, most recently from d06686f to b02ff6e Compare December 5, 2024 08:36
@aobort aobort marked this pull request as ready for review December 5, 2024 09:14
@aobort aobort force-pushed the poc-firmware-controllers branch from b02ff6e to ed1a9fc Compare December 6, 2024 10:44
aobort added 11 commits December 6, 2024 12:47
Signed-off-by: Artem Bortnikov <[email protected]>
Signed-off-by: Artem Bortnikov <[email protected]>
Signed-off-by: Artem Bortnikov <[email protected]>
Signed-off-by: Artem Bortnikov <[email protected]>
Signed-off-by: Artem Bortnikov <[email protected]>
Signed-off-by: Artem Bortnikov <[email protected]>
Signed-off-by: Artem Bortnikov <[email protected]>
Signed-off-by: Artem Bortnikov <[email protected]>
@aobort aobort force-pushed the poc-firmware-controllers branch from ed1a9fc to 971fffa Compare December 6, 2024 10:52
@aobort
Copy link
Contributor Author

aobort commented Dec 6, 2024

@stefanhipfel I've updated the controller: now it checks whether server is in "Maintenance" state. After #76 solved, we'll update server bios controller respectively. Adding the task state into the serverBIOS object's status is prematurely, from my perspective. Let's first run the stuff on real hardware and look how the "instant" operations (like scan and settings apply) work. Also added initial test for BIOS versions retrieval.
I'd say it's ready to be merged.

cc: @afritzler @defo89

@aobort aobort closed this Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change documentation Improvements or additions to documentation size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants