Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding enhancement #98 for SPIRE integration #100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mpeters
Copy link
Member

@mpeters mpeters commented Jun 23, 2023

Implements #98

@mpeters mpeters force-pushed the spire_integration branch from a51cd7c to aa6b020 Compare June 23, 2023 21:19
Copy link

@mheese mheese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the inline comments, I also have some more general questions:

  • is there a design for the work necessary on the SPIRE side?
  • similarly, is there a similar proposal for a keylime integration/plugin on the SPIRE side? If yes, can we cross-reference / link them please?


In order to accomodate this flow, this enhancement will consist of the following:

1. A new node-local, non-TLS API on the keylime agent responding the the `/info` path. It will return information about the keylime agent which will be used to not only identity the agent, but also be used to perform a signature verification. A 3rd party can use the credential created by the agent in the TPM to sign a nonce which can then be verified by the verifier. The new API will return the following information:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly under the light of #60 it would be nice to see a separation between node-local APIs, and APIs that are being called by the verifier.

Also, IMHO it would be nice to consider the following things for node-local APIs:

  • the server that serves verifier APIs and the server that serves node-local APIs should be separate
  • the server should be listening on a unix socket
  • very much optional, but in the spirit of being a cloud native project: use grpc or better ttrpc for these APIs like other cloud native services do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we think they should be separate servers? A single binary which starts multiple processes? Multiple binaries? The latter would be a much bigger lift for packagers, etc.

very much optional, but in the spirit of being a cloud native project: use grpc or better ttrpc for these APIs like other cloud native services do

This is definitely worth doing. I don't know if the initial APIs will have them, but I'll make them versioned so we can add them later.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we think they should be separate servers? A single binary which starts multiple processes? Multiple binaries? The latter would be a much bigger lift for packagers, etc.

sorry, that might have not been clear: logical separation in the code (definitely not multiple binaries or processes), so that the server listening on the unix socket serves all the node-local APIs, and the one listening on TCP for the verifier serves all the existing APIs.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been reading this carefully, and I'd like to dig a little into Mike's assumptions. During the initial review of the proposal I was not at peace with the spire agent having to talk to both the agent and the TPM device.

I think we can drop the requirement that the spire agent talk to the keylime agent -- without affecting security. My apologies for the extremely long writeup, and if I have a reasoning error in here please do point it out. I will then eat crow for having wasted your times :)

(A) we are positing a situation in which the keylime verifier is attesting the target node.
(A.1) i.e., the verifier has already established the correspondence between a node's UUID and its AK
(A.2) all attestation info, including the UUID and the AK, are available through the tenant API
(B) the SPIRE server's goal is to establish the target node's trustworthiness.
(B.1) first, the SPIRE server talks to the keylime verifier to download information about the target node.
(B.2) next the SPIRE server has to prove that the SPIRE agent on the node is co-located with the keylime agent.
(B.3) The simplest way to do this is to establish that both agents (keylime, spire) are talking to the same TPM.
(C) the SPIRE agent has access to all node information from the SPIRE server, as downloaded from the verifier.
(C.1) all the SPIRE agent has to to is to mount a challenge against the TPM device:
(C.2) have the TPM decrypt a challenge encrypted by the pubAK.
(C.3) at this point the SPIRE agent has [dis?]proven the fact that it's talking to the same TPM as the keylime agent. Interaction with the keylime agent was not necessary.
(D) once the SPIRE agent reports back to the server, the SPIRE certificates can be downloaded etc.
(D.1) a simple swizzle on certifying SPIRE agents would be for the SPIRE server to encrypt its certs with each TPM's AK, and have the agent use the actual SPIRE cert as the challenge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galmasi @mpeters. Are we assuming the SPIRE agent is, for all intents and purposes, the equivalent of a keylime_tenant? This would require the shipping of TLS certificates to all nodes (which will include client-private.pem), which strikes me as problem, both in terms of security and maintenance (TLS certificate renewal).

Maybe I am missing something, but it looks like we either do exactly what I describe here or we implement a new HTTP api for the verifier (which would have its own security side-effects). Am I missing something?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galmasi I was reading your reply, and I have some questions/comments. I'm also not entirely happy with the SPIRE agent needing to talk to both the agent and the TPM, but for opposite reasons than you.

I have been reading this carefully, and I'd like to dig a little into Mike's assumptions. During the initial review of the proposal I was not at peace with the spire agent having to talk to both the agent and the TPM device.

From Mike's diagram the part that I don't particularly like is that (2) is going from the SPIRE agent to the TPM. It should be the keylime agent which is always querying the TPM.

I think we can drop the requirement that the spire agent talk to the keylime agent -- without affecting security.

My point of view is the exact opposite on this: we should need to drop the requirement that the SPIRE agent is talking to the TPM.

Here is my reasoning behind this: SPIRE already has a TPM integration as of today, and in order to promote and make keylime more valuable to other use cases even apart from SPIRE (and I am actually working on one right now), the barrier for attestation needs be lowered, and provide more value on top of this. In this case, the keylime agent is the one which interacts with the trust hardware module, and it happens to use the TPM at this point in time. It's the node-level abstraction of how to do these type of actions on the host. Furthermore, keylime does more than what SPIRE is currently doing with its TPM integration, and this is where the particular value add (IMHO) lies.

So I think for your (C.1) I would do the challenge through a node-level API through the keylime agent. It's extremely important though that this API is a host local API (obviously). Admittedly though, your approach can theoretically be considered more secure: as both components independently are talking to the same TPM which is the source of truth after all. However, for all practical purposes a host local API basically does the same thing (and one can control and restrict further access to this socket with additional methods as well). In a nutshell that would provide the following components:

  • the keylime agent helping with identity verification for other components on a host (like SPIRE), and making the reference with its UUID to other keylime components
  • the keylime verifier providing the information that this host is not only authenticated but passes additional integrity checks (MB policies and IMA)
  • the keylime tenant being able to help with identity verification on the server sides because it has the AIK which is needed for verifying the challenges

My apologies for the extremely long writeup, and if I have a reasoning error in here please do point it out. I will then eat crow for having wasted your times :)

It's a good discussion :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpeters so, accessing the verifier APIs will require at least cacert.crt ca-public.pem client-cert.crt client-private.pem client-public.pem server-cert.crt server-public.pem . While I do agree it does not represent a security problem per se, the need for redistributing the TLS certificates over a (potentially) large number of nodes might represent an (operational) problem. I do wonder if SPIRE agents do trust the SPIRE server as a boundary condition, and in case you answer yes, delegating the communication with the keylime verifier to the server

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the requirement that the spire agent talk to the keylime agent -- without affecting security.

My point of view is the exact opposite on this: we should need to drop the requirement that the SPIRE agent is talking to the TPM.

This was my first design @mheese but after starting it and talking it over with others I noticed it was flawed. The purpose of SPIRE attestation via keylime is twofold:

  1. Provide proof of which node the SPIRE agent is running on.
  2. Provide proof that the node from Persist verifier monitoring after agent restarts #1 has passed Keylime attestation

If the SPIRE agent just talks to the Keylime agent then it can't really prove #1. A compromised keylime agent on node A could accept requests and forward them to some other process on node B which could get it's answers either from a Keylime agent on node B or the TPM on node B. And the SPIRE agent on node A wouldn't know the difference as long as node A was registered with Keylime.

So by talking to the TPM directly we can independently prove the identity of the node and then prove #2 by talking to the Keylime agent and verifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if SPIRE agents do trust the SPIRE server as a boundary condition, and in case you answer yes, delegating the communication with the keylime verifier to the server

@maugustosilva Yes, I guess it's not clear from my proposal that the SPIRE server is the one talking to the keylime verifier, so it's only the SPIRE server that needs to be able to communicate with it over TLS.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I mentioned to @galmasi already, that's why it is important that this API is node-local and cannot be reached through any other means. Sure, directly going to the TPM fully eliminates all these concerns, but it also makes it so much more impractical (and keeping it node-local is the practical approach of guaranteeing the same things). So as long as this API is node-local, you can prove (1).

That's what would keep this approach generic and being easily adoptable by other products for which talking directly to the TPM is a barrier which is just too high to achieve and which is why they would like to integrate with keylime to begin with.

That all said, it seems like you and others feel strongly about this approach. Yet again, while I agree that this is the theoretically safer approach, I disagree that it makes a practical difference.

for any 3rd party that wants to do deep verification of an node's status
in Keylime.

## Alternatives
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be great to compare a keylime integration against the "tpm_devid" plugin here, and what the advantages for a keylime integration would be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll add it.


2. A new API on the verifier that can take a signed payload from a TPM and given agent's UUID verify that it came from a TPM associated with that agent. This will be used to independently verify that the Keylime agent resides on a node with that TPM.

3. An expansion of the existing `/agents` GET API on the verifier to return enough information for use as selectors in SPIRE.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what information is still necessary/needed for it to be enough for SPIRE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I was thinking of adding the name(s) of the Keylime policies passed by the node. Right now the only one with a name is the file integrity policy (IMA), but we can look at adding names to the measured boot policy and others in the future.

What other keylime data would you like to see as a selector?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, that's kind of why I was asking :)

@mpeters
Copy link
Member Author

mpeters commented Jun 28, 2023

  • is there a design for the work necessary on the SPIRE side?
  • similarly, is there a similar proposal for a keylime integration/plugin on the SPIRE side? If yes, can we cross-reference / link them please?

I included the diagram for the full flow here in this enhancement to kind of serve as the design for the SPIRE side as well. Obviously it's not as detailed about the changes needed (needs a full new agent and server plugin) as I've never made a SPIRE plugin before.

But no, there isn't a corresponding enhancement proposal on the SPIRE side. I presented the ideas to the SPIRE folks on a video call and they gave feedback and encouragement to move forward. So that part will be a bit more exploratory. But I do expect full documentation of how it works end-to-end to be part of that plugin.

1. SPIRE Agent queries node-local /info API on keylime agent to get information like the Keylime UUID
2. SPIRE Agent creates a nonce that is sent to the TPM’s AK (keylime created) for signing
3. SPIRE Agent sends the information to the Spire Server
4. SPIRE Server queries Keylime Verifier about the agent. Does it exist? Is it passing attestation? If so, can you unencrypt (verify signature) of this nonce? If all are true, then SPIRE attestation passed and identity is issued.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SPIRE attestation as 'periodic' as Keylime attestation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, especially for node attestation, I believe it only happens once at SPIRE agent startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever SPIRE states about the node should only have a short-term validity period since Keylime attestation may detect that the nodes has gone out-of-policy shortly after ...

@THS-on THS-on mentioned this pull request Aug 25, 2023
23 tasks
@mpeters
Copy link
Member Author

mpeters commented Sep 21, 2023

image

@THS-on
Copy link
Member

THS-on commented Sep 25, 2024

@mpeters what is the state of this?

@mpeters
Copy link
Member Author

mpeters commented Sep 25, 2024

I think this is basically done. The changes I needed to Keylime are merged (keylime/rust-keylime#758 and keylime/keylime#1532) and the plugin (https://github.com/keylime/spire-keylime-plugin) works although it needs some love to be production ready (more robust, handle failures and race conditions better).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants