Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes #9

Open
bonyfusolia opened this issue Apr 15, 2021 · 21 comments
Open

Kubernetes #9

bonyfusolia opened this issue Apr 15, 2021 · 21 comments

Comments

@bonyfusolia
Copy link

Hi,

Can we run this agent inside a Kubernetes cluster?

@Lucretius
Copy link
Owner

I have not actually tried doing this, so I am not sure. Feel free to give it a shot!

@devops-42
Copy link

Hi @luanphantiki

thanks for your PR.

I could build the container image from the Dockerfile and have created a deployment configuration for a k8s cluster. Vault is configured with an according app_role, thesnapshot.json is mounted to the Pod at the expected path.
When starting the Pod the log output shows:

Not running on leader node, skipping.

Question: Is it necessary to run the container as a side-car to each existing k8s vault pod or is it possible to tell the snapshotter to detect the leader?

Thanks for your help!
Cheers.

@luanphantiki
Copy link
Contributor

@devops-42

Not running on leader node, skipping.

-> This message showed that it came from a Follower Pod. How many Vault's pod do you have? Let's focus on Leader pod's log

Question: Is it necessary to run the container as a side-car to each existing k8s vault pod or is it possible to tell the snapshotter to detect the leader?

-> It's not required to run as a side-car, you can use another kubernetes deployment with the correct value: "addr":"http://vaul-leader.svc:8200"

Btw, show your snapshot.json file can help me understand what you have.

@devops-42
Copy link

@luanphantiki

Thanks for clarification, I changed the address to the internal svc address of the leading pod (have a cluster of 3 pods deployed). Now the snapshot provider tries to perform something, the logs say:

2021/10/26 08:03:41 Reading configuration...
2021/10/26 08:04:41 Unable to generate snapshot context deadline exceeded (Client.Timeout or context cancellation while reading body)

My snapshot.json is:

{
   "addr":"http://leader-adress:8200,
   "retain":72,
   "frequency":"3600s",
   "role_id": "***",
   "secret_id":"***",
   "aws_storage":{
      "access_key_id":"***",
      "secret_access_key":"***",
      "s3_region":"us-east-1",
      "s3_bucket":"**bucket**",
      "s3_endpoint":"**s3_endpoint**",
      "s3_force_path_style":true
   }
}

What could be wrong here?

@luanphantiki
Copy link
Contributor

@devops-42 can you try to update addr from :

 "addr":"http://leader-adress:8200,

to:

 "addr":"http://leader-adress:8200",

If the issue stills same, try to validate connectivity by getting the shell exec to backup pod and run:

curl -Ik http://vault-leader:8200

And show me the output.

@devops-42
Copy link

My bad, when cleaning up the config file I accidentally deleted the "

Concerning the curl call: I got a 307 status code:

HTTP/1.1 307 Temporary Redirect
Cache-Control: no-store
Content-Type: text/html; charset=utf-8
Location: /ui/
Date: Tue, 26 Oct 2021 09:07:30 GMT

@luanphantiki
Copy link
Contributor

@devops-42 Then finally make sure that you're using Raft as storage, is that correct? Can you show your vault's config ?

@devops-42
Copy link

@luanphantiki

I do use Raft as storage, here's a redacted output of the vault status command

Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            5
Threshold               3
Version                 1.8.1
Storage Type            raft
Cluster Name            vault-cluster-******
Cluster ID              ********-****-****-****-************
HA Enabled              true
HA Cluster              https://***********:8201
HA Mode                 active
Active Since            YYYY-MM-DDTHH:MM:SS.123456789Z
Raft Committed Index    *******
Raft Applied Index      *******

@luanphantiki
Copy link
Contributor

luanphantiki commented Oct 26, 2021

@devops-42 Alright, Let's rerun the backup pod, does it work ? You should tail the vault's logs to see if there is any clue

@devops-42
Copy link

@luanphantiki

It seems that the pod can connect to the leader pod of the vault, the log output of the leader is as follows:

2021-10-26T09:19:43.036Z [INFO]  storage.raft: starting snapshot up to: index=*******
2021-10-26T09:19:43.045Z [INFO]  storage.raft: compacting logs: from=******* to=*******
2021-10-26T09:19:43.081Z [INFO]  storage.raft: snapshot complete up to: index=*******

But when checking the local filesystem of the Pod (which as a PVC attached) no snapshot file has been created.

Any change to configure more debugging in the backup pod?

@luanphantiki
Copy link
Contributor

@devops-42: Can you check your S3? Any new output from backup pod ?

@devops-42
Copy link

@luanphantiki

The backup pod error message stays the same. I could successfully connect from the backup pod to the S3 endpoint (we used minio) via nc:

ip.add.ress.minio (ip.add.ress.minio:9000) open

So I assume that my network setup is correct.

@luanphantiki
Copy link
Contributor

luanphantiki commented Oct 26, 2021

@devops-42 : I haven't tried with Minio on this project and not sure if the current lib (https://github.com/aws/aws-sdk-go/tree/main/service/s3/s3manager) can support minio. @Lucretius can you pls confirm that ?

Anw, I guess you should replace MinoS3's config from snapshot.json by local directive to let backup pod can write data to local file system as work around:

{
...
"local_storage": {
  "path": "/path/to/pvc/"
}
...
}

@devops-42
Copy link

@luanphantiki

at first, thanks for your patience :)

I started a debug pod to play around with configuration and the binary. Tried to perform a backup using this (redacted) configuration:

{
   "addr":"http://vault-leader:8200",
   "retain":72,
   "frequency":"3600s",
   "role_id": "******",
   "secret_id":"******",
   "local_storage":{
    "path": "/tmp"
   }
}

The config file is located below /tmp/snapshot.json. Have started the snapshotter:

~ $ /vault_raft_snapshot_agent /tmp/snapshot.json 
2021/10/26 09:49:23 Reading configuration...
2021/10/26 09:50:23 Unable to generate snapshot context deadline exceeded (Client.Timeout or context cancellation while reading body)

The according log output of the vault leader is:

2021-10-26T09:49:23.857Z [INFO]  storage.raft: starting snapshot up to: index=*******
2021-10-26T09:49:23.862Z [INFO]  storage.raft: compacting logs: from=******* to=*******
2021-10-26T09:49:23.870Z [INFO]  storage.raft: snapshot complete up to: index=*******

Seems to be an issue with the communication with the vault leader.

@luanphantiki
Copy link
Contributor

@devops-42 : Agreed, it failed at snapshot step.

@devops-42
Copy link

@luanphantiki
Ok. Is there by any chance a possibility to get more information why this step fails? It seems to be a timeout value, after 60 secs the snapshot attempt aborts with that error.

@luanphantiki
Copy link
Contributor

@devops-42 unfortunately, this part returned by the vault-api sdk, there is no more details to see.

I have also reproduced your configuration from my side and there is no issue

/ # vi backup.json
/ # /vault_raft_snapshot_agent backup.json
2021/10/26 11:14:07 Reading configuration...
2021/10/26 11:14:07 Successfully created local snapshot to /tmp/raft_snapshot-1635246847792284547.snap

/ # cat backup.json
{
   "addr":"http://vault-leader.svc:8200",
   "retain":72,
   "frequency":"3600s",
   "role_id": "*******",
   "secret_id":"***********",
   "local_storage":{
    "path": "/tmp"
   }
}
/ # du -sh /tmp/raft_snapshot-1635246847792284547.snap
24.0K   /tmp/raft_snapshot-1635246847792284547.snap

@luanphantiki
Copy link
Contributor

@luanphantiki Ok. Is there by any chance a possibility to get more information why this step fails? It seems to be a timeout value, after 60 secs the snapshot attempt aborts with that error.

Simply a connectivity issue though.

@devops-42
Copy link

@luanphantiki

The problem could be related with the size of the vault.db. My vault.db file is currently over 2 GB. I checked whether there's a timeout issue when creating the snapshot by curl:

curl --header "X-Vault-Token: ..." --request GET http://vault-leader:8200/v1/sys/storage/raft/snapshot > /tmp/raft.snap
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1468M    0 1468M    0     0  8610k      0 --:--:--  0:02:54 --:--:-- 10.1M
$ ls -lh /tmp
total 1.5G
-rw-rw-rw-. 1 1000730000 root 1.5G Oct 26 11:25 raft.snap

@Lucretius
Does the command from the vault_snapshot_agent has any built-in timeout?

@luanphantiki
Copy link
Contributor

@devops-42 seems to be the valid issue, but we should move this conversation to #20

@devops-42
Copy link

@luanphantiki You're absolutely right. Thx for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants