Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chisel client keeps getting connection refused when connecting to exitnode #152

Open
mojtabash78 opened this issue Dec 31, 2024 · 18 comments · Fixed by #156
Open

Chisel client keeps getting connection refused when connecting to exitnode #152

mojtabash78 opened this issue Dec 31, 2024 · 18 comments · Fixed by #156
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested rust Pull requests that update Rust code

Comments

@mojtabash78
Copy link

Hi
I'm having trouble using chisel operator in my kuber cluster. i used
kubectl apply -k https://github.com/FyraLabs/chisel-operator
to install it in cluster.

My exit node is a VPS of mine which hosts the chisel server and works fine. i even tried chisel client from my local to verify that the chisel server is good to go, and it is.

I applied a simple hello world pod in cluster for testing the operator functionality. I then applied the loadbalancer svc and operator created a pod for chisel client. but i keep getting this log from this pod:

2024/12/31 12:36:59 client: Connecting to ws://{myvpsip}:9090
2024/12/31 12:36:59 client: Connection error: dial tcp {myvpsip}:9090: connect: connection refused
2024/12/31 12:36:59 client: Retrying in 100ms...
2024/12/31 12:36:59 client: Connection error: dial tcp {myvpsip}:9090: connect: connection refused (Attempt: 1/unlimited)
2024/12/31 12:36:59 client: Retrying in 200ms...
2024/12/31 12:36:59 client: Connection error: dial tcp {myvpsip}:9090: connect: connection refused (Attempt: 2/unlimited)
2024/12/31 12:36:59 client: Retrying in 400ms...
2024/12/31 12:37:00 client: Connection error: dial tcp {myvpsip}:9090: connect: connection refused (Attempt: 3/unlimited)
2024/12/31 12:37:00 client: Retrying in 800ms...

I tried running chisel client command without using the operator and inside a random pod in cluster, and it worked. but with the operator, i keep getting this error.

this is my chisel server config:

[Unit]
Description=Chisel Tunnel
Wants=network-online.target
After=network-online.target
StartLimitIntervalSec=0

[Install]
WantedBy=multi-user.target

[Service]
Restart=always
RestartSec=1
User=root

ExecStart=/usr/local/bin/chisel server --port=9090 --reverse --auth 'admin:admin' -v

and these are exitnode and svc manifest:

apiVersion: chisel-operator.io/v1
kind: ExitNode
metadata:
  name: my-exit-node
  namespace: chisel-operator-system
spec:
  auth: exit-node-secret
  default_route: false
  host: {myvpsip}
  port: 9090

---
apiVersion: v1
kind: Service
metadata:
  name: hello-world
  namespace: chisel-operator-system
  annotations:
    chisel-operator.io/exit-node-name: "my-exit-node"
spec:
  selector:
    app: hello-world
  ports:
    - port: 80
      targetPort: 80
  type: LoadBalancer
  

Can someone help me ?

Copy link

linear bot commented Dec 31, 2024

@korewaChino
Copy link
Member

korewaChino commented Dec 31, 2024

Could you show what deployment/pod the operator produced?

@korewaChino korewaChino added question Further information is requested rust Pull requests that update Rust code labels Jan 4, 2025
@Towerful
Copy link

Towerful commented Jan 7, 2025

Hello,
I am also experiencing this with DigitalOcean and an ExitNodeProvisioner

Manifests are essentially what are described in the readme. Relevant secret is in the chisel namespace.
Digital Ocean is provisioning the VM, chisel is creating an exit-node secret with a user:password key which looks correct. So I don't think its a manifest issue.

ExitNodeProvisioner.yaml:

apiVersion: chisel-operator.io/v1
kind: ExitNodeProvisioner
metadata:
  name: digitalocean
  namespace: chisel
spec:
  DigitalOcean:
    auth: digitalocean
    region: lon1
    size: s-1vcpu-1gb

ExitNode.yaml:

apiVersion: chisel-operator.io/v1
kind: ExitNode
metadata:
  name: digitalocean
  namespace: chisel
  annotations:
    chisel-operator.io/exit-node-provisioner: "digitalocean"
spec:
  host: ""
  port: 9090

Trying to connect with the Digital Ocean Droplet -> Access -> Droplet Console (a web console):
Logging in as root asks for the current root user's password in order to update it. Trying the password found in the secret immediately closes the console window. Trying ctrl+c to skip changing the password immediately closes the console windows.
Logging in as chisel gives an immediate Error: all configured authentication methods failed.

I can't find any audit history in DigitalOcean, other than a "created" event

Edit:
ssh chisel@ip from a terminal prompts for password, but fails to authenticate as in the above error log.
My log for posterity:

2025/01/07 14:29:06 client: Connecting to ws://{ip}:9090
2025/01/07 14:29:06 client: Handshaking...
2025/01/07 14:29:06 client: Authentication failed
2025/01/07 14:29:06 client: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain
2025/01/07 14:29:06 client: Connection error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain
2025/01/07 14:29:06 client: Retrying in 100ms...
2025/01/07 14:29:06 client: Handshaking...
2025/01/07 14:29:06 client: Authentication failed
2025/01/07 14:29:06 client: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain
2025/01/07 14:29:06 client: Connection error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain (Attempt: 1/unlimited)
2025/01/07 14:29:06 client: Retrying in 200ms...
2025/01/07 14:29:06 client: Handshaking...
2025/01/07 14:29:06 client: Authentication failed
2025/01/07 14:29:06 client: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain

@Towerful
Copy link

Towerful commented Jan 7, 2025

Just saw you asked for spec of the pod produced:

Name:             chisel-envoy-default-gw-3d45476e-6cb68db7bb-skghx
Namespace:        chisel
Priority:         0
Service Account:  default
Node:             {snip}
Start Time:       Tue, 07 Jan 2025 14:29:04 +0000
Labels:           pod-template-hash=6cb68db7bb
                  tunnel=envoy-default-gw-3d45476e
Annotations:      <none>
Status:           Running
IP:               10.244.0.131
IPs:
  IP:           10.244.0.131
Controlled By:  ReplicaSet/chisel-envoy-default-gw-3d45476e-6cb68db7bb
Containers:
  chisel:
    Container ID:  containerd://c83b951d931ac9ae70ced32aa1b8ba199e9d09b1420c83ddb096308db1b7f9d2
    Image:         jpillora/chisel
    Image ID:      docker.io/jpillora/chisel@sha256:6e9b2bd8773c5f6571148570957dfdf1db24702dec6d9f15b4027a1910bdb209
    Port:          <none>
    Host Port:     <none>
    Args:
      client
      -v
      {snip}:9090
      R:80:envoy-default-gw-3d45476e.envoy:80/tcp
      R:443:envoy-default-gw-3d45476e.envoy:443/tcp
    State:          Running
      Started:      Tue, 07 Jan 2025 14:29:06 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lghvr (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  kube-api-access-lghvr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  23m   default-scheduler  Successfully assigned chisel/chisel-envoy-default-gw-3d45476e-6cb68db7bb-skghx to {snip}
  Normal  Pulling    23m   kubelet            Pulling image "jpillora/chisel"
  Normal  Pulled     23m   kubelet            Successfully pulled image "jpillora/chisel" in 819ms (819ms including waiting). Image size: 7830406 bytes.
  Normal  Created    23m   kubelet            Created container chisel
  Normal  Started    23m   kubelet            Started container chisel

edit:
sorry, described the wrong pod. updated to the produced pod

@Towerful
Copy link

Towerful commented Jan 7, 2025

Interestingly, Digital Ocean sent me an email with login credentials for the new droplet.
Unfortunately, I have already destroyed the droplet and moved on for the day.
I might try again tomorrow, and test if the creds D.O. have sent are valid. Then I can have a poke around the VM and see if anything strikes me as odd/broken.

@korewaChino
Copy link
Member

korewaChino commented Jan 8, 2025

Just saw you asked for spec of the pod produced:


Name:             chisel-envoy-default-gw-3d45476e-6cb68db7bb-skghx

Namespace:        chisel

Priority:         0

Service Account:  default

Node:             {snip}

Start Time:       Tue, 07 Jan 2025 14:29:04 +0000

Labels:           pod-template-hash=6cb68db7bb

                  tunnel=envoy-default-gw-3d45476e

Annotations:      <none>

Status:           Running

IP:               10.244.0.131

IPs:

  IP:           10.244.0.131

Controlled By:  ReplicaSet/chisel-envoy-default-gw-3d45476e-6cb68db7bb

Containers:

  chisel:

    Container ID:  containerd://c83b951d931ac9ae70ced32aa1b8ba199e9d09b1420c83ddb096308db1b7f9d2

    Image:         jpillora/chisel

    Image ID:      docker.io/jpillora/chisel@sha256:6e9b2bd8773c5f6571148570957dfdf1db24702dec6d9f15b4027a1910bdb209

    Port:          <none>

    Host Port:     <none>

    Args:

      client

      -v

      {snip}:9090

      R:80:envoy-default-gw-3d45476e.envoy:80/tcp

      R:443:envoy-default-gw-3d45476e.envoy:443/tcp

    State:          Running

      Started:      Tue, 07 Jan 2025 14:29:06 +0000

    Ready:          True

    Restart Count:  0

    Environment:    <none>

    Mounts:

      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lghvr (ro)

Conditions:

  Type                        Status

  PodReadyToStartContainers   True 

  Initialized                 True 

  Ready                       True 

  ContainersReady             True 

  PodScheduled                True 

Volumes:

  kube-api-access-lghvr:

    Type:                    Projected (a volume that contains injected data from multiple sources)

    TokenExpirationSeconds:  3607

    ConfigMapName:           kube-root-ca.crt

    ConfigMapOptional:       <nil>

    DownwardAPI:             true

QoS Class:                   BestEffort

Node-Selectors:              <none>

Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s

                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Events:

  Type    Reason     Age   From               Message

  ----    ------     ----  ----               -------

  Normal  Scheduled  23m   default-scheduler  Successfully assigned chisel/chisel-envoy-default-gw-3d45476e-6cb68db7bb-skghx to {snip}

  Normal  Pulling    23m   kubelet            Pulling image "jpillora/chisel"

  Normal  Pulled     23m   kubelet            Successfully pulled image "jpillora/chisel" in 819ms (819ms including waiting). Image size: 7830406 bytes.

  Normal  Created    23m   kubelet            Created container chisel

  Normal  Started    23m   kubelet            Started container chisel

edit:

sorry, described the wrong pod. updated to the produced pod

I'd like the pod spec in YAML as in what CLI args and envars it produced and stuff, Deployment is also fine

edit: cli args are there but what about the environment variables

@Towerful
Copy link

Towerful commented Jan 8, 2025

As I now realize is apparent:
The exit node is being provisioned, chisel is being install & runs:

#: systemctl status chisel
chisel.service - Chisel Tunnel
     Loaded: loaded (/etc/systemd/system/chisel.service; enabled; preset: enabled)
     Active: active (running) since Wed 2025-01-08 11:48:32 UTC; 18min ago
   Main PID: 1862 (chisel)
      Tasks: 5 (limit: 1113)
     Memory: 1.4M (peak: 1.6M)
        CPU: 20ms
     CGroup: /system.slice/chisel.service
             └─1862 /usr/local/bin/chisel server --port=9090 --reverse --auth {snip}
#: netstat -tunl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.54:53           0.0.0.0:*               LISTEN     
tcp6       0      0 :::9090                 :::*                    LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
udp        0      0 127.0.0.54:53           0.0.0.0:*                          
udp        0      0 127.0.0.53:53           0.0.0.0:*   

It seems the cloud-init generator is creating a file: /etc/sysconfig/chisel with AUTH=chisel:{snip}.
And I am guessing that the exit node k8s pod is trying to SSH with chisel@ip.
#: cat /etc/passwd | grep chisel does not return any users.

I tried SSHing via the chisel@ip with the password found in the secret, and I got auth failure.
So I created the chisel user with the correct password, and I could SSH in.
However the k8s pod was still failing (even after restarting it).
I figured I had copied the password incorrectly, so I deleted the exit node & droplet, and redeployed it. Followed the same thing, and still the k8s pod could not connect.

@korewaChino
Copy link
Member

Actually this might be related to #141

@korewaChino korewaChino added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers labels Jan 8, 2025
@korewaChino
Copy link
Member

korewaChino commented Jan 8, 2025

It seems the cloud-init generator is creating a file: /etc/sysconfig/chisel with AUTH=chisel:{snip}.
And I am guessing that the exit node k8s pod is trying to SSH with chisel@ip.
#: cat /etc/passwd | grep chisel does not return any users.

I tried SSHing via the chisel@ip with the password found in the secret, and I got auth failure.

That is normal behavior. We deploy a bare cloud-init configuration which, in DO's case, only lets you log in through root credentials provided from their email address.

We rely on the AUTH environment variable (provided in /etc/sysconfig/chisel) to get provide creds to the server

However the k8s pod was still failing (even after restarting it).
I figured I had copied the password incorrectly, so I deleted the exit node & droplet, and redeployed it. Followed the same thing, and still the k8s pod could not connect.

Does the pod try to import the credential secret? Also could you try the #142 branch as the image?

@Towerful
Copy link

Towerful commented Jan 8, 2025

Hmm,
I have no idea whats happened but it has suddenly started working.
Image IDs match on both operator and worker pods with what I was working with yesterday.
It doesn't seem like jpillora/chisel has pushed any updates.
I might have had some dangling config or secrets or something. Maybe a race condition between secret generation and pod creation?
Very strange!

I'm going to wipe my cluster and try a fresh installation. I'll get back to you in an hour or so...

@Towerful
Copy link

Towerful commented Jan 8, 2025

Yeh, I can now confirm that a fresh install on a fresh cluster is correctly provisioning an ExitNode on DigitalOcean and connecting to it.
I wonder if I had a dangling authentication secret which was causing issues.
I'm not too sure what I've actually changed since yesterday, unfortunately.

I started with "operator-provisioned auto-allocated" then moved to "operator-provisioned manually-allocated".
In the past 3 hours, i have added the auth: digitalocean-auth spec to the ExitNode, without creating the underlying secret. The operator seems to create the secret.

When I was looking at the log spam, I did see something along the lines of "warning no auth password, this is a security risk". I think that was when I added the parameter.
Is it possible that without auth: digitalocean-auth being set, the operator is not generating the secret (or is generating, but not assigning it to the worker deployment), so the worker pod cant authenticate?
Whereas with auth: digitalocean-auth, the operator is updating the secret (in a create-or-update sort of way) and can then assign it to the worker?

@korewaChino korewaChino linked a pull request Jan 8, 2025 that will close this issue
@korewaChino korewaChino reopened this Jan 8, 2025
@Towerful
Copy link

Towerful commented Jan 9, 2025

So, this has cropped up again.
And I have no idea why!

That is normal behavior. We deploy a bare cloud-init configuration which, in DO's case, only lets you log in through root credentials provided from their email address.

We rely on the AUTH environment variable (provided in /etc/sysconfig/chisel) to get provide creds to the server

Steps I've taken to try and remedy:

I tried power cycling the droplet from DO dashboard, and still had SSH auth failures.
I tried killing the chisel pod, and it still wouldnt connect.

I SSHd into the droplet, and did a printenv | grep AUTH and printenv | grep chisel and didnt see anything.

Correct me if I'm wrong, but I don't believe Ubuntu uses the /etc/sysconfig folder.
Env variables in Ubuntu can be set using /etc/environment.... Which I did, rebooted and confirmed the existence of the AUTH env var.
The pod could still not connect.

I tried grep 'chisel' /var/log/auth.log to see if there are any attempts, and I didn't see any.
The pod could still not connect.

I then useradd chisel and set the password.
Still no entries in /var/log/auth.log for connections and pod cannot connect.

What is strange is that the Websocket seems to connect fine.
I don't see any useful logs from journalctl -u chisel on the droplet (just stuff like service started, listening, reverse tunnel enabled etc).


I am building the #156 0.5 staging PR.
I'm going to try wiping and redeploying everything. Might be that messing around getting the image built and deployed broke something on my cluster.
I'll clear out image caches as well, incase something is stuck.
I'll update in an hour


Before wiping, I tried getting chisel operator to auto-provision everything.
I tried chisel-operator.io/exit-node-provisioner: "digitalocean" and chisel-operator.io/exit-node-provisioner: "chisel/digitalocean"
My gateway definition is in the default namespace, but envoy provisions the gateway services in the envoy namespace.
Chisel created an exitnode in the envoy namespace with the following description:

Name:         service-envoy-default-gw-3d45476e
Namespace:    envoy
Labels:       <none>
Annotations:  chisel-operator.io/exit-node-provisioner: envoy/digitalocean
API Version:  chisel-operator.io/v1
Kind:         ExitNode
Metadata:
  Creation Timestamp:  2025-01-09T16:24:48Z
  Generation:          1
  Owner References:
    API Version:     v1
    Controller:      true
    Kind:            Service
    Name:            envoy-default-gw-3d45476e
    UID:             7f444240-f708-4e17-a6fe-e7de7dd9d9f6
  Resource Version:  296742
  UID:               ba1365e4-6c27-44da-ad3e-d9fdca37394b
Spec:
  Auth:           service-envoy-default-gw-3d45476e-auth
  chisel_image:   <nil>
  default_route:  true
  external_host:  <nil>
  Fingerprint:    <nil>
  Host:           
  Port:           9090
Events:           <none>

And the operator sticks on "Waiting for exit node to be provisioned".
I moved the ExitNodeProvisioner to the envoy namespace, however the ExitNode that was generated still had chisel_image: <nil> and no pod was provisioned


UPDATE: Still cannot connect after a cluster wipe & reinstall

@korewaChino
Copy link
Member

I still couldn't reproduce this in 0.5 so I don't even know what went wrong :/

I never really tried Envoy and we don't really support this use case that much since you're supposed to just expose a service directly to the cloud without MetalLB or Envoy with something like NGINX or Traefik.

chisel_image being null should not affect the pod deployment at all, as by default it will use upstream chisel if none was provided.

@Towerful
Copy link

Should I try a different cloud provider?
Is there one you recommended?

@korewaChino
Copy link
Member

The issue isn't the cloud provider but how Envoy is redirecting things, I don't know what it's doing to cause this.

You should try using another proxy instead of Envoy for now, something like Traefik of NGINX.

@korewaChino
Copy link
Member

I SSHd into the droplet, and did a printenv | grep AUTH and printenv | grep chisel and didnt see anything.

Correct me if I'm wrong, but I don't believe Ubuntu uses the /etc/sysconfig folder.

the systemd service explictly loads from the /etc/sysconfig/chisel file.

This is not a DigitalOcean issue but some kind of issue on your networking setup that doesn't let you connect to port 9090 of the VPS node.

I cannot reproduce this in any way on a fresh installl without Envoy and MetalLB inside a k3d container.

I SSHd into the droplet, and did a printenv | grep AUTH and printenv | grep chisel and didnt see anything.

It's never meant to be in the global environment, see man systemd.service

I then useradd chisel and set the password.
Still no entries in /var/log/auth.log for connections and pod cannot connect.

Chisel does not use the system's PAM authentication system. The SSH transport it advertises are on Websocket port 9090 which then has a SSH transport layer inside. You do not have to create a dedicated user for chisel.

Please export your generated deployment/pod in YAML format and the service. I cannot help you if I do not know what it's actually outputting.

kubectl get deployment <deployment_name> -n <namespace> -o yaml
kubectl get service <envoy_generated_service>  -n <namespace> -o yaml
kubectl get exitnode <exit_node_name> -n <namespace> -o yaml

Also, Chisel Operator's ExitNodes are namespaced unless specified otherwise, so you would need to put the DO provisioner inside the envoy namespace.

@Towerful
Copy link

Hello,
Thanks for your time so far! Having dug through the source code, and your explanation, I understand more of what is going on. Certainly the docs of the upstream Chisel are pretty thin on how it actually works.

Anyway, I have wiped my cluster.

  • I was using the Staging PR for 0.5.0 #156 version, I am now back to default helm values.
  • I was deploying the ExitNode, ExitNodeProvisioner and appropriate secrets at the same time as the gateway spec. I am now deploying the ExitNode, ExitNodeProvisioner and Secret before deploying the gateway (this didnt seem to matter at the time, as the droplet would still be provisioned)
  • I was using a 32 character password with symbols. I've changed to a 20 character alphanumeric string. I'm also wrapping the string in quotes (I know I should be doing this anyway, yaml has some quirks).

After wiping and redeploying everything with these 3 changes, the tunnel popped right up and connected (also the log spam has cleared).
I'll test it through some power cycles etc, in case it was a fluke. Deleting the provisioned pod results in the pod being redeployed and instantly connecting. Power cycling the droplet drops the connection, then it is quickly re-established. So I don't think its a fluke.

I am going to try changing the deployment order so the ExitNode and ExitNodeProvisioner get deployed at the same time as the gateway, see if issues resurface.
I am going to try the #156 version, and see if the issues resurface.
I am going to try a longer alphanumeric password, see if the issues resurface.
I am going to try a password with symbols, see if the issues resurface.

Anything else you would like me to test?
Do you still want deployment/svc/exitnode definitions?

@korewaChino
Copy link
Member

korewaChino commented Jan 12, 2025

Do you still want deployment/svc/exitnode definitions?

Yes. Please give me the definitions.

Also, 0.5.0 has been released so use that version instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants