Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antrea L7NetworkPolicies do not handle Service traffic correctly #6854

Open
antoninbas opened this issue Dec 11, 2024 · 6 comments
Open

Antrea L7NetworkPolicies do not handle Service traffic correctly #6854

antoninbas opened this issue Dec 11, 2024 · 6 comments
Labels
area/network-policy Issues or PRs related to network policies. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
Antrea L3/L4 policy rules handle Service traffic correctly: they are applied to traffic "post-DNAT", when the destination IP address has been rewritten to the endpoint IP.
I have observed that Service traffic is not handled correctly for policies with L7 rules: all the traffic is dropped by Suricata, independently of the rule contents.

To Reproduce
Install Antrea with the necessary configuration:

helm install -n kube-system antrea antrea/antrea --set featureGates.L7NetworkPolicy=true --set disableTXChecksumOffload=true

Use the following policy:

apiVersion: crd.antrea.io/v1beta1
kind: NetworkPolicy
metadata:
  name: egress-allow-http
spec:
  priority: 5
  tier: application
  appliedTo:
    - podSelector:
        matchLabels:
          app: http-client
  egress:
    - name: allow-http
      action: Allow      # All other traffic to these Pods will be automatically dropped, and subsequent rules will not be considered.
      to:
        - podSelector:
            matchLabels:
              app: http-server
      l7Protocols:
        - http: {}
    - name: drop-other   # Drop all other egress traffic
      action: Drop

For the http-server application, you can use a Deployment running an nginx Pod, exposed by a Service.

http-server Deployment + Service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: http-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: http-server
  template:
    metadata:
      labels:
        app: http-server
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        imagePullPolicy: IfNotPresent
---
apiVersion: v1
kind: Service
metadata:
  name: http-server
spec:
  selector:
    app: http-server
  ports:
    - port: 80
      targetPort: 80

For the http-client application, you can use a Deployment running an antrea/toolbox Pod.

http-client Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: http-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: http-client
  template:
    metadata:
      labels:
        app: http-client
    spec:
      containers:
      - name: toolbox
        image: antrea/toolbox:latest
        imagePullPolicy: IfNotPresent

After creating everything, try to curl the http-server Service from the http-client Pod. It should hang.
However, if you curl the http-server Pod IP address directly, it will work as expected.

Expected
The policy should work correctly when the http-server application is accessed through the Service.

Actual behavior
The policy only work correctly when the http-server application is accessed directly using the Pod IP.

Versions:
Antrea v2.2.0, and top-of-tree

Additional context
This is the traffic captured on antrea-l7-tap0 (ingress interface for Suricata engine), when accessing the http-server Service.

19:10:44.846442 IP (tos 0x0, ttl 63, id 19449, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.2.16.54062 > 10.10.2.15.80: Flags [S], cksum 0xdb98 (correct), seq 3270148714, win 64860, options [mss 1410,sackOK,TS val 1651039181 ecr 0,nop,wscale 7], length 0
19:10:44.846633 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.96.226.29.80 > 10.10.2.16.54062: Flags [S.], cksum 0xfc4f (correct), seq 3998964513, ack 3270148715, win 64308, options [mss 1410,sackOK,TS val 201309037 ecr 1651035102,nop,wscale 7], length 0
19:10:45.870228 IP (tos 0x0, ttl 63, id 19450, offset 0, flags [DF], proto TCP (6), length 60)
    10.10.2.16.54062 > 10.10.2.15.80: Flags [S], cksum 0xd799 (correct), seq 3270148714, win 64860, options [mss 1410,sackOK,TS val 1651040204 ecr 0,nop,wscale 7], length 0
19:10:45.870501 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.96.226.29.80 > 10.10.2.16.54062: Flags [S.], cksum 0xf84f (correct), seq 3998964513, ack 3270148715, win 64308, options [mss 1410,sackOK,TS val 201310061 ecr 1651035102,nop,wscale 7], length 0
19:10:47.886667 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)

10.10.2.16 is the IP address of the http-client Pod.
10.96.226.29 is the ClusterIP address of the http-server Service.
10.10.2.15 is the IP address of the http-server Pod.

We can see that the client -> server traffic is forwarded to Suricata "post-DNAT" (destination IP is http-server Pod IP). However, the server -> client (reply) traffic appears to be forwarded to Suricata after the source IP has been rewritten pack to the original destination IP (i.e., the ClusterIP). Suricata has no way to identify this reply traffic as part of the same connection. The reply traffic (SYN-ACK in this case) is dropped (I assume by Suricata) and does not show up on antrea-l7-tap0 (egress interface for Suricata engine).

The Antrea datapath should be fixed so that the reply traffic is sent to Suricata prior rewriting the source IP ("un-DNAT").

@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/network-policy Issues or PRs related to network policies. labels Dec 11, 2024
@antoninbas
Copy link
Contributor Author

cc @tnqn @hongliangl @luolanzone for visibility
Do you think this could be fixed in the v2.3 timeframe?

@hongliangl
Copy link
Contributor

Will take a look and evaluate.

@hongliangl
Copy link
Contributor

hongliangl commented Dec 12, 2024

Currently, we use a CT mark L7NPRedirectCTMark in zone 65520 to identify request and reply packets. Packets with the CT mark will be redirected to Suricata via antrea-l7-tap0.

To redirect reply packets to Suricata, all packets will go to table ConntrackZone to restore state L7NPRedirectCTMark in ct zone 65520. However, Service DNAT is also performed in this zone 65520, causing the packets "un-DNAT". As a result, the reply "un-DNAT" packets are sent to Suricata.

@antoninbas
Copy link
Contributor Author

Do we need to use an additional ct zone for such packets (that need to be sent to Suricata), so we can identify reply packets earlier, or is there another solution?

@luolanzone luolanzone added this to the Antrea v2.3 release milestone Dec 13, 2024
@hongliangl
Copy link
Contributor

hongliangl commented Jan 3, 2025

I struggled to find a simple and elegant solution to the issue but couldn't. After discussing it with @wenying, we identified a feasible fix. However, in my opinion, this solution is quite complex and may be difficult to explain.

For the packets of a Serivce connection enforced by one L7 NetworkPolicy:

Packet Type Phase Inport Source IP Destination IP Note
First request packet 1 pod4 10.10.0.4 10.96.0.1 Initial request packet
2 pod4 10.10.0.4 10.10.0.5 DNAT performed in EndpointDNAT
3 antrea-l7-tap1 10.10.0.4 10.10.0.5 Returned from Suricata and forwarded to the Pod
Reply packets 1 pod5 10.10.0.5 10.10.0.4 Sent to Suricata via antrea-l7-tap0 without unDNAT
2 antrea-l7-tap1 10.10.0.5 10.10.0.4 Returned from Suricata and forwarded to the Pod
3 antrea-l7-tap1 10.96.0.1 10.10.0.4 unDNAT applied and sent back to the Pod
Subsequent request packets 1 pod4 10.10.0.4 10.96.0.1 Subsequent request packets
2 pod4 10.10.0.4 10.10.0.5 DNAT with connection tracking, sent to Suricata via antrea-l7-tap0
3 antrea-l7-tap1 10.10.0.4 10.10.0.5 Forwarded to the Pod

In the following context:

  • The flows in bold matches the current mentioned packets if there are multiple flows listed.
  • The flow with * are new flows introduced.

First request packet

Phase 1

This is the enhanced flow matching the first request packet in phase 1. There are 2 changes:

  • Add match condition tcp.
  • Add a learn action to generate learned flow in a new table 100 to match reply packets in phase 1.

* table=AntreaPolicyEgressRule, priority=65000,conj_id=3, tcp, actions=
learn(table=100,idle_timeout=5, priority=200,delete_learned, cookie=0x203000000000a, eth_type=0x800, nw_proto=6,
NXM_OF_IP_DST[]=NXM_OF_IP_SRC[],
NXM_OF_IP_SRC[]=NXM_OF_IP_DST[],
NXM_OF_TCP_SRC[]=NXM_OF_TCP_DST[],
NXM_OF_TCP_DST[]=NXM_OF_TCP_SRC[],
load:0x1->NXM_NX_REG0[23..24],
load:0x1->NXM_NX_REG8[0..11],
load:0x1->NXM_NX_REG0[21..22]),
load:0x3->NXM_NX_REG5[],
ct(commit,table=EgressMetric,zone=65520,exec(load:0x3->NXM_NX_CT_LABEL[32..63],load:0x1->NXM_NX_CT_MARK[7],load:0x1->NXM_NX_CT_LABEL[64..75]))

Phase 2

For the first request packet in phase 2, it is still sent by the following flow to Suricata port.

* table=Output, priority=400,reg0=0x6/0xf actions=output:NXM_NX_REG1[]

* table=Output, priority=400,reg0=0x800000/0x1800000 actions=push_vlan:0x8100,move:NXM_NX_REG8[0..11]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=212,ct_mark=0x80/0x80,reg0=0x200000/0x600000 actions=push_vlan:0x8100,move:NXM_NX_CT_LABEL[64..75]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=210,ct_mark=0x40/0x40 actions=IN_PORT
table=Output, priority=200,reg0=0x200000/0x600000 actions=output:NXM_NX_REG1[]
table=Output, priority=200,reg0=0x2400000/0xfe600000 actions=meter:256,controller(reason=no_match,id=58487,userdata=01.01)

Phase 3

These flows are to match the first request packet in phase 3 returned from Suricata and forward them to the destination Pod with IP 10.10.0.5.

* table=Classifier, priority=300,in_port="antrea-l7-tap1", vlan_tci=0x1000/0x1000 actions=strip_vlan,load:0x6->NXM_NX_REG0[0..3],load:0x2->NXM_NX_REG0[23..24],goto_table:ConntrackZone

* table=ConntrackZone, priority=400,ip,reg0=0x6/0xf actions=set_field:0x200000/0x600000->reg0,ct(table=L3Forwarding,zone=65520,nat)
* table=ConntrackZone, priority=300,reg0=0/0x1800000 actions=resubmit(,100),resubmit(,ConntrackZone)
* table=ConntrackZone, priority=300,ip,reg0=0x800000/0x1800000 actions=goto_table:Output
* table=ConntrackZone, priority=300,ip,reg0=0x1000000/0x1800000 actions=ct(table=ConntrackState,zone=65520,nat)
table=ConntrackZone, priority=200,ip actions=ct(table=ConntrackState,zone=65520,nat)
table=ConntrackZone, priority=0 actions=goto_table:ConntrackState

* table=Output, priority=400,reg0=0x6/0xf actions=output:NXM_NX_REG1[]
* table=Output, priority=400,reg0=0x800000/0x1800000 actions=push_vlan:0x8100,move:NXM_NX_REG8[0..11]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=212,ct_mark=0x80/0x80,reg0=0x200000/0x600000 actions=push_vlan:0x8100,move:NXM_NX_CT_LABEL[64..75]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=210,ct_mark=0x40/0x40 actions=IN_PORT
table=Output, priority=200,reg0=0x200000/0x600000 actions=output:NXM_NX_REG1[]
table=Output, priority=200,reg0=0x2400000/0xfe600000 actions=meter:256,controller(reason=no_match,id=58487,userdata=01.01)

Reply packets

Phase 1

These flows are used to distinguish the reply packets in phase 1 from all traffic. Some new register marks are introduced:

  • reg0=0x0/0x1800000, default
  • reg0=0x800000/0x1800000, reply packets of L7 NetworkPolicy connection
  • 0x1000000/0x1800000, other packets.

At first, all packets are resubmitted to table 100 to load register marks. The the reply packets in phase 1 will be loaded with reg0=0x800000/0x1800000. As a result, the corresponding packets will be forwarded to Output directly to redirect to Suricata without ct action, avoiding unDNAT.

* table=ConntrackZone, priority=400,ip,reg0=0x6/0xf actions=set_field:0x200000/0x600000->reg0,ct(table=L3Forwarding,zone=65520,nat)
* table=ConntrackZone, priority=300,reg0=0/0x1800000 actions=resubmit(,100),resubmit(,ConntrackZone)
* table=ConntrackZone, priority=300,ip,reg0=0x800000/0x1800000 actions=goto_table:Output
* table=ConntrackZone, priority=300,ip,reg0=0x1000000/0x1800000 actions=ct(table=ConntrackState,zone=65520,nat)
table=ConntrackZone, priority=200,ip actions=ct(table=ConntrackState,zone=65520,nat)
table=ConntrackZone, priority=0 actions=goto_table:ConntrackState

* table=Output, priority=400,reg0=0x6/0xf actions=output:NXM_NX_REG1[]
* table=Output, priority=400,reg0=0x800000/0x1800000 actions=push_vlan:0x8100,move:NXM_NX_REG8[0..11]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=212,ct_mark=0x80/0x80,reg0=0x200000/0x600000 actions=push_vlan:0x8100,move:NXM_NX_CT_LABEL[64..75]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=210,ct_mark=0x40/0x40 actions=IN_PORT
table=Output, priority=200,reg0=0x200000/0x600000 actions=output:NXM_NX_REG1[]
table=Output, priority=200,reg0=0x2400000/0xfe600000 actions=meter:256,controller(reason=no_match,id=58487,userdata=01.01)

* table=100, priority=200,tcp,nw_src=10.10.0.5,nw_dst=10.10.0.4,tp_src=80,tp_dst=52994 actions=set_field:0x800000/0x1800000->reg0,set_field:0x1/0xfff->reg8,set_field:0x200000/0x600000->reg0
* table=100, priority=0 actions=set_field:0x1000000/0x1800000->reg0

Phase 2

Similar to the first request packet in phase 3, these flows are also used to match the reply packets in phase 2. With the ct action, the packets will be unDNATed, transmitting into reply packets in phase 3,

* table=Classifier, priority=300,in_port="antrea-l7-tap1", vlan_tci=0x1000/0x1000 actions=strip_vlan,load:0x6->NXM_NX_REG0[0..3],load:0x2->NXM_NX_REG0[23..24],goto_table:ConntrackZone

* table=ConntrackZone, priority=400,ip,reg0=0x6/0xf actions=set_field:0x200000/0x600000->reg0,ct(table=L3Forwarding,zone=65520,nat)
* table=ConntrackZone, priority=300,reg0=0/0x1800000 actions=resubmit(,100),resubmit(,ConntrackZone)
* table=ConntrackZone, priority=300,ip,reg0=0x800000/0x1800000 actions=goto_table:Output
* table=ConntrackZone, priority=300,ip,reg0=0x1000000/0x1800000 actions=ct(table=ConntrackState,zone=65520,nat)
table=ConntrackZone, priority=200,ip actions=ct(table=ConntrackState,zone=65520,nat)
table=ConntrackZone, priority=0 actions=goto_table:ConntrackState

Phase 3

The reply packets in phase 3 are forwarded to the Pod with 10.10.0.4.

* table=Output, priority=400,reg0=0x6/0xf actions=output:NXM_NX_REG1[]
* table=Output, priority=400,reg0=0x800000/0x1800000 actions=push_vlan:0x8100,move:NXM_NX_REG8[0..11]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=212,ct_mark=0x80/0x80,reg0=0x200000/0x600000 actions=push_vlan:0x8100,move:NXM_NX_CT_LABEL[64..75]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=210,ct_mark=0x40/0x40 actions=IN_PORT
table=Output, priority=200,reg0=0x200000/0x600000 actions=output:NXM_NX_REG1[]
table=Output, priority=200,reg0=0x2400000/0xfe600000 actions=meter:256,controller(reason=no_match,id=58487,userdata=01.01)

Subsequent request packets

Phase 1

These flows are used to match the subsequent request packets in phase 1 to make sure the the packets are DNATed correcly, transmitting into the subsequent request packets in phase 2 as well restoring ct state.

* table=ConntrackZone, priority=300,reg0=0x0/0x1800000 actions=resubmit(,100),resubmit(,ConntrackZone)
* table=ConntrackZone, priority=300,reg0=0x800000/0x1800000,ip, actions=goto_table:Output
* **table=ConntrackZone, priority=300,reg0=0x1000000/0x1800000,ip, actions=ct(table=ConntrackState,zone=65520,nat)**table=ConntrackZone, priority=200,ip actions=ct(table=ConntrackState,zone=65520,nat)
table=ConntrackZone, priority=0 actions=goto_table:ConntrackState

* table=100, priority=200,tcp,nw_src=10.10.0.5,nw_dst=10.10.0.4,tp_src=80,tp_dst=52994 actions=set_field:0x800000/0x1800000->reg0,set_field:0x1/0xfff->reg8,set_field:0x200000/0x600000->reg0
* table=100, priority=0 actions=set_field:0x1000000/0x1800000->reg0

Phase 2

The subsequent request packets in phase 2 will be redirected to Suricata with the flow since the ct_mark=0x80/0x80 as well as ct_label is restored.

* table=Output, priority=400,reg0=0x6/0xf actions=output:NXM_NX_REG1[]
* table=Output, priority=400,reg0=0x800000/0x1800000 actions=push_vlan:0x8100,move:NXM_NX_REG8[0..11]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=212,ct_mark=0x80/0x80,reg0=0x200000/0x600000 actions=push_vlan:0x8100,move:NXM_NX_CT_LABEL[64..75]->OXM_OF_VLAN_VID[],output:1
table=Output, priority=210,ct_mark=0x40/0x40 actions=IN_PORT
table=Output, priority=200,reg0=0x200000/0x600000 actions=output:NXM_NX_REG1[]
table=Output, priority=200,reg0=0x2400000/0xfe600000 actions=meter:256,controller(reason=no_match,id=58487,userdata=01.01)

Phase 3

The processs of subsequent request packets in phase 3 is the same as the first request packet in phase 3.

@antoninbas @tnqn @luolanzone

@hongliangl
Copy link
Contributor

Do we need to use an additional ct zone for such packets (that need to be sent to Suricata), so we can identify reply packets earlier, or is there another solution?

That should be a simple way to fix the issue. I don't have a concrete idea about how to implement this, but it may case side effect to other connections not enforced by L7 NetworkPolicies, degrading the performance due to a new introduced ct zone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/network-policy Issues or PRs related to network policies. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants