Network churn Load test - Add network policy enforcement latency measurement #431

agrawaliti · 2024-12-12T11:13:02Z

Integrate network policy enforcement latency measurement.

Developed pipelines to compare network policy-related metrics between Azure powered by Cilium and Azure powered by CNI Overlay using Network Policy Manager.
All the configuration like nodes, pods, namespaces, no. of policies per namespace can be updated in pipeline.

Pipeline: https://dev.azure.com/akstelescope/telescope/_build?definitionId=41

Dashboard with new metrics: https://dataexplorer.azure.com/dashboards/e033bb3b-2cf4-4263-b41b-31597a8c4401?p-_startTime=24hours&p-_endTime=now&p-_cluster=v-cilium_network_churn_main&p-_test-type=v-default-config#5117e0aa-eb12-4f7f-b55d-6ffba1eab4ad

…k-churn

…NER variable

agrawaliti · 2024-12-30T15:31:55Z

@microsoft-github-policy-service agree company="Microsoft"

anson627 · 2025-01-07T01:14:30Z

jobs/competitive-test.yml

@@ -33,6 +33,9 @@ parameters:
 - name: run_id
  type: string
  default: ''
+- name: run_id_2


what is run id 2 for?

I am using two different pre created cluster for azure_cilium and azure_cni_overlay and I am passing those two clusters using run_id and run_id_2, as creating two new cluster for every run with 1000 nodes each takes a very long time, so I am passing two cluster tags to run tests on them.

On second thought I am thinking i can do it with terraform and schedule it to run periodically.

anson627 · 2025-01-07T01:25:36Z

jobs/competitive-test.yml

@@ -48,6 +51,9 @@ parameters:
 - name: ssh_key_enabled
  type: boolean
  default: true
+- name: use_secondary_cluster


what is secondary cluster for?

anson627 · 2025-01-07T01:29:01Z

pipelines/perf-eval/CNI Benchmark/network-churn/cilium-network-churn.yml

+
+variables:
+  SCENARIO_TYPE: perf-eval
+  SCENARIO_NAME: cilium-network-churn


network-policy-churn

anson627 · 2025-01-07T02:05:22Z

steps/topology/network-churn/validate-resources.yml

+    parameters:
+      role: net
+      region: ${{ parameters.regions[0] }}
+  - template: /steps/engine/clusterloader2/cilium/scale-cluster.yml


can we do this in terraform when setting up cluster?

anson627 · 2025-01-07T02:05:43Z

steps/setup-tests.yml

@@ -9,27 +9,40 @@ parameters:
 - name: run_id
  type: string
  default: ''
+- name: run_id_2


anson627 · 2025-01-07T02:05:48Z

steps/setup-tests.yml

 - name: retry_attempt_count
  type: number
  default: 3
 - name: credential_type
  type: string
 - name: ssh_key_enabled
  type: boolean
+- name: use_secondary_cluster


jshr-w

Let's try to (1) Minimize changes that touch code that other pipelines use (2) After minimizing those changes, need to run the other automated pipelines off this branch to ensure they aren't broken.

jshr-w · 2025-01-09T00:26:09Z

modules/python/clusterloader2/slo/config/load-config.yaml

@@ -29,17 +30,17 @@ name: load-config

 # Service test
 {{$BIG_GROUP_SIZE := DefaultParam .BIG_GROUP_SIZE 4000}}
-{{$SMALL_GROUP_SIZE := DefaultParam .SMALL_GROUP_SIZE 20}}
+{{$SMALL_GROUP_SIZE := DefaultParam .CL2_DEPLOYMENT_SIZE 20}}


Can we name this CL2_SMALL_GROUP_SIZE to keep the variable naming coordinated?

jshr-w · 2025-01-09T00:27:21Z

modules/python/clusterloader2/slo/config/load-config.yaml

 {{$bigDeploymentsPerNamespace := DefaultParam .bigDeploymentsPerNamespace 1}}
-{{$smallDeploymentPods := SubtractInt $podsPerNamespace (MultiplyInt $bigDeploymentsPerNamespace $BIG_GROUP_SIZE)}}
+{{$smallDeploymentPods :=  DivideInt $totalPods $namespaces}}


This is going to break all the other tests, right? Could you please restore this, and probably create a parameter for bigDeployments and set that to 0 instead.

jshr-w · 2025-01-09T00:28:05Z

modules/python/clusterloader2/slo/config/load-config.yaml

 {{$smallDeploymentsPerNamespace := DivideInt $smallDeploymentPods $SMALL_GROUP_SIZE}}

 namespace:
  number: {{$namespaces}}
  prefix: slo
  deleteStaleNamespaces: true
  deleteAutomanagedNamespaces: true
-  enableExistingNamespaces: false
+  enableExistingNamespaces: true


This may break testing. If namespaces weren't deleted by the previous run, we should be aware (many existing pipelines are dependent on this). Let's restore to the original value.

jshr-w · 2025-01-09T00:29:43Z

modules/python/clusterloader2/slo/config/modules/reconcile-objects.yaml

@@ -41,10 +49,11 @@ steps:
    - basename: big-deployment
      objectTemplatePath: deployment_template.yaml
      templateFillMap:
-        Replicas: {{$bigDeploymentSize}}
+        Replicas: {{$bigDeploymentSize}}kube


Is this a typo?

jshr-w · 2025-01-09T00:37:28Z

modules/python/clusterloader2/slo/config/modules/reconcile-objects.yaml

        SvcName: big-service
        Group: {{.Group}}
        deploymentLabel: {{.deploymentLabel}}
+{{end}}
  - namespaceRange:


We don't want this code to execute if we are not running a network policy right? Shouldn't we 'else' gate this?

I dont wanna run bigDeployment for network test so i have added {{if not $NETWORK_TEST}} for big deployment

I mean, it works for your scenario but it will break the others -- If not NETWORK_TEST, what is stopping the pipeline from running both phases?

jshr-w · 2025-01-09T00:52:19Z

modules/python/clusterloader2/slo/slo.py

    throughput = 100
-    nodes_per_namespace = min(node_count, DEFAULT_NODES_PER_NAMESPACE)


I think the changes made to pods_per_node are going to break many of the existing pipelines. After this change, how can the service test use 20 pods per node? We need to be careful adding parameters given the number of pipelines that are dependent on them... IMO the safest way will be to have an IF branch here, and possibly a parameter for pods per node ONLY for the network_test.

in my opinion having pods_per_node as a hard constant which can change frequently based on usecase in not a good approach, I have added that parameter in pipeline configuration so if we need custom value we can set it in pipeline, else if unset it will keep working as before i.e default - 40

the default isn't the same for all pipelines though, so something would break.

for consistency, instead let's take the same approach as #456, using the max_pods param to configure pods_per_node for this pipeline.

jshr-w · 2025-01-09T00:53:09Z

modules/python/clusterloader2/slo/slo.py

@@ -142,7 +167,7 @@ def collect_clusterloader2(
        "group": None,
        "measurement": None,
        "result": None,
-        # "test_details": details,
+        # # "test_details": details,


jshr-w · 2025-01-09T00:55:00Z

modules/python/clusterloader2/slo/slo.py

    parser_configure.add_argument("repeats", type=int, help="Number of times to repeat the deployment churn")
    parser_configure.add_argument("operation_timeout", type=str, help="Timeout before failing the scale up test")
+    parser_configure.add_argument("no_of_namespaces", type=int, default=1, help="Number of namespaces to create")


adding arguments without a default forced (need to set nargs) will probably break all the other pipelines in my understand... this comment applies to all arguments added

jshr-w · 2025-01-09T00:56:23Z

steps/engine/clusterloader2/cilium/scale-cluster.yml

-      az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --node-taints "slo=true:NoSchedule" --labels slo=true
-      sleep 300
+      az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --labels slo=true test-np=net-policy-client
+      # sleep 300


This is going to affect all the other pipelines... please let's be careful!

Hello I pushed some test commits yesterday, I am cleaning it up. Thanks for pointing out

…st deployment template for network policy enforcement and refine cluster scaling parameters.

…late path for cluster scaling.

…stency

…d fix string formatting

…ing in execute.yml

…ility in configuration

…ainer registry

…E for improved clarity

agrawaliti force-pushed the network-churn branch from 5f45a99 to e7ad48d Compare December 19, 2024 15:29

fix service churn feature pipeline name (#417)

d69ce5d

agrawaliti force-pushed the network-churn branch from 24c5c8b to d69ce5d Compare December 30, 2024 14:51

agrawaliti added 2 commits December 30, 2024 15:00

Merge branch 'main' of https://github.com/Azure/telescope into networ…

5024f5c

…k-churn

Refactor YAML files for network churn: clean up formatting and add OW…

a0599f3

…NER variable

agrawaliti marked this pull request as ready for review December 30, 2024 15:19

agrawaliti requested review from alyssa1303 and jshr-w December 30, 2024 15:19

agrawaliti changed the title ~~Network churn~~ Network churn Load test - Add network policy enforcement latency measurement Dec 30, 2024

Merge branch 'main' into network-churn

9814101

agrawaliti requested review from anson627, sumanthreddy29 and rafael-mendes-pereira as code owners January 6, 2025 16:19

anson627 reviewed Jan 7, 2025

View reviewed changes

agrawaliti added 2 commits January 8, 2025 10:46

Update nodes_per_nodepool value in validate-resources.yml to 500

b9aa5f6

Enable existing namespaces in load-config.yaml

e976e31

jshr-w requested changes Jan 9, 2025

View reviewed changes

agrawaliti added 12 commits January 9, 2025 10:01

Remove redundant parameters from validate-resources.yml

9c20982

Disable existing namespaces in autoscale and slo configurations; adju…

9b55220

…st deployment template for network policy enforcement and refine cluster scaling parameters.

Enable existing namespaces in autoscale configuration and update temp…

bdb7267

…late path for cluster scaling.

Fix indentation in collect-clusterloader2.yml for template path consi…

6ced26d

…stency

Update load-config.yaml and slo.py for deployment size adjustments an…

2e3e8ee

…d fix string formatting

Add new parameters for namespaces, network policies, and service test…

568b69c

…ing in execute.yml

To test Commit

6ddeace

Revert and update deployment size

a3c8a83

test load condition

56df80f

Test valid datatype

b7e79fc

test string type

95e7006

test type

ed4c562

agrawaliti added 9 commits January 15, 2025 13:40

add commnet

293b0cb

test if

6f6950f

conditional logic

725c6c2

refactor: update group size parameters and maintain backward compatib…

b27b806

…ility in configuration

fix: update nginx image to use the latest version from the Azure cont…

ec21cce

…ainer registry

add metrics for npm

c26601d

adding condition

00c1e73

adding trigger

6beb6b1

refactor: update small group size parameter to use CL2_DEPLOYMENT_SIZ…

b19978a

…E for improved clarity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network churn Load test - Add network policy enforcement latency measurement #431

Network churn Load test - Add network policy enforcement latency measurement #431

agrawaliti commented Dec 12, 2024 •

edited

Loading

agrawaliti commented Dec 30, 2024

anson627 Jan 7, 2025

agrawaliti Jan 9, 2025 •

edited

Loading

anson627 Jan 7, 2025

anson627 Jan 7, 2025

anson627 Jan 7, 2025

anson627 Jan 7, 2025

anson627 Jan 7, 2025

jshr-w left a comment

jshr-w Jan 9, 2025

jshr-w Jan 9, 2025

jshr-w Jan 9, 2025

jshr-w Jan 9, 2025

jshr-w Jan 9, 2025

agrawaliti Jan 15, 2025

jshr-w Jan 16, 2025

jshr-w Jan 9, 2025

agrawaliti Jan 15, 2025

jshr-w Jan 16, 2025

jshr-w Jan 9, 2025

jshr-w Jan 9, 2025

jshr-w Jan 9, 2025

agrawaliti Jan 9, 2025

		throughput = 100
		nodes_per_namespace = min(node_count, DEFAULT_NODES_PER_NAMESPACE)

Network churn Load test - Add network policy enforcement latency measurement #431

Are you sure you want to change the base?

Network churn Load test - Add network policy enforcement latency measurement #431

Conversation

agrawaliti commented Dec 12, 2024 • edited Loading

agrawaliti commented Dec 30, 2024

Choose a reason for hiding this comment

agrawaliti Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jshr-w left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agrawaliti commented Dec 12, 2024 •

edited

Loading

agrawaliti Jan 9, 2025 •

edited

Loading