Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network churn Load test - Add network policy enforcement latency measurement #431

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

agrawaliti
Copy link

@agrawaliti agrawaliti commented Dec 12, 2024

Integrate network policy enforcement latency measurement.

Developed pipelines to compare network policy-related metrics between Azure powered by Cilium and Azure powered by CNI Overlay using Network Policy Manager.
All the configuration like nodes, pods, namespaces, no. of policies per namespace can be updated in pipeline.

Pipeline: https://dev.azure.com/akstelescope/telescope/_build?definitionId=41

Dashboard with new metrics: https://dataexplorer.azure.com/dashboards/e033bb3b-2cf4-4263-b41b-31597a8c4401?p-_startTime=24hours&p-_endTime=now&p-_cluster=v-cilium_network_churn_main&p-_test-type=v-default-config#5117e0aa-eb12-4f7f-b55d-6ffba1eab4ad

@agrawaliti agrawaliti marked this pull request as ready for review December 30, 2024 15:19
@agrawaliti agrawaliti changed the title Network churn Network churn Load test - Add network policy enforcement latency measurement Dec 30, 2024
@agrawaliti
Copy link
Author

@microsoft-github-policy-service agree company="Microsoft"

@@ -33,6 +33,9 @@ parameters:
- name: run_id
type: string
default: ''
- name: run_id_2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is run id 2 for?

Copy link
Author

@agrawaliti agrawaliti Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using two different pre created cluster for azure_cilium and azure_cni_overlay and I am passing those two clusters using run_id and run_id_2, as creating two new cluster for every run with 1000 nodes each takes a very long time, so I am passing two cluster tags to run tests on them.

On second thought I am thinking i can do it with terraform and schedule it to run periodically.

@@ -48,6 +51,9 @@ parameters:
- name: ssh_key_enabled
type: boolean
default: true
- name: use_secondary_cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is secondary cluster for?


variables:
SCENARIO_TYPE: perf-eval
SCENARIO_NAME: cilium-network-churn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network-policy-churn

parameters:
role: net
region: ${{ parameters.regions[0] }}
- template: /steps/engine/clusterloader2/cilium/scale-cluster.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this in terraform when setting up cluster?

@@ -9,27 +9,40 @@ parameters:
- name: run_id
type: string
default: ''
- name: run_id_2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

- name: retry_attempt_count
type: number
default: 3
- name: credential_type
type: string
- name: ssh_key_enabled
type: boolean
- name: use_secondary_cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor

@jshr-w jshr-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to (1) Minimize changes that touch code that other pipelines use (2) After minimizing those changes, need to run the other automated pipelines off this branch to ensure they aren't broken.

@@ -29,17 +30,17 @@ name: load-config

# Service test
{{$BIG_GROUP_SIZE := DefaultParam .BIG_GROUP_SIZE 4000}}
{{$SMALL_GROUP_SIZE := DefaultParam .SMALL_GROUP_SIZE 20}}
{{$SMALL_GROUP_SIZE := DefaultParam .CL2_DEPLOYMENT_SIZE 20}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this CL2_SMALL_GROUP_SIZE to keep the variable naming coordinated?

{{$bigDeploymentsPerNamespace := DefaultParam .bigDeploymentsPerNamespace 1}}
{{$smallDeploymentPods := SubtractInt $podsPerNamespace (MultiplyInt $bigDeploymentsPerNamespace $BIG_GROUP_SIZE)}}
{{$smallDeploymentPods := DivideInt $totalPods $namespaces}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break all the other tests, right? Could you please restore this, and probably create a parameter for bigDeployments and set that to 0 instead.

{{$smallDeploymentsPerNamespace := DivideInt $smallDeploymentPods $SMALL_GROUP_SIZE}}

namespace:
number: {{$namespaces}}
prefix: slo
deleteStaleNamespaces: true
deleteAutomanagedNamespaces: true
enableExistingNamespaces: false
enableExistingNamespaces: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may break testing. If namespaces weren't deleted by the previous run, we should be aware (many existing pipelines are dependent on this). Let's restore to the original value.

@@ -41,10 +49,11 @@ steps:
- basename: big-deployment
objectTemplatePath: deployment_template.yaml
templateFillMap:
Replicas: {{$bigDeploymentSize}}
Replicas: {{$bigDeploymentSize}}kube
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo?

SvcName: big-service
Group: {{.Group}}
deploymentLabel: {{.deploymentLabel}}
{{end}}
- namespaceRange:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want this code to execute if we are not running a network policy right? Shouldn't we 'else' gate this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont wanna run bigDeployment for network test so i have added {{if not $NETWORK_TEST}} for big deployment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it works for your scenario but it will break the others -- If not NETWORK_TEST, what is stopping the pipeline from running both phases?

throughput = 100
nodes_per_namespace = min(node_count, DEFAULT_NODES_PER_NAMESPACE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes made to pods_per_node are going to break many of the existing pipelines. After this change, how can the service test use 20 pods per node? We need to be careful adding parameters given the number of pipelines that are dependent on them... IMO the safest way will be to have an IF branch here, and possibly a parameter for pods per node ONLY for the network_test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion having pods_per_node as a hard constant which can change frequently based on usecase in not a good approach, I have added that parameter in pipeline configuration so if we need custom value we can set it in pipeline, else if unset it will keep working as before i.e default - 40

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default isn't the same for all pipelines though, so something would break.

for consistency, instead let's take the same approach as #456, using the max_pods param to configure pods_per_node for this pipeline.

@@ -142,7 +167,7 @@ def collect_clusterloader2(
"group": None,
"measurement": None,
"result": None,
# "test_details": details,
# # "test_details": details,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

parser_configure.add_argument("repeats", type=int, help="Number of times to repeat the deployment churn")
parser_configure.add_argument("operation_timeout", type=str, help="Timeout before failing the scale up test")
parser_configure.add_argument("no_of_namespaces", type=int, default=1, help="Number of namespaces to create")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding arguments without a default forced (need to set nargs) will probably break all the other pipelines in my understand... this comment applies to all arguments added

az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --node-taints "slo=true:NoSchedule" --labels slo=true
sleep 300
az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --labels slo=true test-np=net-policy-client
# sleep 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to affect all the other pipelines... please let's be careful!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello I pushed some test commits yesterday, I am cleaning it up. Thanks for pointing out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants