-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network churn Load test - Add network policy enforcement latency measurement #431
base: main
Are you sure you want to change the base?
Conversation
5f45a99
to
e7ad48d
Compare
24c5c8b
to
d69ce5d
Compare
@microsoft-github-policy-service agree company="Microsoft" |
@@ -33,6 +33,9 @@ parameters: | |||
- name: run_id | |||
type: string | |||
default: '' | |||
- name: run_id_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is run id 2 for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using two different pre created cluster for azure_cilium and azure_cni_overlay and I am passing those two clusters using run_id and run_id_2, as creating two new cluster for every run with 1000 nodes each takes a very long time, so I am passing two cluster tags to run tests on them.
On second thought I am thinking i can do it with terraform and schedule it to run periodically.
@@ -48,6 +51,9 @@ parameters: | |||
- name: ssh_key_enabled | |||
type: boolean | |||
default: true | |||
- name: use_secondary_cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is secondary cluster for?
|
||
variables: | ||
SCENARIO_TYPE: perf-eval | ||
SCENARIO_NAME: cilium-network-churn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
network-policy-churn
parameters: | ||
role: net | ||
region: ${{ parameters.regions[0] }} | ||
- template: /steps/engine/clusterloader2/cilium/scale-cluster.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this in terraform when setting up cluster?
@@ -9,27 +9,40 @@ parameters: | |||
- name: run_id | |||
type: string | |||
default: '' | |||
- name: run_id_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
- name: retry_attempt_count | ||
type: number | ||
default: 3 | ||
- name: credential_type | ||
type: string | ||
- name: ssh_key_enabled | ||
type: boolean | ||
- name: use_secondary_cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to (1) Minimize changes that touch code that other pipelines use (2) After minimizing those changes, need to run the other automated pipelines off this branch to ensure they aren't broken.
@@ -29,17 +30,17 @@ name: load-config | |||
|
|||
# Service test | |||
{{$BIG_GROUP_SIZE := DefaultParam .BIG_GROUP_SIZE 4000}} | |||
{{$SMALL_GROUP_SIZE := DefaultParam .SMALL_GROUP_SIZE 20}} | |||
{{$SMALL_GROUP_SIZE := DefaultParam .CL2_DEPLOYMENT_SIZE 20}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this CL2_SMALL_GROUP_SIZE to keep the variable naming coordinated?
{{$bigDeploymentsPerNamespace := DefaultParam .bigDeploymentsPerNamespace 1}} | ||
{{$smallDeploymentPods := SubtractInt $podsPerNamespace (MultiplyInt $bigDeploymentsPerNamespace $BIG_GROUP_SIZE)}} | ||
{{$smallDeploymentPods := DivideInt $totalPods $namespaces}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to break all the other tests, right? Could you please restore this, and probably create a parameter for bigDeployments and set that to 0 instead.
{{$smallDeploymentsPerNamespace := DivideInt $smallDeploymentPods $SMALL_GROUP_SIZE}} | ||
|
||
namespace: | ||
number: {{$namespaces}} | ||
prefix: slo | ||
deleteStaleNamespaces: true | ||
deleteAutomanagedNamespaces: true | ||
enableExistingNamespaces: false | ||
enableExistingNamespaces: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may break testing. If namespaces weren't deleted by the previous run, we should be aware (many existing pipelines are dependent on this). Let's restore to the original value.
@@ -41,10 +49,11 @@ steps: | |||
- basename: big-deployment | |||
objectTemplatePath: deployment_template.yaml | |||
templateFillMap: | |||
Replicas: {{$bigDeploymentSize}} | |||
Replicas: {{$bigDeploymentSize}}kube |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a typo?
SvcName: big-service | ||
Group: {{.Group}} | ||
deploymentLabel: {{.deploymentLabel}} | ||
{{end}} | ||
- namespaceRange: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want this code to execute if we are not running a network policy right? Shouldn't we 'else' gate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont wanna run bigDeployment for network test so i have added {{if not $NETWORK_TEST}} for big deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it works for your scenario but it will break the others -- If not NETWORK_TEST
, what is stopping the pipeline from running both phases?
throughput = 100 | ||
nodes_per_namespace = min(node_count, DEFAULT_NODES_PER_NAMESPACE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes made to pods_per_node are going to break many of the existing pipelines. After this change, how can the service test use 20 pods per node? We need to be careful adding parameters given the number of pipelines that are dependent on them... IMO the safest way will be to have an IF branch here, and possibly a parameter for pods per node ONLY for the network_test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my opinion having pods_per_node as a hard constant which can change frequently based on usecase in not a good approach, I have added that parameter in pipeline configuration so if we need custom value we can set it in pipeline, else if unset it will keep working as before i.e default - 40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default isn't the same for all pipelines though, so something would break.
for consistency, instead let's take the same approach as #456, using the max_pods param to configure pods_per_node for this pipeline.
@@ -142,7 +167,7 @@ def collect_clusterloader2( | |||
"group": None, | |||
"measurement": None, | |||
"result": None, | |||
# "test_details": details, | |||
# # "test_details": details, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
parser_configure.add_argument("repeats", type=int, help="Number of times to repeat the deployment churn") | ||
parser_configure.add_argument("operation_timeout", type=str, help="Timeout before failing the scale up test") | ||
parser_configure.add_argument("no_of_namespaces", type=int, default=1, help="Number of namespaces to create") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding arguments without a default forced (need to set nargs
) will probably break all the other pipelines in my understand... this comment applies to all arguments added
az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --node-taints "slo=true:NoSchedule" --labels slo=true | ||
sleep 300 | ||
az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --labels slo=true test-np=net-policy-client | ||
# sleep 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to affect all the other pipelines... please let's be careful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello I pushed some test commits yesterday, I am cleaning it up. Thanks for pointing out
…st deployment template for network policy enforcement and refine cluster scaling parameters.
…late path for cluster scaling.
…d fix string formatting
…ing in execute.yml
…ility in configuration
…E for improved clarity
Integrate network policy enforcement latency measurement.
Developed pipelines to compare network policy-related metrics between Azure powered by Cilium and Azure powered by CNI Overlay using Network Policy Manager.
All the configuration like nodes, pods, namespaces, no. of policies per namespace can be updated in pipeline.
Pipeline: https://dev.azure.com/akstelescope/telescope/_build?definitionId=41
Dashboard with new metrics: https://dataexplorer.azure.com/dashboards/e033bb3b-2cf4-4263-b41b-31597a8c4401?p-_startTime=24hours&p-_endTime=now&p-_cluster=v-cilium_network_churn_main&p-_test-type=v-default-config#5117e0aa-eb12-4f7f-b55d-6ffba1eab4ad