-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add self-managed debug tool #31282
base: main
Are you sure you want to change the base?
Conversation
src/self-managed-debug/src/main.rs
Outdated
Ok(file_name) | ||
} | ||
|
||
fn format_duration(duration: Duration) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if there's a better function I can reuse here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get a TimeDelta
, which has a Display
implementation.
63632d2
to
a6465db
Compare
Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can: - Collect pod logs - Collect info similar to `kubectl get all -o wide` - Up next will be getting events, describe output, and rest of TODOs.
a6465db
to
2460e24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments inline. Thanks for taking this on!
A high-level question I have is why we don't format the data as JSON (or another serialization framework). This would reduce complexity here as a lot of the types already have serde implementations, and we'd not accidentally lose information on the way.
We could then have a second binary that takes the serialized format and pretty-prints it for human consumption. (Or, send both files because it allows users to more easily verify what information we include.)
src/self-managed-debug/Cargo.toml
Outdated
[package] | ||
name = "mz-self-managed-debug" | ||
description = "Debug tool for self-managed Materialize." | ||
version = "0.130.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want the current version here and elsewhere (right now it's 0.133.0-dev.0
, but you'll want to update it prior to merging.)
Also, to update it automatically, adjust line 41 and 61 in bin/bump-version
so that the version gets updated automatically. The script takes care of calling bin/bazel gen
to sync the bazel files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acc ik Marta mentioned we'd most likely want to decouple the tool from LTS versions/materialize https://materializeinc.slack.com/archives/C07PN7KSB0T/p1737567670003019?thread_ts=1737566344.547089&cid=C07PN7KSB0T, so we might want to make this just version 0.1.0. I think it's very doable that we can have the tool work for any version of materialize.
src/self-managed-debug/src/main.rs
Outdated
use std::fmt::Debug; | ||
|
||
use std::process; | ||
|
||
use std::sync::LazyLock; | ||
|
||
use chrono::{Duration, Utc}; | ||
use clap::Parser; | ||
use k8s_openapi::api::apps::v1::{Deployment, ReplicaSet, StatefulSet}; | ||
use k8s_openapi::api::networking::v1::{NetworkPolicy, NetworkPolicyPeer, NetworkPolicyPort}; | ||
use k8s_openapi::apimachinery::pkg::util::intstr::IntOrString; | ||
use kube::config::KubeConfigOptions; | ||
use kube::{Client, Config}; | ||
use tabled::{Style, Table, Tabled}; | ||
|
||
use mz_build_info::{build_info, BuildInfo}; | ||
|
||
use mz_ore::cli::{self, CliConfig}; | ||
|
||
use k8s_openapi::api::core::v1::{ContainerStatus, Pod, Service}; | ||
|
||
use kube::api::{Api, ListParams, LogParams, ObjectMeta}; | ||
use mz_ore::error::ErrorExt; | ||
|
||
use std::fs::File; | ||
use std::io::Write; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Structure uses as:
use std::...;
use other-crate::...;
use crate::...;
I.e., three sections: std, other crates, crate
, separated by a newline.
src/self-managed-debug/src/main.rs
Outdated
/** | ||
* Creates a k8s client given a context. If no context is provided, the default context is used. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
* Creates a k8s client given a context. If no context is provided, the default context is used. | |
*/ | |
/// Creates a k8s client given a context. If no context is provided, the default context is used. |
src/self-managed-debug/src/main.rs
Outdated
/** | ||
* Write k8s pod logs to a file per pod as mz-pod-logs.<namespace>.<pod-name>.log. | ||
* Returns a list of file names on success. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
* Write k8s pod logs to a file per pod as mz-pod-logs.<namespace>.<pod-name>.log. | |
* Returns a list of file names on success. | |
*/ | |
/// Write k8s pod logs to a file per pod as `mz-pod-logs.<namespace>.<pod-name>.log`. | |
/// Returns a list of file names on success. |
src/self-managed-debug/src/main.rs
Outdated
{ | ||
let pod_name = pod.metadata.name.clone().unwrap_or_default(); | ||
eprintln!("Failed to process pod {}: {}", pod_name, e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not do what you want: the ?
operator returns from the enclosing function, not the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah but if we're referring to the usage of ?
in lines 132-168, wouldn't it return from the block given it's an async block?
src/self-managed-debug/src/main.rs
Outdated
Ok(file_name) | ||
} | ||
|
||
fn format_duration(duration: Duration) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get a TimeDelta
, which has a Display
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to everything Moritz said.
src/self-managed-debug/src/main.rs
Outdated
{ | ||
Ok(logs) => logs, | ||
Err(_) => { | ||
// If we get a bad request error, try without the previous flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want to just always try to grab both the current and previous logs.
src/self-managed-debug/src/main.rs
Outdated
|
||
for line in logs.lines() { | ||
writeln!(file, "{}", line)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just write the whole thing in one go? You aren't doing anything with the split lines.
src/self-managed-debug/src/main.rs
Outdated
let client = create_k8s_client(args.kubernetes_context.clone()).await?; | ||
// TODO: Make namespaces mandatory | ||
// TODO: Print a warning if namespace doesn't exist | ||
for namespace in args.kubernetes_namespaces { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to store things in an organized directory structure. Maybe something like:
materialize-debug-{iso_datetime}/{namespace}/{resource_type}/{resource_name}.yaml
or, in the case of logs:
materialize-debug-{iso_datetime}/{namespace}/logs/{pod_name}.{current_or_previous}.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, even if we're zipping at the end!
After talking to the cloud team, decided to remove |
Decided high level, human readable logs aren't useful for the cloud team.
- Rename Kubernetes-related CLI arguments to shortform - Make namespace argument required - Add error handling for non-existent namespaces
- Add timestamp-based directory structure for log files - Separate previous and current pod logs into distinct files - Introduce `Context` struct to track debug tool shared state
Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:
kubectl get all -o wide
I hope to merge all changes in a stack but decided to create a PR for early feedback.
Motivation
https://github.com/MaterializeInc/database-issues/issues/8908
Tips for reviewer
cargo run --bin mz-self-managed-debug -- --kubernetes-context mzcloud-staging-us-east-1-0 --kubernetes-namespaces mz-balancer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.