Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add self-managed debug tool #31282

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

SangJunBak
Copy link
Contributor

@SangJunBak SangJunBak commented Feb 4, 2025

Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:

  • Collect pod logs
  • Collect info similar to kubectl get all -o wide
  • Up next will be getting events, describe output, rest of TODOs, and tests

I hope to merge all changes in a stack but decided to create a PR for early feedback.

Motivation

  • This PR adds a known-desirable feature.

https://github.com/MaterializeInc/database-issues/issues/8908

Tips for reviewer

  • Currently want to leave it all in one file and separate things out later
  • Curious if the amount of code for something like this is normal!
  • To quickly test, run cargo run --bin mz-self-managed-debug -- --kubernetes-context mzcloud-staging-us-east-1-0 --kubernetes-namespaces mz-balancer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@SangJunBak SangJunBak requested a review from a team February 4, 2025 04:50
@SangJunBak SangJunBak requested a review from ParkMyCar as a code owner February 4, 2025 04:50
Ok(file_name)
}

fn format_duration(duration: Duration) -> String {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there's a better function I can reuse here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get a TimeDelta, which has a Display implementation.

@SangJunBak SangJunBak force-pushed the jun/#8908/debug-tool branch from 63632d2 to a6465db Compare February 4, 2025 04:54
@SangJunBak SangJunBak removed the request for review from ParkMyCar February 4, 2025 05:11
Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:
- Collect pod logs
- Collect info similar to `kubectl get all -o wide`
- Up next will be getting events, describe output, and rest of TODOs.
@SangJunBak SangJunBak force-pushed the jun/#8908/debug-tool branch from a6465db to 2460e24 Compare February 4, 2025 05:44
@SangJunBak SangJunBak requested review from a team, aljoscha and morsapaes as code owners February 4, 2025 05:44
@SangJunBak SangJunBak changed the base branch from lts-v0.130 to main February 4, 2025 05:45
@SangJunBak SangJunBak removed request for a team, aljoscha and morsapaes February 4, 2025 05:45
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments inline. Thanks for taking this on!

A high-level question I have is why we don't format the data as JSON (or another serialization framework). This would reduce complexity here as a lot of the types already have serde implementations, and we'd not accidentally lose information on the way.

We could then have a second binary that takes the serialized format and pretty-prints it for human consumption. (Or, send both files because it allows users to more easily verify what information we include.)

[package]
name = "mz-self-managed-debug"
description = "Debug tool for self-managed Materialize."
version = "0.130.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want the current version here and elsewhere (right now it's 0.133.0-dev.0, but you'll want to update it prior to merging.)

Also, to update it automatically, adjust line 41 and 61 in bin/bump-version so that the version gets updated automatically. The script takes care of calling bin/bazel gen to sync the bazel files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acc ik Marta mentioned we'd most likely want to decouple the tool from LTS versions/materialize https://materializeinc.slack.com/archives/C07PN7KSB0T/p1737567670003019?thread_ts=1737566344.547089&cid=C07PN7KSB0T, so we might want to make this just version 0.1.0. I think it's very doable that we can have the tool work for any version of materialize.

src/self-managed-debug/Cargo.toml Show resolved Hide resolved
Comment on lines 12 to 37
use std::fmt::Debug;

use std::process;

use std::sync::LazyLock;

use chrono::{Duration, Utc};
use clap::Parser;
use k8s_openapi::api::apps::v1::{Deployment, ReplicaSet, StatefulSet};
use k8s_openapi::api::networking::v1::{NetworkPolicy, NetworkPolicyPeer, NetworkPolicyPort};
use k8s_openapi::apimachinery::pkg::util::intstr::IntOrString;
use kube::config::KubeConfigOptions;
use kube::{Client, Config};
use tabled::{Style, Table, Tabled};

use mz_build_info::{build_info, BuildInfo};

use mz_ore::cli::{self, CliConfig};

use k8s_openapi::api::core::v1::{ContainerStatus, Pod, Service};

use kube::api::{Api, ListParams, LogParams, ObjectMeta};
use mz_ore::error::ErrorExt;

use std::fs::File;
use std::io::Write;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Structure uses as:

use std::...;

use other-crate::...;

use crate::...;

I.e., three sections: std, other crates, crate, separated by a newline.

Comment on lines 101 to 103
/**
* Creates a k8s client given a context. If no context is provided, the default context is used.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Creates a k8s client given a context. If no context is provided, the default context is used.
*/
/// Creates a k8s client given a context. If no context is provided, the default context is used.

Comment on lines 117 to 120
/**
* Write k8s pod logs to a file per pod as mz-pod-logs.<namespace>.<pod-name>.log.
* Returns a list of file names on success.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Write k8s pod logs to a file per pod as mz-pod-logs.<namespace>.<pod-name>.log.
* Returns a list of file names on success.
*/
/// Write k8s pod logs to a file per pod as `mz-pod-logs.<namespace>.<pod-name>.log`.
/// Returns a list of file names on success.

Comment on lines 171 to 174
{
let pod_name = pod.metadata.name.clone().unwrap_or_default();
eprintln!("Failed to process pod {}: {}", pod_name, e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not do what you want: the ? operator returns from the enclosing function, not the block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but if we're referring to the usage of ? in lines 132-168, wouldn't it return from the block given it's an async block?

src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
Ok(file_name)
}

fn format_duration(duration: Duration) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get a TimeDelta, which has a Display implementation.

Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to everything Moritz said.

src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
{
Ok(logs) => logs,
Err(_) => {
// If we get a bad request error, try without the previous flag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to just always try to grab both the current and previous logs.


for line in logs.lines() {
writeln!(file, "{}", line)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just write the whole thing in one go? You aren't doing anything with the split lines.

let client = create_k8s_client(args.kubernetes_context.clone()).await?;
// TODO: Make namespaces mandatory
// TODO: Print a warning if namespace doesn't exist
for namespace in args.kubernetes_namespaces {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to store things in an organized directory structure. Maybe something like:
materialize-debug-{iso_datetime}/{namespace}/{resource_type}/{resource_name}.yaml

or, in the case of logs:
materialize-debug-{iso_datetime}/{namespace}/logs/{pod_name}.{current_or_previous}.log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, even if we're zipping at the end!

src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
@SangJunBak
Copy link
Contributor Author

After talking to the cloud team, decided to remove dump_k8s_get_all and can add it back later if we need to

Decided high level, human readable logs aren't useful for the cloud team.
- Rename Kubernetes-related CLI arguments to shortform
- Make namespace argument required
- Add error handling for non-existent namespaces
- Add timestamp-based directory structure for log files
- Separate previous and current pod logs into distinct files
- Introduce `Context` struct to track debug tool shared state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants