Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] System Semantic Conventions Non-Normative Guidance #1618

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

braydonk
Copy link
Contributor

Changes

This PR adds non-normative guidance from the System Semantic Conventions Working Group. This is added in a new groups folder in non-normative, and a system subfolder in groups. The docs written here were already discussed in a Google doc where we were originally collaborating on this, a link to which can be shared directly if needed.

Merge requirement checklist

@braydonk braydonk requested review from a team as code owners November 26, 2024 15:01
@braydonk braydonk requested a review from a team November 26, 2024 15:01
@braydonk braydonk changed the title System Semantic Conventions Non-Normative Guidance [chore] System Semantic Conventions Non-Normative Guidance Nov 26, 2024
@braydonk braydonk added Skip Changelog Label to skip the changelog check area:system labels Nov 26, 2024
@mx-psi mx-psi self-requested a review November 26, 2024 15:56
Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this doc!

I don't think we have similar precedents of "why we designed it in this way" documented (the closest analogy is OTEP), but I wish we had more of these.
We might find a better place for it within the repo over time if we'll have more docs like this.

docs/non-normative/groups/system/design-philosophy.md Outdated Show resolved Hide resolved
docs/non-normative/groups/system/design-philosophy.md Outdated Show resolved Hide resolved
docs/non-normative/groups/system/design-philosophy.md Outdated Show resolved Hide resolved
docs/non-normative/groups/system/design-philosophy.md Outdated Show resolved Hide resolved
docs/non-normative/groups/system/design-philosophy.md Outdated Show resolved Hide resolved
docs/non-normative/groups/system/use-cases.md Outdated Show resolved Hide resolved

## **Host**

A user should be able to monitor the health of a host, including monitoring resource consumption, unexpected errors due to resource exhaustion or malfunction of core components of a host or fleet of hosts (network stack, memory, CPU…).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unexpected errors due to resource exhaustion

not sure if we have anything defined today and if there is anything general we can provide, but it'd be nice to have some OS network/hw/etc errors and have them on the dashboards/alerts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the system.network.errors metric, I don't think we have anything else (I don't know if there is a way to retrieve this, libraries like psutil don't provide this for other stuff like memory or disk AFAIK). Still, I think the existing metrics cover the case of troubleshooting resource exhaustion/malfunction

docs/non-normative/groups/system/use-cases.md Outdated Show resolved Hide resolved
docs/non-normative/groups/system/use-cases.md Outdated Show resolved Hide resolved
docs/non-normative/groups/system/use-cases.md Outdated Show resolved Hide resolved
@braydonk braydonk force-pushed the system_semconv_non_normative branch from e980f13 to e051e87 Compare November 27, 2024 14:30
@braydonk
Copy link
Contributor Author

Did a first pass of easy comments to address, will make some time soon to go through the comments that require more thought!

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a question/suggestion.

* General disk and network metrics
* Universal system/process information (names, identifiers, basic specs)

Some Specialist Class examples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the whole description of the rationale here is exactly how it should be, I think we miss the part of having a set of rules/guidelines/sanity-checks that would help somebody in the future to decide into which directory a metric or attribute fall into. This might not be quite easy to define because of the nature of this problem but maybe it would worth adding a section in the bottom suggesting how this kind of situations should be handled in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have a case study below for process.linux.cgroup; perhaps I can adapt this to more general rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 487af83

docs/non-normative/groups/system/use-cases.md Outdated Show resolved Hide resolved

## **Host**

A user should be able to monitor the health of a host, including monitoring resource consumption, unexpected errors due to resource exhaustion or malfunction of core components of a host or fleet of hosts (network stack, memory, CPU…).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the system.network.errors metric, I don't think we have anything else (I don't know if there is a way to retrieve this, libraries like psutil don't provide this for other stuff like memory or disk AFAIK). Still, I think the existing metrics cover the case of troubleshooting resource exhaustion/malfunction

* Machine name
* ID (relevant to its context, could be a cloud provider ID or just base machine ID)
* OS information (platform, version, architecture, etc)
* Number of CPU cores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be "CPU information" instead? We have a bunch of those here

Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, I left a few non-blocking comments above :)

@mx-psi
Copy link
Member

mx-psi commented Nov 29, 2024

I marked #1403 and #1578 to be closed by this PR, please let me know if this is not right

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love writing this down.

The categorization of "Two Class Design Strategy" I think we should move to general non-normative guidance for all semantic conventions to follow.

@mx-psi
Copy link
Member

mx-psi commented Dec 19, 2024

What is missing for this to be merged?

@braydonk
Copy link
Contributor Author

I'm finishing up edits for the remaining open comments, will be pushing this morning.

This PR adds non-normative guidance from the System Semantic Conventions
Working Group. This is added in a new `groups` folder in
`non-normative`, and a `system` subfolder in `groups`. The docs written
here were already discussed in a Google doc where we were originally
collaborating on this, a link to which can be shared directly if
needed.
@braydonk braydonk force-pushed the system_semconv_non_normative branch from e051e87 to 01f43e9 Compare December 19, 2024 18:25
@braydonk
Copy link
Contributor Author

I've pushed up two new commits:

487af83: Addresses review comments. I will re-request review from those who still had open comments.

01f43e9: To address the issue with the markdown files having really long lines, I have set up Prettier to apply to these markdown files and wrap them at 80 characters. Did this in a separate commit so it wasn't too difficult to see exactly how I addressed open comments.

Comment on lines +250 to +257
For example, there may be `process.linux`, `process.windows`, or `process.posix`
names for metrics and attributes. We will not have root `linux.*`, `windows.*`,
or `posix.*` namespaces. This is because of the principle we’re trying to uphold
from the [Namespaces section](#namespaces); we still want the instrumentation
source to be represented by the root namespace of the attribute/metric. If we
had OS root namespaces, different sources like `system`, `process`, etc. could
get very tangled within each OS namespace, defeating the intended design
philosophy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what would be specific problems if we gave up on the prefix and use OS name as a root?

I'm trying to document naming patterns we have in #1708

and I'm actually struggling to understand what benefit the domain prefix brings.

Copy link
Contributor

@lmolkova lmolkova Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. what should I do if I want to describe a property of OS that's indifferent to instrumentation point/source? which namespace would I use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to refine this to express it more concisely, but in the interest of moving the conversation forward I'm going to braindump everything I have here. I'd like to work together to ensure what I'm saying makes sense and can potentially be refined to be easier to follow.


My concerns right now are theoretical; if they're unfounded for some reason I can revisit it. My reasoning was first laid out in this comment on the cgroup PR: #1364 (comment)

Within the semconv that I am familiar with, the root namespace is what organizes instrumentation into the categories they're meant for. http means that these are signals related to http, db means these are signals about databases, so on so forth. In System Semconv, we took this a step further because a single system category would have been too broad and contained too many disparate concepts within it, i.e. we would have had a bunch of system.process, system.memory etc. which would have erased the significance of a system namespace in the first place [1]. So our instrumentation is separated into multiple root namespaces where each root namespace represents the source of the instrumentation, i.e. process namespace is for instrumentation that is about the operating system's concept of a process and so on.

With that context in mind, the issues I have with the OS name as a root namespace are:

  • It presents a similar problem of a category being too broad
    • In fairness, it is at least more useful than system containing everything would have been because it would still present some kind of information (linux root namespace would mean this is linux only). However, I think it's still too broad a category, and we'd end up with lots of instrumentation unrelated to each other being within the namespace, i.e. we'd have linux.process, linux.memory, linux.network etc.
  • It separates related instrumentation from each other
    • The benefit of the instrumentation source being the root namespace is that a user who wants to know all the possible instrumentation related to that source only needs to look in one namespace. If platform exclusive metrics were placed into platform namespaces, then to find all existing memory related instrumentation, the user would need to realize they need to look in two namespaces, the memory namespace and the namespace for their platform (and the namespace for their platform would contain lots of other stuff not related to memory).

Here's the way I think of it as generic as I can manage:

The end of a semconv name is like an object within a category. The end of the name is basically like saying this is what the name actually represents. In the Collector, where many semconv transitions have not yet happened, much instrumentation doesn't have these namespaces because the receiver they are found within is already a form of organization; if I need to know instrumentation about something, I check the receiver related to that thing and look what's there. Within semconv, the decision was made for everything to be namespaced. This makes sense in a general environment, where you aren't inherently structured and need names to contain organizational context so that you can find the instrumentation you're looking for in a sea of other telemetry.

Given that, I see the goal of the namespaces being logical organization. This means the namespaces should be in order of categorical importance. The "importance" is considered recursively for each sub-namespace.

I'm going to demonstrate this with a name picked at random-ish[2]: go.memory.gc.goal

I'm considering the "identity" of the name to be goal, and each element before that to be a namespace.

You could look at the organization of the name in two directions, and I think it needs to make sense from both directions to be an effective name.

Starting from the identity backwards:
What is goal referring to? It's a garbage collection goal. Garbage collection is a memory management concept in go. Thus the name makes sense in that direction, as goal is contained within gc which is within memory which is within go.

Starting from the root namespace:
I want to know about garbage collection of my Go program. The category that makes the most sense would be go since that's the runtime I want to know about. I want to know about the memory of my Go program, in particular the gc goal. The namespaces are ordered in a way that makes sense for me to discover that information.

To demonstrate the negative example, I could reorder this name to be: memory.go.gc.goal

Starting from the identity backwards:
What is goal referring to? It's a garbage collection goal. This is the garbage collection of a Go program. This is a general memory concept. This kind of works, you can still understand what the goal identity means, but it is a bit broken in the other direction.

Starting from the root namespace:
I want to know about garbage collection of my Go program. If I start with the memory category, there are lots of other unrelated memory metrics within it, within which I need to find the go sub-namespace first before being able to find garbage collection goal. In this case, because the less important memory namespace is used as the root, the category ends up being very broad and finding my Go metric means wading through a lot of things that are not related.

In different contexts, determining what is the "most important" category to use as the root namespace is somewhat subjective. Within System Semconv we came up with a pretty reliable rule, which is that the root namespace represents the source of the instrumentation. In the go.memory.gc.goal case, go (the runtime name) is the source of the instrumentation, so that rule kind of works here too. But I'm not sure how well we can guarantee that rule will generically apply.


what should I do if I want to describe a property of OS that's indifferent to instrumentation point/source? which namespace would I use?

I think in that case I actually consider the instrumentation source to be either the operating system or the general system. So if I had a Windows-exclusive metric that is about the Windows Operating System itself, and there's no cross platform name I could use, I'd probably still use the os namespace, i.e. os.windows.<identity>.


Footnotes:

[1] We might be the among first single working groups to need to do this, but it's not the first time the problem has been encountered in semconv. runtime opted to do the same thing as well, since runtime.jvm, runtime.nodejs, runtime.go etc. essentially erases the usefulness of runtime as a root category.

[2] I did intentionally pick a runtime metric that looked like it might clash with other namespaces, since that's something that's come up for our group as well.

Copy link
Contributor

@lmolkova lmolkova Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few points:

  1. The thing being reported is more important than instrumentation source. I hope that current instrumentations are temporary and we'll see more and more native ones. When you think about native instrumentations, things change. E.g.

    • windows itself emits tons of metrics/events. Should they all be reported under os ? Some of them describe system performance, some describe user behavior. Do we group all of them under os? It seems redundant. Or just some of them? How do we decides?
    • any database has management/control plane operations that have nothing to do with DB features (auth, permissions, connection management, scaling, etc). Do we report them as db.{mydb}.* attributes/metrics. E.g. db.cassandra.paxos.prepare.duration - protocol is orthogonal to the DB features of cassandra, but the metric is still about cassandra. Now, do we report db.cassandra.compaction.something (because it's about database) and cassandra.paxos.prepare.duration because it's about protocol?

    TL;DR: any specific system/client lib has features that belong to more than one root namespace. How instrumentation is done may change (from specific collector component to native one), but metrics we define should survive it.

  2. Having common root namespace for the "General Class" makes perfect sense to me: everything common about OS goes under system, everything common about databases goes under db.

  3. I'm challenging the "Specialist Class" naming: I'm reporting different metrics related to jvm, everyone who cares knows that jvm is a runtime, runtime in front of it is redundant. If I care about cassandra-specific metrics, I'm no longer in DB domain - cassandra is the root namespace and everything about cassandra goes there

I.e. How strong do we feel about

For example, a metric for a process's cgroup would be `process.linux.cgroup`,
given that cgroups are a specific Linux kernel feature.

For example, a metric for a process's cgroup would be process.linux.cgroup,
given that cgroups are a specific Linux kernel feature.

Vs linux.cgroup ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[UPDATE]:

The process.linux.cgroup seems to be tricky since process is meaningful there. cgroup is a property of the (linux) process. I believe it's less tricky in most of other semconv cases. Let me see.

Copy link
Contributor Author

@braydonk braydonk Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't cover well enough the point you brought up, which is that there may be a case where the platform/domain name becomes so specific and disparate in itself that it warrants being considered a unique instrumentation source/root namespace/category.

Taking the db.cassandra example, I imagine it would follow a process like this hypothetical scenario:

  1. db.* may have a large number of attributes that can be used cross platform, and our guidance largely expects that they are used as much as possible. For stuff specific to Cassandra some db.cassandra.* stuff starts to get sprinkled into the namespace when necessary.
  2. Overtime more Cassandra specialists join in and start adding more and more db.cassandra attributes. It begins to pollute the db category, such that the instrumenting Cassandra in particular has a far greater variety of things very specific to it. The fact that so many db.cassandra.* things are being added implies that these things are Cassandra specific already, since they should have been using generic attributes when possible.
  3. A decision is made that since Cassandra-specific instrumentation has become so rich and disparate from everything else in the db.* namespace, it makes the most sense to split cassandra out into its own root namespace to represent the richness of the instrumentation available for it and how much it diverts.

Applying this thought process to the questions in your above comment:

windows itself emits tons of metrics/events. Should they all be reported under os ?

I think the steps from above could still occur in the same order. We'd try to use as many generic attributes as possible, introducing os.windows.* where it is specifically necessary. If the os.windows category becomes so deep and disparate from everything else, then it might make sense to move it into its own namespace.

How strong do we feel about process.linux.cgroup vs linux.cgroup

It is essentially for the reason you stated in your update; this is a bit of a different case because whether Linux exclusive or not, the process will always be the instrumentation source. And where something like os and db may be more like categories than an explicit object being instrumented, process will always be the most important root namespace and linux.cgroup on its own wouldn't carry the same semantic meaning.

That being said, there is a case to be made for cgroup here, as cgroup is itself a potentially instrumentable source. It's (hopefully) only a matter of time before semantic conventions come along to instrument cgroups in more detail. In that case, I wonder if that would start in linux.cgroup or if it would just be cgroup off the bat. That I'm not sure about.


The problem I foresee with my own ideas here is it assumes things can be moved around easier. I think the problem with my thought process is that it sort of requires foresight to make sure that there isn't the potential for us to want to extract a platform namespace before reaching the point of stability for another namespace. So all of this probably needs some more thought and discussion. Maybe when I come back from the holidays I'll have the answer (probably not but I can dream 😃).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that some of the things we're discussing (db.cosmosdb) is reasonably close to stability, I think we should try to envision as many things as we can - we might not have another chance in the next few years 🤞

I don't think we need a rigid naming policy - i.e. process.linux.cgroup can be whatever makes the most sense for it.
I'll bring up the general naming to Semonv SIG after the break.

Happy holidays!

@lmolkova
Copy link
Contributor

PTAL at the related #1707 - it's my attempt to document overall semconv guidance (only attribute definition so far). There are some intersections.

Copy link

github-actions bot commented Jan 9, 2025

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:system Skip Changelog Label to skip the changelog check Stale
Projects
Status: Needs More Approval
9 participants