Metric groups #36

emschwartz · 2023-05-02T12:10:29Z

emschwartz
May 2, 2023

@akesling suggested that we might want to generalize how we're handling SLOs now to support metric groups.

His point was that you might want to group a set of function-level metrics together first so that you can monitor them. Then, you might later want to decide on an SLO for them and attach the SLO to the group (or vice versa).

The main way that we would use these types of groups right now would probably be to have a Grafana dashboard that shows you a row for each group, similar to what we have for SLOs but without the targets attached.

What do folks think? Does this sound useful? Should we do this now or hold off until later?

IvanMerrill · 2023-05-02T13:26:20Z

IvanMerrill
May 2, 2023

I think this could be super useful! In my head I'm thinking of logical groupings for things like:

Microservice. This would allow you to easily see all the instrumented functions belonging to a particular microservice. I know that there are module name / function name labels already but these can be reused.
Overall functionality e.g 'make a payment'. Quickly see everything to do with delivering a particular piece of functionality end-to-end.
Supporting team / team responsible for writing the code. This would be particularly useful in incidents to identify who to contact for specific functions.

To me having such an ability really drives to the point of making the metrics useful, so yeah, great idea!

11 replies

IvanMerrill Jun 14, 2023

For PagerDuty specifically, you create a service and the alerts are passed to it using a key relating to this specific service. You can then associate on-call schedules to these services. In this case, what would be required would be to have a way for alert manager to be able to route an alert created based on which service the alert is for. This can easily be based on SLO name or a label attached to the SLO in the recording rule, and would need to be defined at the SLO level I guess, as I'm not sure how we could get any function-level label through to alert manager.

emschwartz Jun 14, 2023
Author

@IvanMerrill's point raises the question of whether you would route alerts based on:

the name of the service
the name of the SLO or
a group of functions that's a subset of the SLO

If it's 1 or 2, then we wouldn't need any additional feature to support this routing. I think you would just configure PagerDuty or Alertmanager to route the alerts based on those details, which would be available as labels on the firing alert.

3 is the only case that might call for a separate group feature. However, all the metrics are aggregated to produce the time series that triggers an alert. I don't think it's possible to have labels from subsets of the metrics bubble up to the level of the alert. So I think this use case wouldn't actually work.

On top of that, I'm not sure it really makes sense to have two sets of functions within the same service maintained by different teams where you'd want all of them to be part of the same SLO. Wouldn't you just have separate microservices or at least separate SLOs?

akesling Jun 14, 2023

On top of that, I'm not sure it really makes sense to have two sets of functions within the same service maintained by different teams where you'd want all of them to be part of the same SLO. Wouldn't you just have separate microservices or at least separate SLOs?

SLO alerts map to production service maintainers, but production service maintainers are not always code maintainers. See the Google SRE model. That can mean any of:

Alerts are handled exclusively by the code maintainers, so those editing SLO objects are those receiving the alert.
Alerts are handled exclusively by some production maintenance team (read: SRE), so those editing SLO objects may not actually be the ones receiving the alerts.
Alerts are handled by multiple separate teams, being any mix of code and non-code maintainers. This is the weird one, because you could imagine the same function feeding into multiple different SLOs, which in turn may alert multiple separate teams.

So... keep in mind that not only can the "same service" may be maintained by multiple teams in different ways (notably code vs. production in a "simple" case), the same code may actually be relevant to different services as well.

My personal taste has always been to split service descriptions so all SLOs are 1:1 with production maintainer responsibilities (i.e. each SLO has exactly one definitive owner). In the case of Autometrics, that would mean having separate SLO definitions for each independent alertable service.

emschwartz Jun 15, 2023
Author

Makes sense. So, we can assume alert routing can be done at the level of the SLO.

I think this is the one part that may be actionable for autometrics:

the same code may actually be relevant to different services as well.

If it's library code, it seems somewhat unlikely that you'd have an SLO defined in it, no?

If it's a binary that multiple teams might run, we might want to make it so the SLO details can be loaded from environment variables at runtime. The same goes for the service name

akesling Jun 15, 2023

If it's a binary that multiple teams might run, we might want to make it so the SLO details can be loaded from environment variables at runtime. The same goes for the service name

Just as long as the user can still set configuration dynamically themselves without worrying about environment variables ;). The same underlying dynamic API can enable use of flags, config files, etc.

mies · 2023-05-02T13:26:23Z

mies
May 2, 2023
Maintainer

I also mentioned this idea; for instance at Fiberplane we have several functions responsible for our real-time component. Grouping these together (and later on attaching an SLO to them) enables you to reason about their performance and availabilty as a unit of work.

5 replies

emschwartz Jun 13, 2023
Author

This is an interesting one. I had initially thought that this was the main use case for this feature. However, I'm now wondering if we can achieve similar ends but in a more automatic way that requires less thought from developers.

In this particular case, the question is whether the module contains enough information to achieve this grouping. If it doesn't, could you refactor your code so that it did?

For Fiberplane's API, the module does tell us enough to group the HTTP handlers and the websocket handlers.

The reason I'm thinking along these lines is that I'm worried about adding a feature that a) people might not use and b) one that requires people to think. The thing I love about the build_info feature is that you basically don't have to do anything but you get more powerful queries for free. The analog here would be if we can extract info from a label that's already there and created automatically to build more powerful queries for you.

hatchan Jun 14, 2023
Maintainer

I think it is a good example of how a module might not always be the right grouping.

With the hypothetical example where we want to have a view of all the handlers, so this would be viewable in a graph, but also have some SLO's attached to it. In our case we have the issue that not all handlers are in the same module, so we need to use a prefix of the module and then get all the functions that are in there. But that leads to the second issue that not all the functions in these modules are a handler, so you'd get functions that you are not interested in.

This example is about handlers, but you could come up with something similar for our data layer or some other middleware.

I'm not sure what the issue is with not all users using this. If this is an optin feature users can upgrade into it when their code base requires it. I would rather have that, then require the user to refactor their code for a specific feature.

IvanMerrill Jun 14, 2023

I think the level of grouping we're missing is the app or service grouping. If you use the kube-prometheus stuff then you get the node exporter attaching labels from each container to the metrics, which is great. You could also do this via the job config or OTeL collector in a push configuration. However these are defined within the infrastructure, which is separate to the application which is where autometrics works. For all the support stuff, for many use cases, having another single label that works at a higher level to module, that can be applied uniformly across the code base makes sense to me. This solves the support use case, and the micro-service use case. I do agree that there might be a good reason to have this as an environment variable that the code is using

emschwartz Jun 14, 2023
Author

In our case we have the issue that not all handlers are in the same module, so we need to use a prefix of the module and then get all the functions that are in there. But that leads to the second issue that not all the functions in these modules are a handler, so you'd get functions that you are not interested in.

That's a fair point.

I think the thing I'm still not quite seeing is what value we'd get out of this grouping. Would we have any reason to ever look at these groups separately? What would we do with that information?

I'm not sure what the issue is with not all users using this.

The issue isn't with it not being used by all users but by any, or by enough that it seems worth implementing. For any new feature we add, we need to replicate it across all 6 (and hopefully counting) implementations and we need to fully explain it in the docs. I just want to make sure that whatever we add actually adds value and we can point to how people (ideally including ourselves) would use the feature and get value out of it before implementing.

IvanMerrill Jun 14, 2023

I'm not sure what the issue is with not all users using this. If this is an optin feature users can upgrade into it when their code base requires it. I would rather have that, then require the user to refactor their code for a specific feature.

Yeah, this is how I see it. I don't see anyone looking to use groups 'just because it's there'. To me this is a feature that would be used when people already have an idea of how they would group the functions and want a way to visually represent this group, but don't want to turn them into an SLO.

I also see grouping not as a way to create a composite metric in the same way as an SLO but as a way to view similar metrics together in one place i.e an SLO is 'here is the the aggregated response time across all of these functions' where as the group would be 'here's the response times of all these functions'. I'm not saying there would never be a use case for a composite metric within a group, but that in my mind it's more likely to want to see all similar metrics for each function in the group alongside each other

emschwartz · 2023-05-02T13:29:11Z

emschwartz
May 2, 2023
Author

A little detail on how we could implement this:

I think we'd want to support arbitrary nesting and overlapping of groups. The way to do this with Prometheus labels is pretty fun -- and, amazingly but maybe not surprisingly, built on another idea I got from Brian Brazil's blog: Negative lookahead assertions in PromQL selectors.

When you're attaching multiple group labels to a metric, you would join all of them into a single label value with some separator like a space. PromQL regexes intentionally don't support lookaheads so we need Brian Brazil's trick to be able to query for metrics that belong to multiple groups. We would use multiple regex label selectors in the query {group=~"group_1", group=~"group_2"}.

Another detail that I think is kind of neat about this is that this adds labels without adding cardinality. The metrics for a given function would always be produced with the group labels of all of the groups it's part of. So in Prometheus, there would only be a single time series with all of the different labels attached. The only time it would need to start a new time series is if you changed the group membership for a function, but then the old time series would be removed from memory after a little while.

10 replies

emschwartz Jun 14, 2023
Author

I'd find that confusing because we're not necessarily allowing (or trying to allow) you to add arbitrary labels to metrics. I think the idea is to let you add some additional labels but that have enough meaning to us that we can build queries and visualizations on top of them that would actually be helpful to you.

hatchan Jun 14, 2023
Maintainer

I'd find that confusing because we're not necessarily allowing (or trying to allow) you to add arbitrary labels to metrics. I think the idea is to let you add some additional labels but that have enough meaning to us that we can build queries and visualizations on top of them that would actually be helpful to you.

I'm not saying we should expose it as just "labels" we could still expose this as "groups" and just use "labels" as the storage. FYI: not sure if this actually a good idea, just wanted to float it out there :)

emschwartz Jun 14, 2023
Author

Wait, sorry, now I'm confused. Are you talking about referring to it in the code as "labels" or storing groups as top-level labels?

hatchan Jun 14, 2023
Maintainer

In the time series, have a label called labels. Then in the libraries and tooling around it call it groups (and other things in the future can also use this). Then when you add a group, encode it in a specific way and store it in the labels label. As I said, probably not a really good idea.

emschwartz Jun 14, 2023
Author

Ah, gotcha. Yeah, I agree that's probably not a good idea :)

P2P-Nathan · 2023-05-26T22:35:56Z

P2P-Nathan
May 26, 2023
Collaborator

All three options that @IvanMerrill pointed out stand out as great candidates. A fourth that I haven't made up my mind out would be a grouping for functionality that is depended on some flaky or external resource. Almost like a troubleshooting hint "If this is broke, always look here first". Some of those dependencies are so much clearer when your coding then when its in production.

The great thing with groups being flexible is users may start to find interesting reasons to groups calls we haven't thought of.

8 replies

IvanMerrill Jun 13, 2023

Auto-instrumentation in APM agents has the ability to understand where an HTTP call is being made (for example, but this works for other protocols too), and then provide details such as those mentioned (URL, maybe the SQL query, latency, errors etc). They're generally doing this by instrumenting the library making the call i.e not the application code but the actual library used to make the HTTP call.

I think a simple way to get started would be just to have an extra label that states what external source is being called and leave it at that e.g. external-service=paymentDB. Knowing that this function calls an external service, which external service is being called and a proxy for latency and errors (i.e the latency and error rate for the function making the call) is a great starting point. Being able to group these then to show all functions making calls to external services is then a great further step.

hatchan Jun 14, 2023
Maintainer

I think a simple way to get started would be just to have an extra label that states what external source is being called and leave it at that e.g. external-service=paymentDB. Knowing that this function calls an external service, which external service is being called and a proxy for latency and errors (i.e the latency and error rate for the function making the call) is a great starting point. Being able to group these then to show all functions making calls to external services is then a great further step.

I think in that case we should just use groups, instead of introducing another concept named external-service.

I'm imagining you'd want to know a) that a function depends on an external service b) some top-level identifier for the service like the domain name and c) maybe the more specific thing like the full URL. This could apply to APIs outside of your company, or maybe even just other microservices that might be owned by another team.

If we are going to add the full url then we do risk exploding the cardinality. Since the URL might contain a dynamic part, such as a ID.

IvanMerrill Jun 14, 2023

If we are going to add the full url then we do risk exploding the cardinality. Since the URL might contain a dynamic part, such as a ID.

Agreed - I don't think that this is a good idea. This is where logging / tracing begin to shine, but at a higher cost.

I think in that case we should just use groups, instead of introducing another concept named external-service.

Yeah that's fair.

emschwartz Jun 14, 2023
Author

I think in that case we should just use groups, instead of introducing another concept named external-service.

I'm not so sure. The advantage of introducing a specific concept is that we can build queries, visualizations, or debugging aids that are based on us knowing what that means. If it's just a generic group, we're limited to showing you "here's the stats for the whole group".

An example of what I have in mind in this case would be some kind of feature where you specify that some set of functions call the GitHub API. Now, if you've labeled that in such a way that we (the developers of Autometrics and related software) understand that those are dependent on the Github API, we could pull some information into the UI to show you, for example, if Github is having issues right now. We could tighten the debugging loop by identifying for you that the reason your functions are erroring is actually because this external service is down.

I'm not sure that this is something we should build ever or now, but it's the type of benefit that would come from a more narrowly tailored feature rather than a generic one.

P2P-Nathan Jun 15, 2023
Collaborator

If we are going to add the full url then we do risk exploding the cardinality. Since the URL might contain a dynamic part, such as a ID.

If we do end up wanting paths, or something in that direction, oTel has the attribute http.route with the parameter name instead of the value, to prevent the cardinality sprawl

/users/:userID?
{controller}/{action}/{id?}

emschwartz · 2023-06-02T16:07:15Z

emschwartz
Jun 2, 2023
Author

What should the relationship between groups and SLO Objectives be?

It would make sense if you could create some groups, attach those to functions (similar to how you do that with SLOs now), and then later attach an objective to a group.

One question is whether we care about retroactively including groups in SLOs once you've added them.

If we don't, it's pretty easy to imagine how attaching an SLO to a group would (from then on) attach the additional objective-related labels.
If we do care about including the groups retroactively, I'm not entirely sure how we'd make that work. Maybe we'd need something like a separate info metric that just says which groups are included in the SLO?...

3 replies

P2P-Nathan Jun 4, 2023
Collaborator

I can imaging creating a group and applying the SLO at the group level being highly valuable. The first scenario I imagined was something along the lines of instrumenting a key functionality in an application. Using a group like BILLING_CRITICAL you could easily tag all of the functions in the chain and later decided if they needed to be %99 or %99.9 successful.

One item that did become apparent when thinking through that scenario is that SLOs on duration become more difficult or convoluted at a group level.

One question is whether we care about retroactively including groups in SLOs once you've added them.

I don't think SLOs would need to be retroactively applied, as long as it would still be possible to see the performance trends across the whole time span.

IvanMerrill Jun 6, 2023

I'm in agreement with @P2P-Nathan - definitely want to be able to add groups to an SLO, but at that point you're changing the measurement so I think it makes the most sense to draw a line in the sand and say 'we've now changed the objective and this applies going forward'.

Applying it retrospectively, so changing how you're measuring your past performance based on adding in the group now, doesn't seem right to me. You can always see how adding the group would have impacted performance by extending the new query over a longer period of time, but the change should only impact future measurements.

I think the main reason I am thinking this is because I am unsure what someone would do if they did backdate the updated SLO and found that they breached the SLO at some point historically. Any update to the SLO should be made with an understanding of the impact (i.e not adding it without knowing that it's a reasonable objective) and this would be part of the decision as to whether or not to add the group. The actual measurement against the SLO however should only be done from the point of applying the change onwards.

emschwartz Jun 6, 2023
Author

Excellent! Then this feature should be pretty easy to add 😁

emschwartz · 2023-06-14T14:48:33Z

emschwartz
Jun 14, 2023
Author

@IvanMerrill said:

I think the level of grouping we're missing is the app or service grouping.

I find the idea of a service label compelling.

Right now, you can use Prometheus relabeling rules to add such a label. However, we can only use such a label in the queries we generate if we standardize the name of the label.

One argument that seems pretty strong for having a service label is that right now, if you have multiple services using autometricized shared code and dumping metrics into the same Prometheus instance, the queries we generate for you will actually be incorrect, because they will merge the function metrics from multiple services together. The queries we create should include sum by (service, module, function) ... in order to properly differentiate metrics from different services.

Now, if we were going to add a service label, I'd actually do it in a different way than I had previously imagined for a group. I think you'd want to set the service label in the autometrics initialization function that you'd run once for the whole binary, rather than adding it to every invocation of the autometrics macro/decorator/wrapper.

6 replies

emschwartz Jun 14, 2023
Author

Mm good question. I think we could do it either way, but I think there's a slightly stronger case for adding it to every metric.

If we add it to the build_info, we'd just need to make sure that all the queries are always doing the magic to merge that info in. We're kind of doing that anyway because the version is such useful information, but this is something you might forget about or not do if you're writing queries by hand.

If we add it to every metric, it would always be there -- and wouldn't add additional cardinality.

mies Jun 14, 2023
Maintainer

I like the idea of a service label as well and I can see a lot of value in being able to view the metrics and SLO's on a per-service basis.

Regardless of implementation though isn't a service a logical grouping of functions though? 🤔

The service label could then feed into @akesling point on owners/maintainers and to whom an alert should route to (could indeed be a mix).

emschwartz Jun 15, 2023
Author

I created an issue to track this on the roadmap #27

Regardless of implementation though isn't a service a logical grouping of functions though? 🤔

Yes, you're right that it is, but the implementation makes a big difference for the DX and how it can be used.

An important distinction is that we're saying the service is a group that includes all autometricized functions in a given binary. With this way of implementing it, you cannot define a service as arbitrary smaller groups of functions within a single binary.

P2P-Nathan Jun 15, 2023
Collaborator

I find the idea of a service label compelling.

This is pretty big on my list as well, I've been pushing to get more oTel semantic conventions for resources in out product as well. If we could match the oTel Service spec that would be pretty cool.

emschwartz Jun 21, 2023
Author

Once I started implementing this, I started wondering whether we should just add the service.name label to the build_info metric, as opposed to all of them. It feels a bit tedious, at least in Rust, to add it all over the place. However, once I started typing up this comment, I was sufficiently convinced that we should add the label everywhere so I'm putting this here for posterity.

The arguments I see for adding it to every metric are:

It is technically part of the unique identifier for some metrics, along with the module and function name
If a user starts hand-writing queries, they may not add the part to join the core metrics with the build_info and their queries might be wrong (if they have multiple instances of the code running as different services). It doesn't seem great to have the default be kind of incorrect
It's not actually part of the build information because it may be set at runtime (although maybe you could make the case that the build info is kind of part of the service?)

The arguments for adding it only to the build_info metric:

Only need to add it to one place, so it's easier to implement
It's already useful to join the build_info metric with the function-related metrics to attach the version and commit labels

emschwartz · 2023-06-15T08:36:28Z

emschwartz
Jun 15, 2023
Author

This is a very fruitful discussion so far, though it seems like many of the potential use cases we've discussed would be best left out of scope for a groups feature specifically.

I'm wondering if we should go back to the original use case @akesling suggested for groups:

define groups of functions that might eventually be rolled into an SLO, but without defining an SLO to start
allow people to visualize the performance of groups taken together like we do currently with SLOs

If we think of groups as pre-SLOs and leave other use cases out of scope, this has some implications for how we might implement it. One big one I can think of is that we may not need to support functions being members of multiple groups. Each function can belong to one group, and then you can add a whole group to an SLO when you create them. This would be a bit simpler to implement, as we wouldn't need the tricks I described here.

One reason for this limitation is that we don't currently support having one function being part of multiple SLOs. It would pose a problem if you could add a function to multiple groups, and then add the groups to different SLOs, which would mean we'd need to decide which SLO the function is part of. We could theoretically make it possible to add functions to multiple SLOs but this would definitely complicate things (particularly because the SLO is defined using not one but multiple labels for the name, percentile, and latency).

What do folks think about that way of looking at it?

I wonder if there's some more specific name than "groups" that would make the scope of such a feature clearer.

1 reply

P2P-Nathan Jun 15, 2023
Collaborator

Each function can belong to one group, and then you can add a whole group to an SLO when you create them. This would be a bit simpler to implement, as we wouldn't need the tricks I described here.

I like this approach, it adds the functionality which keeping complexity on hold until we have a better understanding of how many people might want/need/use multiple SLOs on a single item. My personal preference is to keep things on the supportable side and drag my feet a little until there is clear user desire for a feature.

Metric groups #36

Replies: 7 comments · 44 replies

emschwartz Jun 14, 2023 Author

emschwartz Jun 15, 2023 Author

mies May 2, 2023 Maintainer

emschwartz Jun 13, 2023 Author

hatchan Jun 14, 2023 Maintainer

emschwartz Jun 14, 2023 Author

emschwartz May 2, 2023 Author

emschwartz Jun 14, 2023 Author

hatchan Jun 14, 2023 Maintainer

emschwartz Jun 14, 2023 Author

hatchan Jun 14, 2023 Maintainer

emschwartz Jun 14, 2023 Author

P2P-Nathan May 26, 2023 Collaborator

hatchan Jun 14, 2023 Maintainer

emschwartz Jun 14, 2023 Author

P2P-Nathan Jun 15, 2023 Collaborator

emschwartz Jun 2, 2023 Author

P2P-Nathan Jun 4, 2023 Collaborator

emschwartz Jun 6, 2023 Author

emschwartz Jun 14, 2023 Author

emschwartz Jun 14, 2023 Author

mies Jun 14, 2023 Maintainer

emschwartz Jun 15, 2023 Author

P2P-Nathan Jun 15, 2023 Collaborator

emschwartz Jun 21, 2023 Author

emschwartz Jun 15, 2023 Author

P2P-Nathan Jun 15, 2023 Collaborator

Replies: 7 comments 44 replies

emschwartz Jun 14, 2023
Author

emschwartz Jun 15, 2023
Author

mies
May 2, 2023
Maintainer

emschwartz Jun 13, 2023
Author

hatchan Jun 14, 2023
Maintainer

emschwartz Jun 14, 2023
Author

emschwartz
May 2, 2023
Author

emschwartz Jun 14, 2023
Author

hatchan Jun 14, 2023
Maintainer

emschwartz Jun 14, 2023
Author

hatchan Jun 14, 2023
Maintainer

emschwartz Jun 14, 2023
Author

P2P-Nathan
May 26, 2023
Collaborator

hatchan Jun 14, 2023
Maintainer

emschwartz Jun 14, 2023
Author

P2P-Nathan Jun 15, 2023
Collaborator

emschwartz
Jun 2, 2023
Author

P2P-Nathan Jun 4, 2023
Collaborator

emschwartz Jun 6, 2023
Author

emschwartz
Jun 14, 2023
Author

emschwartz Jun 14, 2023
Author

mies Jun 14, 2023
Maintainer

emschwartz Jun 15, 2023
Author

P2P-Nathan Jun 15, 2023
Collaborator

emschwartz Jun 21, 2023
Author

emschwartz
Jun 15, 2023
Author

P2P-Nathan Jun 15, 2023
Collaborator