Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing and log correlation without otel #43718

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

lmolkova
Copy link
Member

@lmolkova lmolkova commented Jan 7, 2025

The next PR in the series of client core tracing+logging improvements.

In this episode, we:

  • Implement fallback tracing which:
    • generates trace context - it will be used to correlate logs between each other
    • records spans as logs (if users enable corresponding logger)
  • Introduce InstrumentationContext - strongly typed container for trace context that's used to correlate logs with or without otel
  • Merge HttpInstrumentationPolicy with HttpLoggingPolicy. This is not strictly necessary, but allows to
    • merge similar configuration like URL sanitization and allowed headers (not yet done in this PR)
    • optimize code (e.g. do sanitization once)
    • keep things consistent (e.g. span attributes and logging keys)
    • it will be even more useful when we add metrics - we'd reuse the same duration and other things that we log today
  • Improves logging in HttpRetryPolicy and HttpRedirectPolicy
    • makes it similar to HTTP requests
    • consistent attributes following OTel semantic conventions (where possible)
    • enables correlation
  • Moves logging into instrumentation package

In next episodes:

  • Relationship between HttpLogOptions and InstrumetnationOptions?
  • HTTP spans: time-to-last-byte
  • Metrics

@lmolkova lmolkova changed the title Log based tracing without otel Tracing and log correlation without otel Jan 7, 2025
@lmolkova
Copy link
Member Author

lmolkova commented Jan 7, 2025

Fallback tracing benchmarks

FallbackTracingBenchmarks.fallbackTracerDisabled         avgt   15     0.499 ±  0.005  ns/op
FallbackTracingBenchmarks.fallbackTracerEnabledNoLogs    avgt   15    47.193 ±  2.478  ns/op
FallbackTracingBenchmarks.fallbackTracerEnabledWithLogs  avgt   15  1349.738 ± 46.289  ns/op

So with fallback tracing (no logs) enabled by default, we're going to create ~2 spans for client SDK code in common case (1 public API-surface and 1 HTTP), adding ~90ns to each client call along with some allocations which seems reasonable.

This time is essentially spent on generating random Ids and we'd probably get a comparable impact from x-ms-client-id thing.

@azure-sdk
Copy link
Collaborator

azure-sdk commented Jan 7, 2025

API change check

API changes are not detected in this pull request.

@lmolkova
Copy link
Member Author

lmolkova commented Jan 7, 2025

Champion scenario:

  var httpPipeline = new HttpPipelineBuilder()
          .policies(new HttpRetryPolicy(), new HttpRedirectPolicy(), new HttpInstrumentationPolicy(null, null))
          .build();

  // this would effectively be done by codegen in client libs (when they create a span)
  // but users can also provide context explicitly and we'll use it.
  // we also support creating this context from OTel spans
  RequestOptions requestOptions = new RequestOptions().setInstrumentationContext(createRandomContext());

  HttpRequest request = new HttpRequest(HttpMethod.GET, "http://www.microsoft.com")
          .setRequestOptions(requestOptions);

  httpPipeline.send(request);

logs with simulated error (formatted, verbose)

2025-01-06 19:04:44.865 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.body.size": 0,
  "span.id": "f913333fbc212845",
  "http.request.resend_count": 0,
  "url.full": "http://www.microsoft.com",
  "event.name": "http.request"
}

2025-01-06 19:04:44.866 [main] [DEBUG] core.tracing - 
{
  "library.name": "core",
  "library.version": "1.0.0-beta.2",
  "span.name": "GET",
  "span.duration": 5.2217,
  "span.id": "f913333fbc212845",
  "server.address": "www.microsoft.com",
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "error.type": "java.net.UnknownHostException",
  "http.request.method": "GET",
  "span.kind": "CLIENT",
  "server.port": 80,
  "span.parent.id": "72d954337db64062",
  "url.full": "http://www.microsoft.com",
  "event.name": "span.ended"
}

2025-01-06 19:04:44.866 [main] [WARN] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.duration": 7.4057,
  "exception.type": "java.net.UnknownHostException",
  "exception.message": null,
  "http.request.body.size": 0,
  "span.id": "f913333fbc212845",
  "http.request.resend_count": 0,
  "url.full": "http://www.microsoft.com",
  "exception.stacktrace": "java.net.UnknownHostException\r\n\tat com.test/com.test.App.lambda$main$0(App.java:32)\r\n\tat [email protected]/io.clientcore.core.http.pipeline.HttpPipelineNextPolicy...",
  "event.name": "http.response"
}

2025-01-06 19:04:44.868 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpRetryPolicy - 
{
  "retry.was_last_attempt": false,
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "retry.delay": 824,
  "exception.type": "java.io.UncheckedIOException",
  "exception.message": "java.net.UnknownHostException",
  "retry.max_attempt_count": 3,
  "span.id": "72d954337db64062",
  "http.request.resend_count": 0,
  "exception.stacktrace": "java.io.UncheckedIOException: java.net.UnknownHostException\r\n\tat com.test/com.test.App.lambda$main$0(App.java:32)\r\n\tat ....",
  "event.name": "http.retry"
}

2025-01-06 19:04:45.694 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.body.size": 0,
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "event.name": "http.request"
}

2025-01-06 19:04:45.803 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - {
  "http.request.time_to_response": 108.7054,
  "http.response.body.size": 201253,
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.duration": 108.7196,
  "http.request.body.size": 0,
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "http.response.status_code": 200,
  "event.name": "http.response"
}

2025-01-06 19:04:45.803 [main] [DEBUG] core.tracing - {
  "library.name": "core",
  "library.version": "1.0.0-beta.2",
  "server.address": "www.microsoft.com",
  "span.name": "GET",
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "span.duration": 108.9267,
  "http.request.method": "GET",
  "span.kind": "CLIENT",
  "server.port": 80,
  "span.parent.id": "f913333fbc212845",
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "http.response.status_code": 200,
  "event.name": "span.ended"
}

or with simulated error and redirect (formatted, info)

2025-01-06 19:21:08.735 [main] [WARN] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 6.5101,
  "exception.type": "java.net.UnknownHostException",
  "exception.message": null,
  "http.request.body.size": 0,
  "span.id": "58fea390ed94a20b",
  "http.request.resend_count": 0,
  "url.full": "http://microsoft.com",
  "event.name": "http.response"
}

2025-01-06 19:21:09.648 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "http.request.time_to_response": 121.9558,
  "http.response.body.size": 0,
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 121.9866,
  "http.request.body.size": 0,
  "span.id": "0ef927d10d392086",
  "http.request.resend_count": 1,
  "url.full": "http://microsoft.com",
  "http.response.status_code": 301,
  "event.name": "http.response"
}

2025-01-06 19:21:09.885 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "http.request.time_to_response": 236.4778,
  "http.response.body.size": 201253,
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 236.4934,
  "http.request.body.size": 0,
  "span.id": "3016d2e453fc9874",
  "http.request.resend_count": 1,
  "url.full": "https://www.microsoft.com/",
  "http.response.status_code": 200,
  "event.name": "http.response"
}

So new RequestOptions().setInstrumentationContext(createRandomContext()); brings correlation to all HTTP logs (including retries and redirects) in this pipeline.
It works with and without otel.

We can easily extend it to support implicit context propagation (which already works with otel).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to HttpInstrumentationLoggingTests

@@ -314,27 +311,6 @@ public void tracingIsDisabledOnInstance() throws IOException {
assertEquals(0, exporter.getFinishedSpanItems().size());
}

@Test
public void tracingIsDisabledOnRequest() throws IOException {
Copy link
Member Author

@lmolkova lmolkova Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed disabling individual requests - we didn't advertise this feature in azure-core and I don't recall anyone asking for it.
We also don't support it in other languages.

private static final int TRACE_ID_HEX_LENGTH = 32;
private static final int SPAN_ID_HEX_LENGTH = 16;
private static final char[] ENCODING = buildEncodingArray();
private static final String ALPHABET = "0123456789abcdef";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used in the utility method buildEncodingArray could this be turned into a local variable there?

@@ -184,6 +184,7 @@ public static void runtimeError(ClientLogger logger, Throwable t) {
* @return true if OTel is initialized successfully, false otherwise
*/
public static boolean isInitialized() {
return INSTANCE.initialized;
return INSTANCE == null || INSTANCE.initialized;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does isInitialized return true if the instance is null?

boolean isAllZero = true;
for (int i = 3; i < 35; i++) {
char c = traceparent.charAt(i);
if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f'))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this character range a specification somewhere that could be included here? I know many times both a-f and A-F are acceptible.

Comment on lines +60 to +64
for (int i = 0; i < 2; i++) {
if (traceparent.charAt(i) != '0') {
return false;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just unroll this loop

if (traceparent.charAt(0) != '0' || traceparent.charAt(1) != '0') {
    return false;
}

with that this can then be combined with the next if

if (traceparent.charAt(0) != '0' || traceparent.charAt(1) != '0' || traceparent.charAt(2) != '-') {
    return false;
}

Comment on lines +85 to +87
if (traceparent.charAt(35) != '-') {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making the checks non-sequential for better performance?

This and the charAt(52) checks could be done eagerly to prevent needing to track more information earlier.

Also, somewhere in this class mind adding a comment that outlines the definition of a valid traceparent?

Comment on lines +62 to +68
private FallbackSpanContext() {
this.traceId = INVALID_TRACE_ID;
this.spanId = INVALID_SPAN_ID;
this.traceFlags = "00";
this.isValid = false;
this.span = Span.noop();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this constructor if its only usage is to create the static INVALID? Could that just call the constructor below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants