Tracing and log correlation without otel #43718

lmolkova · 2025-01-07T01:52:29Z

The next PR in the series of client core tracing+logging improvements.

In this episode, we:

Implement fallback tracing which:
- generates trace context - it will be used to correlate logs between each other
- records spans as logs (if users enable corresponding logger)
Introduce InstrumentationContext - strongly typed container for trace context that's used to correlate logs with or without otel
Merge HttpInstrumentationPolicy with HttpLoggingPolicy. This is not strictly necessary, but allows to
- merge similar configuration like URL sanitization and allowed headers (not yet done in this PR)
- optimize code (e.g. do sanitization once)
- keep things consistent (e.g. span attributes and logging keys)
- it will be even more useful when we add metrics - we'd reuse the same duration and other things that we log today
Improves logging in HttpRetryPolicy and HttpRedirectPolicy
- makes it similar to HTTP requests
- consistent attributes following OTel semantic conventions (where possible)
- enables correlation
Moves logging into instrumentation package

In next episodes:

Relationship between HttpLogOptions and InstrumetnationOptions?
HTTP spans: time-to-last-byte
Metrics

lmolkova · 2025-01-07T02:41:50Z

Fallback tracing benchmarks

FallbackTracingBenchmarks.fallbackTracerDisabled         avgt   15     0.499 ±  0.005  ns/op
FallbackTracingBenchmarks.fallbackTracerEnabledNoLogs    avgt   15    47.193 ±  2.478  ns/op
FallbackTracingBenchmarks.fallbackTracerEnabledWithLogs  avgt   15  1349.738 ± 46.289  ns/op

So with fallback tracing (no logs) enabled by default, we're going to create ~2 spans for client SDK code in common case (1 public API-surface and 1 HTTP), adding ~90ns to each client call along with some allocations which seems reasonable.

This time is essentially spent on generating random Ids and we'd probably get a comparable impact from x-ms-client-id thing.

azure-sdk · 2025-01-07T02:43:44Z

API change check

API changes are not detected in this pull request.

lmolkova · 2025-01-07T03:18:43Z

Champion scenario:

  var httpPipeline = new HttpPipelineBuilder()
          .policies(new HttpRetryPolicy(), new HttpRedirectPolicy(), new HttpInstrumentationPolicy(null, null))
          .build();

  // this would effectively be done by codegen in client libs (when they create a span)
  // but users can also provide context explicitly and we'll use it.
  // we also support creating this context from OTel spans
  RequestOptions requestOptions = new RequestOptions().setInstrumentationContext(createRandomContext());

  HttpRequest request = new HttpRequest(HttpMethod.GET, "http://www.microsoft.com")
          .setRequestOptions(requestOptions);

  httpPipeline.send(request);

logs with simulated error (formatted, verbose)

2025-01-06 19:04:44.865 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.body.size": 0,
  "span.id": "f913333fbc212845",
  "http.request.resend_count": 0,
  "url.full": "http://www.microsoft.com",
  "event.name": "http.request"
}

2025-01-06 19:04:44.866 [main] [DEBUG] core.tracing - 
{
  "library.name": "core",
  "library.version": "1.0.0-beta.2",
  "span.name": "GET",
  "span.duration": 5.2217,
  "span.id": "f913333fbc212845",
  "server.address": "www.microsoft.com",
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "error.type": "java.net.UnknownHostException",
  "http.request.method": "GET",
  "span.kind": "CLIENT",
  "server.port": 80,
  "span.parent.id": "72d954337db64062",
  "url.full": "http://www.microsoft.com",
  "event.name": "span.ended"
}

2025-01-06 19:04:44.866 [main] [WARN] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.duration": 7.4057,
  "exception.type": "java.net.UnknownHostException",
  "exception.message": null,
  "http.request.body.size": 0,
  "span.id": "f913333fbc212845",
  "http.request.resend_count": 0,
  "url.full": "http://www.microsoft.com",
  "exception.stacktrace": "java.net.UnknownHostException\r\n\tat com.test/com.test.App.lambda$main$0(App.java:32)\r\n\tat [email protected]/io.clientcore.core.http.pipeline.HttpPipelineNextPolicy...",
  "event.name": "http.response"
}

2025-01-06 19:04:44.868 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpRetryPolicy - 
{
  "retry.was_last_attempt": false,
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "retry.delay": 824,
  "exception.type": "java.io.UncheckedIOException",
  "exception.message": "java.net.UnknownHostException",
  "retry.max_attempt_count": 3,
  "span.id": "72d954337db64062",
  "http.request.resend_count": 0,
  "exception.stacktrace": "java.io.UncheckedIOException: java.net.UnknownHostException\r\n\tat com.test/com.test.App.lambda$main$0(App.java:32)\r\n\tat ....",
  "event.name": "http.retry"
}

2025-01-06 19:04:45.694 [main] [DEBUG] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.body.size": 0,
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "event.name": "http.request"
}

2025-01-06 19:04:45.803 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - {
  "http.request.time_to_response": 108.7054,
  "http.response.body.size": 201253,
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "http.request.method": "GET",
  "http.request.duration": 108.7196,
  "http.request.body.size": 0,
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "http.response.status_code": 200,
  "event.name": "http.response"
}

2025-01-06 19:04:45.803 [main] [DEBUG] core.tracing - {
  "library.name": "core",
  "library.version": "1.0.0-beta.2",
  "server.address": "www.microsoft.com",
  "span.name": "GET",
  "trace.id": "a4d931fb2eb04346b9589424370887ad",
  "span.duration": 108.9267,
  "http.request.method": "GET",
  "span.kind": "CLIENT",
  "server.port": 80,
  "span.parent.id": "f913333fbc212845",
  "span.id": "7f622788d28f645f",
  "http.request.resend_count": 1,
  "url.full": "http://www.microsoft.com",
  "http.response.status_code": 200,
  "event.name": "span.ended"
}

or with simulated error and redirect (formatted, info)

2025-01-06 19:21:08.735 [main] [WARN] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 6.5101,
  "exception.type": "java.net.UnknownHostException",
  "exception.message": null,
  "http.request.body.size": 0,
  "span.id": "58fea390ed94a20b",
  "http.request.resend_count": 0,
  "url.full": "http://microsoft.com",
  "event.name": "http.response"
}

2025-01-06 19:21:09.648 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "http.request.time_to_response": 121.9558,
  "http.response.body.size": 0,
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 121.9866,
  "http.request.body.size": 0,
  "span.id": "0ef927d10d392086",
  "http.request.resend_count": 1,
  "url.full": "http://microsoft.com",
  "http.response.status_code": 301,
  "event.name": "http.response"
}

2025-01-06 19:21:09.885 [main] [INFO] io.clientcore.core.http.pipeline.HttpInstrumentationPolicy - 
{
  "http.request.time_to_response": 236.4778,
  "http.response.body.size": 201253,
  "trace.id": "dfc10b5460554c87ba9dc6c5a1fdda1c",
  "http.request.method": "GET",
  "http.request.duration": 236.4934,
  "http.request.body.size": 0,
  "span.id": "3016d2e453fc9874",
  "http.request.resend_count": 1,
  "url.full": "https://www.microsoft.com/",
  "http.response.status_code": 200,
  "event.name": "http.response"
}

So new RequestOptions().setInstrumentationContext(createRandomContext()); brings correlation to all HTTP logs (including retries and redirects) in this pipeline.
It works with and without otel.

We can easily extend it to support implicit context propagation (which already works with otel).

lmolkova · 2025-01-07T04:00:58Z

sdk/clientcore/core/src/test/java/io/clientcore/core/util/HttpLoggingPolicyTests.java

moved to HttpInstrumentationLoggingTests

lmolkova · 2025-01-07T04:01:53Z

...ncy-tests/src/test/java/io/clientcore/core/http/pipeline/HttpInstrumentationPolicyTests.java

@@ -314,27 +311,6 @@ public void tracingIsDisabledOnInstance() throws IOException {
        assertEquals(0, exporter.getFinishedSpanItems().size());
    }

-    @Test
-    public void tracingIsDisabledOnRequest() throws IOException {


removed disabling individual requests - we didn't advertise this feature in azure-core and I don't recall anyone asking for it.
We also don't support it in other languages.

alzimmermsft · 2025-01-07T18:24:42Z

.../src/main/java/io/clientcore/core/implementation/instrumentation/fallback/RandomIdUtils.java

+    private static final int TRACE_ID_HEX_LENGTH = 32;
+    private static final int SPAN_ID_HEX_LENGTH = 16;
+    private static final char[] ENCODING = buildEncodingArray();
+    private static final String ALPHABET = "0123456789abcdef";


Since this is only used in the utility method buildEncodingArray could this be turned into a local variable there?

alzimmermsft · 2025-01-07T19:12:38Z

...re/src/main/java/io/clientcore/core/implementation/instrumentation/otel/OTelInitializer.java

@@ -184,6 +184,7 @@ public static void runtimeError(ClientLogger logger, Throwable t) {
     * @return true if OTel is initialized successfully, false otherwise
     */
    public static boolean isInitialized() {
-        return INSTANCE.initialized;
+        return INSTANCE == null || INSTANCE.initialized;


Why does isInitialized return true if the instance is null?

alzimmermsft · 2025-01-07T19:15:24Z

...va/io/clientcore/core/implementation/instrumentation/fallback/FallbackContextPropagator.java

+        boolean isAllZero = true;
+        for (int i = 3; i < 35; i++) {
+            char c = traceparent.charAt(i);
+            if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f'))) {


Is this character range a specification somewhere that could be included here? I know many times both a-f and A-F are acceptible.

alzimmermsft · 2025-01-07T19:16:26Z

...va/io/clientcore/core/implementation/instrumentation/fallback/FallbackContextPropagator.java

+        for (int i = 0; i < 2; i++) {
+            if (traceparent.charAt(i) != '0') {
+                return false;
+            }
+        }


Let's just unroll this loop

if (traceparent.charAt(0) != '0' || traceparent.charAt(1) != '0') { return false; }

with that this can then be combined with the next if

if (traceparent.charAt(0) != '0' || traceparent.charAt(1) != '0' || traceparent.charAt(2) != '-') { return false; }

alzimmermsft · 2025-01-07T19:19:07Z

...va/io/clientcore/core/implementation/instrumentation/fallback/FallbackContextPropagator.java

+        if (traceparent.charAt(35) != '-') {
+            return false;
+        }


Thoughts on making the checks non-sequential for better performance?

This and the charAt(52) checks could be done eagerly to prevent needing to track more information earlier.

Also, somewhere in this class mind adding a comment that outlines the definition of a valid traceparent?

alzimmermsft · 2025-01-07T19:22:36Z

...ain/java/io/clientcore/core/implementation/instrumentation/fallback/FallbackSpanContext.java

+    private FallbackSpanContext() {
+        this.traceId = INVALID_TRACE_ID;
+        this.spanId = INVALID_SPAN_ID;
+        this.traceFlags = "00";
+        this.isValid = false;
+        this.span = Span.noop();
+    }


Do we need this constructor if its only usage is to create the static INVALID? Could that just call the constructor below.

lmolkova added 7 commits January 6, 2025 16:45

log correlation and log-based tracing

dccdca2

a lot of changes

d8297fa

draft complete

4cd72bb

almost ready

ff8cdca

cleanup attribute and event names

5f4a7c0

Retry policy logging clean up

98cff1f

redirect policy logging cleanup

8178d76

github-actions bot added the clientcore label Jan 7, 2025

lmolkova changed the title ~~Log based tracing without otel~~ Tracing and log correlation without otel Jan 7, 2025

clean up

3960ec6

clean up

4c6f7a4

lmolkova commented Jan 7, 2025

View reviewed changes

update snippets

bcc6c9c

alzimmermsft reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracing and log correlation without otel #43718

Tracing and log correlation without otel #43718

lmolkova commented Jan 7, 2025 •

edited

Loading

lmolkova commented Jan 7, 2025 •

edited

Loading

azure-sdk commented Jan 7, 2025 •

edited

Loading

lmolkova commented Jan 7, 2025 •

edited

Loading

lmolkova Jan 7, 2025

lmolkova Jan 7, 2025 •

edited

Loading

alzimmermsft Jan 7, 2025

alzimmermsft Jan 7, 2025

alzimmermsft Jan 7, 2025

alzimmermsft Jan 7, 2025

alzimmermsft Jan 7, 2025

alzimmermsft Jan 7, 2025

Tracing and log correlation without otel #43718

Are you sure you want to change the base?

Tracing and log correlation without otel #43718

Conversation

lmolkova commented Jan 7, 2025 • edited Loading

lmolkova commented Jan 7, 2025 • edited Loading

azure-sdk commented Jan 7, 2025 • edited Loading

lmolkova commented Jan 7, 2025 • edited Loading

lmolkova Jan 7, 2025

Choose a reason for hiding this comment

lmolkova Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

alzimmermsft Jan 7, 2025

Choose a reason for hiding this comment

alzimmermsft Jan 7, 2025

Choose a reason for hiding this comment

alzimmermsft Jan 7, 2025

Choose a reason for hiding this comment

alzimmermsft Jan 7, 2025

Choose a reason for hiding this comment

alzimmermsft Jan 7, 2025

Choose a reason for hiding this comment

alzimmermsft Jan 7, 2025

Choose a reason for hiding this comment

lmolkova commented Jan 7, 2025 •

edited

Loading

lmolkova commented Jan 7, 2025 •

edited

Loading

azure-sdk commented Jan 7, 2025 •

edited

Loading

lmolkova commented Jan 7, 2025 •

edited

Loading

lmolkova Jan 7, 2025 •

edited

Loading