feat: ops[tracing] [draftey draft] #1527

dimaqq · 2025-01-15T08:49:10Z

[tbd]

Files

Functionality

dimaqq · 2025-01-15T08:51:04Z

Notes:

adds opentelemetry-api (64kB wheel) to dependencies
adds a bunch of opentelemetry-this-and-that (total size tbd) to ops[tracing] dependency group

dimaqq · 2025-01-17T08:51:56Z

No idea why RTD build fails...

HACKING.md

dimaqq · 2025-01-27T00:59:33Z

squashed and rebased: #1539 got merged; #1538 a no-go.

benhoyt

Commented with some questions. Let's discuss voice further on Monday.

benhoyt · 2025-01-31T02:09:27Z

ops/_private/yaml.py

 # Use C speedups if available
 _safe_loader = getattr(yaml, 'CSafeLoader', yaml.SafeLoader)
 _safe_dumper = getattr(yaml, 'CSafeDumper', yaml.SafeDumper)


+@tracer.start_as_current_span('ops.yaml.safe_load')  # type: ignore


Why do we want to trace internal calls like this?

I'm not opposed to dropping this.

The reason I've instrumented yaml ops is that we'd had a performance issue before (missing c speedups).

Suppose someone runs into that again and we only provide coarse-grained tracing, they would see:

ops.RelationData.update() 1500ms (control handed over from charm to ops)

hook tool relation-set 50ms

What can they go on in this case?

Let's decide on this at a stand-up or during review.

benhoyt · 2025-01-31T02:14:46Z

ops/_tracing/export.py

+
+def setup_tracing(charm_class_name: str) -> None:
+    global _exporter
+    # FIXME would it be better to pass Juju context explicitly?


Yes, please. :-)

Let's check with @tonyandrewmeyer

If I were to pass juju context to this function, I should probably also pass it to _Manager too...

And if so, maybe Scenario's Ops would also have to be changed?

I'm not sure about the trade-offs off the top of my head.

benhoyt · 2025-01-31T02:15:37Z

ops/_tracing/buffer.py

+                stored: int | None = conn.execute(
+                    """
+                    SELECT sum((length(data)+4095)/4096*4096)
+                    FROM tracing


Isn't this going to mean a full table scan? Is that a problem?

Yes, but it's surprisingly fast.

I'm not sure I understand exactly why, but somehow, sum(length(blob_column)) is faster than sum(int_column).

One point to note is that we use BatchSpanProcessor buffers a bit in memory. By the time export is called, and we buffer the data, the chunk size is on the order of 2KB ~ 4KB for trivial charms and ~40KB for complex, instrumented charms. This means that blob storage takes most of the database, and the full scan only smaller part, the rows.

I've just tested that the worst case for this query is <50ms in my VM, at 40MB db buffer size.

Do you reckon that's acceptable?

P.S.
Sqlite is an MVCC database, which makes exact sum or even exact row count a costly operation.
I'm not sure if cold start can be effectively solved, even with stat1...
However, if really needed, we could cache the db size in memory, maybe? Run this query once on startup, then maintain the in-memory estimate.

benhoyt · 2025-01-31T02:17:41Z

ops/_tracing/buffer.py

+                    conn.execute(
+                        f"""
+                        DELETE FROM tracing
+                        WHERE id IN ({','.join(('?',) * len(collected_ids))})


I think it'd avoid allocating/copying twice to say '?,' * (len(collected_ids)-1) + '?'

benhoyt · 2025-01-31T02:21:02Z

ops/_tracing/buffer.py

+# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
+# ANY KIND, either express or implied. See the License for the specific language
+# governing permissions and limitations under the License.
+"""Buffer for tracing data."""


I just want to confirm: there's no existing buffer handling in the opentelemetry libraries, correct?

benhoyt · 2025-01-31T02:25:51Z

ops/model.py

@@ -122,6 +125,7 @@ class Model:
    as ``self.model`` from any class that derives from :class:`Object`.
    """

+    @tracer.start_as_current_span('ops.Model')  # type: ignore


I thought are approach was going to be "trace external things like hook tools and Pebble calls", not most methods. In this file we seem to be taking more of an "every model method" approach? Is that required to get more meaningful traces? What would it look like with just hook tools + Pebble traced?

benhoyt · 2025-01-31T02:27:01Z

ops/pebble.py

@@ -2068,23 +2074,27 @@ def _request_raw(

        return response

+    @tracer.start_as_current_span('ops.pebble.Client.get_system_info')  # type: ignore


Similar here: could we not trace the lower-level Pebble request to do it only in one place, or is that too hard to make meaningful?

dimaqq · 2025-01-31T03:06:26Z

I need to handle more export errors:

ERROR:opentelemetry.sdk.trace.export:Exception while exporting Span batch.
Traceback (most recent call last):
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
        (self._dns_host, self.port),
    ...<2 lines>...
        socket_options=self.socket_options,
    )
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
    ~~~~~~~~~~~~^^^^
ConnectionRefusedError: [Errno 111] Connection refused

dimaqq commented Jan 17, 2025

View reviewed changes

HACKING.md Outdated Show resolved Hide resolved

dimaqq force-pushed the feat-otel branch 3 times, most recently from a483e07 to 3c060af Compare January 27, 2025 00:58

dimaqq changed the title ~~feat: otel [very draftey draft]~~ feat: ops[tracing] [very draftey draft] Jan 28, 2025

dimaqq changed the title ~~feat: ops[tracing] [very draftey draft]~~ feat: ops[tracing] [draftey draft] Jan 28, 2025

dimaqq force-pushed the feat-otel branch from 7899bbc to f4ccbb1 Compare January 28, 2025 07:06

feat: ops[tracing]

d7cfe5d

dimaqq force-pushed the feat-otel branch from 2a83b26 to d7cfe5d Compare January 30, 2025 06:34

chore: events as OTEL events; wip custom events

10308d8

benhoyt reviewed Jan 31, 2025

View reviewed changes

dimaqq added 4 commits January 31, 2025 15:53

wip

f7b57d2

discussion: if we're goiong to instrument yaml, let's log byte size

1ac5fb9

instrument mappings

397a3eb

add ops.Resources

cb82876

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ops[tracing] [draftey draft] #1527

feat: ops[tracing] [draftey draft] #1527

dimaqq commented Jan 15, 2025 •

edited

Loading

dimaqq commented Jan 15, 2025

dimaqq commented Jan 17, 2025

dimaqq commented Jan 27, 2025

benhoyt left a comment

benhoyt Jan 31, 2025

dimaqq Jan 31, 2025

benhoyt Jan 31, 2025

dimaqq Jan 31, 2025

benhoyt Jan 31, 2025

dimaqq Jan 31, 2025 •

edited

Loading

benhoyt Jan 31, 2025

benhoyt Jan 31, 2025

benhoyt Jan 31, 2025

benhoyt Jan 31, 2025

dimaqq commented Jan 31, 2025

		@@ -2068,23 +2074,27 @@ def _request_raw(

		return response

		@tracer.start_as_current_span('ops.pebble.Client.get_system_info') # type: ignore

feat: ops[tracing] [draftey draft] #1527

Are you sure you want to change the base?

feat: ops[tracing] [draftey draft] #1527

Conversation

dimaqq commented Jan 15, 2025 • edited Loading

Files

Functionality

dimaqq commented Jan 15, 2025

dimaqq commented Jan 17, 2025

dimaqq commented Jan 27, 2025

benhoyt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimaqq Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimaqq commented Jan 31, 2025

dimaqq commented Jan 15, 2025 •

edited

Loading

dimaqq Jan 31, 2025 •

edited

Loading