Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ops[tracing] [draftey draft] #1527

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

Conversation

dimaqq
Copy link
Contributor

@dimaqq dimaqq commented Jan 15, 2025

[tbd]

Files

  • ops/jujuversion.py -- lightweight, no need to instrument
  • ops/_tracing/buffer.py -- new
  • ops/_tracing/__init__.py -- new, shim only
  • ops/_tracing/hacks.py -- may have to run before OTEL import
  • ops/_tracing/export.py -- the guts, exempt
  • ops/version.py -- nothing to done
  • ops/_main.py -- done
  • ops/log.py -- recursion prevention added
  • ops/charm.py -- started
  • ops/pebble.py -- TBD started, mostly relying on urllib instrumentation though
  • ops/framework.py -- TBD started
  • ops/__init__.py -- only shims, no need to instrument
  • ops/_private/harness.py -- test only
  • ops/_private/__init__.py -- empty
  • ops/_private/yaml.py -- done
  • ops/_private/timeconv.py -- nothing to do
  • ops/model.py -- tbd
  • ops/storage.py -- ??
  • ops/lib/__init__.py -- deprecated, will not touch
  • ops/jujucontext.py -- nothing to do
  • ops/main.py -- nothing to do
  • ops/testing.py -- out of scope

Functionality

@dimaqq
Copy link
Contributor Author

dimaqq commented Jan 15, 2025

Notes:

  • adds opentelemetry-api (64kB wheel) to dependencies
  • adds a bunch of opentelemetry-this-and-that (total size tbd) to ops[tracing] dependency group

@dimaqq
Copy link
Contributor Author

dimaqq commented Jan 17, 2025

No idea why RTD build fails...

HACKING.md Outdated Show resolved Hide resolved
@dimaqq dimaqq force-pushed the feat-otel branch 3 times, most recently from a483e07 to 3c060af Compare January 27, 2025 00:58
@dimaqq
Copy link
Contributor Author

dimaqq commented Jan 27, 2025

squashed and rebased: #1539 got merged; #1538 a no-go.

@dimaqq dimaqq changed the title feat: otel [very draftey draft] feat: ops[tracing] [very draftey draft] Jan 28, 2025
@dimaqq dimaqq changed the title feat: ops[tracing] [very draftey draft] feat: ops[tracing] [draftey draft] Jan 28, 2025
Copy link
Collaborator

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented with some questions. Let's discuss voice further on Monday.

# Use C speedups if available
_safe_loader = getattr(yaml, 'CSafeLoader', yaml.SafeLoader)
_safe_dumper = getattr(yaml, 'CSafeDumper', yaml.SafeDumper)


@tracer.start_as_current_span('ops.yaml.safe_load') # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to trace internal calls like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to dropping this.

The reason I've instrumented yaml ops is that we'd had a performance issue before (missing c speedups).

Suppose someone runs into that again and we only provide coarse-grained tracing, they would see:

  • ops.RelationData.update() 1500ms (control handed over from charm to ops)
  • hook tool relation-set 50ms

What can they go on in this case?

Let's decide on this at a stand-up or during review.


def setup_tracing(charm_class_name: str) -> None:
global _exporter
# FIXME would it be better to pass Juju context explicitly?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check with @tonyandrewmeyer

If I were to pass juju context to this function, I should probably also pass it to _Manager too...

And if so, maybe Scenario's Ops would also have to be changed?

I'm not sure about the trade-offs off the top of my head.

stored: int | None = conn.execute(
"""
SELECT sum((length(data)+4095)/4096*4096)
FROM tracing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to mean a full table scan? Is that a problem?

Copy link
Contributor Author

@dimaqq dimaqq Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's surprisingly fast.

I'm not sure I understand exactly why, but somehow, sum(length(blob_column)) is faster than sum(int_column).

One point to note is that we use BatchSpanProcessor buffers a bit in memory. By the time export is called, and we buffer the data, the chunk size is on the order of 2KB ~ 4KB for trivial charms and ~40KB for complex, instrumented charms. This means that blob storage takes most of the database, and the full scan only smaller part, the rows.

I've just tested that the worst case for this query is <50ms in my VM, at 40MB db buffer size.

Do you reckon that's acceptable?

P.S.
Sqlite is an MVCC database, which makes exact sum or even exact row count a costly operation.
I'm not sure if cold start can be effectively solved, even with stat1...
However, if really needed, we could cache the db size in memory, maybe? Run this query once on startup, then maintain the in-memory estimate.

conn.execute(
f"""
DELETE FROM tracing
WHERE id IN ({','.join(('?',) * len(collected_ids))})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd avoid allocating/copying twice to say '?,' * (len(collected_ids)-1) + '?'

# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific language
# governing permissions and limitations under the License.
"""Buffer for tracing data."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to confirm: there's no existing buffer handling in the opentelemetry libraries, correct?

@@ -122,6 +125,7 @@ class Model:
as ``self.model`` from any class that derives from :class:`Object`.
"""

@tracer.start_as_current_span('ops.Model') # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought are approach was going to be "trace external things like hook tools and Pebble calls", not most methods. In this file we seem to be taking more of an "every model method" approach? Is that required to get more meaningful traces? What would it look like with just hook tools + Pebble traced?

@@ -2068,23 +2074,27 @@ def _request_raw(

return response

@tracer.start_as_current_span('ops.pebble.Client.get_system_info') # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here: could we not trace the lower-level Pebble request to do it only in one place, or is that too hard to make meaningful?

@dimaqq
Copy link
Contributor Author

dimaqq commented Jan 31, 2025

I need to handle more export errors:

ERROR:opentelemetry.sdk.trace.export:Exception while exporting Span batch.
Traceback (most recent call last):
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
        (self._dns_host, self.port),
    ...<2 lines>...
        socket_options=self.socket_options,
    )
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
    ~~~~~~~~~~~~^^^^
ConnectionRefusedError: [Errno 111] Connection refused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants