-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ops[tracing] [draftey draft] #1527
base: main
Are you sure you want to change the base?
Conversation
Notes:
|
No idea why RTD build fails... |
a483e07
to
3c060af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented with some questions. Let's discuss voice further on Monday.
# Use C speedups if available | ||
_safe_loader = getattr(yaml, 'CSafeLoader', yaml.SafeLoader) | ||
_safe_dumper = getattr(yaml, 'CSafeDumper', yaml.SafeDumper) | ||
|
||
|
||
@tracer.start_as_current_span('ops.yaml.safe_load') # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to trace internal calls like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to dropping this.
The reason I've instrumented yaml ops is that we'd had a performance issue before (missing c speedups).
Suppose someone runs into that again and we only provide coarse-grained tracing, they would see:
ops.RelationData.update()
1500ms (control handed over from charm to ops)- hook tool
relation-set
50ms
What can they go on in this case?
Let's decide on this at a stand-up or during review.
|
||
def setup_tracing(charm_class_name: str) -> None: | ||
global _exporter | ||
# FIXME would it be better to pass Juju context explicitly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check with @tonyandrewmeyer
If I were to pass juju context to this function, I should probably also pass it to _Manager too...
And if so, maybe Scenario's Ops would also have to be changed?
I'm not sure about the trade-offs off the top of my head.
stored: int | None = conn.execute( | ||
""" | ||
SELECT sum((length(data)+4095)/4096*4096) | ||
FROM tracing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to mean a full table scan? Is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it's surprisingly fast.
I'm not sure I understand exactly why, but somehow, sum(length(blob_column))
is faster than sum(int_column)
.
One point to note is that we use BatchSpanProcessor buffers a bit in memory. By the time export is called, and we buffer the data, the chunk size is on the order of 2KB ~ 4KB for trivial charms and ~40KB for complex, instrumented charms. This means that blob storage takes most of the database, and the full scan only smaller part, the rows.
I've just tested that the worst case for this query is <50ms in my VM, at 40MB db buffer size.
Do you reckon that's acceptable?
P.S.
Sqlite is an MVCC database, which makes exact sum or even exact row count a costly operation.
I'm not sure if cold start can be effectively solved, even with stat1...
However, if really needed, we could cache the db size in memory, maybe? Run this query once on startup, then maintain the in-memory estimate.
conn.execute( | ||
f""" | ||
DELETE FROM tracing | ||
WHERE id IN ({','.join(('?',) * len(collected_ids))}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd avoid allocating/copying twice to say '?,' * (len(collected_ids)-1) + '?'
# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF | ||
# ANY KIND, either express or implied. See the License for the specific language | ||
# governing permissions and limitations under the License. | ||
"""Buffer for tracing data.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to confirm: there's no existing buffer handling in the opentelemetry libraries, correct?
@@ -122,6 +125,7 @@ class Model: | |||
as ``self.model`` from any class that derives from :class:`Object`. | |||
""" | |||
|
|||
@tracer.start_as_current_span('ops.Model') # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought are approach was going to be "trace external things like hook tools and Pebble calls", not most methods. In this file we seem to be taking more of an "every model method" approach? Is that required to get more meaningful traces? What would it look like with just hook tools + Pebble traced?
@@ -2068,23 +2074,27 @@ def _request_raw( | |||
|
|||
return response | |||
|
|||
@tracer.start_as_current_span('ops.pebble.Client.get_system_info') # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here: could we not trace the lower-level Pebble request to do it only in one place, or is that too hard to make meaningful?
I need to handle more export errors:
|
[tbd]
Files
ops/jujuversion.py
-- lightweight, no need to instrumentops/_tracing/buffer.py
-- newops/_tracing/__init__.py
-- new, shim onlyops/_tracing/hacks.py
-- may have to run before OTEL importops/_tracing/export.py
-- the guts, exemptops/version.py
-- nothing to doneops/_main.py
-- doneops/log.py
-- recursion prevention addedops/charm.py
-- startedops/pebble.py
-- TBD started, mostly relying on urllib instrumentation thoughops/framework.py
-- TBD startedops/__init__.py
-- only shims, no need to instrumentops/_private/harness.py
-- test onlyops/_private/__init__.py
-- emptyops/_private/yaml.py
-- doneops/_private/timeconv.py
-- nothing to doops/model.py
-- tbdops/storage.py
-- ??ops/lib/__init__.py
-- deprecated, will not touchops/jujucontext.py
-- nothing to doops/main.py
-- nothing to doops/testing.py
-- out of scopeFunctionality
ops
with[tracing]
ops
without[tracing]
ops[tracing]
ops[tracing]
ops[tracing]
ops[tracing]
ops[tracing]
ops[tracing]
type: ignore