-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Tracing #2032
Comments
note #1738 |
Beware of making the code complicated in order to achieve this! A great way to reduce the number of traces is Sampling: https://opentelemetry.io/docs/concepts/sampling/ You can reduce traffic 10/100 fold while not losing detail. |
Part of #2032. I removed 2 spans that seem redundant: - `shape_write.log_collector.handle_txn`: this span wraps the `handle_transaction` function. However, there is already a span `pg_txn.replication_client.transaction_received` that in fact calls into `handle_transaction`. - `shape_write.log_collector.handle_relation`: this span wraps the handling of relation messages. Similarly to transactions, there is a `pg_txn.replication_client.relation_received` span that ends up calling into the `handle_relation` function. Here are the relevant code snippets: `Electric.Postgres.ReplicationClient`: ```ex {m, f, args} = state.transaction_received OpenTelemetry.with_span( "pg_txn.replication_client.transaction_received", [num_changes: length(txn.changes), num_relations: MapSet.size(txn.affected_relations)], fn -> apply(m, f, [txn | args]) end ) ``` The call to `apply` is a chain of calls that eventually ends up calling `handle_transaction` (and does nothing more): `Electric.StackSupervisor`: ```ex transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [shape_log_collector]} ``` `Electric.Replication.ShapeLogCollector.ex`: ```ex def store_transaction(%Transaction{} = txn, server) do ot_span_ctx = OpenTelemetry.get_current_context() GenStage.call(server, {:new_txn, txn, ot_span_ctx}, :infinity) end def handle_call({:new_txn, %Transaction{xid: xid, lsn: lsn} = txn, ot_span_ctx}, from, state) do OpenTelemetry.set_current_context(ot_span_ctx) Logger.info("Received transaction #{xid} from Postgres at #{lsn}") Logger.debug(fn -> "Txn received in ShapeLogCollector: #{inspect(txn)}" end) OpenTelemetry.with_span("shape_write.log_collector.handle_txn", [], fn -> handle_transaction(txn, from, state) end) end ``` ### Question I removed the spans from `Electric.Replication.ShapeLogCollector`, another option would be to remove the spans from `Electric.Postgres.ReplicationClient`, any preference here?
Part of #2032. We noticed that most spans are descendants of the `pg_txn.replication_client.process_x_log_data` root span. Therefore, we decided to only sample a portion of those spans. This PR introduces a custom sampler that works as follows: - Samples a configurable ratio of `pg_txn.replication_client.process_x_log_data` root spans - Samples all other root spans - Child spans are sampled if their parent is sampled ### Problem We would like to sample all errors. To do this we need to make the sampling decision at the end of the span when we have all attributes and events because errors are recorded using an "exception" event, this is known as tail sampling. However, the Erlang opentelemetry library only seems to support head sampling: > Sampling is performed at span creation time by the Sampler configured on the Tracer cf. https://docs.honeycomb.io/manage-data-volume/sample/techniques/ if you're not familiar with head vs tail sampling. EDIT: solving this problem may require using HoneyComb's "Refinery" mechanism. So Electric would sample all traces and we would setup Refinery with custom tail sampling logic. However, this requires extra infrastructure to set up.
@balegas more work needed here? |
@icehaunter let's add the Electric metrics to honeycomb. |
We have set up basic Tracing for Electric but we want to continue improving what we become more successful investigating issues.
The text was updated successfully, but these errors were encountered: