Replies: 2 comments 2 replies
-
Firebase handles it similar to scenario 2, letting you register up to 500 custom events. I wouldn't mind having to first register my custom events 👌 |
Beta Was this translation helpful? Give feedback.
-
From a user perspective, Im a fan of Scenario 1. If the is a dropdown where you add custom events, whatever you add next can automatically be mapped to the next custom event in the background for that users account. Depending on how the the user builds the lookup table for Scenario 2 it may not be that bad. I think that @GianniCarlo's reference to Firebase's solution for capping the number of custom events is reasonable. 500 may also be high for this stage as well. |
Beta Was this translation helpful? Give feedback.
-
Hi, I have something to bring before you as a community of users. Namely, I’ve come to the conclusion that I need to drop a feature in the name of performance. Currently, you can include a payload dictionary in you signals, and you can filter by the key-value pairs in that payload dictionary. In your hypothetical pasta timer app, you just add
{"pastaType": "Farfalle"}
to your signal and you can later filter and break down your signals bypastaType
. The payload dictionary also contains a list of default keys such as operating system version, build number, and so on.Behind the scenes this works using the PostgreSQL jsonb type field, a database column I can drop JSON data into and then filter and query the data as if it contained columns— with a performance hit.
Now, over the last few months, I’ve been working towards no longer using PostgreSQL for calculating Insights. I'm migrating to Apache Druid instead: a time series database which is way faster, and specifically engineered to work with large amounts of analytics data. The reason for this will be obvious to you if you use Telemetry with an app that has more than a couple hundred users: At some point the PostgreSQL performance drops off a cliff. I can mitigate the problem by giving the database server more and more resources, but for longer-term we need a different solution. After looking at various data lakes, nosql databases, and time-series databases, I believe Druid is the right way forward.
The downside is, Druid does not have a JSON-type field, it needs discrete columns for a variety of reasons. So something has to change in the way signals are ingested. I can see three options, and I’d very much like to hear your opinions on these:
Scenario 1: Automatic Lookup Table
In the database: Create columns for all the default keys, as well as custom1, custom2, ... customN. Also create a lookup table that, for each app, maps a custom string to each of the column names.
At ingestion time: For each key in the payload dictionary, check if it is in the lookup table for that app. If not, map it to the next free column. If there are no free columns left, ignore the key and warn the user in the UI somehow. Users can use the Lexicon UI to manage the lookup table.
The key's value get transcribed to the mapped column.
Scenario 2: Manual Lookup Table
In the database: Same as Scenario 1. Create columns for all the default keys, as well as custom1, custom2, ... customN. Also create a lookup table that, for each app, maps a custom string to each of the column names.
However, this time the lookup table is not auto-generated. Instead, each time a user wants to use a custom key, they'll have to define a mapping in the lookup table first, via some control in the Lexicon UI.
At ingestion time, if a mapping has been defined, the value gets transcribed into the map column. Otherwise, the value is dropped.
Scenario 3: No Lookup Table
In the database: Create columns for all the default keys, as well as custom1, custom2, ... customN. Do not create a lookup table.
This scenario drops the lookup table. Signals are no longer allowed to have a custom payload. Instead, when a Signal is generated, users have hard-coded properties named custom 1, custom2, ... customN which they'll be able to fill.
At ingest time, these are directly written into the respective database columns.
Merits and Drawbacks
From a purely user focused standpoint, scenario 1 seems ideal. You, the users, need to change almost nothing. However, the lookup comes at a performance cost, and there’s the very real risk of losing Data once all custom columns are filled.
From a performance standpoint, and from a standpoint of obfuscating as little as possible from users (who are after all developers and want to understand a system) scenario 3 seems most attractive.
The downside is, though, users will have to do the lookup in their head. Or on a piece of paper. A lot of “semantic-ness” is lost. One way I’ve thought about mitigating this is to not use customN as column name, but instead have columns for all possible three-letter acronyms but this is probably a bit hacky.
So I’m asking you, the community of users of telemetry: which direction should we go? Are there other options that you can think of? Are there ways to make the programming API easy to use with hardcoded column names?
A custom payload clearly is important to many users of telemetry. Let’s see if we can make that feature useful, easy AND fast. Thanks for reading :)
Beta Was this translation helpful? Give feedback.
All reactions