-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce a standard format for decoded sensor data #395
Comments
@pablojimpas thanks for your suggestion. I think this can be very useful indeed. In my view, there would be a common JSON schema for decoded payload; the normalized payload. Then, there would be another function that maps the output from the uplink decoder to this new schema. I would like to keep that as a separate, optional step, after the uplink decoder. The decoded uplink is preserved in upstream messages in The Things Stack, and there will be a new field with normalized fields. The reason for this is that this standard JSON schema will never fully cover all device specific fields. We already track sensors so what's really to be done is drafting a JSON schema with properties for each of these sensor types, that clearly define their data type, unit, description and validity (max/min, scale, patterns, enum values etc). When we have that, we can define a new function signature maybe The Things Stack would make this available on What do you think? |
@johanstokking I like your approach! Having this as a separate schema will allow us to draft the solution more quickly and then, since it's an optional step, the transition period will be graceful. Each manufacturer will be able to implement the normalizeUplink(data) function at its own pace. Now it's just a matter of adding the new I may be able to contribute to this if I'm pointed in the right direction, since I want to see this happen promptly, but I don't promise anything, looking forward to your comments. |
We can very well use some help here. Are you proficient with drafting JSON schema? What we need is to extend the schema with:
If you're not proficient with JSON schema you can also just start with a table defining the sensor types in terms of unit and validity info. We can get that drafted in a JSON schema. What we may also consider is allowing multiple measurements or an indication of the actual sensor. There may be more than one sensor of the same type, like buttons. CayenneLPP supports numbered channels for this. |
Unfortunately, I've minimal experience with JSON schemas.
I'll take care of it. I'll start such a table with all the current sensor types in the first comment of this topic so that it is visible, I will start with just the basics, but I encourage input from all who are interested. |
@johanstokking how do you think generic sensor types should be handled? For example, a sensor type of I also see an issue with the current list of sensor types, there's for example I think right now it's a bit messy mixing types of sensor with quantities, types of sensor it's a more consumer-facing concept but the actual quantity being measured it's what developers/integrators are interested in. So, maybe the normalized output should only care about types of measurements and not about types of sensors, what do you think? |
Yes I agree. We should probably define the quantities and the units. Which sensor produced a measurement can be defined in some sort of channel or sensor index. |
So, something like this: "normalized_payload": {
"temperature": [
{
"value": 17.5,
"unit": "celsius",
"sensor": "surface temperature"
}
],
"pressure": [
{
"value": 1022,
"unit": "hectopascal",
"sensor": "vapor pressure"
},
{
"value": 1013.25,
"unit": "hectopascal",
"sensor": "barometer"
}
]
} Well, maybe the |
@johanstokking I've started working on a JSON schema for the format that I described in my previous comment. I'm only making required the I'll post a PR with my early progress soon, but first let me know if you've something against this proposed format, please. |
Thanks, see review comments in the PR. Indeed we don't need units. Maybe we need to support both simple readings and per-sensor readings? Like {
"temperature": 21.2,
"humidity": 37.5,
"sensors": {
"test": {
"temperature": 23.5
},
"other": {
"humidity": 41.1,
"windSpeed": 11.9
}
}
} This way, we can do the following:
The way this would work is that there's a pattern properties within |
I agree that would make most common payloads smaller and simpler while still retaining the flexibility to model complex scenarios. However, I see a flaw with your proposed format. Suppose that an integrator wants to make use of this normalized payload and gets: {
"temperature": 21.2,
"humidity": 37.5
} How will the end-user app know if I like the simpler approach, but I think we have to keep some context in the data to be useful. Some quantities are used in many situations, and knowing which type of sensor made the measurement gives the context to figure out those situations. |
Right, so then we have two options: {
"soilTemperature": -3.1,
"airTemperature": 9.4
} Or: {
"temperature": [
{
"value": -3.1,
"source": "soil"
},
{
"value": 9.4,
"source": "air"
}
]
} Here, anyone can access Therefore I would prefer the first scenario: it's very explicit. There can be different temperatures: soil, water, air, for thermostats also the target temperature etc. Things like soil moisture levels on different depths (i.e. % at 25/50/75/100 cm deep) wouldn't be normalized this way; we would still need arrays for that. So we can also do both: {
"soilMoisture": [
{
"value": 29.4,
"source": "-100cm"
},
{
"value": 36.5,
"source": "-75cm"
},
...
]
} This way, most applications will just use What do you think? |
Overall, I agree with your analysis @johanstokking, it's a good compromise to cover every possible use case; however, this solution will require more domain knowledge for every measurement to get the naming right! I can start modeling the easy ones in the JSON schema if you want, but I would like to get more input from other parties (e.g. device manufacturers) to make sure everyone's interests are met. On the other hand, what changes will be necessary on |
@johanstokking any chances of seeing this in |
Yes you are correct; the implementation is fairly trivial in The Things Stack and we can make 3.20.0. You can indeed open an issue in https://github.com/TheThingsNetwork/lorawan-stack/issues referencing this one. I can also do it but good to file in your own words and you'll be subscribed automatically etc. |
Perfect! I've just created the new issue there TheThingsNetwork/lorawan-stack#5429 and mentioned you, so you also get in the discussion. |
Yes it does indeed. Knowing the difference between air and soil temperature is necessary domain knowledge I think. Naive applications may otherwise mix up different quantitites. I think we need to keep those quantities (like air vs soil temperature) separate from the unit (both Celcius). The question is though whether we need this array with multiple values. I think there are a few use cases for it:
So
Yes true. This will gradually grow over time, just like we kept adding sensors to the Device Repository. This is very much an iterative process. |
Absolutely, that's the main benefit of creating this normalized format, to give context/meaning to the data used by end user applications. We must protect that feature in the implementation.
Two concrete examples will help us understand this more easily, one trivial and one that simulates a fairly complete scenario. This will ensure that we don't miss any important detail from the simplest case to the very complex device. The normalized format should be flexible enough to cover both cases and the most straightforward solution (avoid arrays and properties bloat if possible), while still retaining context for the data. The first one will be a device that just sends one reading at a time of ambient temperature. The ideal and most straightforward format in that case to me will be just this: {
"ambientTemperature": 20.2
} As an app integrator, you maintain all the context (quantity=temperature, source=ambient, units=implied by the quantity, documented somewhere in the specification of this format), and you don't have to deal with anything else. But now suppose that we have a single microcontroller getting the following measurements from different sensors:
Then, it groups 2 readings spaced in time and packs them into a single LoRaWAN packet. {
"readings": [
{
"time": ...,
"ambientTemperature": 20.3,
"ambientHumidity": 33.0,
"atmosphericPressure": 1012.4,
"solarRadiation": 294.4,
"windSpeed": 2.8,
"windDirection": 181.0,
"leafHumidity": 23.5,
"soilTemperature": [
{
"value": 13.5,
"source": "50cm"
},
{
"value": 15.5,
"source": "10cm"
}
],
"soilMoisture": [
{
"value": 60.5,
"source": "50cm"
},
{
"value:" 55.0,
"source": "10cm"
}
],
"soilEC": {
"value": 2740.0,
"source": "10cm"
},
"soilPH": {
"value": 5.8,
"source": "10cm"
},
"soilNitrogen": {
"value": 200.4,
"source": "10cm"
},
"soilPhosphorus": {
"value": 158.8,
"source": "10cm"
},
"soilPotassium": {
"value": 303.1,
"source": "10cm"
},
},
{
"time": ...something else...,
/*...sencond reading...*/
}
]
} This is less than ideal because from the perspective of an integrations developer, if you need to support both scenarios (or anything in between) you have to check for an huge number of possibilities, but it's needed to preserve all the context. The format should be uniform regardless of the scenario, but the complexity has to be modeled somewhere (the proposed format could certainly be improved though), ideally we come up with a solution that doesn't pollute too much the simple cases, the first example uniformed with the latter format will look like this: {
"readings": [
{
"ambientTemperature": 20.2
}
]
} Which does not look that horrible from an integrator perspective: We still have to consider the source, which can be dynamic for some quantities (e.g. soil temperature). We can consider this to be some kind of “source modifier” because it is valuable to have a well-defined source. And there could also be multiple readings of the same quantity. So, the above example in reality will be unified to: {
"readings": [
{
"temperature": [
{
"value": 20.2,
"source": "ambient"
},
]
}
]
} This it's starting to get really ugly, but I think being uniform may be a necessary evil. The complex example will partially look like this: {
"readings": [
{
"time": ...,
...
"temperature": [
{
"value": 20.3,
"source": "ambient"
},
{
"value": 15.5,
"source": "soil",
"modifier": "10cm"
},
]
...
},
...
]
} So, we have: From the integrator perspective, it's valuable knowing that the data will be uniform regardless of the scenario you're dealing with. Not having this extra verbosity comes at the expense of having to check for a gargantuan number of possibilities if you want to cover every scenario without being a “naive application”. The “API” for working with this data it's not that bad apart from the two arrays: package main
import (
"encoding/json"
"fmt"
"time"
)
type Reading struct {
Time time.Time `json:"time,omitempty"`
Temperature []Measurement `json:"temperature,omitempty"`
}
type Measurement struct {
Value float32 `json:"value"`
Source string `json:"source"`
Modifier string `json:"modifier,omitempty"`
}
func main() {
rawData := `[
{
"time": "2022-05-06T19:07:10Z",
"temperature": [
{
"value": 20.3,
"source": "ambient"
},
{
"value": 15.5,
"source": "soil",
"modifier": "10cm"
}
]
},
{
"time": "2022-05-06T19:27:10Z",
"temperature": [
{
"value": 13.5,
"source": "soil",
"modifier": "10cm"
}
]
},
{
"time": "2022-05-06T19:57:10Z",
"temperature": [
{
"value": 9.0,
"source": "soil",
"modifier": "50cm"
}
]
},
{
"time": "2022-05-06T20:59:10Z",
"humidity": [
{
"value": 83.8,
"source": "soil",
"modifier": "10cm"
}
]
}
]`
var readings []Reading
err := json.Unmarshal([]byte(rawData), &readings)
if err != nil {
fmt.Println(err)
}
// eg. print only the measurements of soil temperature at 10cm
for _, r := range readings {
for _, t := range r.Temperature {
if t.Source == "soil" && t.Modifier == "10cm" {
fmt.Printf("%v: Soil temperature at 10cm was %v\n", r.Time, t.Value)
}
}
}
} My remaining concern with this is the Please excuse such a long example to make my points, but hopefully, I've brought some concerns to the table, so we can design a better solution. |
Thanks for the examples. I also think that taking realistic example measurements into account is very helpful. I would also prefer avoiding traversing an array of similar units to find the source of interest. I like your initial example where you differentiate measurements; some measurements need more specification (like anything below soil surface), while others don't (like ambient temperature). My suggestions would be:
Example: {
"readings": [
{
"logicalTime": 1,
"ambientTemperature": "20.2", // Celcius
"soil": {
"depth": 15, // centimeters down
"moisture": 15.5, // percentage?
"temperature": 9.4 // Celcius
}
},
{
"logicalTime": 1,
"soil": {
"depth": 25, // centimeters down
"moisture": 10.9, // percentage?
"temperature": 3.1 // Celcius
}
}
]
} Accessibility is not that bad:
What do you think? |
Overall, I agree with your modifications. I like what you did to avoid the second array, grouping related measurements and spreading across different readings if necessary. Here are some comments:
What if it is? Is that situation what you are referring to later in point 4? I think that's the easiest solution, spreading the measurements in the already needed array.
I don't quite get that, what other upstream entity could know about the relative to absolute time conversion apart from the decoder/manufacturer? |
Yes indeed. So we won't forbid In case of soil, there must be specifier like depth. But in case of ambient temperature or wind, I don't think that one device would measure two different readings with distinct sensors. But ok, it can, and we allow for it if we stick to this format.
What I meant is that some devices can be remotely configurable (via downlink) with a measurement interval. Like every hour or every 2 hours. The payload may not contain the timestamps to save space. So the codec sees two groups of readings but doesn't know how far apart the readings are. For this, though, what we'd also like to do is adding state to the codec context. It's a bit off topic here, but the idea is that every codec has access to the input state and can return the updated state. Think of it as a digital twin. The state may contain the current sensor configuration (as sent via downlink message and acked by the device, or as received many messages back when the device sent a status message). So maybe in the future the codec will be able to convert logical time in absolute time, until we can provide that context here, we can't rely on absolute timestamps. |
Exactly, would be a less common case for sure, but you never know, one could set up a device to compare the precision of different sensors measuring the same quantity for example, so it's nice to be flexible here.
Aaah I see…so an integration could potentially configure the interval sending a downlink and then that integration will have the context necessary to make the relative time conversion. Right?
That's a bit advanced and out of the scope of this issue sure, might be convenient to track that in a different one. Otherwise, I think we are ready to start defining the JSON schema with the simplest quantities to start iterating on this. |
Yes. That is what we mean with upstream; north of Application Server. Uplink messages magically flow against gravity.
Yes. We'll triage TheThingsNetwork/lorawan-stack#5429 tomorrow morning CEST. The schema definition is pretty much decoupled from support in The Things Stack. I agree that we should start with the simplest quantities and iterate. We produce releases every 2 or 3 weeks so new fields are usable pretty quickly. |
@pablojimpas I'm picking this up now. I'm revisiting the example I shared above, and now I realize there's a discrepancy between If Something like this: {
"readings": [
{
"logicalTime": 1,
"ambientTemperature": "20.2", // Celcius
"soilDepth": 15, // centimeters down
"soilMoisture": 15.5, // percentage?
"soilTemperature": 9.4 // Celcius
},
{
"logicalTime": 1,
"soilDepth": 25, // centimeters down
"soilMoisture": 10.9, // percentage?
"soilTemperature": 3.1 // Celcius
}
]
} Do you have any progressive insight on this? In any case, we'll start making this work end-to-end with |
From an integration developer perspective, what's the different between handling a non-existent I guess that yes, it will be simpler to avoid nesting as much as possible, and as long as we can express everything with the right naming convention, I'm fine with that.
Not really, but the more that I think about the array, the less that I like it…but If I recall correctly, we determined earlier that it was almost inevitable. |
True. One can argue that developers have to account for undefined values anyway, so nested objects would not make a difference. Then, should we put
I've got an idea to overcome this. I really want to encourage device makers to combine multiple readings in one LoRaWAN frame. This is just a really good practice. But I do get the issue on the other side: you'll end up with an array. What if we introduce a new message type that is sent by the Application Server, for each normalized payload? This would be a first class citizen in the message types: just like we have activations, uplink messages, downlink events, etc. So if there's one normalized payload measurement in the message, AS publishes one message. If there's an array with two items, AS publishes two messages. For the application developer, two individual uplink frames with one measurement will look exactly the same as one uplink frame with two measurements. The "full" uplink message will still carry the array of normalized payloads; there will be an extra, simpler message that is only published if there's normalized payload. |
I think
This is actually a pretty clever solution to avoid arrays and achieve uniform messages, I think this can work beautifully. The only downside that I can foresee is that if an integration uses webhooks to redirect uplinks, for example, it will now have to be aware of both uplink events and this new type of event. |
Initially I thought of sending the uplink message multiple times indeed, but that can cause problems upstream. The plan to add a new message type here, and we already encourage integration developers to specify a path for that message to keep things separate: I need to look into this a bit more to see if we can actually proceed with this, but it looks like we can. These are the current flattened output fields of all codecs that provide examples:
Few noticable things here:
|
If we want to support min/max/avg/median/percentiles for temperature... What do we do? {
"air": {
"temperature": {
"current": 20.5,
"min": 19.2,
"max": 20.6
}
}
} Is this still developer friendly enough? Or {
"air": {
"temperature": 20.2,
"minTemperature": 19.2,
"maxtemperature": 20.6
}
} I like the former one personally. |
That's what I thought, just another thing to keep in mind.
The list looks like a terrifying mess and illustrates perfectly why this issue it's important!
This will indeed be necessary, there might be a lot of use cases that produce uplinks that do not adhere strictly to “a quantity with some units in a defined context”, for example: open/closed status, events from computer vision recognition on the edge, periodic beacons with some device state… For now, thought, I think we should focus on the easiest one to standardize, the physical quantities. Once we have everything in place, we can go for the more difficult ones to agree on.
From a developer perspective, I think that once you have to go one level deep to get the value with the validation required, there's no difference going one or more levels. I like the first one too, it's easier to glance over it. The second one would make more sense to me if we were trying to avoid nesting at all costs for developer ergonomics. I mean, having just one level with To get a first JSON schema, I think we should focus on mapping some fields from that vast list into a nice table similar to the one in the first comment of this issue. |
Yep, I think the first one is nicer too.
Yes, I think we should go in the direction of defining structures for things that can be controlled (valves, lights, doors) and things that cannot necessarily be measured in physical quantities but scores. But indeed, let's figure this out later. |
With #508 merged, the next step is to define a next batch of fields. The original comment is a great start. On one hand it's desirable to keep iterations big and avoid and pushing lots of incremental support to the device makers because they won't keep up with that. We also have to keep TTS and our documentation up-to-date so every schema addition comes with 3 public pull requests plus some TTI internal merges. On the other hand, we need to keep the pace, so big schema changes may take a long time to fully agree with and commit to. I would suggest going with the low hanging fruit, which is basically what's in the aforementioned comment here, and then incrementally add more stuff as device makers and application developers start embracing it with open arms and tears of joy in their eyes. |
@pablojimpas are you willing to spend some time on schema additions? |
Sure! Awesome work with your 3 PRs so far, I'm sure I can use those as a basis for implementing more measurements. If you don't mind, I'll start with those measurements that are more useful for agricultural use cases. Let's see if we can cover a good number of variables before TTC2022 so that this new format can be promoted there to gain adoption easily. I will start with a PR to include more air and soil quantities to |
Hi all. Device manufacturer here (KELLER Pressure). I am not sure if my input is welcome, but here are my two cents:
Here's a fabricated but possible example of a set of measurements from a LoRaWAN device equipped with some special pressure sensors:
"Counter input": There might be a 'rain catcher' or another device that counts impulses. The number is the count of impulses in a predefined time range. It is unit-less. It is not necessary to send "Pd" if the two input values ("P1","PBaro") are also sent. However, customers often want both. How would this look like with the normalizer?
|
Thanks for your input @cBashTN
What do you mean by differences? A delta w.r.t. a previous value? We can support that but only if we make things stateful. We have plans for that as well. I think the goal is to produce absolute values in normalized payload, even if the end device sends changes. Regarding Regarding unitless counters; this still means something in the domain, right? I mean, if it's raindrops, we can have In case you are working with auxiliairy input, i.e. any external device that provides current and your device is sending the voltage level but doesn't really know what it is, then we also have to work with state. The idea is that we get some sort of installation state, per device, that is made available to the normalizer so it knows what the decoded payload means exactly. |
Right. We could devise some sort of flag indicating whether it is an absolute value or a delta, but that would add unnecessary complexity to integrators who want to benefit from the normalized payload. Since we already have to make the decoder/normalizer stateful to address the “installation state” issue, this rare case could also be implemented that way. In the case of deltas between two values present in the same payload, I don't think it makes sense to handle this in either the decoder or the normalizer. The normalized payload will contain the two relevant values (e.g,
From my experience with rain gauges that work with pulses, each pulse corresponds to some millilitres of water. The conversion has to be provided by the manufacturer and present in the decoder/normalizer to come up with something that makes sense in the domain. |
I have some questions about the current schema. Apologies if these are addressed above.
|
Good question. We support JSON Schema and it only supports
Hmm I don't understand these questions. Can you elaborate, maybe with an example? |
Summary
The main explanation of this issue has already been described in a previous one: #237
However, this one is about making a detailed data model specification for the sensors in devices.
Why do we need this?
There's value in going a step further than just giving best practices, actually enforcing a strict format has benefits.
For example, suppose that someone wants to create an app that scans the QR code of a LoRaWAN device, automatically registers it in The Things Stack and starts displaying data from the device's various sensors. Currently, it's impossible to build such an app in a manufacturer-agnostic way, completely decoupled from the device specifics.
Having a known data model/format will allow building integrations/apps using data coming from The Things Stack without caring about what device it's providing that data, the units used…just caring about the capabilities/sensors that it has.
What is already there? What do you see now?
Previous related issue: #237
What is missing? What do you want to see?
A strict data model specification, its implementation in every current decoder, and its enforcement in new devices.
How do you propose to implement this?
Can you do this yourself and submit a Pull Request?
I can help to provide requirements for the spec and maybe migrate current decoders to comply with the format.
Work in progress normalized format
The text was updated successfully, but these errors were encountered: