Data entry into influxdb

Hello experts
I'm using node-red to fetch sensor data from an MQTT broker. I modify the data and send it again to the broker, to the global context and some of it to influxdb.
I use this data model on the broker:
msg.topic: data/'place'/'value-type'
msg.payload: 'value'

'place' : cellar, woodshed, office, printer1, ...
'value-type' : Temp, Hum, Light, Door, Yellow, ...

The sensor data comes from various sensors/data-sources and is then reworked with node-red according the above data model.

To send data to influxdb, I take the above message and rearrange it according to a beginners guide from the net:
msg.measurement : 'place'
msg.payload: {'value-type' : 'value'}

This means that I create an entry in influxdb for each data point. I heard this is not optimal. Are there any hints on how I can improve this, if it makes sense at all?

  • group messages for the same 'place' / msg.measurement
  • use another data model for influxdb with tags (I have absolutely no clue what tags are, btw)
  • other hints?

Kind regards,
Urs.

Generally data which are collected in one operation (and so have the same timestamp) should be written to a single measurement. If data are collected in multiple operations (and so have different timestamps) then they should go to separate measurements.

So, for example, if you have a temperature and humidity sensor that gives you both in one operation then put them both in one measurement, otherwise put them in different measurements.

1 Like

I see what you mean. I will have to think about it. I would have to recombine the messages after I have split and normalised the data. The msg.parts fields are still there.

It might be better, and almost certainly simpler, to send it straight to influx at the point of measurement input rather than after splitting it up.

On the other hand it may be of no consequence which way you do it. How many values are there and how often are you writing them? If you did merge them where appropriate how many measurements would there then be?

Many thanks for taking my hand and leading me forward.

In the current setup there are about 40-60 messages per minute going into the influxdb. I placed a speed node in front of the DB to get a feel about it:

All messages are a single value, value-type, place triplet. If a data source sends several values in one batch, I split them up, reform them into the standardised value-types and places.

How many different process variables (measurements) is that?

What hardware are you running on. If it is a pi then where is the database (SD card USB SSD, etc)?

Do you mean value, field name and measurement name?

Edit: or possibly value, tag value and measurement name, or even value, tag value, and another tag value?

I watched the webinar from influxdb about Schema Design for IoT
Should have done that before starting to send data into my influxdb.

Main learning was about what to put into tags.
So far I did not use those, but used measurement instead.
Anyway, now I have some material to rethink my data schema.

Here is a table and an explanation on what measurement data flows through my MQTT broker. This data is collected from various sensors and data sources, then normalised and then sent to the broker.

Data description / Remarks Place Value-Type Unit Data type Example
power usage of the household in W power now W Int 250
total power usage of the household in kWh up to now power today kWh Int 1687
current solar power production solar now W Int 132
total solar power production up to now today solar today KwH Int 5634
currently exported power into the grid grid exportNow W Int 0
currently imported power from the grid grid importNow W Int 118
temperature in the home office office Temp Int 14
humidity in the home office office Hum % Int 61
relay status of the sonoffth10 device office Relay1 on/off String off
temperature in the storage room storage Temp Int 14
humidity in the storage room storage Hum % Int 61
relay status of the sonoffth10 device storage Relay1 on/off String off
temperature in the cellar cellar Temp Int 15
humidity in the cellar cellar Hum % Int 98
relay status of the sonoffth10 device cellar Relay1 on/off String off
temperature in the living room 2nd floor living Temp Int 21
humidity in the kiving room 2nd floor living Hum % Int 62
relay status of the sonoffth10 device living Relay1 on/off String off
temperature in the parents sleeping room parents Temp Int 18
humidity in the partens sleeping room parents Hum % Int 64
relay status of the sonoffth10 device parents Relay1 on/off String off
temperature in the woodshep ( equals outside temp) woodshed Temp Int 12
humidity in the woodshep ( equals outside humidity) woodshed Hum % Int 85
relay status of the sonoffth10 device woodshed Relay1 on/off String off
open if the wardrobe doors are open wardrobe Doors open/closed String closed
on if the wardrobe lights are on wardrobe lights on/off String off
1 if the wardrobe lights are on wardrobe dlights 1/0 Int 0
Distance from the top of the tank to the water level tank Dist cm Int 124
Volume of water in the tank tank Volume m3 float 27.4
on if pump is running tank pump on/off String off
1 if pump is running tank dpump 1/0 Int 0
power consumption of freezer2 freezer2 Power A float 0.3
on if freezer2 is connected to power freezer2 Relay1 on/off String on
disk partition used db1 root Byte Int 45789
disk partition usage in % db1 root-usage % float 3.5
disk partition used db1 boot Byte Int 65784
disk partition usage in % db1 boot-usage % float 5.1
disk partition used db1 run Byte Int 2345675
disk partition usage in % db1 run-usage % float 15.2
load average (as seen in top) db1 load1 1s load av. float 0.7
load average (as seen in top) db1 load5 5s load av. float 0.3
load average (as seen in top) db1 load15 15s load av. float 0.1
used memory for all processes db1 memoryTotal Byte Int 3967456
free memory db1 memoryFree Byte Int 1324546
memory usage in % db1 memoryUsage % Int 66

msg.topic = data/'Place'/'Value-Type'
msg.payload = 'Example'

Now the question is, how do I change this data format into a influxdb compatible structure.
In the next post I send a first idea.

Not everything which is sent through the MQTT broker needs to go into the influxdb. I only send data where I want to know the history, or the development over time.
All current values go into global context, so for controlling actors, I fetch data from there.
I never need to fetch data from influxdb into node-red. Only grafana uses data from influxdb.

There has been a discussion about my way of doing it which creates holes in the database. I could aggregate messages with the same measurement and the same tags.

Here is a proposed idea on how I could use measurement/tag/field of influxdb to structure the data. I'm very interested in your comments to learn, how I can improve that.

proposed data format for influxdb msg for node-red-contrib-influxdb
measurement tag field msg.measurement msg.payload
Energy source = power now Energy [ { source = power } , { now = 250 } ]
Energy source = solar now Energy [ { source = solar } , { now = 132 } ]
Energy source = grid exportNow Energy [ { source = grid } , { exportNow = 0 } ]
Energy source = grid importNow Energy [ { source = grid } , { importNow = 118 } ]
Rooms source = office Temp Rooms [ { source = office } , { Temp = 14 } ]
Rooms source = office Hum Rooms [ { source = office } , { Hum = 61 } ]
Rooms source = storage Temp Rooms [ { source = storage } , { Temp = 14 } ]
Rooms source = storage Hum Rooms [ { source = storage } , { Hum = 61 } ]
Rooms source = cellar Temp Rooms [ { source = cellar } , { Temp = 15 } ]
Rooms source = cellar Hum Rooms [ { source = cellar } , { Hum = 98 } ]
Rooms source = living Temp Rooms [ { source = living } , { Temp = 21 } ]
Rooms source = living Hum Rooms [ { source = living } , { Hum = 62 } ]
Rooms source = parents Temp Rooms [ { source = parents } , { Temp = 18 } ]
Rooms source = parents Hum Rooms [ { source = parents } , { Hum = 64 } ]
Rooms source = woodshed Temp Rooms [ { source = woodshed } , { Temp = 12 } ]
Rooms source = woodshed Hum Rooms [ { source = woodshed } , { Hum = 85 } ]
Water source = tank Dist Water [ { source = tank } , { Dist = 124 } ]
Water source = tank Volume Water [ { source = tank } , { Volume = 27.4 } ]
Water source = tank dpump Water [ { source = tank } , { dpump = 0 } ]
Energy source = freezer2 Power Energy [ { source = freezer2 } , { Power = 0.3 } ]
Systems source = db1 root Systems [ { source = db1 } , { root = 45789 } ]
Systems source = db1 root-usage Systems [ { source = db1 } , { root-usage = 3.5 } ]
Systems source = db1 boot Systems [ { source = db1 } , { boot = 65784 } ]
Systems source = db1 boot-usage Systems [ { source = db1 } , { boot-usage = 5.1 } ]
Systems source = db1 run Systems [ { source = db1 } , { run = 2345675 } ]
Systems source = db1 run-usage Systems [ { source = db1 } , { run-usage = 15.2 } ]
Systems source = db1 load1 Systems [ { source = db1 } , { load1 = 0.7 } ]
Systems source = db1 load5 Systems [ { source = db1 } , { load5 = 0.3 } ]
Systems source = db1 load15 Systems [ { source = db1 } , { load15 = 0.1 } ]
Systems source = db1 memoryTotal Systems [ { source = db1 } , { memoryTotal = 3967456 } ]
Systems source = db1 memoryFree Systems [ { source = db1 } , { memoryFree = 1324546 } ]
Systems source = db1 memoryUsage Systems [ { source = db1 } , { memoryUsage = 66 } ]

(and yes, it is { "source" = "db1"} etc, but this forum has eaten up all my "-characters.

Would it not be much simpler to send the data to influx at the point it is first available, before you split it into multiple topics.

For data containing such characters use the same method as posting logs and flows, using triple backticks. Then the forum will not mess with it.

I do have to bring the incoming data into shape, add some context, before I drop it into influxdb. But I know how to do this. So yes, Colin, I promise, I will do it, really, honestly.

But there is another topic I need to figure out first:

  1. I understood that influx-measurement is really means what it says: one measurement set of data from one sensor.
  2. I understood too that fields are used to describe values of a measurement of a sensor.
  3. What I'm not sure at all is on how to use the influx tags.

In my previous message I showed one way I figured I could do it. I'm really interested in the opinions of those who have done this.

Not necessarily one sensor, but a set of values that are available at the same instant.

A field represents a value of some sort, the field name tells you what it is. Temperature, humidity etc are typical field names.

Tags are different from fields in that they are not process values, but are strings (usually) that identify something about the set of values in that measurement. A typical use is to identify location. So if you were measuring temperature and humidity in a set of rooms then the fields might be temperature and humidity, a tag name might be 'room' and the tag values would be kitchen, bedroom etc. So records in the database would be something like

time       temperature     humidity      room
-----      ------------   ----------   --------
....        20.5           65            "kitchen"
....        25.3           68            "bedroom"

In the query to access the database one can then say things like
select * from the_measurement where room="kitchen"
to get all the kitchen values.
In grafana one can even have automatically repeating panels where you just have to define the panel for one room and then it will automatically show the panels for all the other rooms.
Tags are indexed in the database which means one can select on tag values very efficiently.

Note that all the records in the measurement have the same fields, if the fields are different then they should be in different measurements. That means your table above has some problems. For example you seem to have three different fields in the Energy measurement, now, exportNow and importNow. However, if those are all energy measurements then perhaps you just need a field called value and a tag called source. Alternatively if all those values are read at the same time then put it in three fields in one measurement, with field names power, solar and grid, for example, with no need for tags at all.

3 Likes

Many thanks. This is exactly what I was looking for.
I will use this advise to redraw the data schema. This is the part where my head hurts the most.
And after that there will be time for drawing wires with node-red. This is the fun part.

Hi @Colin, sorry to enter this topic at such a late stage. I am asking you directly because you appear to have a better handle on infuxdb schema than most. I have spent a couple of days reviewing lots of posts re. schema and find myself a little confused over table (measurement) organisation and wondered if you or any one else can clarify the subject of schema design a little more. I have been using Influxdb + telegraf + grafana for a while but would like to get it right this time round.

I am trying to redesign from scratch a suitable influxdb schema for DIY home IoT implementation.

Given a list of all types of measurement (temp, humidity, battery, light level, light colour, light status, occupancy, barometric pressure, water level, device_id, location, hostname, CPU_load, memory etc. etc.). What would be the ideal schema for a performant and flexible layout ?

The sort of things I am not sure of are:

  1. How many tables (measurements) ?

  2. Which items should be tags ?

  3. Which should be fields ?

Also it would be interesting/instructive to see the data structure used to create the default system monitoring dashboard that is automatically generated by grafana.

P.S. I tried to stay awake through all of the influxdb video [Training] InfluxDB Performance Tuning and Schema Design - YouTube but it lacked enough real IoT schema examples for me to make any more progress. One thing they say is design your queries first which is OK if you know in advance all the queries you may need. Unfortunately I cannot envisage all of these up front.

I think all or those are covered in the post you replied to, and in previous posts in the thread.

Please read my posts again carefully, then if necessary come back and ask for clarification.
Note that I am not an expert in influx, this is just my take on it.

Hi @Colin thank you for your response. I found more replies from yourself and just assumed that you were an expert on this.

I have read many posts including this one but still have difficulty understanding when to create multiple measurements for example.

Also some where a point was raised against putting the device_id as a tag because the time series is more about continuous data analysis and should not be interrupted by a change in device_id when the most important thing is the temperature at that location rather than which device originated it.

Where are the lines drawn between which items should have their own measurement and which items should share the same measurement ?

Clearly one would not be measuring barometric pressure in every room but care should be taken not to have a table full of null values.

Can you offer any further help or insight other that go back and read it all again ?

Best regards, oz

If you mean the id of the device measuring the temperature, then it depends what you want to measure. If you want to measure the temperature of a device wherever you put it (a portable device maybe) then use a tag defining the device. If you want to measure the temperature of a room then use the room name as the tag, then if the device breaks and you replace it you will still be measuring the bedroom (or wherever) temperature

As said in earlier posts I generally decide this based on whether the data are available at the same instant. So if you have a sensor measuring room humidity and another measuring room temperature then put them in separate Measurements, but if you have one sensor measuring measuring both then put them in one measurement. However, if you only wanted to store humidity every 10 mins but temperature every 1 minute then they would obviously go in separate measurements.

Exactly. That should be in a separate measurement.

I recommend starting by working out all the fields and tags, then define the Measurements.

Hi @Colin, thank you very much for your response.

Can I clarify that when you use the term 'measurement' you are referring to a separate influxdb table (as referenced by the FROM field in a grafana query ?

And if so does this mean that a typical home IoT system would likely have lots of measurements like 'humidity', 'power', 'barometric pressure', 'host_info', 'lighting', 'heating' etc. as a direct preference to having ANY null values ?

Finally (I hope), which is the preferred choice:

  1. Create a separate 'measurement' for humidity and temperature (because the sampling periods are different) or

  2. Align the sampling periods and use the same measurement or

  3. Cache all data in global/flow and write to influx in one go (assuming more rapidly sampled data is just for the ui and that these intermediate values can be lost).

Is all of this a kind of db schema normalisation as found in conventional 3rd normal form relational databases ?

I mean Influx Measurements. Influx doesn't have tables, but Measurements are similar to tables in, for example, mysql.

For 1, 2 or 3 it is up to you. Option 2 is just a subset of Option 3. Caching will make db access a little more efficient, at the expense of complexity (and hence bug likelihood) in node-red. Personally I would generally go for the simplest, which is option 1.

Remember that you cannot cache across tags, so if you have a tag for room, then you would have to cache the data separately for each room. Caching can get complex and I don't like complexity where it is not required.

If you are running on any sensibly powered machine then home automation is a trivial task for influx. It is designed for handling vast amounts of data at great speed. There is something in the docs somewhere that talks about the power of machine you need for various tasks, and a home automation system is way off the bottom of the lowest task that they talk about. The result is that it doesn't really matter how you organise the data, it won't make any difference to performance.
Don't try and run it with an SD card as the storage though.