You can disconnect them all. Then add one function write to InfluxDB a time. This is one useful way to debug which influxDB write caused the error, or a cumulated error due to dozens of separate writes.
In a last attempt I reread the thread to check if I followed all the hints.
There was a remaining one: set the influxd log level to 'debug'
I just did this and watched what happens. There is a problem with a .tsm file. The following entries repeat itself continuously in the log:
Oct 12 20:57:03 db1 influxd[5482]: ts=2021-10-12T18:57:03.867776Z lvl=info msg="TSM compaction (start)" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 op_event=start
Oct 12 20:57:03 db1 influxd[5482]: ts=2021-10-12T18:57:03.868965Z lvl=info msg="Beginning compaction" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 tsm1_files_n=5
Oct 12 20:57:03 db1 influxd[5482]: ts=2021-10-12T18:57:03.869817Z lvl=info msg="Compacting file" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 tsm1_index=0 tsm1_file=/var/lib/influxdb/data/home/autogen/150/000000001-000000001.tsm
Oct 12 20:57:03 db1 influxd[5482]: ts=2021-10-12T18:57:03.870638Z lvl=info msg="Compacting file" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 tsm1_index=1 tsm1_file=/var/lib/influxdb/data/home/autogen/150/000000002-000000001.tsm
Oct 12 20:57:03 db1 influxd[5482]: ts=2021-10-12T18:57:03.871440Z lvl=info msg="Compacting file" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 tsm1_index=2 tsm1_file=/var/lib/influxdb/data/home/autogen/150/000000003-000000001.tsm
Oct 12 20:57:03 db1 influxd[5482]: ts=2021-10-12T18:57:03.872261Z lvl=info msg="Compacting file" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 tsm1_index=3 tsm1_file=/var/lib/influxdb/data/home/autogen/150/000000004-000000001.tsm
Oct 12 20:57:03 db1 influxd[5482]: ts=2021-10-12T18:57:03.873045Z lvl=info msg="Compacting file" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 tsm1_index=4 tsm1_file=/var/lib/influxdb/data/home/autogen/150/000000005-000000001.tsm
Oct 12 20:57:09 db1 influxd[5482]: ts=2021-10-12T18:57:09.380056Z lvl=info msg="Error replacing new TSM files" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 error="cannot allocate memory"
Oct 12 20:57:10 db1 influxd[5482]: ts=2021-10-12T18:57:10.380492Z lvl=info msg="TSM compaction (end)" log_id=0X9TZQ_G000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0X9TgcUl000 op_name=tsm1_compact_group db_shard_id=150 op_event=end op_elapsed=6512.731ms
I stopped the influxdb.
Then I ran 'sudo influx_inspect verify -dir /var/lib/influxdb' , but the 5 .tsm files seemed 'healthy'.
Then I headed over to /var/lib/influxdb/data/home/autogen/150 and just deleted all the files there and the directory too. I do not care if I miss a few data points. desperate times, desperate measures.
I restarted influxdb, top and iotop look perfect.
Then I added all my data sources from within node-red in one go, just to see how this goes.
Result so far:
- influxd as process shows rarely up in 'top'
- 'sudo iotop' shows almost no write activity
- data show up in grafana as it should
- no more errors in the log about a problem with compacting a .tsm file
Therefore I set the log level back from debug to warn.
Let's declare this exercise a success and head for the pillow.
Many thanks for all the assistance I got. I learned a lot. Not being alone with a problem like this means a lot to me. I hope I can give something forward at another opportunity.
Kind regards,
Urs.
It seems that this (and similar issues) are not uncommon when running on 32 bit system, such as on a pi. See, for example, Compaction crash loops and data loss on Raspberry Pi 3 B+ under minimal load Ā· Issue #11339 Ā· influxdata/influxdb Ā· GitHub
It may well fail again when the db gets back to a significant size.
Unfortunately there doesn't seem to be a fix and likely never will be as there is no further development on 1.8 and apparently 2.0 doesn't (and won't) support 32 bit systems.
There is now, I think, a 64 bit OS version available for the pi 4, though I haven't tried it, or I suppose one could install Ubuntu or Debian.
I now use an old laptop as my main server (running Ubuntu server). Pretty much any old 64 bit laptop will handle the sort of loads you are talking about.
However, whatever you do, you should re-arrange the db schema, as the way you have it at the moment is a long way from ideal. Maybe best to carry on that discussion on your other thread?
It is possible that the fact that you have a database that is full of holes (due to sending multiple different field values to the measurements one at a time) may be making the issue more likely to happen.
To clarify what I mean, the db1 measurement will have a row in it for each timestamp for which you write a value. So one row will have timestamp and root value, but nothing for root-usage and all the others. Then the next row may have timestamp and root-usage value, but none of the others. This is a Bad Thing for an Influx DB.
Good news: with your hints I was able to fix this.
Bad news: with my beginners DB schema I have a DB full of holes, the problem might happen again
Conclusion:
- short term: monitor the system, enable logging if this occurs again and delete the offending .tsm files
- now: listen to people on the other thread to arrange the data in a better way.
- long term: update to influxdb 2.x for 64bit OS, once Raspbian for 64bit is stable
And a nice screenshot about the load average after the fix to end this thread:
Glad you got it sorted. Before you move on to InfluxDB 2.0, try and gain as much experience from 1.8 first. The query language in version 2.0 has a very steep learning curve.
You can't run 2.0 on a pi, not with standard raspbian anyway.
@Colin. Of course, I had forgotten that. I installed it on my converted Chromebook with Ubuntu Server!
You can install Ubuntu server 64bit on RPI3 and RPI4 and then run 2.0
Have you done that? Did the standard Ubuntu/Pi install script work?
Edit: I mean the node red install script.
Can you access the GPIO's from Ubuntu server?
[edit] More to the point, can you use the GPIO's from Node-Red?
@Colin Not yet. I will check it out when time allows and report back
@ghayne Sorry we don't use GPIO from Node-RED function. We currently use C and pigpio library for the GPIO interface. I think Node-RED GPIO interface is also based on the pigpio library wrapped with Node.js. So I will expect it to work
Maybe use lgpio, as ubuntu recommend?
I have just installed Ubuntu 21.04 64 bit on a Pi 3B. I then installed node-red with the standard script with no problems (No GPIO nodes) it's running fine.
That is good to know. It would be interesting for someone to try Raspbian 64 bit, or whatever they call it now. I haven't got a spare Pi to try it on at the moment.
I had just reclaimed the 3B after moving its databases to my server.
It looks like lgpio was written by the same author as pigpio.
pigpio was written specifically for pi, while lgpio was written for general Linux SBCs.
Here are the links to them:
http://abyz.me.uk/rpi/pigpio/index.html
http://abyz.me.uk/lg/index.html
Well you are fast. I want to see the speed of InfluxDB on 3B and Ubuntu 64b-bit OS
I Just installed InfluxDB 2.0. Don't expect an answer soon
I am steering clear of 2.0 for a while. Watching the influx forum there seem to be ongoing issues for the moment.
I'm just giving the whole thing a try, I don't intend to actually use it!
It's autumn - time for dabbling again