In one of my earlier blog posts I reported that occasionally the HABpanel will disconnect from the server. Turns out it’s not HABpanel, but it’s the Pi itself which is causing the trouble. Part of the problem why it took me so long to investigate is that the display is in the kitchen, and someone had to have a look and spot the small red error message. To work around that problem, I hooked the device up in the network monitoring, and had an alarm triggered when the device is not reachable. Sure enough, that happens occasionally.
Because I moved
/var/log to a small RAM disk to avoid wearing out the SDcard, all logs are lost once the device is rebooted. Had to bring keyboard and mouse to the kitchen in order to save the logfiles once the device was no longer reachable over the network.
That is the error you will find in
/var/log/syslog when the network driver decides to stop working properly:
Jan 28 23:51:04 kitchen-display kernel: [71414.560944] brcmfmac: _brcmf_set_multicast_list: Setting mcast_list failed, -110 Jan 28 23:51:07 kitchen-display kernel: [71417.120970] brcmfmac: _brcmf_set_multicast_list: Setting allmulti failed, -110 Jan 28 23:51:09 kitchen-display kernel: [71419.680888] brcmfmac: brcmf_run_escan: error (-110) Jan 28 23:51:09 kitchen-display kernel: [71419.680906] brcmfmac: brcmf_cfg80211_scan: scan error (-110) Jan 28 23:51:12 kitchen-display kernel: [71422.240830] brcmfmac: _brcmf_set_multicast_list: Setting BRCMF_C_SET_PROMISC failed, -110 Jan 28 23:52:08 kitchen-display kernel: [71477.919786] brcmfmac: brcmf_run_escan: error (-110) Jan 28 23:52:08 kitchen-display kernel: [71477.919802] brcmfmac: brcmf_cfg80211_scan: scan error (-110)
Based on all the discussions in the GitHub issues, there is no workaround which works all the time. Sometimes unloading and loading the kernel module helps, but occasionally even that is not enough. A reboot is required. Ok, I don’t like it, at all. But even more I don’t like a non-working display which is supposed to work unattended.
In order to see if the Pi needs to be rebooted, I wrote a small shell script:
This script first checks the uptime (if someone has a better way to do that without piping three commands together, please let me know), and only continues if the Pi is already up for at least
20 minutes. This avoids reboot cycles where the script triggers a reboot right after the Pi came up and has not yet started the network. Or a reboot cycle in case the network is temporarily not available (router reboot ect).
After the uptime check passed, the script pings the gateway, and then reboots in case the
ping command comes back with an error. It might be a good idea to check for a specific return code, I know that
RC=2 when the network driver fails and the interface is down. Will see how this works, once the display is back in the kitchen and left unattended. The network monitoring will report any downtimes.
Next is the systemd timer:
[Timer] OnCalendar=*-*-* *:0/5:00 OnBootSec=60 Persistent=false Unit=reboot-on-network-failure.service [Install] WantedBy=timers.target
This runs the service every
5 minutes. The service itself just runs the above script:
[Unit] Description=Reboot on network failure [Service] Type=oneshot ExecStart=/bin/bash /root/reboot-on-network-failure.sh TimeoutStopSec=30 KillMode=none RemainAfterExit=no User=root Group=root
Yes, with systemd you actually need two files where cron can do this in a single line. Don’t know why people keep pretending that systemd is so much better.
And finally, all of this needs to be installed on the Pi. The following Tasks in my Ansible Playbook take care of this: