Reboot the Raspberry Pi on network failures (brcmfmac: brcmf_cfg80211_scan: scan error -110)

Posted by ads' corner on Wednesday, 2020-01-29
Posted in [Ansible][Linux][Raspberry-Pi][Software][Stupid]

In one of my earlier blog posts I reported that occasionally the HABpanel will disconnect from the server. Turns out it’s not HABpanel, but it’s the Pi itself which is causing the trouble. Part of the problem why it took me so long to investigate is that the display is in the kitchen, and someone had to have a look and spot the small red error message. To work around that problem, I hooked the device up in the network monitoring, and had an alarm triggered when the device is not reachable. Sure enough, that happens occasionally.

Because I moved /var/log to a small RAM disk to avoid wearing out the SDcard, all logs are lost once the device is rebooted. Had to bring keyboard and mouse to the kitchen in order to save the logfiles once the device was no longer reachable over the network.

Turns out it’s a well-known problem with the network chip on the Raspberry Pi. One of the GitHub issues is open since 2018, the other one since 2019. So no hope for a real quick fix :-(

imgflip.com/memegenerator

That is the error you will find in /var/log/syslog when the network driver decides to stop working properly:

Jan 28 23:51:04 kitchen-display kernel: [71414.560944] brcmfmac: _brcmf_set_multicast_list: Setting mcast_list failed, -110
Jan 28 23:51:07 kitchen-display kernel: [71417.120970] brcmfmac: _brcmf_set_multicast_list: Setting allmulti failed, -110
Jan 28 23:51:09 kitchen-display kernel: [71419.680888] brcmfmac: brcmf_run_escan: error (-110)
Jan 28 23:51:09 kitchen-display kernel: [71419.680906] brcmfmac: brcmf_cfg80211_scan: scan error (-110)
Jan 28 23:51:12 kitchen-display kernel: [71422.240830] brcmfmac: _brcmf_set_multicast_list: Setting BRCMF_C_SET_PROMISC failed, -110
Jan 28 23:52:08 kitchen-display kernel: [71477.919786] brcmfmac: brcmf_run_escan: error (-110)
Jan 28 23:52:08 kitchen-display kernel: [71477.919802] brcmfmac: brcmf_cfg80211_scan: scan error (-110)

Based on all the discussions in the GitHub issues, there is no workaround which works all the time. Sometimes unloading and loading the kernel module helps, but occasionally even that is not enough. A reboot is required. Ok, I don’t like it, at all. But even more I don’t like a non-working display which is supposed to work unattended.

In order to see if the Pi needs to be rebooted, I wrote a small shell script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash

set -e

uptime=`/bin/cat /proc/uptime | cut -f1 -d' ' | awk '{print int($0)}'`

if [ "$uptime" -lt 1200 ];
then
    # uptime is not high enough
    exit 0
fi

set +e

# try to ping the gateway
ping -c 5 192.168.0.1 > /dev/null 2>&1
rc=$?

if [ "$rc" -gt 0 ];
then
    # cat't reach the gateway, reboot the Pi
    /sbin/reboot
fi

exit 0

This script first checks the uptime (if someone has a better way to do that without piping three commands together, please let me know), and only continues if the Pi is already up for at least 20 minutes. This avoids reboot cycles where the script triggers a reboot right after the Pi came up and has not yet started the network. Or a reboot cycle in case the network is temporarily not available (router reboot ect).

After the uptime check passed, the script pings the gateway, and then reboots in case the ping command comes back with an error. It might be a good idea to check for a specific return code, I know that RC=2 when the network driver fails and the interface is down. Will see how this works, once the display is back in the kitchen and left unattended. The network monitoring will report any downtimes.

Next is the systemd timer:

[Timer]
OnCalendar=*-*-* *:0/5:00
OnBootSec=60
Persistent=false
Unit=reboot-on-network-failure.service

[Install]
WantedBy=timers.target

This runs the service every 5 minutes. The service itself just runs the above script:

[Unit]
Description=Reboot on network failure

[Service]
Type=oneshot
ExecStart=/bin/bash /root/reboot-on-network-failure.sh
TimeoutStopSec=30
KillMode=none
RemainAfterExit=no
User=root
Group=root

Yes, with systemd you actually need two files where cron can do this in a single line. Don’t know why people keep pretending that systemd is so much better.

And finally, all of this needs to be installed on the Pi. The following Tasks in my Ansible Playbook take care of this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
- name: Copy network failure script
  copy:
    src: "{{ playbook_dir }}/files/{{ item }}"
    dest: "/root/{{ item }}"
    owner: root
    group: root
    mode: 0700
  loop:
    - reboot-on-network-failure.sh

- name: Copy network failure systemd units and timers
  copy:
    src: "{{ playbook_dir }}/files/{{ item }}"
    dest: "/etc/systemd/system/{{ item }}"
    owner: root
    group: root
    mode: 0644
  loop:
    - reboot-on-network-failure.service
    - reboot-on-network-failure.timer
  register: reboot_on_network_failure_systemd

- name: Register network failure service
  systemd:
    daemon_reload: yes
    name: reboot-on-network-failure.timer
    enabled: yes
    state: restarted
  when: reboot_on_network_failure_systemd.changed

Categories: [Ansible] [Linux] [Raspberry-Pi] [Software] [Stupid]