Avoid "wear out" of SSD-cards in an openHAB system

Posted by ads' corner on Sunday, 2019-01-13
Posted in [Ansible][Hardware][Linux][Openhab][Raspberry-Pi]

You might know that problem: the brand new SSD in your system is super fast, but after a good time using it, the card is dead. Unlike spinning disks, which usually fail over time, and show I/O errors by blocks, SSD cards are prone to a problem called Wear leveling. Blocks which are written more often will “wear out”, and become unresponsible. More writes increase this risk. And a typical openHAB system does a number of writes all the time: every time an external status changes, it’s written to the event log. By default the syslog is written to disk as well, and then there is a myriad of systemd services, writing status information into files.

For my first openHAB test installation I did not spent much attention to this problem. I however made sure that I installed everything using Ansible, and not in manual steps. So when the first card died, I was able to spin up the system again in a matter of hours (Ansible took over after the base system was initialized). Now it was time to do something against the “Wear leveling”.

After spending some time on research, I decided on 3 steps:

  • Remove swapfiles: the Raspberry Pi has 1 GB of RAM, and does not swap
  • Update the systemd journal configuration: make storage volatile, and reduce logfile size
  • Move /var/tmp and /var/log to a RAM disk: these two directories are hit most my I/O writes

Remove swapfiles

This part is fairly easy: openHABian has a package preinstalled which I just had to remove.

1
2
3
4
5
6
7
- name: Remove packages
  apt:
    name: "{{ item }}"
    state: absent
    purge: yes
  with_items:
    - dphys-swapfile

First I missed the purge=yes line, but when I later inspected the logfiles, I found out that not purging but just deinstalling the package leaves the systemd service entry around - which then just throws more errors into the logfile.

Update systemd journal configuration

By default, the journal is written to disk. That can be changed to “volatile”, then it is kept in memory. I also reduced the size of the log - I’m mostly only interested in the last few entries anyway.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
- name: check if systemd is used
  stat:
    path: /etc/systemd/journald.conf
  register: journald_conf_exists

- name: Update /etc/systemd/journald.conf
  lineinfile:
    dest: /etc/systemd/journald.conf
    regexp: "{{ item.regexp }}"
    line: "{{ item.line }}"
    state: "{{ item.state }}"
    create: yes
  with_items:
    - { regexp: '^#? *Storage', line: 'Storage=volatile', state: present }
    - { regexp: '^#? *SystemMaxUse', line: 'SystemMaxUse=50M', state: present }
    - { regexp: '^#? *SystemMaxFileSize', line: 'SystemMaxFileSize=25M', state: present }
  when:
    - journald_conf_exists.stat.exists == True
  notify:
    - restart systemd-journald

And the handler:

1
2
3
4
5
  handlers:
    - name: restart systemd-journald
      service:
        name: systemd-journald
        state: restarted

Move logs to a RAM disk

This was the complicated part, for several reasons. First of all, if everything lives on a RAM disk, the logs are gone once the system is rebooted. So it’s an assessment if I need persistent logs, or not. Looking back at my usage history of the logs, I found that again I’m mostly only interested in the most recent logs. And if something goes wrong with the system: I can just setup another one. I already proved that this works.

The second problem to take into account is the size of the RAM disk, or disks in my case. systemd is storing “stuff” in /var/tmp, so I wanted to move that on a disk. After some observations, I found that the directory is never really big, so 10 MB should be sufficient.

The biggest problem is /var/log, which can grow quite a bit. In my case it stays around 20-30 MB, until logrotate jumps in and cleans up. This can probably be tuned more, but I’m not really interested in too much fine tuning there. I decided on a 50 MB RAM disk for /var/log.

Let’s create the /etc/fstab entries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
- name: Update /etc/fstab
  lineinfile:
    dest: /etc/fstab
    line: "{{ item.line }}"
    state: "{{ item.state }}"
    create: yes
  with_items:
    - { line: 'tmpfs     /var/tmp        tmpfs   size=10M,nodev,nosuid,noatime,mode=1777     0  0', state: present }
    - { line: 'tmpfs     /var/log        tmpfs   size=50M,nodev,nosuid,noatime,mode=0755     0  0', state: present }
  notify:
    - cleanout old logs
    - restart system

When something is changed here, I just reboot to activate the changes. The handlers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    - name: cleanout old logs
      shell: rm -f /var/log/*.gz /var/log/apt/*.gz /var/log/openhab2/events.log /var/log/openhab2/openhab.log /var/log/samba/*.gz /var/log/unattended-upgrades/*.gz
      args:
        warn: false

    - name: restart system
      shell: ( /bin/sleep 5 ; shutdown -r now "Ansible triggered" ) &
      async: 30
      poll: 0
      ignore_errors: true

When the system came back online, everything seems to be working. On first sight. After checking the entire log since the reboot (the entries before that are no longer available), I found that a number of services reports problems. Mainly because files and directories in /var/log are missing now.

To fix that problem, I wrote a quick systemd service, which is fired before affected services come up (the Before line in the unit file), but fired after the RAM disk is mounted (the After line in the unit file).

[Unit]
Description=Create logfile directory (on RAM fs)
After=var-log.mount
Before=openhab2.service samba.service nmbd.service smbd.service

[Service]
Type=oneshot
ExecStart=/bin/mkdir -p /var/log/openhab2
ExecStart=/bin/chown openhab:openhabian /var/log/openhab2
ExecStart=/bin/chmod 0775 /var/log/openhab2
ExecStart=/usr/bin/setfacl -m u::rwx,g::rwx,o::r-x /var/log/openhab2
ExecStart=/usr/bin/setfacl -d -m u::rwx,g::rwx,o::r-x /var/log/openhab2
ExecStart=/bin/mkdir -p /var/log/samba
ExecStart=/bin/chown root:adm /var/log/samba
ExecStart=/bin/chmod 0750 /var/log/samba
ExecStart=/bin/mkdir -p /var/log/samba/cores /var/log/samba/cores/smbd /var/log/samba/cores/nmbd
ExecStart=/bin/chown root:root /var/log/samba/cores /var/log/samba/cores/smbd /var/log/samba/cores/nmbd
ExecStart=/bin/chmod 0700 /var/log/samba/cores /var/log/samba/cores/smbd /var/log/samba/cores/nmbd
ExecStart=/bin/mkdir -p /var/log/lightdm
ExecStart=/bin/chown root:root /var/log/lightdm
ExecStart=/bin/chmod 0711 /var/log/lightdm
ExecStart=/bin/chown root:root /var/log/lightdm
ExecStart=/bin/mkdir -p /var/log/sysstat
ExecStart=/bin/chown root:root /var/log/sysstat
ExecStart=/bin/chmod 0755 /var/log/sysstat
ExecStart=/bin/mkdir -p /var/log/apt
ExecStart=/bin/chown root:root /var/log/apt
ExecStart=/bin/chmod 0755 /var/log/apt
ExecStart=/bin/mkdir -p /var/log/unattended-upgrades
ExecStart=/bin/chown root:adm /var/log/unattended-upgrades
ExecStart=/bin/chmod 0750 /var/log/unattended-upgrades
ExecStart=/usr/bin/touch /var/log/lastlog
ExecStart=/bin/chown root:utmp /var/log/lastlog
ExecStart=/bin/chmod 0664 /var/log/lastlog
RemainAfterExit=true

[Install]
WantedBy=default.target

Last but not least, need to upload and enable this service:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
- name: Install logdir service
  copy:
    src: files/openhab-logdir.service
    dest: /etc/systemd/system/openhab-logdir.service
    owner: root
    group: root
    mode: 0664
  register: logdir_service

- name: Enable logdir service
  systemd:
    name: openhab-logdir.service
    enabled: yes
    state: started

- name: Restart logdir service
  service:
    name: openhab-logdir.service
    state: restarted
  when: logdir_service.changed

The name logdir is based on the first few lined, when I just tried to fix the openHAB log directory. More problems were found after that.

Conclusion

All in all this running fine for a few days already. I’m keeping an eye on the system, and at some point need to integrate it into my monitoring as well.


Categories: [Ansible] [Hardware] [Linux] [Openhab] [Raspberry-Pi]