Install Telegraf using Ansible

Posted by ads' corner on Wednesday, 2020-06-24
Posted in [Ansible][Influxdb][Linux]

I have an InfluxDB up and running in my network, and decided to monitor all (well, all possible - the QNAP seems to be a problem) devices. That’s quite easy to do by installing Telegraf as a server agent, and add some configuration. Everything is deployed using Ansible, so I can re-use the same Playbook for many devices.

Let’s assume I have a laptop which I want to monitor. The hostname is sunlight (it’s not, I’m just using this as an example).

InfluxDB

First I need to create a database and credentials in InfluxDB.

At this point I can decide if I:

  • have everything in a single database, or use a separate database for monitoring the devices - I decided to use a separate database home
  • make it easy to have one account for all devices, or use separate accounts per device - I decided to roll out separate accounts
  • just go with ALL permissions, or is it enough to have WRITE permission for the device account - if the database is already created, the account only needs WRITE

The following files are in the InfluxDB Playbook directory:

-r-------- 1 mon mon 13 Jun 17 17:46 credentials/influxdb-admin-password.txt
-r-------- 1 mon mon  6 Jun 17 17:45 credentials/influxdb-admin-username.txt

-r-------- 1 mon mon  5 Jun 23 00:44 credentials/influxdb-sunlight-dbname.txt
-r-------- 1 mon mon 13 Jun 23 00:43 credentials/influxdb-sunlight-password.txt
-r-------- 1 mon mon  8 Jun 23 00:44 credentials/influxdb-sunlight-username.txt

-r-------- 1 mon mon  5 Jun 23 00:44 credentials/influxdb-home-dbname.txt
-r-------- 1 mon mon 13 Jun 23 02:02 credentials/influxdb-home-password.txt
-r-------- 1 mon mon  8 Jun 23 00:44 credentials/influxdb-home-username.txt

credentials/influxdb-sunlight-dbname.txt and credentials/influxdb-home-dbname.txt hold the same database name: home, but I use different files in case I have to separate this in the future.

In InfluxDB I ensure that authentication is enabled:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
- name: Update influxdb.conf
  ini_file:
    path: "/etc/influxdb/influxdb.conf"
    section: "{{ item.section }}"
    option: "{{ item.option }}"
    value: "{{ item.value }}"
    state: "{{ item.state }}"
  loop:
    - { section: "http", option: "auth-enabled", value: "true", state: present }
  notify:
    - restart influxdb

Create the home database:

1
2
3
4
5
- name: Create home database in InfluxDB
  influxdb_database:
    database_name: "{{ lookup('file', playbook_dir + '/credentials/influxdb-home-dbname.txt') }}"
    username: "{{ lookup('file', playbook_dir + '/credentials/influxdb-admin-username.txt') }}"
    password: "{{ lookup('file', playbook_dir + '/credentials/influxdb-admin-password.txt') }}"

Create the sunlight device user:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
- name: Create sunlight user in InfluxDB
  influxdb_user:
    user_name: "{{ lookup('file', playbook_dir + '/credentials/influxdb-sunlight-username.txt') }}"
    user_password: "{{ lookup('file', playbook_dir + '/credentials/influxdb-sunlight-password.txt') }}"
    admin: no
    grants:
      - database: "{{ lookup('file', playbook_dir + '/credentials/influxdb-sunlight-dbname.txt') }}"
        privilege: "WRITE"
    login_username: "{{ lookup('file', playbook_dir + '/credentials/influxdb-admin-username.txt') }}"
    login_password: "{{ lookup('file', playbook_dir + '/credentials/influxdb-admin-password.txt') }}"

That’s it for InfluxDB, and the server side. Now for the laptop.

Telegraf: Sunlight

In Ansible I outsource the configuration into a variable file, so I can re-use the same Playbook. Another way to do it is to have a Role, and define the variables all in your hosts.cfg.

The telegraf-config.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
---

hostname: "sunlight"
input_filters: "cpu:mem:disk:diskio:swap:net:system:postfix:processes:smart:temp:wireless"
output_filters: "influxdb"

# the space at the beginning is required, otherwise Ansible
# parses the string as a list
influxdb_url: ' ["http://192.168.0.21:8086"]'
influxdb_database: "{{ lookup('file', playbook_dir + '/credentials/influxdb-sunlight-dbname.txt') }}"
influxdb_username: "{{ lookup('file', playbook_dir + '/credentials/influxdb-sunlight-username.txt') }}"
influxdb_password: "{{ lookup('file', playbook_dir + '/credentials/influxdb-sunlight-password.txt') }}"

This defines the hostname for this devices, the list of input and output filters I want to use, and the InfluxDB credentials. For $influxdb_url there needs to be a space at the beginning of the string, details are here.

Load the configuration in the Playbook:

1
2
3
4
- name: Load Telegraf configuration
  include_vars:
    file: telegraf-config.yml
    name: telegraf_config

There is no Telegraf in the Raspbian I’m using, need to add the vendor repository:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
- name: Telegraf repository key
  apt_key:
    url: https://repos.influxdata.com/influxdb.key
    keyring: /etc/apt/trusted.gpg.d/telegraf.gpg
    state: present

- name: Telegraf repository
  apt_repository:
    repo: "deb https://repos.influxdata.com/ubuntu {{ ansible_distribution_release }} stable"
    state: present
    filename: telegraf
  register: repo_telegraf

- name: Update cache
  apt:
    update_cache: yes
  when: repo_telegraf.changed

Install the Telegraf package:

1
2
3
4
5
6
- name: Telegraf packages
  apt:
    name:
      - telegraf
    state: present
  register: install_telegraf

Telegraf is now installed, but it comes with a lengthy default configuration file which is not really usable. I’m using a small trick here:

  • when the Telegraf package was just installed (which is registered in $install_telegraf), I delete (move away) the default configuration file
  • then I check if the configuration file exists
  • if the file does not exists, I create one from scratch
1
2
3
- name: Move away default configfile
  command: mv /etc/telegraf/telegraf.conf /etc/telegraf/telegraf.conf.default
  when: install_telegraf.changed

The only time the file is deleted by the Playbook is when the telegraf package is installed. Make sure the state=present, not state=latest - otherwise this will also be triggered when the package is updated.

By making sure the /etc/telegraf/telegraf.conf file is re-created from scratch, one can wipe this file at any time, and have the Playbook take care of the situation. This comes handy when additional input filters are added. Re-creating the configfile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
- name: Figure out if configfile exists
  stat:
     path: "/etc/telegraf/telegraf.conf"
  register: etc_telegraf_telegraf_conf

# create a new telegraf.conf
- block:
  # configfile does not exist
  # can only happen when the configfile was moved away right after
  # the package was installed
  # generate a configfile using "telegraf config"

  - name: Generate telegraf.conf
    shell: "telegraf --input-filter {{ telegraf_config.input_filters }} --output-filter {{ telegraf_config.output_filters }} config > /etc/telegraf/telegraf.conf"

  # will fail when the config is not alright
  - name: Test generated config
    command: telegraf --config /etc/telegraf/telegraf.conf --test
    register: test_config

  when: etc_telegraf_telegraf_conf.stat.exists != True

The block above will also test the configfile, and fail early if the file is faulty. In theory a temp file should be used which is first created, then tested and if valid it will replace the configfile. But things work smooth in my setup and I don’t really need this extra step.

Configuration

Next I’m reading the configuration file into a variable. That’s necessary because some of the input and output filters need to verify if specific content is already/still in the file. Especially the exec input filter needs this step, but that’s for another blog post.

1
2
3
4
5
6
7
8
9
- name: Retrieve /etc/telegraf/telegraf.conf
  slurp:
    src: "/etc/telegraf/telegraf.conf"
  register: etc_telegraf_telegraf_conf_retrieve
  changed_when: false

- name: Extract /etc/telegraf/telegraf.conf
  set_fact:
    etc_telegraf_telegraf_conf: "{{ etc_telegraf_telegraf_conf_retrieve.content | b64decode }}"

After this step, the $etc_telegraf_telegraf_conf holds the current content of the file. Keep in mind that any of the following steps which change telegraf.conf will not update the content of the variable.

In a first step I’m changing a couple of settings to make sure that Telegraf has a larger buffer available to store data temporarily. This is a laptop, and occasionally it’s not in the home network. Want to keep as much data around as possible. This only works if the laptop is not rebooted, but that is a rare occasion anyway. I’m also explicitely defining the hostname which is used to tag the data in InfluxDB. Don’t want to rely on Telegraf to figure out what hostname it is, and maybe get fooled by auto-assigned hostnames from a DHCP service.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# set a couple default settings
# especially increase buffers to ensure more data is stored if
# the device is not able to reach the InfluxDB service
- name: Update telegraf.conf default settings
  ini_file:
    path: "/etc/telegraf/telegraf.conf"
    section: "{{ item.section }}"
    option: "{{ item.option }}"
    value: "{{ item.value }}"
    state: "{{ item.state }}"
  loop:
    - { section: "agent", option: "hostname", value: '"{{ telegraf_config.hostname }}"', state: present }
    - { section: "agent", option: "metric_buffer_limit", value: '1000000', state: present }
    - { section: "agent", option: "metric_batch_size", value: '10000', state: present }
  notify:
    - restart telegraf

Next is the InfluxDB database configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# setup the InfluxDB output filter
- name: Update telegraf.conf InfluxDB settings
  ini_file:
    path: "/etc/telegraf/telegraf.conf"
    section: "{{ item.section }}"
    option: "{{ item.option }}"
    value: "{{ item.value }}"
    state: "{{ item.state }}"
  loop:
    - { section: "[outputs.influxdb]", option: "urls", value: "{{ telegraf_config.influxdb_url }}", state: present }
    - { section: "[outputs.influxdb]", option: "database", value: '"{{ telegraf_config.influxdb_database }}"', state: present }
    - { section: "[outputs.influxdb]", option: "username", value: '"{{ telegraf_config.influxdb_username }}"', state: present }
    - { section: "[outputs.influxdb]", option: "password", value: '"{{ telegraf_config.influxdb_password }}"', state: present }
    - { section: "[outputs.influxdb]", option: "skip_database_creation", value: 'true', state: present }
  when: telegraf_config.output_filters.find('influxdb') != -1
  notify:
    - restart telegraf

This part is only run if influxdb is in the list of output filters. It also turns off automatic database creation. That part is not necessary because the database already exists, and because the user has only WRITE permissions (which can’t be used to create a database).

Since I have a local Postfix installed on the laptop, I also monitor the queue:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
- block:
  - name: Update Postfix mailqueue ACLs for Telegraf
    acl:
      path: "/var/spool/postfix/{{ item }}"
      etype: group
      entity: telegraf
      permissions: rX
      recursive: yes
      state: present
    loop:
      - .
      - active
      - hold
      - incoming
      - deferred
      - maildrop

  - name: Update Postfix mailqueue default ACLs for Telegraf
    acl:
      path: "/var/spool/postfix/{{ item }}"
      etype: group
      entity: telegraf
      permissions: rX
      recursive: yes
      default: yes
      state: present
    loop:
      - .
      - active
      - hold
      - incoming
      - deferred
      - maildrop

  when: telegraf_config.input_filters.find('postfix') != -1

The Telegraf agent runs as an unprivileged user - this user must be able to see the Postfix queue. That’s somehow possible with group rights, but easy to do with ACLs. Ansible has the acl module which is used to set these permissions - but only if postfix is included in the input filter list.

Harddisks are monitored using the S.M.A.R.T. tools:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
- block:
  - name: Update telegraf.conf S.M.A.R.T. settings
    ini_file:
      path: "/etc/telegraf/telegraf.conf"
      section: "{{ item.section }}"
      option: "{{ item.option }}"
      value: "{{ item.value }}"
      state: "{{ item.state }}"
    loop:
      - { section: "[inputs.smart]", option: "interval", value: '"300s"', state: present }
    notify:
      - restart telegraf

  when: telegraf_config.input_filters.find('smart') != -1

The only change here is that the check runs every 5 minutes, not every 10 seconds. There are not so many changes in a harddisk status …

And finally the Telegraf agent needs to be started:

1
2
3
4
5
6
- name: Enable and start Telegraf service
  service:
    name: telegraf
    state: started
    enabled: yes
  register: start_telegraf

Now the laptop is feeding data into the InfluxDB, and the data can be shown in Grafana.


Categories: [Ansible] [Influxdb] [Linux]