Monitoring Syslog: InfluxDB-Telegraf-Grafana via Ansible role

This a continuation of the last blog entry. This time we are going to gather syslog messages from the monitoring containers and it is going to be deployed by ansible!

As usual, all this is based on Anton’s Karneliuk blog post. All credits to him.

So initially we built a monitoring stack with InfluxDB, Telegraf and Grafana manually to gather and visualise SNMP infor form the Arista cEOS switches.

This time, we are going to send SYSLOG from the monitoring stack containers to a new Telegraf instance.

Ideally, we would like to send Syslog from the cEOS devices but as Anton mentions, the syslog rfc3164 that most network kit implements, it is not supported (yet) by telegraf, that supports rfc5424.

You can read more info about this in all these links:

https://github.com/influxdata/telegraf/issues/4593

https://github.com/influxdata/go-syslog/pull/27

https://github.com/influxdata/telegraf/issues/7023

https://github.com/influxdata/telegraf/issues/4687

https://medium.com/@leodido/from-logs-to-metrics-f38854e3441a

https://itnext.io/metrics-from-kubernetes-logs-82cb1dcb3551

So the new ansible role for building influx-telegraf-grafana instances is “monitoring_stack”:

├── ansible.cfg
├── ansible-hosts
├── group_vars
│   ├── ceoslab.yaml
│   └── monitoring.yaml
└── playbooks
    ├── monitoring.yaml
    └── roles
        ├── monitoring_stack
        │   ├── tasks
        │   │   ├── container_grafana.yml
        │   │   ├── container_influxdb.yml
        │   │   ├── container_telegraf_snmp.yml
        │   │   ├── container_telegraf_syslog.yml
        │   │   └── main.yml
        │   └── templates
        │       ├── telegraf_snmp_template.j2
        │       └── telegraf_syslog_template.j2

We will have four monitoring containers:

  • influxdb: our time-series database with two databases: snmp and syslog
  • grafana: GUI to visualize influxdb contents, we will have pales for snmp and syslog queries. It will need to connect to influxdb
  • telegraf-snmp: collector of snmp info from the cEOS containers. The list is introduced manually in the template. It will write in influxdb
  • telegraf-syslog: collector of syslog messages from the monitoring containers. It will write in influxdb

As the containers are running locally, we define them in the inventory like this:

$ cat ansible-hosts
....
[monitoring]
localhost

We define some variables too in group_vars for the monitoring containers that will be used in the jinja2 templates and tasks

$ cat group_vars/monitoring.yaml
# Defaults for Docker containers
docker_mon_net:
  name: monitoring
  subnet: 172.18.0.0/16
  gateway: 172.18.0.1

path_to_containers: /PICK_YOUR_PATH/monitoring-example

var_influxdb:
  username: xxx
  password: xxx123
  snmp_community: xxx123
  db_name:
    snmp: snmp
    syslog: syslog

var_grafana:
  username: admin
  password: xxx123

var_telegraf:
…

So we execute the playbook like this:

ansible master$ ansible-playbook playbooks/monitoring.yaml -vvv --ask-become-pass

The very first time, if you pay attention to the ansible logging, everything should success. If for any reason you have to make changes or troubleshoot, and execute again the full playbook, some tasks will fail, but not the playbook (this is done with ignore_errors: yes inside a task). For example, the docker network creation will fail as it is already there. The same if you try to create the user and dbs in a already running influx instance.

That playbook just calls the role “monitoring_stack“. The main playbook in that role will create the docker network where all containers will be attached, all the containers and do something hacky with iptables.

As the cEOS lab is built (using docker-topo) independently of this playbook, there are already some iptables rules in place, and somehow, when executing the role, the rules change and it blocks the new network for any outbound connectivity.

Before the iptables change in the playbook:

# iptables -t filter -S DOCKER-ISOLATION-STAGE-1
Warning: iptables-legacy tables present, use iptables-legacy to see them
-N DOCKER-ISOLATION-STAGE-1
-A DOCKER-ISOLATION-STAGE-1 -i br-4bd17cfa19a8 ! -o br-4bd17cfa19a8 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i br-94c1e813ad6f ! -o br-94c1e813ad6f -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i br-13ab2b6a0d1d ! -o br-13ab2b6a0d1d -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i br-00db5844bbb0 ! -o br-00db5844bbb0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i br-121978ca0282 ! -o br-121978ca0282 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
#
# iptables -t filter -S DOCKER-ISOLATION-STAGE-2
Warning: iptables-legacy tables present, use iptables-legacy to see them
-N DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-2 -o br-4bd17cfa19a8 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o br-94c1e813ad6f -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o br-13ab2b6a0d1d -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o br-00db5844bbb0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o br-121978ca0282 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN

I want to avoid DOCKER-ISOLATION-STAGE-2 so I want the “-A DOCKER-ISOLATION-STAGE-1 -j ACCEPT” on top of that chain.

This is not the first (neither last) time that this issue bites me. I need to review carefully the docker-topo file and really get me head around the networking expectations from docker.

Another thing about docker networking that bites me very often. In my head, each monitoring has an IP. For example influxdb is 172.18.0.2 and telegraf-syslog is 172.18.0.4. We have configured influxdb to send syslog to telegraf-syslog container so I would expect the influxdb container to use its 0.2 and everything is local (no forwarding, no firewall, etc0. But not, it uses the host ip, 172.18.0.1.

Apart from that, there are several things that I had to review while adapting the role to my environment regarding docker and ansible.

docker documentation:

how to create network: https://docs.docker.com/engine/reference/commandline/network_create/

how to configure container logs: https://docs.docker.com/engine/reference/commandline/container_logs/

how to configure the logging driver in a container: https://docs.docker.com/config/containers/logging/configure/

how to configure syslog in a container: https://docs.docker.com/config/containers/logging/syslog/

how to run commands from a running container: https://docs.docker.com/engine/reference/commandline/exec/

ansible documentation:

become – run comamnds with sudo in a playbook: https://docs.ansible.com/ansible/latest/user_guide/become.html (–ask-become-pass, -K)

docker container module: https://docs.ansible.com/ansible/latest/modules/docker_container_module.html

grafana data source module: https://docs.ansible.com/ansible/latest/modules/grafana_datasource_module.html

This is important because via ansible, I had to workout the meaning of become, how to add the syslog config in the containers and add grafana datasources via a module.

All my ansible code is here.

Another thing I had to hardcode in the code, it is the IP for the telegraf-syslog container in each container playbook:

syslog-address: “udp://172.18.0.4:6514”

$ cat container_influxdb.yml 
---
...
- name: 4- CONTAINER INFLUXDB // LAUNCHING CONTAINER
  docker_container:
      name: influxdb
      image: influxdb
      state: started
      command: "-config /etc/influxdb/influxdb.conf"
      networks:
          - name: "{{ docker_mon_net.name }}"
      purge_networks: yes
      ports:
          - "8086:8086"
      volumes:
          - "{{ path_to_containers }}/influxdb/influxdb.conf:/etc/influxdb/influxdb.conf:ro"
          - "{{ path_to_containers }}/influxdb/data:/var/lib/influxdb"
      log_driver: syslog
      log_options:
        syslog-address: "udp://172.18.0.4:6514"
        tag: influxdb
        syslog-format: rfc5424
  become: yes
  tags:
      - tag_influx
...

Once you have all containers running:

$ docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                  NAMES
dd519ff01d6e        telegraf            "/entrypoint.sh -con…"   4 hours ago         Up 4 hours          8092/udp, 0.0.0.0:161->161/udp, 8125/udp, 8094/tcp     telegraf_snmp
869f158046a6        grafana/grafana     "/run.sh"                4 hours ago         Up 4 hours          0.0.0.0:3000->3000/tcp                                 grafana
dc68f261746b        influxdb            "/entrypoint.sh -con…"   4 hours ago         Up 4 hours          0.0.0.0:8086->8086/tcp                                 influxdb
3662c3c69b21        telegraf            "/entrypoint.sh -con…"   6 hours ago         Up 6 hours          8092/udp, 0.0.0.0:6514->6514/udp, 8125/udp, 8094/tcp   telegraf_syslog
ada1f884f1b7        ceos-lab:4.23.3M    "/sbin/init systemd.…"   28 hours ago        Up 4 hours          0.0.0.0:2002->22/tcp, 0.0.0.0:9002->443/tcp            3node_r03
22d9c4ae9043        ceos-lab:4.23.3M    "/sbin/init systemd.…"   28 hours ago        Up 4 hours          0.0.0.0:2001->22/tcp, 0.0.0.0:9001->443/tcp            3node_r02
fe7046b1f425        ceos-lab:4.23.3M    "/sbin/init systemd.…"   28 hours ago        Up 4 hours          0.0.0.0:2000->22/tcp, 0.0.0.0:9000->443/tcp            3node_r01

You should verify that syslog messages are stored in influxdb:

$ curl -G 'https://localhost:8086/query?db=syslog&pretty=true&u=xxx&p=xxx123' --data-urlencode "q=SELECT * FROM syslog limit 2" --insecure
{
    "results": [
        {
            "statement_id": 0,
            "series": [
                {
                    "name": "syslog",
                    "columns": [
                        "time",
                        "appname",
                        "facility",
                        "facility_code",
                        "host",
                        "hostname",
                        "message",
                        "msgid",
                        "procid",
                        "severity",
                        "severity_code",
                        "timestamp",
                        "version"
                    ],
                    "values": [
                        [
                            "2020-07-21T12:08:16.216632823Z",
                            "influxdb",
                            "daemon",
                            3,
                            "3662c3c69b21",
                            "athens",
                            "ts=2020-07-21T12:08:16.169711Z lvl=info msg=\"InfluxDB starting\" log_id=0O8KE_AG000 version=1.8.1 branch=1.8 commit=af0237819ab9c5997c1c0144862dc762b9d8fc25",
                            "influxdb",
                            "11254",
                            "err",
                            3,
                            1595333296000000000,
                            1
                        ],

We can create the new queries in grafana for SYSLOG. The datasources are already created by ansible so we dont have to worry about that.

For creating a query about the number of syslog messages we receive. This is what I did:

grafana – syslog rate query

Most of the entries come from “influxdb”.

For creating a query with the content of each syslog message:

grafana – syslog content

Here I struggled a bit. I can’t really change much in the table view.

And this is the dashboard with the syslog queries and snmp from the last blog entry:

grafana – dashboard – syslog and snmp

So at the end, I have an ansible role working!

Need to learn more about how to backup stuff from grafana. I have been playing with this:

https://github.com/ysde/grafana-backup-tool

Next thing I want to try is telemetry.

Ansible Troubleshooting 2

Today I was trying to write a playbook to push config to Arista devices.

Initially I wanted to use napalm module to push the config (as I have done with nornir) but it seems the napalm-ansible module requires napalm3 and netmiko3 and that breaks my nornir2.4 ( that requires napalm<3) So I uninstalled napalm-ansible and restored the other packages. Good thing i chekced the version before hand.

$ python -m pip list | grep -E 'nornir|napalm|netmiko|ansible'
ansible 2.9.10
napalm 2.5.0
netmiko 2.4.2
nornir 2.4.0

So I had to check the eos_config module. I think the napalm-ansible module is more powerful as it uses diff and sessions provided by Arista. As far as I can see, there is no option to say to the module to just make a dry run.

At the end I managed to put everything together but the eos_config was failing:

TASK [11- push config]
task path: xxx/testdir2/ceos-testing/ansible/playbooks/gen-config.yaml:60
fatal: [r1]: FAILED! => {
"changed": false,
"msg": "path specified in src not found"
}

The funny thing is all other tasks that needed to use templates were using the same path and were fine:

- name: 10- merge all configs in one file
  assemble:
    src: "CFGS/{{ inventory_hostname }}/" 
    dest: "CFGS/{{ inventory_hostname }}-full.txt"

- name: 11- push config
  debugger: on_failed
  eos_config:
    #src: "{{playbook_dir}}/../CFGS/{{ inventory_hostname }}-full.txt"
    src: "../CFGS/{{ inventory_hostname }}-full.txt"
    backup: yes

So I had to find out where that task was looking for the file. It seems “assemble“, “template” and “file” tasks use as pwd where I am calling the script (xxx/testdir2/ceos-testing/ansible). But “eos_config” is using where the playbook is (xxx/testdir2/ceos-testing/ansible/playbook) based on my running command “…/ansible master$ ansible-playbook playbooks/gen-config.yaml“.

So I was searching for some help and I found the playbook path and ansible search paths. So now I needed to verify that. I found some ansible debugger and examples that were really useful!

So I used “debugger: on_failed” for my task 11. And could see the path:

TASK [11- push config]
task path: /home/tomas/storage/technology/arista/testdir2/ceos-testing/ansible/playbooks/gen-config.yaml:60
fatal: [r1]: FAILED! => {
"changed": false,
"msg": "path specified in src not found"
}
[r1] TASK: 11- push config (debug)> p task.args
{'backup': True,
'src': '/home/tomas/storage/technology/arista/testdir2/ceos-testing/ansible/playbooks/CFGS/r1-full.txt'}
[r1] TASK: 11- push config (debug)> quit
User interrupted execution

So it is clear it was looking at the playbook dir.

So after fixing the path, I realised that I didn’t want to run everything and wanted to use tags so only the last part was executed.

/ansible master$ cat playbooks/gen-config.yaml
...
- name: 12- display result 
  debug: msg: "Backup file is {{ load_config.shortname }} and result is: {{ load_config }}" 
  tags: push_config
...

/ansible master$ ansible-playbook playbooks/gen-config.yaml --limit="r1" -vvv --tags "push_config"

One more thing, the output of ansible when you have dictionaries, it is not great. I checked this link and it is good for failures and with -vvvv. But for green outputs still not great:

TASK [12- - display result] *
task path: /home/tomas/storage/technology/arista/testdir2/ceos-testing/ansible/playbooks/gen-config.yaml:61
ok: [r1] =>
msg: 'Backup file is /home/tomas/storage/technology/arista/testdir2/ceos-testing/ansible/playbooks/backup/r1_config and result is: {''changed'': True, ''commands'': [''interface Ethernet1'', ''no shutdown'', ''interface Ethernet2'', ''no shutdown'', ''router bgp 100'', ''neighbor AS100-CORE password mpls-sr''], ''updates'': [''interface Ethernet1'', ''no shutdown'', ''interface Ethernet2'', ''no shutdown'', ''router bgp 100'', ''neighbor AS100-CORE password mpls-sr''], ''session'': ''ansible_1594920727'', ''backup_path'': ''/home/tomas/storage/technology/arista/testdir2/ceos-testing/ansible/playbooks/backup/r1_config.2020-07-16@18:32:07'', ''date'': ''2020-07-16'', ''time'': ''18:32:07'', ''shortname'': ''/home/tomas/storage/technology/arista/testdir2/ceos-testing/ansible/playbooks/backup/r1_config'', ''filename'': ''r1_config.2020-07-16@18:32:07'', ''failed'': False}'
META: ran handlers
META: ran handlers
PLAY RECAP
r1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
(testdir2) go:1.12.5|py:3.7.3|tomas@athens:~/storage/technology/arista/testdir2/ceos-testing/ansible master$

Ansible – Troubleshooting

A couple of years a go I wrote a playbook with ansible to use napalm for configuring some switches.

I wanted to test again ansible as I am quite rusty and there is always demand for that.

I started with just something basic with my ceos lab.

All my code is here:

https://github.com/thomarite/ceos-testing/tree/master/ansible

Initially I was following the official documentation for Arista EOS Ansible modules:

https://ansible-arista-howto.readthedocs.io/en/latest/COLLECTING_STATUS.html

https://github.com/titom73/ansible-arista-module-howto

Installing ansible was fine with pip in my venv.

But I hit the wall with just the first example using “eos_facts”. Initially I wasnt adding debugging flags so was even worse. Fortunately I remembered “-vvv”. I was seeing this:

The full traceback is:
Traceback (most recent call last):
File "/home/tomas/.ansible/tmp/ansible-tmp-1594296522.1539829-295453-189146847007138/AnsiballZ_eos_facts.py", line 102, in
_ansiballz_main()
File "/home/tomas/.ansible/tmp/ansible-tmp-1594296522.1539829-295453-189146847007138/AnsiballZ_eos_facts.py", line 94, in _ansiballz_main
invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
File "/home/tomas/.ansible/tmp/ansible-tmp-1594296522.1539829-295453-189146847007138/AnsiballZ_eos_facts.py", line 40, in invoke_module
runpy.run_module(mod_name='ansible.modules.network.eos.eos_facts', init_globals=None, run_name='main', alter_sys=True)
File "/home/tomas/.pyenv/versions/3.7.3/lib/python3.7/runpy.py", line 205, in run_module
return _run_module_code(code, init_globals, run_name, mod_spec)
File "/home/tomas/.pyenv/versions/3.7.3/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/tomas/.pyenv/versions/3.7.3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/modules/network/eos/eos_facts.py", line 206, in
File "/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/modules/network/eos/eos_facts.py", line 197, in main
File "/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/module_utils/network/common/facts/facts.py", line 23, in init
File "/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/module_utils/network/common/network.py", line 213, in get_resource_connection
File "/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/module_utils/network/common/network.py", line 229, in get_capabilities
File "/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/module_utils/connection.py", line 121, in init
AssertionError: socket_path must be a value
fatal: [r3]: FAILED! => {
"changed": false,
"module_stderr": "Traceback (most recent call last):\n File \"/home/tomas/.ansible/tmp/ansible-tmp-1594296522.1539829-295453-189146847007138/AnsiballZ_eos_facts.py\", line 102, in \n _ansiballz_main()\n File \"/home/tomas/.ansible/tmp/ansible-tmp-1594296522.1539829-295453-189146847007138/AnsiballZ_eos_facts.py\", line 94, in _ansiballz_main\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\n File \"/home/tomas/.ansible/tmp/ansible-tmp-1594296522.1539829-295453-189146847007138/AnsiballZ_eos_facts.py\", line 40, in invoke_module\n runpy.run_module(mod_name='ansible.modules.network.eos.eos_facts', init_globals=None, run_name='main', alter_sys=True)\n File \"/home/tomas/.pyenv/versions/3.7.3/lib/python3.7/runpy.py\", line 205, in run_module\n return _run_module_code(code, init_globals, run_name, mod_spec)\n File \"/home/tomas/.pyenv/versions/3.7.3/lib/python3.7/runpy.py\", line 96, in _run_module_code\n mod_name, mod_spec, pkg_name, script_name)\n File \"/home/tomas/.pyenv/versions/3.7.3/lib/python3.7/runpy.py\", line 85, in _run_code\n exec(code, run_globals)\n File \"/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/modules/network/eos/eos_facts.py\", line 206, in \n File \"/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/modules/network/eos/eos_facts.py\", line 197, in main\n File \"/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/module_utils/network/common/facts/facts.py\", line 23, in init\n File \"/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/module_utils/network/common/network.py\", line 213, in get_resource_connection\n File \"/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/module_utils/network/common/network.py\", line 229, in get_capabilities\n File \"/tmp/ansible_eos_facts_payload_r5gz8rov/ansible_eos_facts_payload.zip/ansible/module_utils/connection.py\", line 121, in init\nAssertionError: socket_path must be a value\n",
"module_stdout": "",
"msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
"rc": 1
}

So, “socket_path” was defined. I checked all the python files mentioned in the stack but couldnt find anything.

It was clear that I wasn’t providing enough info to ansible to establish the socket for connection to the devices (ip:port)

And the example from the documentation didnt work neither:

https://docs.ansible.com/ansible/latest/network/user_guide/platform_eos.html#using-eapi-in-ansible

I knew my old ansible script was working before I left my job. But I knew as well that I was using the latest version of ansible so very likely things have changed since then.

$ ansible --version
ansible 2.9.10

So I had to read about the “eos_fact” and “eos_config” module searching here:

https://docs.ansible.com/ansible/latest/modules/list_of_network_modules.html

https://docs.ansible.com/ansible/latest/modules/eos_facts_module.html#eos-facts-module

https://docs.ansible.com/ansible/latest/modules/eos_config_module.html

After some time, I managed to fix the playbook and my environment and I could run the playbook using the ssh connector (but I was ignoring a warning about “provider” not needed…)

(testdir2)/ansible master$ cat ansible-hosts
[all:vars]
ansible_python_interpreter=/home/tomas/storage/technology/arista/testdir2/bin/python
ansible_user='tomas'
ansible_password='tomas123'
[ceoslab]
r1 ansible_host=127.0.0.1 ansible_port=2000
r2 ansible_host=127.0.0.1 ansible_port=2001
r3 ansible_host=127.0.0.1 ansible_port=2002

(testdir2)/ansible master$ cat group_vars/ceoslab.yaml
ansible_network_os: eos

The output:

/ansible master$ ansible-playbook playbooks/collect-facts-cli.yaml
PLAY [Run commands on ceos lab]
TASK [Collect all facts from device] ***
[WARNING]: provider is unnecessary when using network_cli and will be ignored
[WARNING]: default value for gather_subset will be changed to min from !config v2.11 onwards
ok: [r1]
ok: [r3]
ok: [r2]
TASK [Display result] ****
ok: [r2] => {
"msg": "Model is cEOSLab and it is running 4.23.3M"
}
ok: [r1] => {
"msg": "Model is cEOSLab and it is running 4.23.3M"
}
ok: [r3] => {
"msg": "Model is cEOSLab and it is running 4.23.3M"
}
PLAY RECAP *
r1 : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
r2 : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
r3 : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Ok, so getting the playbook using the API shouldnt be that difficult? It was.

The full traceback is:
File "/tmp/ansible_eos_facts_payload_vz7c7ipu/ansible_eos_facts_payload.zip/ansible/module_utils/network/common/network.py", line 229, in get_capabilities
capabilities = Connection(module._socket_path).get_capabilities()
File "/tmp/ansible_eos_facts_payload_vz7c7ipu/ansible_eos_facts_payload.zip/ansible/module_utils/connection.py", line 185, in rpc
raise ConnectionError(to_text(msg, errors='surrogate_then_replace'), code=code)
fatal: [r1]: FAILED! => {
"changed": false,
"invocation": {
"module_args": {
"auth_pass": null,
"authorize": null,
"gather_network_resources": null,
"gather_subset": [
"all"
],
"host": null,
"password": null,
"port": null,
"provider": null,
"ssh_keyfile": null,
"timeout": null,
"transport": null,
"use_ssl": null,
"username": null,
"validate_certs": null
}
},
"msg": "Could not connect to http://127.0.0.1:80/command-api: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1056)"
}

I was surprised that it was using port 80. I was pretty sure I was providing the correct port (900x) so somehow my data wasnt being processed.

I wasn’t clearly paying attention to the documentation:

https://docs.ansible.com/ansible/latest/modules/eos_facts_module.html#eos-facts-module

It says clearly that “provider” is deprecated since 2.5! I am using 2.9

As well, I have a very poor knowledge of ansible and I didnt understand the concept of “connection”. The SSH was using “network_cli” and API was using “httpapi”.

I was very close to give up the API connection when somehow I searched for “ansible network_cli” and I found documentation for that plugging. Then I searched for “httpapi” and it was gold!

https://docs.ansible.com/ansible/latest/plugins/connection/network_cli.html

https://docs.ansible.com/ansible/latest/plugins/connection/httpapi.html

I realised that I need to pass specific vars for getting the HTTPS connection. So at the end, managed to get the config right for both SSH and HTTPS:

/ansible master$ cat ansible-hosts
[all:vars]
ansible_python_interpreter=/home/tomas/storage/technology/arista/testdir2/bin/python
ansible_user='tomas'
ansible_password='tomas123'
[ceoslab]
r1 ansible_host=127.0.0.1 ansible_port=2000 ansible_httpapi_port=9000
r2 ansible_host=127.0.0.1 ansible_port=2001 ansible_httpapi_port=9001
r3 ansible_host=127.0.0.1 ansible_port=2002 ansible_httpapi_port=9002

/ansible master$ cat group_vars/ceoslab.yaml
ansible_network_os: eos
#start - eapi config
ansible_httpapi_use_ssl: 'yes'
ansible_httpapi_validate_certs: 'no'
ansible_httpapi_password: "{{ ansible_password }}"
#end - eapi config

The output:

ansible master$ ansible-playbook playbooks/collect-facts-eapi.yaml
PLAY [Run commands on remote ceos lab] *
TASK [Collect all facts from device] ***
[WARNING]: default value for gather_subset will be changed to min from !config v2.11 onwards
ok: [r3]
ok: [r1]
ok: [r2]
TASK [Display result] ****
ok: [r2] => {
"msg": "Model is cEOSLab and it is running 4.23.3M"
}
ok: [r1] => {
"msg": "Model is cEOSLab and it is running 4.23.3M"
}
ok: [r3] => {
"msg": "Model is cEOSLab and it is running 4.23.3M"
}
PLAY RECAP *
r1 : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
r2 : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
r3 : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

At the end of the day, the scripts are identical apart from the “connection” var:

/ansible/playbooks master$ diff collect-facts-cli.yaml collect-facts-eapi.yaml
4c4
< connection: network_cli
---
> connection: httpapi

I think you can pass a var to the playbook via the CLI so I will try later.

My recommendation is always to use “-vvv”.

BTW, A good ansible summary I found:

https://gist.github.com/andreicristianpetcu/b892338de279af9dac067891579cad7d

In summary, I found ansible more difficult to troubleshoot that nornir. In nornir, is pure python, I can run ipdb wherever a I want.

But anyway, I learned things. I will add try to write a bit more complex playbooks.

Netbox – API Troubleshooting

Yesterday managed to get netbox and my lab connected. So today followed up with the original article, and found a new issue that took me several hours.

Initially I was seeing an error that I couldn’t undestand “

netbox.exceptions.CreateException: This field is required

From

(venv) /netbox-example/nornir-napalm-netbox-demo master$ python scripts/create_interfaces.py
nb_url = http://0.0.0.0:8080
Creating Netbox Interface for device r1, interface Loopback1
Traceback (most recent call last):
File "scripts/create_interfaces.py", line 42, in
task=create_netbox_interface,
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/init.py", line 146, in run
result = self._run_serial(task, run_on, **kwargs)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/init.py", line 72, in _run_serial
result[host.name] = task.copy().start(host, self)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/task.py", line 85, in start
r = self.task(self, **self.params)
File "scripts/create_interfaces.py", line 34, in create_netbox_interface
device_id=device_id,
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/netbox/dcim.py", line 431, in create_interface
return self.netbox_con.post('/dcim/interfaces/', required_fields, **kwargs)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/netbox/connection.py", line 124, in post
raise exceptions.CreateException(resp_data)
netbox.exceptions.CreateException: This field is required.

So I started to follow the trace, adding “print” and using “ipdb” to see what was going on:

....
/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/netbox/connection.py(71)__request()
70 finally:
---> 71 self.close()
72
ipdb> dir(response)
['attrs', 'bool', 'class', 'delattr', 'dict', 'dir', 'doc', 'enter', 'eq', 'exit', 'format', 'ge', 'getattribute', 'getstate', 'gt', 'hash', 'init', 'init_subclass', 'iter', 'le', 'lt', 'module', 'ne', 'new', 'nonzero', 'reduce', 'reduce_ex', 'repr', 'setattr', 'setstate', 'sizeof', 'str', 'subclasshook', 'weakref', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
ipdb> response.url
'http://0.0.0.0:8080/api/dcim/interfaces/'
ipdb> response.text
'{"type":["This field is required."]}'
ipdb> response.status_code
400
ipdb> response.content
b'{"type":["This field is required."]}'
ipdb> response.reason
'Bad Request'
ipdb> response.request

ipdb> prepared_request

ipdb> prepared_request.url
'http://0.0.0.0:8080/api/dcim/interfaces/'
ipdb> dir(prepared_request)
['class', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'gt', 'hash', 'init', 'init_subclass', 'le', 'lt', 'module', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook', 'weakref', '_body_position', '_cookies', '_encode_files', '_encode_params', '_get_idna_encoded_host', 'body', 'copy', 'deregister_hook', 'headers', 'hooks', 'method', 'path_url', 'prepare', 'prepare_auth', 'prepare_body', 'prepare_content_length', 'prepare_cookies', 'prepare_headers', 'prepare_hooks', 'prepare_method', 'prepare_url', 'register_hook', 'url']
ipdb> prepared_request.path_url
'/api/dcim/interfaces/'
ipdb> response.__content
*** AttributeError: 'Response' object has no attribute '__content'
ipdb> response._content
b'{"type":["This field is required."]}'
ipdb> response.content
b'{"type":["This field is required."]}'
ipdb> response.headers
{'Server': 'nginx', 'Date': 'Wed, 08 Jul 2020 12:36:35 GMT', 'Content-Type': 'application/json', 'Content-Length': '36', 'Connection': 'keep-alive', 'Vary': 'Accept, Cookie, Origin', 'Allow': 'GET, POST, HEAD, OPTIONS, TRACE', 'API-Version': '2.8', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN'}
ipdb> response.reason
'Bad Request'
ipdb> response.request

ipdb> response.test
*** AttributeError: 'Response' object has no attribute 'test'
ipdb> response.text
'{"type":["This field is required."]}'
ipdb> response.url
'http://0.0.0.0:8080/api/dcim/interfaces/'
ipdb> quit
Create Netbox Interfaces
r1 ** changed : False
vvvv Create Netbox Interfaces ** changed : False vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv ERROR
---- napalm_get ** changed : False --------------------------------------------- INFO
(venv) go:1.12.5|py:3.7.3|tomas@athens:~/storage/technology/netbox-example/nornir-napalm-netbox-demo master$ python scripts/create_interfaces.py
nb_url = http://0.0.0.0:8080
url3=http://0.0.0.0:8080/api/dcim/interfaces?limit=0
Creating Netbox Interface for device r1, interface Loopback1
url3=http://0.0.0.0:8080/api/dcim/devices/?name=r1&limit=0
device_id = 1
url3=http://0.0.0.0:8080/api/dcim/interfaces/
resp_ok=False resp_status=400
body_data= {'name': 'Loopback1', 'form_factor': 1200, 'device': 1}
params= /dcim/interfaces/
resp_data= {'type': ['This field is required.']}
Traceback (most recent call last):
File "scripts/create_interfaces.py", line 43, in
task=create_netbox_interface,
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/init.py", line 146, in run
result = self._run_serial(task, run_on, **kwargs)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/init.py", line 72, in _run_serial
result[host.name] = task.copy().start(host, self)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/task.py", line 85, in start
r = self.task(self, **self.params)
File "scripts/create_interfaces.py", line 35, in create_netbox_interface
device_id=device_id,
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/netbox/dcim.py", line 431, in create_interface
return self.netbox_con.post('/dcim/interfaces/', required_fields, **kwargs)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/netbox/connection.py", line 130, in post
raise exceptions.CreateException(resp_data)
netbox.exceptions.CreateException: This field is required.

So it seems that at the end I realised that I was missing the parameter “type” !!!

I was checking the documentation from netbox in github but I couldnt see clearly what kind of config I had to provide…

I checked the “type” value for the only interfaces I already had in netbox: “http://0.0.0.0:8080/api/dcim/interfaces/

So I tried to pass exactly that but it was still failing…

(venv) go:1.12.5|py:3.7.3|tomas@athens:~/storage/technology/netbox-example/nornir-napalm-netbox-demo master$ python scripts/create_interfaces.py
nb_url = http://0.0.0.0:8080
url3=http://0.0.0.0:8080/api/dcim/interfaces?limit=0
Creating Netbox Interface for device r1, interface Loopback1
url3=http://0.0.0.0:8080/api/dcim/devices/?name=r1&limit=0
device_id = 1
url3=http://0.0.0.0:8080/api/dcim/interfaces/
resp_ok=False resp_status=400
body_data= {'name': 'Loopback1', 'form_factor': 1200, 'device': 1, 'type': {'value': '1000base-t', 'label': '1000BASE-T (1GE)', 'id': 1000}}
params= /dcim/interfaces/
resp_data= {'type': ['Value must be passed directly (e.g. "foo": 123); do not use a dictionary or list.']}
Traceback (most recent call last):
File "scripts/create_interfaces.py", line 50, in
task=create_netbox_interface,
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/init.py", line 146, in run
result = self._run_serial(task, run_on, **kwargs)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/init.py", line 72, in _run_serial
result[host.name] = task.copy().start(host, self)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/nornir/core/task.py", line 85, in start
r = self.task(self, **self.params)
File "scripts/create_interfaces.py", line 42, in create_netbox_interface
**interface_type,
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/netbox/dcim.py", line 431, in create_interface
return self.netbox_con.post('/dcim/interfaces/', required_fields, **kwargs)
File "/home/tomas/storage/technology/netbox-example/venv/lib/python3.7/site-packages/netbox/connection.py", line 130, in post
raise exceptions.CreateException(resp_data)
netbox.exceptions.CreateException: Value must be passed directly (e.g. "foo": 123); do not use a dictionary or list.
(venv) go:1.12.5|py:3.7.3|tomas@athens:~/storage/technology/netbox-example/nornir-napalm-netbox-demo master$

Somehow the API had to be documented… by chance, looking at the bottom of the netbox page, there was an”API” link….

So, now I needed to look up the correct API call. Based on the script and logs, it was a “POST” for “/dcim/interfaces/”. Here we go!

So finally, I had the info. I confirmed what fields were mandatory and the value they needed!

interface_type = {}
interface_type["type"] = "1000base-t"
for interface_name in interfaces.keys():
    if not is_interface_present(nb_interfaces, f"{task.host}", interface_name):
        print(
            f"* Creating Netbox Interface for device {task.host}, interface {interface_name}"
        )
        device_id = get_device_id(f"{task.host}", netbox)
        print("device_id = %s" % device_id)
        netbox.dcim.create_interface(
           name=f"{interface_name}",
           form_factor=1200,  # default
           device_id=device_id,
           **interface_type,
        )

So the script ran fine for all my devices:

netbox-example/nornir-napalm-netbox-demo master$ python scripts/create_interfaces.py
nb_url = http://0.0.0.0:8080
url3=http://0.0.0.0:8080/api/dcim/interfaces?limit=0
Create Netbox Interfaces
r1 ** changed : False
vvvv Create Netbox Interfaces ** changed : False vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv INFO
---- napalm_get ** changed : False --------------------------------------------- INFO
^^^^ END Create Netbox Interfaces ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
r2 ** changed : False
vvvv Create Netbox Interfaces ** changed : False vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv INFO
---- napalm_get ** changed : False --------------------------------------------- INFO
^^^^ END Create Netbox Interfaces ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
r3 ** changed : False
vvvv Create Netbox Interfaces ** changed : False vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv INFO
---- napalm_get ** changed : False --------------------------------------------- INFO
^^^^ END Create Netbox Interfaces ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

And it is updated in GUI:

Nornir

Nornir is a python framework mainly for network automation. Instead of using another tool like Ansible (that you need to learn), you can do the same just using pure python all the way. Ansible doesnt scale well and can be very slow, with nornir you have threading from day zero, so if you have to run tasks in 100 devices, you will feel and see the difference.

I learnt about nornir via Kirk Byers’ course. Unfortunately I didnt have the chance/time to use it in my former day job so now I have had time to review things and do a small project.

From https://github.com/thomarite/ceos-testing in the nornir section you can find the whole environment. I tested on the 3-node topology.

It is nothing special. The script builds the config for BGP or ISIS using jinj2 and yaml files. I have the feeling that my jinja2 is a bit difficult to follow. Then using napalm connects to the devices to push or check the config.

Just one issue, as it seems due to the nature of cEOS relaying on docker and my filesystem, if you decide to push the config (dry_run=False == commit=True) the task will fail (while trying to write startup config) but it is actually executed.

(testdir2) /testdir2/ceos-testing/nornir master$ python buid-config.py -b isis -c
hostname: r1
task: deploy_config for isis
failed: True
logs: Traceback (most recent call last):
...
File ".../testdir2/lib/python3.7/site-packages/pyeapi/eapilib.py", line 469, in send
raise CommandError(code, msg, command_error=err, output=out)
pyeapi.eapilib.CommandError: Error [1000]: CLI command 5 of 5 'write memory' failed: could not run command [Error copying system:/running-config to flash:/startup-config (Operation not permitted)]
changed: False
diff:

hostname: r2
task: deploy_config for isis
failed: False
logs: None
changed: False
diff:

hostname: r3
task: deploy_config for isis
failed: False
logs: None
changed: False
diff:

This shouldn’t happen on vEOS or the real hardware (if you have the correct aaa config of course)

CI: Basics with Travis

For some time I wanted to learn a bit about CI/CD. Today I have given a go to Travis.

All this is based on Kirk Byers python course and his git repo.

So I just created an empty repo and started working on it:

$ git clone https://github.com/thomarite/test-ci.git

$ cd test-ci
$ pyenv local 3.7.3
$ python -m venv virt_env
$ source virt_env/bin/active

$ python -m pip install pylama
$ python -m pip install black
$ python -m pip install pytest
$ python -m pip install tox

$ mkdir tests

$ vim tests/test_sample.py
def increment(x):
return x + 1


def test_answer():
assert increment(4) == 5

$ vim requirements.txt
pytest==5.4.3
pylama==7.7.1
black==19.10b0

$ vim .travis.yml
language: python
python:
"3.7"
# command to install dependencies
install:
pip install -r requirements.txt
# command to run tests
script:
pylama .
black --check .
py.test -s -v tests/

Then you create an account with Travis-ci.org that is “free” and you link up to your repo. As soon as you commit, you will how the tests run and if they are successful.

As I have now a basic setup, I hope I carry on using it to any new python stuff I try.