Configure NXOS with Napalm

using ansible napalm to configure n9kv

27 September 2020   10 min read

Napalm offers an easy way to configure and gather information from network devices using a unified API. No matter what vendor it is used against the input task and returned output will be the same. The only thing that will not be vendor neutral is the actual commands run and configuration being applied. This post documents experiences of trying to replace the whole configuration on NXOS using Napalm with Ansible.

Napalm operates using the nxapi and scp-server features so these must be enabled for Napalm to work. I have been using N9Kv 9.2.4 on EVE-NG and find the API to be slow at times and buggy in that it can intermittently break when using config_replace. Even though the service was still running and NXOS said the port was open you couldn’t telnet on 443. Removing the command nxapi use-vrf management improved stability, as the VRF is optional it doesn’t effect pushing config via the management interface.

feature nxapi                                            Enable the NXOS API
feature scp-server Allows for the copying of files used for diff and config_replace

show nxapi See the certificate, port number and timeout settings
show nxapi-server logs Show logs of past API connections

The main options when using Napalm within Ansible are pretty much the same as when it is used natively with Python. The rollback feature hasn’t been ported into Ansible however it does still create rollback_config.txt meaning rollback can be done by applying this configuration file.

- name: NET >> Apply the configuration
napalm_install_config: Used to deploy the final configuration
provider: '{{ creds }}' The provider, a dictionary of auth creds set in group_var
dev_os: '{{ os }}' The network device type (os) set in host_var
timeout: 60 Default timeout to wait for a response is 60 seconds
config_file: '{{ host_tmpdir }}/assembled.conf' File containing the configuration that is to be applied to this device
commit_changes: true or false True will apply the changes, false will discard the changes after doing a diff
replace_config: true or false (Optional) True to replace entire config or False to just merged with it (default is False)
get_diffs: true or false (Optional) True compares diffs of current and new config, use -v or register to print (default is True)*
diff_file: '{{ host_tmpdir }}/diff' (Optional) Writes the results of diff to a file called diff (also need get_diffs to be enabled)
register: changes
- debug: var=changes.msg.splitlines() Prints the differences to CLI

commit_changes: False means it wont commit changes, but even if this was ‘True’ is overridden by Ansible check-mode.
get_diff is enabled by default, use -v or register and debug to view diffs. The output will be just one long string so use splitlines with debug to make it more human readable.

  register: changes
- debug: var=changes.msg.splitlines()

A better option for complex or long changes is to save the diff to file diff_file: (still need to have get_diffs enabled)

        diff_file: "{{ ans.dir_path }}/{{ inventory_hostname }}/diff.txt"

Replace

Nexus configuration files are checkpoint files which use the rollback feature to create archives (checkpoints) and rollback between checkpoint file versions (without needing a reboot). Napalm makes use of this feature to to perform replace_config.
nxapi and scp-server must be including in any config files pushed to NXOS as these features are used to connect and copy the configuration files over to the device. It will do API calls to copy files over, get the diff and apply the config. By default Napalm timeout expects a response to each API call in 60 seconds, when applying large config files it is likely this will need to be increased.

    - name: "CFG >> Applying changes using replace config"
      napalm_install_config:
        provider: "{{ ans.creds_all }}"
        dev_os: "{{ ansible_network_os }}"
        timeout: 240
        config_file: "{{ ans.dir_path }}/{{ inventory_hostname }}/config/config.cfg"
        commit_changes: True
        replace_config: True
        diff_file: "{{ ans.dir_path }}/diff/{{ inventory_hostname }}.txt"
        get_diffs: True
      register: changes
    - debug: var=changes.msg.splitlines()
      tags: [diff]

The command order in the desired state config file does not have to match the current configuration, NXOS is smart enough to compare commands. The only gotcha is if a line of config relies on another part of the configuration, such as creation VLAN before applying to an interface.

These two lines MUST be at the start of the configuration file or the deployment will fail.

!Command: Checkpoint cmd vdc 1*          NXOS wont recognize candidate_config.txt as a checkpoint file without it
version 9.2(4) Bios:version* Without it NXOS does 'no hostname' which causes failure due to 'Syntax error while parsing 'vdc DC1-N9K-LEAF01 id 1'

If !Command: Checkpoint cmd vdc 1 is missing the deployment will fail with a 500 response code. The error on the device will be:

ERROR: Rollback patch computation failed due to the following reason(s)
The checkpoint file was not created using checkpoint CLI

When Napalm is run to replace the configuration (replace_config: true) the following commands are applied on the device:

  • scp -t bootflash:/candidate_config.txt
  • delete bootflash:/sot_file
  • checkpoint file bootflash:/sot_file
  • checkpoint file bootflash:/rollback_config.txt
  • rollback running-config file bootflash:/candidate_config.txt
  • copy running-config startup-config

When using the replace method you need to get used to failure, is going to happen lot in the early stages and when adding new features. A lot of the issues arise from the hidden default configuration, is helpful to use show run all and/or show file sot_file to workout what the full configuration should look like. The two most common failure scenarios you will come across are:

  • Something stopped the code being deployed on the NXOS and reverted
fatal: [DC1-N9K-SPINE01]: FAILED! => {"changed": false, "msg": "cannot install config: Invalid status code returned on NX-API POST\ncommands: ['terminal dont-ask', 'rollback running-config file candidate_config.txt', 'no terminal dont-ask']\nstatus_code: 500"}

  • Lost access to the device (commands issued broke your access) or it took longer than 60 seconds to apply (likely config was still applied)
fatal: [DC1-N9K-SPINE02]: FAILED! => {"changed": false, "msg": "cannot install config: HTTPSConnectionPool(host='10.10.108.12', port=443): Read timed out. (read timeout=60)"}

Can use show rollback status to see how long the API call actually took to apply and adjust the Napalm timeout accordingly.

DC1-N9K-LEAF01# show rollback status
Last operation : Rollback to file
Details:
  Rollback type: atomic candidate_config.txt
  Start Time: Sun Sep 13 09:35:27 2021
  End Time: Sun Sep 13 09:38:07 2021
  Operation Status: Success

For failures the best bet is to log into the NXOS and see if can work it out from the logs and files. The sot_file, candidate_config.txt and rollback_config.txt files are created whenever Napalm runs.

show file sot_file                                                       Device configuration, equivalent of show run all
show file candidate_config.txt Configuration transfer by Napalm that was to be applied
show file rollback_config.txt Rollback config created before applying change (is same as sot_file)

show diff rollback-patch file sot_file file candidate_config.txt Check difference between device config and config file
rollback running-config file rollback_config.txt To rollback the configuration
rollback running-config file candidate_config.txt verbose Manually do the replace_config, verbose shows the cmds entered live

Some useful commands to see what happened when an attempt at deploying configuration was made.

show accounting log                                  See all the commands run on NXOS
show rollback status Details on whether last install was a success or fail and the time it took
show rollback log exec Line-by-line the commands applied and possibly the command that made it fail
show rollback log verify Result of verification actual config is what was declared in applied config

Troubleshooting deployment failures

The first step is to use a combination of show rollback log verify and show rollback log exec to see if the reason for failure is obvious.

show rollback log verify will show what configuration was missing before the change was rolled back. The output from this command is not always clear, especially when applying lots of configuration. The below output shows that the command boot nxos bootflash:/nxos.9.2.4.bin was applied (in running config) but the expected command (in the applied config file) was boot nxos bootflash:/nxos.9.2.4.bin sup-1, so running config was missing sup-1. This was typo by me as later NXOS versions require this and I had forgotten to take it out of my templates.

DC1-N9K-LEAF01# show rollback log verify
Operation            : Rollback to Checkpoint File
Checkpoint file name : /candidate_config.txt
Scheme               : bootflash
Rollback done By     : admin
Rollback mode        : atomic
Verbose              : disabled
Start Time           : Fri, 21:49:28 18 Sep 2020
Start Time UTC       : Fri, 21:49:28 18 Sep 2020
End Time             : Fri, 21:57:19 18 Sep 2020
End Time UTC         : Fri, 21:57:19 18 Sep 2020
Status               : Failed

Verification patch contains the following commands:
---------------------------------------------------
!!
Configuration To Be Removed Present in Running-config
=====================================================
!
boot nxos bootflash:/nxos.9.2.4.bin
Configuration To Be Added Missing in Running-config
===================================================
!
boot nxos bootflash:/nxos.9.2.4.bin sup-1

In this situation show rollback log exec wouldn’t give any indication of what the problem was as the cmd wont cause the CLI raise an error.

switch# show rollback log exec
Operation            : Rollback to Checkpoint File
Checkpoint file name : /candidate_config.txt
Scheme               : bootflash
Rollback done By     : admin
Rollback mode        : atomic
Verbose              : disabled
Start Time           : Sat, 15:12:46 19 Sep 2020
Start Time UTC       : Sat, 15:12:46 19 Sep 2020
End Time             : Sat, 15:15:28 19 Sep 2020
End Time UTC         : Sat, 15:15:28 19 Sep 2020
Rollback Status      : Failed
Restoring Previous Config : Success

Executing Patch:
----------------
`config t `
`interface Ethernet1/128`
`shutdown`
`exit`

.....
`boot nxos bootflash:/nxos.9.2.4.bin sup-1`
Performing image verification and compatibility check, please wait....
`interface Ethernet1/5`
`no shutdown`
`interface Ethernet1/6`
`no shutdown`
`exit`

If the image didn’t exist then this would be shown in show rollback log exec as that does cause the CLI raise an error.

`crypto key param rsa label DC1-N9K-LEAF01.stesworld.com modulus 2048`
`boot nxos bootflash:/nxos.9.2.3.bin sup-1`
Image provided does not exist.
Failed to set the boot variable: image not found (0x40450008)

Retrying Rollback Patch:
----------------
`config t `
`interface Ethernet1/6`
`no switchport trunk allowed vlan`

If you cant find the source fo the problem from the either of these commands I find the best thing to do is download the base config file locally and manually compare that against what you are trying to deploy. The problem is normally some hidden commands that you have forgotten about.

from napalm import get_network_driver
driver = get_network_driver('nxos')
device = driver('10.10.108.21','admin','ansible')
device.open()

with open("base_config.txt", mode='w') as x:
    x.write(device._get_checkpoint_file())

Some common issues I have come across so far:

  • Trunk ports have to use switchport trunk allowed vlan 1-4094 instead of switchport trunk allowed vlan all
  • !#switchport trunk allowed vlan 1-4094 is required even if the interface is switchport mode access
  • Make sure all used interfaces have ‘!#no shutdown’, is a hidden command so wont see in show run (NXOS hides by using using !#)
  • Referencing source-interfaces in the configuration before they have been created. Put source-interfaces nearer end of config file
  • For port-channels only the port-channels (not physical interface) has the commands switchport mode command. If it is a trunk switchport trunk allowed vlan x is also only on the port-channel
  • The physical interface for port-channels has to be in the is format channel-group 27 force mode active with keyword force
  • Is no need to have the vlan 1, 10, 20, 30 command for all vlans created, they are all entered vertically with the name under the VLAN

Dealing with interfaces

If the interface is an access ports (including port-channels) it always needs this command or the deployment will fail. The only exception to this are interfaces that are Layer3 ports.

!#switchport trunk allowed vlan 1-4094

For example an dual-homed access port would look like this, notice how switchport mode access is only on the port-channel

interface Ethernet1/13
  description ACCESS >DC1-SRV-APP01 eth1
  spanning-tree port type network
  !#switchport
  switchport access vlan 10
  channel-group 13 force mode active
  no shutdown

interface Port-channel13
  description ACCESS >DC1-SRV-APP01 eth1
  spanning-tree port type network
  !#switchport
  !#switchport trunk allowed vlan 1-4094
  switchport access vlan 10
  switchport mode access
  no shutdown

Dual-homed trunk ports can’t have switchport mode or allowed vlans under the ethernet interface or the deployment will fail with this error:

Retrying Rollback Patch:
----------------
`config t `
`interface Ethernet1/14`
`switchport trunk allowed vlan 110, 120`
Syntax error while parsing 'switchport trunk allowed vlan 110, 120'

interface Ethernet1/14
  description UPLINK > DC1-VIOS-SW1
  spanning-tree port type network
  !#switchport
  channel-group 17 force mode active
  no shutdown

interface Port-channel17
  description UPLINK > DC1-VIOS-SW1
  spanning-tree port type network
  !#switchport
  switchport trunk allowed vlan 110,120
  switchport mode trunk
  vpc 17
  no shutdown

Merge

Napalm implements merges by simply applying the configuration line by line, it doesn’t use the checkpoint rollback functionality. This means that the changes made are not atomic, to delete something you have to specifically define the configuration to do so.

    - name: "CFG >> Merging changes with current config"
      napalm_install_config:
        provider: "{{ ans.creds_all }}"
        dev_os: "{{ ansible_network_os }}"
        timeout: 60
        config_file: "{{ ans.dir_path }}/{{ inventory_hostname }}/config/config.cfg"
        commit_changes: True
        diff_file: "{{ ans.dir_path }}/diff/{{ inventory_hostname }}.txt"
        get_diffs: True
      register: changes
    - debug: var=changes.msg.splitlines()
      tags: [diff]

The commands that it runs are as follows

  • checkpoint file bootflash:/rollback_config.txt
  • Applies merge config line by line.
  • copy running-config startup-config

Diffs for merges are simply the lines in the merge candidate config. It is not going to show you any differences unless you are specifically deleting (using no) something from the config.