Automate Leaf and Spine Deployment - Part6

post validation

23 March 2021   14 min read

The 6th post in the ‘Automate Leaf and Spine Deployment’ series goes through the validation of the fabric once deployment has been completed. A desired state validation file is built from the contents of the variable files and compared against the devices actual state to determine whether the fabric and all the services that run on top of it comply.

The desired state is what we expect the fabric to be in once it has been deployed. A declaration of how the fabric should be built is made in the variables files, therefore it makes sense that these are used to build the desired state validation file. The napalm_validate desired state file is a list of dictionaries with the key being what napalm_getter is being checked and the value the expected result.

- get_bgp_neighbors:
    global:
      router_id: 192.168.101.16
      peers:
        _mode: strict
        192.168.101.11:
          is_enabled: true
          is_up: true

The napalm_getter is used to gather the actual state and produces a compliance report in JSON format. The majority of elements built when deploying the fabric don’t have getters so to validate these I built a filter plugin that uses the same napalm_validate framework but with the actual state supplied through a static input file (created with napalm_cli). The desired state dictionary is similar but it has commands for the keys.

cmds:
  - show ip ospf neighbors detail:
      192.168.101.11:
        state: FULL
      192.168.101.12:
        state: FULL
      192.168.101.22:
        state: FULL

Both validation engines are within the same validate role but with separate template and task files. The templates generate the desired state which is a combination of the getters or commands and the expected output. As the same getter or command can be used to validate multiple roles (base, fabric, tenant, etc) with different expected outcomes Jinja template inheritance (block and extends) is needed. These validation engines create a combined compliance report that is then used to decide if the post validation checks comply.



napalm_validate

As Napalm is vendor agnostic the jinja template file used to create the validation file is the same for all vendors. The following elements are validated by napalm_validate with the roles being validated in brackets.

  • hostname (fbc): Automatically created device names are correct
  • lldp_neighbors (fbc): Devices physical fabric and MLAG connections are correct
  • bgp_neighbors (fbc, tnt): Overlay neighbors are all up (strict). fbc doesn’t check for sent/rcv prefixes, this is done by tnt

There was originally ICMP validation of all the loopbacks but it took too long to complete. At the end of the day if a loopback was not up you would find out through one of the other tests so I decided not to use it (left the config in but is hashed out).

Rendering the template is pretty standard stuff.

- name: "SYS >> Creating bse_fbc napalm_validate validation file"
  template:
    src: napalm/bse_fbc_val_tmpl.j2
    dest: "{{ ans.dir_path }}/{{ inventory_hostname }}/validate/napalm_desired_state.yml"
  changed_when: False
  tags: bse_fbc

The thing that needs to be highlighted from the jinja template configuration is the use of inheritance with block and extends. The jinja template where the inherited configuration is to be added (bse_fbc_val_tmpl.j2) has a block statement. It is only populated with data if that block is called in another jinja template, if it is not called it will be ignored.

{% block get_bgp_neighbors %}
{% endblock %}

The template where the data will be inherited from (svc_tnt_val_tmpl.j2) has an extends statement which defines the template that will inherit from this one and a block statement to identify the data that will be extended to (inherited by) the other template. The block statement must be the same name in both templates and can only be used once.

{% extends "napalm/bse_fbc_val_tmpl.j2" %}

{% block get_bgp_neighbors %}
          address_family:
            l2vpn:
              received_prefixes: '>=1'
{% endblock %}

The extra BGP information in this block statement needs to be added for each peer. To get around the problem of only being able to use a block statement once it is enclosed in a macro which allows it to be inserted multiple times (for the spine and non-spine switches). This BGP inheritance is needed as once tenants have been added to the fabric the number of BGP receive/sent prefixes will no longer be 0.

{%- macro macro_get_bgp_neighbors() -%}
{% block get_bgp_neighbors %}{% endblock %}
{%- endmacro -%}

- get_bgp_neighbors:
    global:
      router_id: {{ intf_lp[0].ip |ipaddr('address')  }}
      peers:
        _mode: strict
{% if bse.device_name.spine in inventory_hostname %}
{% for x in groups[bse.device_name.leaf.split('-')[-1].lower()] + groups[bse.device_name.border.split('-')[-1].lower()] %}
        {{ hostvars[x].intf_lp[0].ip |ipaddr('address') }}:
          is_enabled: true
          is_up: true
{{ macro_get_bgp_neighbors() }}
{% endfor %}
{% else %}
{% for x in groups[bse.device_name.spine.split('-')[-1].lower()] %}
        {{ hostvars[x].intf_lp[0].ip |ipaddr('address') }}:
          is_enabled: true
          is_up: true
{{ macro_get_bgp_neighbors() }}
{% endfor %}
{% endif %}

Both templates don’t need calling, calling the template with the extend statement will automatically also render the inheriting template. For example, running post_validate with the full tag only needs to call the service_route template (svc_rte_val_tmpl.j2), all of the others extended to below it (bse_fbc_val_tmpl.j2 and svc_tnt_val_tmpl.j2) are rendered automatically.

- name: "NAP_VAL >> Creating bse_fbc, svc_tnt and svc_rte validation file"
  template:
    src: napalm/svc_rte_val_tmpl.j2
    dest: "{{ ans.dir_path }}/{{ inventory_hostname }}/validate/napalm_desired_state.yml"
  changed_when: False
  tags: [bse_fbc_tnt, bse_fbc_tnt_intf, full]

The remaining tasks in nap_val require ignore_errors to allow the playbook to continue and create the compliance_report even if validation fails, without this Ansible would stop the playbook at the failure and not create the actual report.

- name: "Create napalm compliance report"
  block:
  - name: "NET >> Validating LLDP and BGP (napalm_validate)"
    napalm_validate:
      provider: "{{ ans.creds_all }}"
      dev_os: "{{ ansible_network_os }}"
      validation_file: "{{ ans.dir_path | fix_home_path() }}/{{ inventory_hostname }}/validate/napalm_desired_state.yml"
    register: nap_val
    ignore_errors: yes

  - name: "PASS >> Saving compliance report to {{ ans.dir_path }}/reports/"
    copy: content="{{ nap_val.compliance_report }}" dest={{ ans.dir_path }}/reports/{{ inventory_hostname }}_compliance_report.json
    changed_when: False
    ignore_errors: yes

custom_validate

custom_validate requires a per-OS type template file and per-OS type method within the custom_validate.py filter_plugin. The following elements are validated by napalm_validate with the roles being validated in brackets.

  • show ip ospf neighbors detail (fbc): Underlay neighbors are all up (strict)
  • show port-channel summary (fbc, intf): Port-channel state and members (strict) are up
  • show vpc (fbc, tnt, intf): MLAG peer-link, keep-alive state, vpc status and active VLANs
  • show interfaces trunk (fbc, tnt, intf): Allowed vlans and STP forwarding vlans
  • show ip int brief include-secondary vrf all (fbc, tnt, intf): Layer3 interfaces in fabric and tenants
  • show nve peers (tnt): All VTEP tunnels are up
  • show nve vni (tnt): All VNIs are up, have correct VNI number and VLAN mapping
  • show interface status (intf): State and port type
  • show ip ospf interface brief vrf all (rte): Tenant OSPF interfaces are in correct process, area and are up
  • show bgp vrf all ipv4 unicast (rte): Prefixes advertised by network and summary are in the BGP table
  • show ip route vrf all (rte): Static routes are in the routing table with correct gateway and AD

custom_validate validates the network state on a per-command rather than per-getter basis. The desired_state.yml has a top layer cmds dictionary that holds a list of {command: desired_state} dictionaries. As it uses napalm_validate to do the validation the desired_state structure is very much the same.

The rendering of the templates is pretty simple for the base and fabric roles as the inventory plugin has already created the data models. The service roles are more complicated as the roles filter_plugin first needs to be run to generate the data models before the templates can be rendered. See the fabric services post for more details on these role filter_plugins.

- name: "BLK >> bse_fbc and svc_tnt custom_validate validation files"
  block:
  - set_fact:
      flt_svc_tnt: "{{ svc_tnt.tnt |create_svc_tnt_dm(svc_tnt.adv, fbc.adv.mlag.peer_vlan, svc_rte.adv.redist.rm_name
                      | default(svc_tnt.adv.redist.rm_name)) }}"
  - name: "CUS_VAL >> Creating {{ ansible_network_os }} bse_fbc and svc_tnt validation file"
    template:
      src: "{{ ansible_network_os }}/svc_tnt_val_tmpl.j2"
      dest: "{{ ans.dir_path }}/{{ inventory_hostname }}/validate/{{ ansible_network_os }}_desired_state.yml"
    changed_when: False
  tags: bse_fbc_tnt

custom_validate uses jinja inheritance a lot more extensively than napalm_validate. There can only be one extends statement in each template with any block of the same name overriding other lower level template block values. To achieve template merging super() is defined under the top level template block to render the contents of the preceding block, so give back the results of the proceeding template as well as current template rather than overwriting it. This is used in the svc_intf_val_tmpl.j2 template for show vpc, show interface trunk and show ip interface brief as it adds to the desired state for the same commands from templates bse_fbc_val_tmpl.j2 and svc_tnt_val_tmpl.j2.

{## show interfaces_trunk ##}
{% block show_int_trunk %}
{{ super() }}
{% for intf in flt_svc_intf %}{% if 'trunk' in intf.type %}
      {{ intf.intf_num  }}:
        allowed_vlans: {{ intf.ip_vlan }}
{% if intf.po_num is not defined %}
        stpfwd_vlans: {{ intf.ip_vlan }}
{% else %}
        stpfwd_vlans: none
{% endif %}{% endif %}{% endfor %}
{% endblock %}

The devices actual_state is collected using napalm-cli (in JSON format) and stored in a variable (output). Ansible loops through the cmds list calling each dictionary key ('| list | first' converts type dict_keys to string), adds JSON to the end of the command and uses napalm_cli to run it. Loop control ensures that only the command and not the output is printed to screen.

  - include_vars: "{{ ans.dir_path }}/{{ inventory_hostname }}/validate/{{ ansible_network_os }}_desired_state.yml"
  - name: "NET >> Gathering actual state from the devices"
    napalm_cli:
      provider: "{{ ans.creds_all }}"
      dev_os: "{{ ansible_network_os }}"
      args:
        commands:
          - "{{ item.keys() | list | first }} | json"
    register: output
    loop: "{{ cmds }}"
    loop_control:
      label: "{{ item.keys() | list | first }}"

The collected actual_state (output variable), desired_state (nxos_desired_state.yml), parent directory location, hostname and device OS are fed into the custom_validate filter plugin which will run the validation (napalm_validate engine) and create a compliance report.

 - name: "CUS_VAL >> Validating and saving compliance report to {{ ans.dir_path }}/reports/"
    set_fact:
      validate_result: "{{ cmds | custom_validate(output.results, ans.dir_path, inventory_hostname, ansible_network_os) }}"

custom_validate filter plugin

The custom_validate plugin imports the napalm validate and ValidationException modules to handle validation and error checking.

from napalm.base import validate
from napalm.base.exceptions import ValidationException

It is made up of three methods:

  • nxos_dm: Formats the command output received from the device into the same format as the desired_state.yml file ready for validation. The idea is that you would have different methods for the different vendors with the method chosen based on OS type
  • compliance_report: Runs napalm_validate validate.compare method using a static actual_state variable rather than getters. It produces a compliance report with a per-command and overall complies state (True or False)
  • custom_validate: The plugin ‘engine’ that takes the input arguments from Ansible, gathers the formatted actual_state from nxos_dm before running the compliance_report method with these arguments

Compliance reporting

The napalm_validate (nap_val.yml) and custom_validate (cus_val.yml) results are combined into the one compliance report saved in /device_configs/reports. Each getter or command has its own complies dictionary (True or False) which feeds into the compliance reports complies dictionary. It is based on this value that a task in the main playbook will raise an exception if compliance fails.

    - name: "Loads validation report and checks whether it complies (passed)"
      block:
      - include_vars: "{{ ans.dir_path }}/reports/{{ inventory_hostname }}_compliance_report.json"
      - name: "POST_VAL >> Compliance check failed"
        fail:
          msg: "Non-compliant state encountered. Refer to the full report in {{ ans.dir_path }}/reports/{{ inventory_hostname }}_compliance_report.json"
        when: not complies

Running Post Validation

Running post-validation is hierarchial as the addition of elements in later roles effects the validation outputs in earlier roles. For example, if extra VLANs or port-channels are added in the services roles this will effect the bse_fbc post-validate output of show vpc and show port-channel summary. For this reason post-validation must be run for the current role and all roles before it. This is done automatically by Jinja template inheritance as calling a template with the extends statement will also render the inheriting templates.

Tags can be used to selectively decide on the post-validation tests performed. There is no differentiation between naplam_validate and custom_validate, both are run as part of the validation tasks.

tag Description
bse_fbc Runs validation of the base fabric, does not validate the services on it
bse_fbc_tnt Runs validation of the base fabric and the tenants on the fabric
bse_fbc_tnt_intf Runs validation of the base fabric, tenants and additional non-fabric (service) interfaces
full Runs validation of the base fabric, tenants and additional non-fabric (service) interfaces and routing

Run fabric validation: Runs validation against the desired state got from all the variable files

ansible-playbook PB_post_validate.yml -i inv_from_vars_cfg.yml --tag full

Viewing compliance report: When viewing the validation report piping it through json.tool makes it more human readable

cat ~/device_configs/reports/DC1-N9K-SPINE01_compliance_report.json | python -m json.tool

Creating new custom validations

The directory custom_val_builder is designed to make it easy to create and test new validations before adding them to the playbooks custom_validate role. The tags allow for it to be run in different ways as you walk through the stages of building the desired and actual state files and eventually create a compliance report.

Title Tag Information
Discovery disc Run the commands from desired_state.yml against a device and print the raw output
Template tmpl Render the contents of val_tmpl.j2 to desired_state.yml and prints to screen
Data-model dm Run cmds like discovery but the output is fed through device_dm and the new DM printed to screen
Report from file rpt_file Build the compliance report using the desired_state.yml and a static DM in the file file_output.json
Report from cmd report Build the compliance report using the desired_state.yml and a dynamic DM created from the device output

The desired state (desired_state.yml) is rendered from the template val_tmpl.j2. It is a list of dictionaries under the main cmds dictionary with the key being the command and value the desired state. The amount of dictionary nesting is determined by the sub-layers of the command output that you are validating.

cmds:
  - show ip ospf interface brief vrf all:
{% for each_proc in svc_rte.ospf %}{% if each_proc.interface is defined %}
{# Checks if local host is in the switches defined under interface, if not then checks under process #}
{% for each_intf in each_proc.interface %}{% if inventory_hostname in each_intf.switch |default(each_proc.switch) %}
{% for each_name in each_intf.name %}
      {{ each_name | replace(fbc.adv.bse_intf.intf_fmt, fbc.adv.bse_intf.intf_short) }}:
        proc: '{{ each_proc.process }}'
        area: {{ each_intf.area }}
        status: up
        nbr_count: '>=1'
{% endfor %}{% endif %}
{% endfor %}{% endif %}
{% endfor %}

Running ansible-playbook PB_val_builder.yml -i hosts --tag tmpl (using a static inventory file) produces the desired state:

cmds:
  - show ip ospf interface brief:
    xxxx

The actual state (from device or file_output.json) is a dictionary in the same format as the desired state value which is created from the returned device output (in JSON format) using a filter plugin. For example the returned output for ansible-playbook PB_val_builder.yml -i hosts --tag disc would be:

ok: [DC1-N9K-BORDER02] => {
    "validate_result": {
        "show ip ospf interface brief vrf all | json": {
            "TABLE_ctx": {
                "ROW_ctx": [
                    {
                        "TABLE_intf": {
                            "ROW_intf": [
                                {
                                    "admin_status": "up",
                                    "area": "0.0.0.0",
                                    "cost": "100",
                                    "ifname": "Vlan2",
                                    "index": "3",
                                    "nbr_total": "1",
                                    "state_str": "P2P"
                                },

The device_dm method matches based on the command and manipulates the actual state into the correct data model structure. For some odd reason Cisco decided that NXOS will return a dictionary rather than list if the output contains only one element. It might not get you to start with but will comeback to bite at some point so I use a method to always convert that dictionary into a list.

        def shit_nxos(main_dict, parent_dict, child_dict):
            if isinstance(main_dict[parent_dict][child_dict], dict):
                main_dict[parent_dict][child_dict] = [main_dict[parent_dict][child_dict]]

Cisco NXOS JSON output for all commands generally follows the same format but with different ‘TABLE_xxx’ and ‘ROW_xxx’ names dependant on the feature and nested level. Below is python code in the filter_plugin for manipulating the output into the correct data model format.

            # OSPF_INT_BRIEF: Is a 1 deep nested dict {interfaces: {attribute: value}}
            elif "show ip ospf interface brief" in cmd:
                # Apply NXOS 'dict to list' fix incase only one interface
                shit_nxos(json_output, 'TABLE_ctx', 'ROW_ctx')
                for each_proc in json_output['TABLE_ctx']['ROW_ctx']:
                    shit_nxos(each_proc, 'TABLE_intf', 'ROW_intf')

                    # Loops through interfaces in OSPF process and creates temp dict of each interface in format {intf_name: {attribute: value}}
                    for each_intf in each_proc['TABLE_intf']['ROW_intf']:
                        tmp_dict[each_intf['ifname']]['proc'] = each_proc['ptag']
                        tmp_dict[each_intf['ifname']]['area'] = each_intf['area']
                        tmp_dict[each_intf['ifname']]['status'] = each_intf['admin_status']
                        tmp_dict[each_intf['ifname']]['nbr_count'] = each_intf['nbr_total']
                # Adds the output gathered to the actual state dictionary
                actual_state["show ip ospf interface brief vrf all"] = dict(tmp_dict)

Running ansible-playbook PB_val_builder.yml -i hosts --tag dm does the data manipulation returning the actual state.

ok: [DC1-N9K-BORDER02] => {
    "validate_result": {
        "show ip ospf interface brief vrf all": {
            "Vlan2": {
                "area": "0.0.0.0",
                "nbr_count": "1",
                "proc": "DC_UNDERLAY",
                "status": "up"
            },
            "Eth1/1": {
                "area": "0.0.0.0",
                "nbr_count": "1",
                "proc": "DC_UNDERLAY",
                "status": "up"
            }
        }
    }
}

These two files are fed into the napalm_validate compare engine to produce a compliance report. Nothing special has been done here, napalm_validate is perfect for this so just using its method and feeding my own input in.

from napalm.base import validate
from napalm.base.exceptions import ValidationException


    def compliance_report(self, desired_state, actual_state):
        report = {}
        # Feeds files into napalm_validate and produces a report
        for each_dict in desired_state:
            for cmd, desired_results in each_dict.items():
                try:
                    report[cmd] = validate.compare(desired_results, actual_state[cmd])
                # If validation couldn't be run on a command adds skipped key to the cmd dictionary
                except NotImplementedError:
                    report[cmd] = {"skipped": True, "reason": "NotImplemented"}

The report is run using ansible-playbook PB_val_builder.yml -i hosts --tag report with the overall report result printed to screen along with a link to the report. You don’t have to always run the report against a live device to get the actual state, this can be fed in from a file using the tag rpt_file .

Report Complies: True
View the report using cat files/compliance_report.json | python -m json.tool

The report can be viewed locally using cat files/compliance_report.json | python -m json.tool. More information on building custom validates can be found in the README of custom_val_builder.