VSS Recovery mode

vss recovery mode scenario

22 March 2018   6 min read

Dual-active Detection (DAD) is designed to prevent a split-brain scenario where both VSS supervisors become active in the event of a VSL link failure. It uses a separate (from the VSL link) secondary communication link to communicate the devices state.
When the VSL link fails the standby switch becomes active and the current active switch is informed of this over the DAD links and goes into recovery mode to stop a split-brain situation occurring.

21:28:01.153 GMT: %VSLP-2-VSL_DOWN:   All VSL links went down while switch is in ACTIVE role
21:28:01.185 GMT: %FASTHELLO-2-FH_DOWN:  Fast-Hello interface Gi1/2/12 lost dual-active detection capability
21:28:01.187 GMT: %FASTHELLO-2-FH_DOWN:  Fast-Hello interface Gi1/3/12 lost dual-active detection capability
21:28:01.201 GMT: %SW_DA-1-DETECTION:  detected dual-active condition
21:28:01.201 GMT: %SW_DA-1-RECOVERY: Dual-active condition detected: Starting recovery-mode, all non-VSL interfaces have been shut down
21:28:01.624 GMT: %C4K_REDUNDANCY-3-COMMUNICATION: Communication with the peer Supervisor has been lost
21:28:01.651 GMT: %FASTHELLO-2-FH_DOWN:  Fast-Hello interface Gi2/2/12 lost dual-active detection capability
21:28:01.651 GMT: %FASTHELLO-2-FH_DOWN:  Fast-Hello interface Gi2/3/12 lost dual-active detection capability
21:28:01.713 GMT: %C4K_REDUNDANCY-3-SIMPLEX_MODE: The peer Supervisor has been lost

When the switch goes into recovery mode all ports except the VSL ports are shutdown, it is also possible to have some specific links excluded from being shutdown. On the console it will say (recovery-mode)#.

While in recovery mode, avoid config changes (don’t even type config t). Doing so marks config as modified meaning that manual intervention will be required to bring the VSS back (saving config on standby and rebooting it).

Upon seeing the VSL ports come back up the switch in recovery-mode reloads itself and comes back as the standby chassis with all its ports up.

22:51:10.702 GMT: %C4K_IOSINTF-5-LMPHWSESSIONSTATE: Lmp HW session UP on slot 1 port 2.
22:51:10.800 GMT: %C4K_IOSINTF-5-LMPHWSESSIONSTATE: Lmp HW session UP on slot 1 port 1.
22:51:12.927 GMT: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/1/1, changed state to up
22:51:12.928 GMT: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to up
22:51:26.699 GMT: %VSLP-5-VSL_UP:  Ready for control traffic
22:51:29.699 GMT: %SW_DA-1-VSL_RECOVERED: VSL has recovered during dual-active situation: Reloading switch 1
22:51:29.720 GMT: %VSLP-5-RRP_MSG: Role change from Active to Standby and hence need to reload
22:51:29.720 GMT: %VSLP-5-RRP_MSG: Reloading the system...%Unable to initiate reload in peer.
22:51:30.589 GMT: %RF-5-RF_RELOAD: Shelf reload. Reason: dual-active
22:51:31.563 GMT: %SYS-5-RELOAD: Reload requested by VS. Reload Reason: dual-active.
22:51:31.607 GMT: %SYS-3-LOGGER_FLUSHED: System was paused for 00:00:01 to ensure console debugging output.

[Sat Feb 10 22:51:32 2018] Message from sysmgr: Reason Code:[3] Reset Reason:Reset/Reload requested by [console]. [Reload command]

A successful recovery should show the following message near the end of the bootup process.

Initializing as Virtual Switch STANDBY processor

*       STANDBY SUPERVISOR        *
*     REDUNDANCY mode is SSO      *
*        Continue bootup          *

If the VSS has configuration that has not been saved when it goes into recovery mode that switch will NOT automatically reload once the VSL links are restored.

21:30:49.901 GMT: %VSLP-5-VSL_UP:  Ready for control traffic
21:30:53.909 GMT: %VSLP-5-RRP_MSG: Role change from Active to Standby and hence need to reload
21:30:53.909 GMT: %VSLP-5-RRP_UNSAVED_CONFIG: Ignoring system reload since there are unsaved configurations.
Please save the relevant configurations
21:30:53.909 GMT: %VSLP-5-RRP_MSG: Use 'redundancy reload shelf' to bring this switch to its preferred STANDBY role

In this situation, you must save the running config and reload manually. Only configuration changes applied to VSL ports on the switch can be saved, all other config changes are discarded when the node reboots as VSS standby.

SW-4506E-VSS01(recovery-mode)#redundancy reload shelf
System configuration has been modified. Save? [yes/no]: yes
Reload the entire shelf [confirm]
Preparing to reload this shelf

WB-4506E-VSS01(recovery-mode)#%Unable to initiate reload in peer.
Feb 10 2018 21:41:18.755 GMT: %RF-5-RF_RELOAD: Shelf reload. Reason: Reload Shelf CLI
Feb 10 2018 21:41:19.742 GMT: %SYS-5-RELOAD: Reload requested by console. Reload Reason: Reload Shelf CLI.
[Sat Feb 10 21:41:20 2018] Message from sysmgr: Reason Code:[3] Reset Reason:Reset/Reload requested by [console]. [Reload command]

After the recovery (once the VSL link is restored and switch reboots) the new active switch configuration will be used to overwrite the configuration on the peer switch (the old-active switch) when it becomes the hot-standby switch. Changes made to the active switch need not match the old-active switch configuration because the configuration on the old-active switch (now the hot-standby switch) will be overwritten.

Mismatching Configurations

If there is any difference in the configuration on the active and standby device after it reboots the standby will keep rebooting and never comeback up as part of the VSS.
It goes through the whole the bootup process but soon after the ‘Initializing as Virtual Switch STANDBY processor` message it will fail redundancy mode checks and reboot.

Initializing as Virtual Switch STANDBY processor

22:02:41.980: %C4K_IOSVSLENCR-3-VSLPMKKEYSTOREERROR: Failed to open PMK keystore file.
22:02:49.857: %C4K_IOSMODPORTMAN-4-POWERSUPPLYBAD: Power supply 2 has failed or been turned off
22:03:35.394: %C4K_IOSINTF-5-LMPHWSESSIONSTATE: Lmp HW session UP on slot 1 port 2.
22:03:35.406: %C4K_IOSINTF-5-LMPHWSESSIONSTATE: Lmp HW session UP on slot 1 port 1.
22:03:51.393: %VSLP-5-VSL_UP:  Ready for control traffic
22:03:54.408: %VSLP-5-RRP_ROLE_RESOLVED: Role resolved as STANDBY by VSLP
22:04:29.793: %C4K_REDUNDANCY-2-IOS_VERSION_CHECK_FAIL: STANDBY:IOS version mismatch. Active supervisor version is 15.2(1)E2 (cat4500e-UNIVERSALK9-M). Standby supervisor version is 15.2(1)E2 (cat4500e-UNIVERSALK9-M). Redundancy feature may not work as expected.
22:04:29.793: %C4K_REDUNDANCY-2-NON_SYMMETRICAL_REDUNDANT_SYSTEM: STANDBY:STANDBY supervisor will operate in fallback redundancy mode rpr.
22:04:33.087: %C4K_REDUNDANCY-3-COMMUNICATION: STANDBY:Communication with the peer Supervisor has been established
22:04:34.356: %C4K_REDUNDANCY-2-VS_REBOOT_ON_RPR_FALLBACK: STANDBY:Supervisor in virtual-switch configuration cannot operate in redundancy mode RPR, will be reset
22:04:35.184: %RF-5-RF_RELOAD: STANDBY:Self Reload. Reason: Virtual-switch fallback to RPR
22:04:35.686: %SYS-5-RELOAD: STANDBY:Reload requested by Platform redundancy manager. Reload Reason: Virtual-switch fallback to RPR.
22:04:35 2018 Message from sysmgr: Reason Code:[3] Reset Reason:Reset/Reload requested by [console]. [Reload command]

To recover from this situation isolate it from the network by unplugging all links (to stop an active-active situation causing network disruption) including the DAD links. Removing the DAD links allows it to comeback online as the active VSS device since it no longer sees the other VSS member. Once back online compare the code on each switch, correct the problem, save configuration on both, plug back in VSL and DAD links and reboot the standby. Once back online and happy is healthy can plug back in all the other links.

I came across this problem when we configured new DAD Links and one of the switches had cdp enable on one of the DAD links. Therefore, when configuring DAD Links it is best to default interfaces and make sure the configs of them are the same before applying the dual-active fast-hello command (port needs to be a switchport on 4500).

(config)# default interface gi 1/2/12
(config)# interface gi 1/2/12
(config-if)# dual-active fast-hello
WARNING: Interface GigabitEthernet1/2/12 placed in restricted config mode. All extraneous configs removed!