Thursday, August 1, 2013

Replacing a failed VSS supervisor

Several years ago now, Cisco introduced the Catalyst 6500 "Virtual Switching System" (VSS), and it has become a very popular deployment model. VSS allows two separate Cisco 6500 chassis to be paired together and managed as one.

I'm a fan of this technology, as it allows for the reduction of a loop topology in a redundant layer-2 network. However, there may come a time when a supervisor module will fail, and hopefully the impact is minimal! I've seen times when a SUP720 has gone belly-up with minimal impact to the users, and other times when the impact was less than graceful. Luckily, there have been more of the former than the latter. While on the subject of failures in VSS -- I'll take a second to encourage you to make sure that some method of dual-active detection is in place!

Cisco has a published guide to replacing a failed supervisor in a VSS system, and I would definitely recommend reading it over prior to trying to replace a failed supervisor. It's entitled "Replace Supervisor Module in Cisco Catalyst 6500 Virtual Switching System 1440" (if the link eventually breaks, hopefully your favorite search engine will find it based on the title) so please read this first before proceeding. After doing this enough, I came up with a modified procedure that I feel covers all of the bases and can make this stressful process go a bit more smoothly.

The reason I have written up a different version is that there are a certain aspects of the official guide that I feel could be more conservative from a risk standpoint. For instance, it recommends connecting the VSL's prior to having the failed supervisor ready to boot in VSS mode. While this may be harmless, I'd rather avoid the situation and risk a stale configuration potentially impacting the active VSS member.

Without further ado, these are the steps I have followed to replace failed supervisors in 6500 VSS. I do welcome any comments or questions.

Prepare Your Laptop

  1. Copy IOS binary and running-configuration from active VSS; save to laptop
  2. Check the active VSS chassis' switch number: 

  3. LAB-VSS-6500-0-15#switch read switch_num local
    Read SWITCH_NUMBER from Active rommon is 1
    LAB-VSS-6500-0-15#


  4. Set up static IP (10.1.1.1/30) on laptop
  5. Ensure you have some sort of file transfer server available on the laptop (SCP, FTP, TFTP, HTTP, etc.)

Prepare the New Supervisor

  1. Procure spare/new Sup720 supervisor (VS-S720-10G or such model)
  2. Remove any transceivers or cables from failed supervisor, then removed failed supervisor
  3. Pull any other linecards in the chassis out a few inches -- voila -- it's a "spare chassis" now!
  4. Insert spare/new supervisor and connect laptop to console port
  5. Connect laptop to copper interface on supervisor (i.e. Gi1/3) and then configure the port as a routed port with 10.1.1.2/30 as the IP.
  6. Set the VSS switch number to the opposite of the number from the active chassis (either 1 or 2, remember to make it the opposite!) - switch set switch_num 2
  7. Validate the setting via switch read switch_num local
  8. Check version of code the switch has on it, on the same filesystem as the active has its code. i.e. dir sup-bootdisk:
  9. If necessary, copy IOS binary from laptop to filesystem, then validate with "verify" command
  10. Ensure configuration register is set to 0x2102 with: show ver | inc register (if not, set it in config mode and then save config)
  11. Copy over the active supervisor's running-configuration (already saved to your laptop) to the new sup's startup-configuration
  12. Confirm show bootvar is correct - pay attention to confreg and boot image
  13. Power down the chassis!

Bring up the Chassis

  1. Slide all linecards back in; insert transceivers and cables as appropriate to new supervisor. Ensure VSL's are connected!
  2. Power up the chassis
  3. On active chassis, issue these commands until satisfied everything has come back as expected
    • show switch virtual redundancy // watch for 2nd chassis to come up and enter SSO
    • show switch virtual link // validate the VSL's come up alright
    • show switch virtual dual-active pagp // check to ensure dual-active detection is enabled. If using something other than enhanced pagp for this, substitute command as appropriate
    • show logging // ensure VSS is coming up and nothing else is going wrong
    • show etherchannel summary // make sure those multichassis etherchannels fill back up
This isn't a short procedure by any means, but it is relatively straightforward. Again, I have linked the official Cisco doc for replacing a failed supervisor -- please read it! My steps here are, of course, "take at your own risk." That being said, this procedure has worked well for me on several occasions.

No comments:

Post a Comment