02-03-2012 12:14 AM
Hello,
According to CCO when we initiate a switchover in dual Sup N7K environment either manualy, using ISSU or when it's initiated by some error when a process is restared several times, the switchover should be non-disruptive (assuming there is NSF, BFD, etc configured). I assume there are some controlled operations taking place. But, what happens when SUP fails completely or I suddenly remove the Sup manually from a chassis? Should it be nondisruptive or we can expect some downtime and network instability? I noticed during Sup removal that the whole network became unstable for several minutes. ESX hosts lost clustering, servers became unavailable, etc. I have no detail logs from that time (pings, show route, etc), but accoring to visual obeservations it was not as smooth as the customer expected it to be (he asked why 2 Sups then? - not counting ISSU)
Best regards,
Krzysztof
Solved! Go to Solution.
02-04-2012 01:15 PM
Hello Krzysztof
Ideally when an active sup crashes or is removed, standby sup overtakes all operations immediately without interruptions. Simple view is that there is keepalive between two sups, and when the second sup doesn't receive it - it becomes active. (For example when sup or any linecard is removed - nexus detects ejectors state - and does appropriate actions)
But please consider that depending on N7K configuration and working environment (neighbor configuration - timers, bgp graceful restart etcc) result can be different.
Here is the brief info how chassis with 2 supervisors realize high availability.
Both supervisors have the following:
System manager - special process that watch all the processes running in the sup and through redundancy driver exchanges HA signals with other sup and sync's up.
MTS - maintains communications between applications which are running. Synchronized between sups via special out of band channel.
Persistent storage Services (PSS) - each process saves checkpoint (running/runtime data) into PSS which helps to seamlessly restore each process when it crashes. Published across supervisors and linecards.
Standby sup is always in hot standby mode. Most of the processes on standby sup follows the state their "active" peers.
So you see that there is a big work behind the scene to provide high availability.
But each particular case should be investigated.
Hope that helps,
Alex
02-04-2012 01:15 PM
Hello Krzysztof
Ideally when an active sup crashes or is removed, standby sup overtakes all operations immediately without interruptions. Simple view is that there is keepalive between two sups, and when the second sup doesn't receive it - it becomes active. (For example when sup or any linecard is removed - nexus detects ejectors state - and does appropriate actions)
But please consider that depending on N7K configuration and working environment (neighbor configuration - timers, bgp graceful restart etcc) result can be different.
Here is the brief info how chassis with 2 supervisors realize high availability.
Both supervisors have the following:
System manager - special process that watch all the processes running in the sup and through redundancy driver exchanges HA signals with other sup and sync's up.
MTS - maintains communications between applications which are running. Synchronized between sups via special out of band channel.
Persistent storage Services (PSS) - each process saves checkpoint (running/runtime data) into PSS which helps to seamlessly restore each process when it crashes. Published across supervisors and linecards.
Standby sup is always in hot standby mode. Most of the processes on standby sup follows the state their "active" peers.
So you see that there is a big work behind the scene to provide high availability.
But each particular case should be investigated.
Hope that helps,
Alex
02-04-2012 02:00 PM
Hello Oleksandr,
Many thanks for clarification. Indeed, HA is higly developed in Nexus, and I believe there should be minimal downtime during switchover. However, configuration of a nexus we tested is quite complex, there are UDLD, BFD, VPC, VDC, OTV and many more, so I think all these combined, along with surrounding environment can make the convergence little longer.
I just wanted some confirmation :-) Thanks again.
Best regards,
Krzysztof
02-09-2012 03:14 AM
Krzysztof,
Here you can read more about redundance on nexus :
Regards,
Alex
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide