Solved: Supervisor removal on Nexus 7000 and network stability

Krzysztof Zaleski · ‎02-03-2012

Hello,

According to CCO when we initiate a switchover in dual Sup N7K environment either manualy, using ISSU or when it's initiated by some error when a process is restared several times, the switchover should be non-disruptive (assuming there is NSF, BFD, etc configured). I assume there are some controlled operations taking place. But, what happens when SUP fails completely or I suddenly remove the Sup manually from a chassis? Should it be nondisruptive or we can expect some downtime and network instability? I noticed during Sup removal that the whole network became unstable for several minutes. ESX hosts lost clustering, servers became unavailable, etc. I have no detail logs from that time (pings, show route, etc), but accoring to visual obeservations it was not as smooth as the customer expected it to be (he asked why 2 Sups then? - not counting ISSU)

Best regards,

Krzysztof

Oleksandr Nesterov · ‎02-04-2012

Hello Krzysztof

Ideally when an active sup crashes or is removed, standby sup overtakes all operations immediately without interruptions. Simple view is that there is keepalive between two sups, and when the second sup doesn't receive it - it becomes active. (For example when sup or any linecard is removed - nexus detects ejectors state - and does appropriate actions)

But please consider that depending on N7K configuration and working environment (neighbor configuration - timers, bgp graceful restart etcc) result can be different.

Here is the brief info how chassis with 2 supervisors realize high availability.

Both supervisors have the following:

System manager - special process that watch all the processes running in the sup and through redundancy driver exchanges HA signals with other sup and sync's up.

MTS - maintains communications between applications which are running. Synchronized between sups via special out of band channel.

Persistent storage Services (PSS) - each process saves checkpoint (running/runtime data) into PSS which helps to seamlessly restore each process when it crashes. Published across supervisors and linecards.

Standby sup is always in hot standby mode. Most of the processes on standby sup follows the state their "active" peers.

So you see that there is a big work behind the scene to provide high availability.

But each particular case should be investigated.

Hope that helps,

Alex

View solution in original post

Oleksandr Nesterov · ‎02-04-2012

Hello Krzysztof

Ideally when an active sup crashes or is removed, standby sup overtakes all operations immediately without interruptions. Simple view is that there is keepalive between two sups, and when the second sup doesn't receive it - it becomes active. (For example when sup or any linecard is removed - nexus detects ejectors state - and does appropriate actions)

But please consider that depending on N7K configuration and working environment (neighbor configuration - timers, bgp graceful restart etcc) result can be different.

Here is the brief info how chassis with 2 supervisors realize high availability.

Both supervisors have the following:

System manager - special process that watch all the processes running in the sup and through redundancy driver exchanges HA signals with other sup and sync's up.

MTS - maintains communications between applications which are running. Synchronized between sups via special out of band channel.

Persistent storage Services (PSS) - each process saves checkpoint (running/runtime data) into PSS which helps to seamlessly restore each process when it crashes. Published across supervisors and linecards.

Standby sup is always in hot standby mode. Most of the processes on standby sup follows the state their "active" peers.

So you see that there is a big work behind the scene to provide high availability.

But each particular case should be investigated.

Hope that helps,

Alex

Krzysztof Zaleski · ‎02-04-2012

Hello Oleksandr,

Many thanks for clarification. Indeed, HA is higly developed in Nexus, and I believe there should be minimal downtime during switchover. However, configuration of a nexus we tested is quite complex, there are UDLD, BFD, VPC, VDC, OTV and many more, so I think all these combined, along with surrounding environment can make the convergence little longer.

I just wanted some confirmation :-) Thanks again.

Best regards,

Krzysztof

Oleksandr Nesterov · ‎02-09-2012

Krzysztof,

Here you can read more about redundance on nexus :

http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9402/ps9512/White_Paper_Continuous_Operations_High_Availability.html

Regards,

Alex