cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1155
Views
5
Helpful
3
Replies

Supervisor removal on Nexus 7000 and network stability

Hello,

According  to CCO when we initiate a switchover in dual Sup N7K environment either  manualy, using ISSU or when it's initiated by some error when a  process is restared several times, the switchover should be non-disruptive (assuming  there is NSF, BFD, etc configured). I assume there are some controlled operations taking place. But, what happens when SUP fails completely or  I suddenly remove the Sup manually from a chassis? Should it be nondisruptive or we  can expect some downtime and network instability? I noticed during Sup  removal that the whole network became unstable for several minutes. ESX  hosts lost clustering, servers became unavailable, etc. I have no detail  logs from that time (pings, show route, etc), but accoring to visual obeservations it was not as smooth as the customer expected  it to be (he asked why 2 Sups then? - not counting ISSU)

Best regards,

Krzysztof

1 Accepted Solution

Accepted Solutions

Oleksandr Nesterov
Cisco Employee
Cisco Employee

Hello Krzysztof

Ideally when an active sup crashes or is removed, standby sup overtakes all operations immediately without interruptions. Simple view is that there is keepalive between two sups, and when the second sup doesn't receive it - it becomes active. (For example when sup or any linecard is removed - nexus detects ejectors state - and does appropriate actions)

But please consider that depending on N7K configuration and working environment (neighbor configuration - timers, bgp graceful restart etcc) result can be different.

Here is the brief info how chassis with 2 supervisors realize high availability.

Both supervisors have the following:

System manager - special process that watch all the processes running in the sup and through redundancy driver exchanges HA signals with other sup and sync's up.

MTS - maintains communications between applications which are running. Synchronized between sups via special out of band channel.

Persistent storage Services (PSS) - each process saves checkpoint (running/runtime data) into PSS which helps to seamlessly restore each process when it crashes. Published across supervisors and linecards.  

Standby sup is always in hot standby mode. Most of the processes on standby sup follows the state their "active" peers.

So you see that there is a big work behind the scene to provide high availability.

But each particular case should be investigated.

Hope that helps,

Alex

View solution in original post

3 Replies 3

Oleksandr Nesterov
Cisco Employee
Cisco Employee

Hello Krzysztof

Ideally when an active sup crashes or is removed, standby sup overtakes all operations immediately without interruptions. Simple view is that there is keepalive between two sups, and when the second sup doesn't receive it - it becomes active. (For example when sup or any linecard is removed - nexus detects ejectors state - and does appropriate actions)

But please consider that depending on N7K configuration and working environment (neighbor configuration - timers, bgp graceful restart etcc) result can be different.

Here is the brief info how chassis with 2 supervisors realize high availability.

Both supervisors have the following:

System manager - special process that watch all the processes running in the sup and through redundancy driver exchanges HA signals with other sup and sync's up.

MTS - maintains communications between applications which are running. Synchronized between sups via special out of band channel.

Persistent storage Services (PSS) - each process saves checkpoint (running/runtime data) into PSS which helps to seamlessly restore each process when it crashes. Published across supervisors and linecards.  

Standby sup is always in hot standby mode. Most of the processes on standby sup follows the state their "active" peers.

So you see that there is a big work behind the scene to provide high availability.

But each particular case should be investigated.

Hope that helps,

Alex

Hello Oleksandr,

Many thanks for clarification. Indeed, HA is higly developed in Nexus, and I believe there should be minimal downtime during switchover. However, configuration of a nexus we tested is quite complex, there are UDLD, BFD, VPC, VDC, OTV and many more, so I think all these combined, along with surrounding environment can make the convergence little longer.

I just wanted some confirmation :-) Thanks again.

Best regards,

Krzysztof