NAC restore operation when having issue

Da ICS16 · ‎02-02-2024

Dear Community,

We are planning to create NAC - restoration operational guideline when NAC nodes have issue and cannot resolve on time.

Here are four use cases we consider to preparing guideline.

1. ISE server crash - how to prepare step by step to resolve this case and ensure no impact to operations?

2. Network device configuration - how to bypass on switch level and how to backup / restore switch on ISE dashboard?

3. SSL / self-signed certificate compromise - how to prevent it and fix it?

4. All ISE servers completely down - how to bring it up immediately?

Remark.

- We use ISE 3.1

- Using Cisco switch Cisco c9200L

- There are three deployment nodes ( PAN, Secondary Node and pxGRID node )

Highly appreciated for your supporting and advise if you could share good practice with document support your real practice experiences.

Best regards,

Ruben Cocheno · ‎02-02-2024

@Da ICS16

1. ISE server crash - how to prepare step by step to resolve this case and ensure no impact on operations? Call TAC

2. Network device configuration - how to bypass on switch level and how to backup / restore the switch on the ISE dashboard? You have to Disable NAC if you can't have it operational in time, and resolution takes more than expected.

3. SSL / self-signed certificate compromise - how to prevent it and fix it? Track your certs and keep them "secret" and managed only by certain staff

4. All ISE servers are completely down - how to bring it up immediately? You need to make sure you have enough resiliency across your infra to avoid this scenario, and there is no bulletproof Design

Tag me to follow up.
Please mark it as Helpful and/or Solution Accepted if that is the case. Thanks for making Engineering easy again.
Connect with me for more on Linkedin https://www.linkedin.com/in/rubencocheno/

Aref Alsouqi · ‎02-02-2024

You didn't mention where you will have the PSN personas and how many you will have. PSNs are essentials and crucial in your network. All the authentication, authorization, web authentication, profiling, etc are managed by the PSNs. If the PSN is residing on a separate node, and the primary or the secondary PAN, or even pxGrid should crash you won't be really affected unless you are using services that would rely on the primary PAN.

Here are my overview thoughts to your questions:

1. ISE server crash - how to prepare step by step to resolve this case and ensure no impact to operations?
It depends, but if the crashed node is a PSN, and you have another PSN that will serve the RADIUS/TACACS traffic then the impact will be minimum because in that case the network devices (NADs) would have both PSNs IP addresses configured, and if the PSN that crashed happens to be the first one in the list, then the NADs will wait a few seconds (depends on the config) before they deem that node as unreachable and switch to the second one in the list. However, if you only have a single PSN and that should go down then obv you will have a full impact in that case. However, in this case you can rely on assigning dynamically a specific VLAN to the endpoints until ISE is back up and running. It is a sort of a backup VLAN, think about it as a temporary solution that will allow end users and endpoints to connect to the internet or even the corporate VLAN, depends on the needs and criticality.

ISE doesn't support snapshots in case of virtual deployment. Your main tool to ensure that you will speed up the process to restore your ISE deployment would be the configurational backup on ISE. The operational backups are not as essential as the configurational backups since they don't contain any config data. With the configurational backup you can easily restore the backup on the newly deployed node (or a disk swap in case of a hardware deployment). You can even restore the IP addresses and all the network settings from the same configurational backup.

With virtual deployments you can do something that is not really very scalable, but again it depends on your deployment which would be to shutdown ISE VM monthly or quarterly or whichever timeframe you decide and then make a VM copy and store it somewhere. That copy could be used to restore a crashed VM very quickly. But as you can see, this is not a dynamic nor a scalable solution, but I have to say that for some of my small-sized customers, this did work very well.

2. Network device configuration - how to bypass on switch level and how to backup / restore switch on ISE dashboard?
The configurational backup I mentioned above covers restoring the network devices and all the other bits and pieces that would be responsible to bring your deployment alive. Bypassing the switches? as mentioned above, if you configure the critical VLAN assignment, the switches will assign whichever VLAN you configured when ISE is down, and the switches will then trigger the authentication process once ISE is back online. I don't think critical VLAN assignment would work with wireless though.

3. SSL / self-signed certificate compromise - how to prevent it and fix it?
Depends where you manage your certs. For instance if you are referring to ISE Internal CA certificates that would've been issued to the users that have enrolled into BYOD flow, then as part of that deployment the users will be able to manage their registered devices from My Devices portal on ISE. So, in a scenario that a laptop is stolen, or maybe is breached the users can log into My Devices portal and mark that device as lost or stolen. Lost means no access to the network (because the device will be added to a blacklisted identity group) and it can gain access to the network again without any re-onboarding if the users should remove it from the lost state to registered.

Stolen on the other hand means the device will be blocked accesses to the network (similar to the lost device) however, this time the device certificate will be automatically revoked by ISE Internal CA. When the stolen device is moved away from the stolen state in My Devices portal, the device will be marked as not registered, so to allow that device to connect again to the network it has to go through the whole onboarding process again. But, as might be thinking about all this, nothing would be ready for these things to work by default, some of them might, but to have a fully functional BYOD flow relying on ISE Internal CA you have to configure all the required bits and pieces.

On the other side, if the certificates are managed by an external entity, there should still be a process in place to imply that when a device is lost, compromised, or stolen it must be reported to the security team, which will then report this to the infra team to disable all the identities on the AD related to that device, including revoking the device and user certificates.

Some companies might rely on certificate management tools that would allow the users to log into a portal, and revoke their certificates from there, it has pretty much the same concept as My Devices portal in ISE.

4. All ISE servers completely down - how to bring it up immediately?
I would say there is no immediate restore with any technology (smiley face). Any major failure needs time, effort, and resources to do a full restore. However, it depends on the environment, for instance the example of shutting down ISE VMs and copy the VMs manually would be the fastest way to restore ISE in your deployment, and I'd seen this many times with some of my customers. Technically speaking your just need to make sure that the old nodes are totally down and then you power on the stored/backup VM, and that's it. Obv with this approach you might not have the very latest updates on the policies, say if you had done some changes after that VM was copied, obv those changes won't be restored.

Instead if you have a hardware deployment, that means you would need to re-install ISE and then restore the configs from the configurational backup, and if you don't have that backup, you will have to rebuild everything from the scratch which is gonna be a nightmare.

Lastly, please note that if you still have a single ISE node in your deployment that didn't fail, you can move the failed personas temporarily to that node until you build up the other nodes in parallel. This will impact the performance in your network, but will at least give you a working single-point-of-failure solution. In case the primary PAN fails, you can easily promote the secondary PAN,. In case of a pxGrid node, you can easily move the pxGrid persona to another node, obv this requires updating any integrations or IP addresses that will be affected by this move. In case of a PSN node, you can easily move the PSN personal to another node if you don't have already another one, this also requires updating the IP addresses on the NADs. In case an MnT fails, same as before, you can easily move its persona to another node. In case of a failure with the secondary PAN, nothing will be affected, so you can build up the secondary PAN in parallel.