The following document contains examples of how to deploy and configure the CSR1000V within Amazon Web Services (AWS) in a DMVPN configuration as well as details on how to setup VPC Gateway Redundancy.
Great document. Thank you for taking the time to put it all together!
Thanks for the nice documentation. I was trying to follow the VPN gateway redundancy part of with with BFD failure messages across the tunnel to acheive the HA between CSR. I have 2 question with the senario explained inside the document -
1- I understood the purpose of Tunnel 33 which is kind of tracking the rechability to AWS VPC & run the EEM script based on the BFD failure messages, but do you know why did the docoument has tunnel 98 there ? Was that to check the availibility on the ISP in respective zone ? CSR A will be deployed in zone a & CSR will be in zone b, I have no idea if AWS has same ISP in both the zones. but if thats not the case then may be we need tunnel 98 which is to track the ISP connectivity & if tunnel 98 is down on CSR A then the syslog message of tunnel 98 being down should trigger the EEM to failover to CSR-B ??
2- The document just explain on how to failover from CSR A to CSR B via EEM script through EC2 helper VM,but did not talk about how to fail back once the CSR-A is recovered. Can you explain that here ?
Let me take your questions 1 at a time:
1a) The difference between Tunnel 33 and Tunnel 98 is the following:
- Tunnel 33 is for reachability between the two CSRs. If you picture CSRA and CSRB connected to a common segment on GigabitEthernet1, Tunnel 33 is a second "virtual" connection between the two devices. The primary reason that this is necessary is that AWS does not support broadcasts and multicasts. This tunnel masks this traffic from AWS and allows it to flow across that common segment because all that AWS sees is unicast GRE traffic between the two tunnel endpoints. This allows CSR A to recognize when B goes down and visa versa. This tunnel is strictly a control plane connection, and is the glue that makes the redundancy work.
-Tunnel 98 (and Tunnel 96 on CSRB) represent the connections back to the corporate network. These tunnel interfaces are used to forward traffic across the internet VPN. It is a little mis-leading to see the tunnel source be a private (GigabitEthernet1) address, but remember that this address is NAT'd to whatever Elastic IP is assigned to it within the AWS infrastructure. These tunnels are the data plane connection. They are where any production traffic would be forwarded. This is also how your corporate network would learn about routes to anything within the AWS cloud.
1b) Reachability to the ISP within the zone.
- This setup has been tested within a single availability zone. The same idea could be used to expand such a setup beyond the specific details of this document, but the details of the document show both CSR-A and CSR-B in a single availability zone.
- There are two components to the reachability, one is the ability to access the ISP, and the second is the ability to access the other CSR. We are trying to ensure that if either one of these two components fails on one of the CSRs, that the other takes over. For example, if one of the two CSRs was reloaded, or became unreachable for some reason, the other would take over. Likewise if one of the two ISPs failed, we would want the other to become the primary router. You could come up with other scenarios to add to increase the effectiveness of this solution for your specific environment.
- Within a gateway load balancing protocol like HSRP there is a concept of preemption that allows CSR-A to be the active device as long as it is operational. If it goes down, CSR-B would become active, and CSR-A would take over once it came back online. In this setup there is no preemption present. That is not to say that you could not modify the script to trigger proactively, but we have not tested that.
- In this scenario if CSR-A is the active, and it fails, CSR-B will take over. When CSR-A comes back online it effectively remains as a standby for CSR-B until CSR-B fails. Once CSR-A detects a failure of CSR-B it will takeover, but only after a failure of CSR-B.
Hopefully that answers your questions. If you have others, let me know.