End To End Keepalives For Ethernet-Based WAN Service

Unanswered Question
May 23rd, 2010

Hi All,


We are currently migrating to an Ethernet-based MPLS WAN solution with a dial on demand backup solution (using Dialer interfaces) and are having issues with monitoring our routers.  The issue is that when there is a fault upstream in the network (i.e. between the CE and the PE, but not between the CE and the first hop Ethernet device), then the CE (Cisco 3845) does not detect the issue since the Layer 2 Ethernet segment will still be up to the Ethernet device, hence both the interface and line protocol will still be up.  Our WAN routing protocol (BGP) will detect this issue, and consequently, our backup link will trigger and come up.  However, any traps/syslogs to our NMS will be lost (since it will take time for the DDR to come up and routing to converge) and our NMS will still be able to poll all the interfaces on the CE when the ISDN comes up, so from that perspective, it will still see the router as being up (after a slight bleep during the DDR and convergence) and consequently our NMS will still think the primary link is active.


Is there a 'neat' way to overcome this limitation without relying on any infrastructure other than the CE (since we probably won't be able to make any configuration changes to the PE or the intermediate network)?  Basically, I am looking for something like the EEK you used to be able to use with point to point Frame Relay.  My current thoughts were:


1) IP SLA objects/probes - would be annoying since I'd have to customise it for each router to ping their respective PE and I want something generic I can put across all the sites via CiscoWorks.  Also, I would still be able to ping the PE across the DDR link

2) Block the NMS probes over the backup link - I'd lose all monitoring/stats collection over the backup link when a failure occurred on the primary link

3) OAM - limitation where we can't make any changes to any infrastructure between the CE and the PE


Is there anything else that Cisco might have in its IOS arsenal that might overcome this issue?


Thanks,

goulin

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Giuseppe Larosa Mon, 05/24/2010 - 11:39

Hello Goulin,

you need cooperation with service provider.


you could combine BGP with BFD = Bidirectional forwarding detection but this needs to be enabled on CE side and on PE side.


see

http://www.cisco.com/en/US/docs/ios/iproute_bfd/configuration/guide/irb_bfd_ps6441_TSD_Products_Configuration_Guide_Chapter.html#wp1096480


this should also give you fast detection of neighbor failure instead of relying on hold timer expiration.


NMS mat not be able to detect that primary path has failed but human operators can correlate the BGP session down and the DDR Call with the root cause.


I would not attempt to use EEM to shut down the primary link when PE IP address on interface is not reachable.


Hope to help

Giuseppe

goulin Mon, 05/24/2010 - 19:05

Hi Giuseppe,


Thanks for the link.  As you said, I'd need this to be configured on the PE also so it probably won't work in my case.


Anyhow, just in regards to BFD - I can't seem to find anywhere where is suggests that the interface state is changed when a BFD neighbor is not detected.  If I combine BFD with BGP and lose the BGP neighbor, will it modify the interface state (i.e. will the link protocol go down)?  If not, then it probably won't help in my situation, other than increasing the speed of convergence.


Thanks,

goulin

Giuseppe Larosa Tue, 05/25/2010 - 02:16

Hello Goulin,


>> other than increasing the speed of convergence.


depending on application this may be important


for sure interface will not be declared down, if BFD doesn't receive answers, but BGP session will be from the fact that BGP session is down it can be easily deducted that primary path is down at OSI layer3 and that you are using the DDR.

after BGP session is down you could run a traceroute to show that the path is via the ISDN DDR.


You can write down a procedure for NOC operators, probably you could even write a script that detects change in traceroute output to rise an alarm when traceroute output changes (I don't know if it is easy or not but it should be possible).


as I wrote you should not make a script with EEM to shut the primary port when  BFD fails to allow for restore.


Hope to help

Giuseppe

Actions

This Discussion