EEM script to Reload router 1 Time if Dial Fails

Unanswered Question
Nov 2nd, 2009

We have a script we use to track objects and if access to these objects fails we reload the router 1 time (see it below). We want to now apply the same script in a different capacity for another customer for dial. If the router fails to connect after trying to round robin thru 4 toll free 800 numbers then reload the router.

event manager applet vpn_tunnel_rebooter

event none

action 1.0 cli command "enable"

action 2.0 cli command "config t"

action 3.0 cli command "no event manager applet vpn_tunnel_unreachable"

action 4.0 cli command "end"

action 5.0 cli command "write mem"

action 6.0 reload

!

event manager applet vpn_tunnel_up

event track 456 state up

action 001 cli command "enable"

action 002 cli command "config t"

action 003 cli command "event manager applet vpn_tunnel_unreachable"

action 004 cli command "event track 456 state down"

action 005 cli command "action 1.0 policy vpn_tunnel_rebooter"

action 006 cli command "end"

Any ideas on how we might accomplish this? I have very little EEM experience :)

1) Current Object tracking tracks 3 objects, If access to all three is down we go into a 180 sec delay down timer. If they remain down we switch over to dial backup.

2) We round robin between 4 dial numbers and if access to all 4 fails, we want to reload router "ONE" time

3) When it comes back up and objects are still unavailable we attempt to dial the 4 numbers again and if it also fails to connect "DO NOT" reload this time.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Gerard Roy Tue, 11/03/2009 - 08:51

Hi there,

We have Version 12.4(15)T7

c181x-adventerprisek9-mz.124-15.T7.bin

Thanks for your support!

Joe Clarke Tue, 11/03/2009 - 10:00

Okay, so Tcl will be required here. Please provide a flow chart of the process that you want to occur. I've got a good idea of what needs to happen when the problem is first encountered, but I'm not clear exactly what you want to do after the reboot.

Gerard Roy Tue, 11/03/2009 - 10:43

Some background on why this is required:

We have found a known issue with the cisco 1811 built in modem where if the router attempts to dial when the phone line has been unplugged or interrupted (SR: 612816679 - a fax or phone is attached to same line) the router will never connect via dial until the router is recycled. We have cleared the line, shut and no shut the dialer and async interfaces and nothing works. ONLY way is to be sure the phone line is plugged into the v.92 port and reload. From there on out it will dial faithfully until the next disconnect on interruption.

Flow - Not sure of all the logic here so please correct as you see fit.

1) Router is up and is tracking three objects - if access to any one object or 2 objects fails - do nothing. If access to all three objects fails - ip sla will kick in delay down timer of 180 seconds. If access to objects recovers before timing out, restore tracking to up and cancel delay down.

2) If delay down timer expires with no object recovery - attempt to dial as normal.

3) Router will round robin between 4 known 800 numbers and if it fails to connect on all 4 tries, reload the router.

4) When router comes back up and tracking still shows down, have it continue to cycle thru the 800 numbers and attempt to dial as it would have done normally but do not reboot again.

5) After a reload and dialup started to work, router should disconnect from dialed 800 number. Reset the reload once script back to a ready state.

Joe Clarke Tue, 11/03/2009 - 10:52

Okay, more questions.

1. I assume you already have the IP SLA collector and tracking statements working correctly? A show run would be helpful.

2. No problem here. The policy will trigger off the tracked object which ANDs together the other tracked objects. If that ANDed tracked object goes down, the EEM policy will fire. What commands are required to switch to dial backup?

3. How does one know if all four numbers failed?

4. Once 3 is answered, this is not a problem.

5. No problem.

Gerard Roy Tue, 11/03/2009 - 11:37

1. I assume you already have the IP SLA collector and tracking statements working correctly? A show run would be helpful.

Yes, IP Sla works and show run attached :)

2. No problem here. The policy will trigger off the tracked object which ANDs together the other tracked objects. If that ANDed tracked object goes down, the EEM policy will fire. What commands are required to switch to dial backup?

Not sure what you asking here. Interesting traffic is acl 101 and is triggered when an IPsec tunnel needs to be built out the dialer interface and the default route of 0.0.0.0 is now known via this interface (broadband is down)

3. How does one know if all four numbers failed?

See attached file that shows the normal failures. I believe the TTY1: Modem: (unknown)->HANGUP would be an indication that the modem has to be reset.

4. Once 3 is answered, this is not a problem.

5. No problem.

You do not know how much we appreciate this. Thanks Again. We Owe You!

Joe Clarke Tue, 11/03/2009 - 11:54

So, nothing really needs to be done to switch to dial backup, right? That is, tracked object 456 will go down, then the DHCP-added route will be dropped, thus making the Dialer route more desirable. The EEM policy should also watch tracked object 456 then.

Will you be leaving debugging enabled forever?

Gerard Roy Tue, 11/03/2009 - 12:10

Yes exactly, routes would be removed and 456 down is correct.

Good Point - we would not want to leave debugging on. Can we leave all debugging off and continually do a show dialer to get what you would need to key off of? There will be 4 dial strings a successes column and a Failures column. If the failures column increments 1 higher using the attempted calls dial string, transition thru remaining three strings so you have had all 4 strings increment by one failure and then reboot.

Attachment: 
Joe Clarke Tue, 11/03/2009 - 18:31

Okay, I'm almost done. I just need to right one more policy to disarm the dial backup watch if the dial backup comes up. What syslog messages do you get when dial backup comes up as it should?

Gerard Roy Wed, 11/04/2009 - 09:37

I have attached a file that shows a successfully connected dial session. I believe we can use "Current call connected" string for denoting a successful connection. Will that work?

Joe Clarke Wed, 11/04/2009 - 12:07

Okay, these policies should do what you want. They are all untested as I do not have a dial backup setup. First, you need to register the following applets:

event manager applet remove-dial-backup-watch

event syslog pattern "SYS-5-RESTART"

action 1.0 cli command "enable"

action 2.0 cli command "config t"

action 3.0 cli command "no event manager policy tm_check_dial_backup.tcl"

action 4.0 cli command "end"

event manager applet watch-track-down

event track 456 state down

action 1.0 syslog msg "Track 456 is down, waiting to see if dial backup comes up"

action 2.0 cli command "enable"

action 3.0 cli command "config t"

action 4.0 cli command "event manager policy tm_check_dial_backup.tcl"

action 5.0 cli command "end"

event manager applet watch-track-up

event track 456 state up

action 1.0 syslog msg "Track 456 is up, removing dial backup watcher"

action 2.0 cli command "enable"

action 3.0 cli command "config t"

action 4.0 cli command "no event manager policy tm_check_dial_backup.tcl"

action 5.0 cli command "end"

Then, you need to set an environment variable for the Tcl policy:

dial_backup_numbers : Comma-separated list of numbers to check

For example:

event manager environment dial_backup_numbers 18667143757,18003179379,18886601039,18004434603

Then INSTALL but do NOT register the attached Tcl policy. That is, copy the attached Tcl policy to your EEM user policy directory, but do not register it.

Then everything should just work.

Gerard Roy Thu, 11/05/2009 - 11:45

Wow! - Awesome!

I will install and test right now and let you know.

Thanks!

Gerard Roy Thu, 11/05/2009 - 12:53

I ran a bunch of debugs:

debug event manager, debug dial, debug track, debug ppp nego and got some errors. The router dials as it should in my environment and I am looking to get it to fail to dial while the cable is plugged in so I can see if it will reboot after circulating thru the 4 number strings. See attached for details.

Gerard Roy Thu, 11/05/2009 - 16:39

I found the error in the script. There was an extra "t" near the "proc get_curr_states { output numbers } {" line and am now testing again. I will update you once I find the results :)

Joe Clarke Thu, 11/05/2009 - 22:23

Yes, fixing that typo will certainly fix the errors you're seeing in the logs.

Gerard Roy Mon, 11/09/2009 - 14:42

Hi There (sorry I never got your first name :), We have a new error message: "Failed to find 18667143757 in output of "show dial"". I don't fully understand why we get this since I can see the string so I was hoping you would have another look. Also, can you add one more requirement? Can you add a "Clear Line 1" at the start of the script just to be sure the line is not hung? (see very bottom of file)

It has never rebooted yet and I am assuming becuse it has not been able to "see" the "Failed to find 18667143757" line.

I have attached a debug file for your reference.

Thanks again for your awesome scripting.

Jerry

Attachment: 
Joe Clarke Mon, 11/09/2009 - 14:55

This version should fix the problems you're seeing, and adds the clear line support.

Joe Clarke Mon, 11/09/2009 - 17:44

Actually, "clear line" is interactive, so you'll need this version instead.

Gerard Roy Tue, 11/10/2009 - 10:21

Hi Joe,

What is the logic behind when this should reload? I disconnected the phone line and then disabled the uplink to the internet on Fa0 and the object tracking went into delay down and attempted to fail over to dial as it should. It cycled thru all 4 numbers and never reloaded the 1811. I let it continue a couple more complete cycles but it never reloaded. I have attached the logs.

Thanks,

Jerry

Attachment: 
Joe Clarke Tue, 11/10/2009 - 10:36

According to this, the reload should have happened. Here is a version with some debugging added so I can see what's going on. Install this version, and also enable "debug event manager tcl commands" and get the output of "show event manager policy registered" when the main link is down. Repeat the test, and post the new log.

Gerard Roy Tue, 11/10/2009 - 12:42

Excellent - it works!!!! Just awesome! You are the MAN! You do not know how much this is appreciated - again - THANK YOU.

I have attached 2 files for your review. One is the successful reload and the second is the success-after-reload because I wanted to make sure it did not reload a second time.

Attachment: 
Gerard Roy Tue, 11/10/2009 - 13:49

Joe,

BTW - what should I remove in the script to disable the bugging you added?

Thanks,

Jerry

Gerard Roy Tue, 11/10/2009 - 16:41

Works great - Thanks and please post for all the world to use. This is a Huge issue with 1811's at the very least.

Highest Regards,

Jerry Roy

Gerard Roy Wed, 01/06/2010 - 16:41

tHi Joe,

Looks like the script continues to reload the router   I am trying to nail down the exact scenario in which it does this. I current have disconnected the broadband and left the modem line unplugged so I could test the 1 time reboot. I saw it reboot one time and assumed all is well but after a short period I hear it reboot continuosly after each cycle thur the 4 numbers. What can I look at? I notice after the reboot it shows the reachability of the tracked objects start running back down thru the 180 second delay down timer (bb and phone line are disconnected) and then it goes thru the 4 dial backup numbers again and finally reboots again. It continues this over and over. (See atached)

Message was edited by: Gerard Roy - added current config from router

Joe Clarke Wed, 01/06/2010 - 17:24

Actually, this new version, while it has a bug fix, will not help what you are seeing.  What you are seeing is not a problem with the EEM code.  Your tracked objects are flapping, and EEM is doing what it's told.  When the router reboots, tracked object 456 comes up (01:07:38 UTC).  That installs the Tcl policy which watches for the tracked object to go down.  Then, at 01:10:38 UTC, the tracked object 456 goes down.  This causes the Tcl script to watch the dial state, then trigger another reload.  The process repeats at 01:13:38 UTC.  I think this is being caused by your delay down.

Gerard Roy Wed, 01/06/2010 - 17:32

nope It continues to reload. What can I send you that might help debug?

Gerard Roy Wed, 01/06/2010 - 17:54

I see it now. For some reason it says the objects are up even when nothin is plugged into the damn port. Just lame.

Let me try a newer version of code to see if this is still an issue.

Gerard Roy Wed, 01/06/2010 - 18:26

Joe,

What version of code did you develope on? I have upgraded it and now the problem seems to be gone Can you modify the tcl script to clear the line before it cycles thru the 4 dial backup numbers? It seems to me it would make more sense to clear at this time.

Joe Clarke Wed, 01/06/2010 - 19:12

I developed with the assumption that EEM 2.1 was being used, but as I said, the problem is most likely related to your "delay down".  A tracked object can only have one of two states (either up or down).  When the router reloads, the object is most likely up pending the delay down timer of 180 seconds (which explains why the EEM policies were firing on three minute boundaries).

I didn't see any object tracking changes in later 12.4T code, but I could have missed a bug fix.  If the default object state is now up (or you're not getting the Down->Up transition change now), then that's good.

The current Tcl script will clear the line before reloading the device.  It will clear the line, then see if the backup is still down.  If so, then the reload will happen.  If the backup comes up successfully, the policy will remove itself.

Joe Clarke Wed, 01/06/2010 - 19:22

I found the bug.  It's CSCsr27735.  The fix for this made it so there is now a "default state" you can specify for tracked objects watching IP SLA operations.  The default state is DOWN (which is what you want in this case).  This change was made in 12.4(22)T.

Gerard Roy Wed, 01/06/2010 - 19:43

if both the modem cable and the broadband cable have been unplugged and the router is rebooted via the script, do you think the object tracking should show up right after the reboot? well it does. The delay down has no business showing up after a reboot UNTIL the actually object can be reached. The older version c181x-adventerprisek9-mz.124-15.T7.bin shows it up after the reboot and I as not able to see for how long till it finally figured out it was really down so the script rebooted it again. I have installed c181x-advipservicesk9-mz.124-22.T3.bin and it looks to be resolved. The delay down is required to be sure the link is not flapping. Thanks Again

Joe Clarke Wed, 01/06/2010 - 20:19

Yes, this is what I have summarized above (see all my recent posts on this thread).  As I said, the bug which changed the behavior to having IP SLA tracked objects be DOWN by default was merged into 12.4(22)T.  You should be good to go on that version of code.  If you need this to work on older code, a hack would be required in the policy which watches for the SYS-5-RESTART syslog.

Gerard Roy Thu, 01/07/2010 - 08:53

Hi Joe,

We have 850 locations I am going to have to update code on if I can't get this to work with existing code How tough to do the hack you mentioned?

Thanks,

Jerry

Joe Clarke Thu, 01/07/2010 - 09:44

I believe something like this would work:

event manager environment quote "

event manager applet remove-dial-backup-watch

event syslog pattern "SYS-5-RESTART"

action 1.0 cli command "enable"

action 2.0 cli command "config t"

action 3.0 cli command "no event manager policy tm_check_dial_backup.tcl"

action 4.0 cli command "event manager applet remove-watch-timed"

action 5.0 cli command "event timer countdown time 185"

action 5.1 cli command "action 1.0 cli command $quote enable$quote"

action 5.2 cli command "action 2.0 cli command $quote config t$quote"

action 5.3 cli command "action 3.0 cli command $quote no event manager policy tm_check_dial_backup.tcl$quote"

action 5.4 cli command "action 4.0 cli command $quote no event manager applet remove-watch-timed$quote"

action 5.5 cli command "action 5.0 cli command $quote end$quote"

action 6.0 cli command "end"

Gerard Roy Thu, 01/07/2010 - 11:44

Hi Joe,

Still rebooting - see attached. I noticed when I rolled back it also errored on the ip sla statements. I had to re-add them back in as rtr statements. Thank You.

Joe Clarke Thu, 01/07/2010 - 12:06

Increase the 185 second timer to 210.  That should provide enough time.

Gerard Roy Thu, 01/07/2010 - 12:39

I think that did it.

One more go around with reboots and code rollbacks and I am sure it is resolved Just awesome work. You should get a raise! You have been fantastic.

BTW - You up for another challenge? Take a look at this bug 613172159. I need these 1811's to actualy fail back over to the primary circuit. It is a dual dhcp scenario (2 broadband providers both serving IP's via dhcp). Problem we are seeing is that it fails to fall back to the primary. Looks like the "Fix" doesnt work correctly with object tracking.

Thanks again!

I will give an update shortly.

Gerard Roy Fri, 01/08/2010 - 18:32

Hi Joe,

Have a strange issue with the scripting. It only seems to work on sites that obtain their IP address via dhcp. I deployed on 3 statically assigned sites and none of them will do the reboot. Looks like they are stuck in some sort of loop. The dialer never seems to increment in the output when I do a debug event manager all. I have attached a file. Thanks for looking

Gerard Roy Fri, 01/08/2010 - 18:39

Yup - for some reason it will not get past the 1st 4 dial attempts. It never increments beyond the 1st 4.

Joe Clarke Fri, 01/08/2010 - 21:32

The script is working as designed.  The number of failures is not incrementing in the "show dial" output.  The first time the script runs it sees 1 failure per number.  It stores that value, then continues to run "show dial" until each number has at least one more failure.  Only then will it reload the router.  You need to figure out why the dial backup is no longer trying to dial after one failure.

Gerard Roy Mon, 01/11/2010 - 14:40

Joe,

A command we asked Cisco to fix for us that was applied to the routers last year is the culprit on why the static sites would not work with the EEM script (actually all sites since I did not have it applied to my lab router when testing). The command “dialer pre-classify” was created to prevent random IP traffic from causing sites to dial a connection for no apparent reason but now it limits the number of times a single valid IP address will cause the dialup to trigger traffic per dialup number. The script is looking to confirm the router attempted to dial all 4 dial backup numbers and then reboot. With dialer pre-classify it never reaches a point where it sees all 4 numbers attempted.

I have removed the dialer pre-classify command on the sites in question. Now I need to get Cisco to fix the command again

Thank You Sir.

Jerry

Gerard Roy Wed, 01/13/2010 - 12:04

Joe,

Can you make a modification for me? Add an event to have the router ping 200.200.200.1. We have a route to this on all our routers out the dial interface to test dial backup (used to confirm modem, line and ppp credentails all work). I noticed when I attempted to ping this address after your script went thru the 4 dial numbers that it finally did the reboot. Looks like the script just needrd the router to attempt to dial 1 more time out the 1st number because it seems to have missed recording of the first dial event when it started tracking it. (does this make any sense?)

Actions

This Discussion