After hours monitoring

Tod Larson · ‎02-25-2012

We have a gremlin in the system where users often can't log into the windows domain around 4 AM. The problem clears up around 5 AM. I'd like to setup some sort monitor on a couple routers to validate connectivity our netapp and our dns servers overnight. How can I configure an IP SLA or something to test connectivity every 10 seconds or so, then issue a syslog message with if the test fails? Then during the day we can analyse exactly the resutls..

Hopefully the results of this monitoring would give us a clue as to what is going on.

sean_evershed · ‎02-25-2012

Hi, you could write an EEM applet that emails you the results of the tests it performs.

See below this link for an example

https://learningnetwork.cisco.com/blogs/network-sheriff/2009/06/19/writing-your-first-eem-applet

hobbe · ‎02-27-2012

Hi

First of all i would enable logging on all systems and check what that tells me.

if your users have problems with logging on to the windows machines, what does their log tell you ? and what does the windows servers log tell you ?

Then i would check link saturation, if it might be some backup that is running at that time and saturates a link or something.

Thats what i would start with.

Good luck

HTH

sleepyshark · ‎02-27-2012

Tod -

For this particular instance, i'd really look outside of infrastructure. While the error may indeed be caused by a lack of infrastructure, it appears that you have machines/applications/alliances running some tasks overnight. I'd look for a comprehensive network monitoring program (SolarWinds in my fave, but it's super expensive). You need to monitor [not only] your core/distribution layer, but your to/from traffic and see what is causing this and why. Use SNMP and NetFlow, and you'll gain a invaluable insight into what's going on overnight.

From personal experience, I have run into the same problems where SAN-SAN replication runs at the same time as snapshots/backups and it [literally] brings the entire network to a crawl. Simply identifying what is happening across the entire network can help you make subtle scheduling changes that could eliminate these issues.

Thanks,

Sean Brown

http://www.sleepyshark.com

Tod Larson · ‎03-04-2012

Thanks for the input. I now have What's Up Gold set to showman alert every time any of our 8 switches has an interface that averages >50% utilization for 10 minutes. Now we wait and see. On the first 24 hrs the only alarm was a WAN interface hit 70% at 3pm...definitely not my gremlin.

Mr_Helpful · ‎03-04-2012

First place to check would really be the logs of your domain-controllers...

They must be doing something around that time, or at least they must also be noticing they have some issues themselves. As mentioned by others it might be something within the storage environment that (depending on your environment) may not even be dependant on your network.

Sounds like one of those "the 'network' is not working" issues that Windows admins love to toss our way...

;-)