Besides the occasional s/w bugs we are challenged with from time to time the biggest downtime we encounter are failed line cards especially ws-x6248-rj45 and ws-x6348-rj45 (not always full HW fail, sometimes soft reset or reinsertion will recover)
There are a multitude of alerts we can receive once a card goes into a failed state (reactive) but I was wondering if anyone has developed a means of proactively checking a line cards health. The aim being to have advance warning and perform a scheduled swap out thus mitigating the impact.
I was thinking along the lines of checking the SCP counters and monitoring the number of retries. An increasing amount of SCP retries is indicative of a pending problem. This check would need to be scripted etc. (sh scp module <#>)
Others may be measuring the asicreg counters but a lot of the time these are engineering commands that only the Cisco TAC can interpret.
I was wondering if anyone has BKM's we could apply to monitor a line cards health?