Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

CSS/CSM: Day 828 problem

[toc:faq]

Introduction:

This document briefly summarizes the 828 days problem often observed in CSS and CSM.

Core Issue:

 

There is a need to provide detailed information on how the 828 days problem occurs, and ways to avoid it.

Resolution:

 

CSS and CSM are based on 32 bit VxWorks.

Further, the clock of VxWorks is counted at 60 Hz (increases by one every 1/60 of a second). You can get the value with tickGet() API, provided in the following URL. For example, when you get the value of tickGet() five seconds after booting, you can get the value of 0x12c (60Hz * 5sec = 300). CSS and CSM refer to this value for various purposes.

http://www.vxdev.com/docs/vx55man/vxworks/ref/tickLib.html

 

As tickGet() is 32 bit timer, and its maximum value is 2^32 = 4294967296, when this value is wrapped, the counter is reset to 0.

In other words, tickGet() value will be reset to 0 after 828 days and 12 hours have elapsed, according to the following formula.

Therefore, various problems will occur after 828 days have elapsed.

 

2^32 / (60Hz*60sec*60min*24hr) = 828.5days

 

 

Let us explain this problem in a bit more detail. We will use keepalive feature for the example. Keepalive is sent by CSS regularly.

 

By default, CSS sends icmp packets to a service every five seconds for an availability check.

Within CSS, the next transmission time will be calculated with current time and keepalive interval (five seconds, 0x12c; next_keepalive = tickGet() + 0x12c).

 

For example, if keepalive is sent 3600 seconds after booting, the next icmp packets will be sent 3605 seconds after booting.

If the value retrieved by tickGet() is larger than 3605 seconds (tickGet() > next_keepalive), keepalive packets will be sent.

 

If the tickGet() value is 0xfffffff0, the next_keepalive value is set to 0x1000011c, but the maximum value of tickGet() is 2^32 = 0xffffffff. Therefore, if this maximum value is exceeded, it is reset to 0 and the next keepalive value is set to 0x1000011c.

In this case, the condition of tickGet() > next_keepalive will never come, and thus CSS stops sending keepalive packets.

 

Changing the base OS from 32 bit to 64 bit also requires significant changes in CSS/CSM, which runs on the OS. Therefore, we have decided not to upgrade the base OS.

As a result, many bugs that may have taken effect after 828 days have been corrected.

 

 

For both CSS and CSM, we fixed many bugs. The root problem, however, remains. Therefore we suggest you reboot CSS/CSM before 828 days have elapsed..

 

Note: End of SW Maintenance Releases Date: September 20, 2012

http://www.cisco.com/en/US/prod/collateral/contnetw/ps5719/ps792/end_of_life_c51-657403.html

 

End of SW Maintenance Releases Date: August 26, 2011

http://www.cisco.com/en/US/prod/collateral/modules/ps2706/ps780/end_of_life_c51-577764.html

 

 

Also, some of the reported failures were analyzed in order to determine that correction was impossible.

To avoid these problems, it is recommended that CSS/CSM be rebooted every two years.

 

When the CSS has an uptime of 828 days, it cannot send packets to the management port for 18 minutes. This issue affects the management port only. The circuit and VIP addresses works fine. We recommend that you reboot the CSS before its uptime is 828 days.

http://www.cisco.com/en/US/docs/app_ntwk_services/data_center_app_services/css11500series/v8.20/release/note/RN820_X.html#wp223623

 

When 828 days have elapsed since the CSM was booted, the HTTP probe will fail and will stay in the down state for about 18 minutes. Reboot the CSM before 828 days have elapsed. (CSCso08858)

http://www.cisco.com/en/US/docs/interfaces_modules/services_modules/csm/4.2.x/release/notes/ol_6897.html#wp274406

 

 

Related Information

Original Document: Cisco Support Community Japan DOC-31076

Author: Yuji Shimazaki

https://supportforums.cisco.com/docs/DOC-31076

Posted on March 19, 2013

CSS/CSM: 828 日問題

Comments
Community Member

There is a typo on the formula.

2^32 / (60Hz*60sec*60min*24hr) = 828.5days is correct one.

I guess, because the 32 bit clock tick will be reset every 828.5 day, without the restart, the issue will happen every 828.5 day.

You are right, I made a typo.

Thank you for your pointing it out. I fixed it.

Community Member

Hi,

The root cause is that tickget() is reset to 0 whenever it reached 2^32 ( = 828.5 days ).

and I don't think this issue happen just one time because this problem will occur on 828.5 *n days since CSS was booted. 

Cisco Employee

Yes, this symptom is not a one-time issue, but occurs on 828.5 * n days.
I handled the SR that uptime was 1657 days.

 

Community Member

Thanks yushimaz for your feedback.

Community Member

I have one more question.

In below sentence, I don't understand why the next_keepalive value will be 0x1000012b.

If the keepalive value will be added 0x12C from 0xFFFFFFF0, the next value would be 0x10000011C.

Could you please explain me? Also, I guess the next value will be set as 0x11C after overflow.

---------

If the tickGet() value is 0xfffffff0, the next_keepalive value is set to 0x1000012b, ....

 

 

Cisco Employee

Thank you for your comments.
Yes, you are right!

I calculated the value with '0xFFFFFFFF + 0x12C'..
So, I fixed it.

 

2450
Views
0
Helpful
7
Comments