Apologize for length, just trying to get an idea of how other clients handle routine downtime such as DB patching, OS patching, Tidal Patching etc. as well as Disaster Recovery planning.
Currently we are on a Window master in 5.3 but are currently in the middle of an upgrade to 6.1 moving to Linux masters. We use the Fault Master which means that unless there is a definite requirement to bring the entire environment down, we always have a tidal master running in PRD even during Win Patching which occurs once every 4 weeks Sun from 3 - 5 am. Most of our majors systems that use Tidal are also down during that timeframe.
One of the masters has a patching win of 3 - 4am and the other has a 4 - 5am. I believe FM patches on another week. Because we have a Tidal 5.3 bug that is sensitive to network performance, patching both masters on the same day ensures that after the patching window is complete that we always end up with a live master that is on the same data center as the DB cluster (which sort of mitigates our exposure to the bug) .
With 6.1 and the bug supposeded fixed, the new patching schedule has both masters being patched on different weeks.
These are my questions:
- Dees any one see any pros and cons with patching each master on a different week?
- Even with the low number of transactions, should we be setting the queue = 0 prior to the patching and failover? We haven't done this on Windows and no one's complained.
- In 6.1 , if the live master happens to be the backup master, what happens when the FM goes down for patching? Can the FM only go down when primary master is the live master? We have our fault tolerance set to auto ( either masters can take over as live ).
- In windows, we have the master services set to auto - is there a reason not to do the same for Linux after the server bounces? (GIven that in 6.1 there is an order to bringing things up FM, master then CM). I have seen times when master came down and CM was up - is that not acceptable? Should CM always be bounced after the master bounces?
On a more advanced topic with DR -
Are any other client on a clustered DB (like Oracle RAC)? Are you able keep Tidal running somehow during DB cluster patching or is the hard rule always to bring TIDAL down? I am aware at least on 5.3 that Tidal losing connection to the DB is NOT pretty. Lots of stuck jobs and other data integrity issues. The question arises because DB team is looking to take advantage of some Oracle features to see if we can patch DB without having to bring down Tidal entirely. Can perhaps failing over and forcing that new master to a specific DB cluster that will be available be an option?
With CM on load balancers, do people use a 'TES under downtime maintained please log in again at 6:00 am' type http redirect so that the users get something more informative whenTIdal is down. What do you use for the load balancer to detect Tidal is down? Access to a file in the webserver? An actual signon attempt? Are you able to force a redirect anytime even when CMs are up just so no on else can log in other than admins (whihc is a nice feature especially right after a system issue)? How to folks handle system availabily after a major system issue - do you limit access to TES until some system sanity tests and test jobs ave been run? If so, how do you do it?
Lastly, does anyone have a backup fault monitor installed on another data center for DR purposes? Have you been able to make it work? I am assuming bringing stand-by FM means having to reconfigure all the masters to point to this. DO you have scripts for doing this or more manual? Can't remember if this is all on config files or actually SQL updates will need to be made behind the scenes. We currently do do this in 5.3 but noticed in our 6.1 architecture that this component is a single point of failure if we lose a data center.