Recently I was approached by one of the major customers to assist in creation of database fall-back procedure for a Windows based Oracle 10g RAC system, that is using OCFS to store database data.
The objective was to upgrade clusterware and database from Oracle 10.2.0.2 to 10.2.0.4 and on top of it apply patch 38. Obviously, both ORACLE_HOMES and database itself were to be upgraded.
In a large company world, I would suggest to create separate RAC cluster, setup standby database there and during the brief outage copy control files and redo logs, start-up database and be done with it. This way the fall-back is very simple (start using old system) and most of the work (installing and patching software) can be done without any production impact.
The problem was that customer engaged us only after their fall-back test totally failed and all outage process was based on actually upgrading production system (software and database).
The major issue was that Oracle provided fall-back failed, as downgrade did not worked as expected, as well as nothing was said about proper downgrade of ORACLE_HOMES.
At the end, we arrived to below fall-back procedure:
1) Shutdown all database related Windows services
2) Shutdown clusterware related Windows services
3) Kill ons.exe processes - they sometimes left hanging
4) Take system state backup (IMPORTANT)
5) Backup ocfs and oraclefence drivers
6) Backup database and clusterware Oracle homes
7) Backup Oracle Inventory
8) Backup control files and redo logs
9) Backup OCR using ocrconfig -export <exp_ocr_file.bkp>
10) Backup Voting disk using 'ocopy <voting_file> <exp_voting_file.bkp>'
11) Backup config files (init/tnsnames/listener/sqlnet) (just in case)
During fall-back exercise, we were able to:
1) Restore system state backup in 3 minutes
2) Restore drivers in 3 min (manual copy)
3) Oracle Homes and Inventory Restore was done using rename (1 minute)
4) OCR was restored using "ocrconfig -import" (3 minutes), no need to restore voting disk
5) Database restore from RMAN took 40-50 minutes
6) Redo restore (50GB) took 15 minutes in parallel with database restore
7) Control file restore took 1 minute
8) Database recovery and startup took 5 minutes
Fall-back was 100% successful