A real life SCR failover

Quite unexpectedly this week, I got to genuinely use SCR “in anger” when I killed a client’s production Exchange 2007 server by attempting to install SP2 on it (for that whole sorry story see http://social.technet.microsoft.com/Forums/en/exchangesoftwareupdate/thread/713d2b17-f19d-4eaf-8146-c51f59942d08?prof=required). I’ll keep my swearing about SP2 off the page here and focus on the hero of the week – which was SCR!

I’ve had some problems with setting up SCR on earlier rollup packs (ru5 and earlier). On one server I could only do manual reseeds, and I had some problems with ipv6, OA and SCR. But that was then – this week, using SP1RU9 and SP2, SCR has manifestly done what it’s supposed to.

The setup was as follows:

  • Two identically spec’d servers with Mailbox, Hub and CAS roles
  • Eight storage groups of between 500MB and 25GB in size.

Configuring SCR

I configured SCR following the technet docs.  But in brief I:

  1. Created Data and Log folders on the target server that matched the source server.
  2. Used the Enable-StorageGroupCopy cmdlet to get things started:
    • Enable-StorageGroupCopy -identity StorageGroup -ReplayLagTime 0 -StandbyMachine TargetServer
  3. Ran the Update-StorargeGroupCopy cmdlet on the target server to seed the replication:
    • Update-StorageGroupCopy -Identity SourceServer\StorageGroup -StandbyMachine TargetServer
  4. Created standby storage groups and mail databases on the target server, according to the advice in the technet articles. These have different Data and Log folder to the copy locations, but are waiting and ready to have their paths changed at the moment of urgency. It really does make the failover procedure much quicker!
  5. Monitored the status of SCR with the Get-StorageGroupCopyStatus cmdlet:
    • Get-StorageGroupCopyStatus -StandbyMachine TargetServer

Failing Over

I failed over the databases using the process I outlined in this post. This is where SCR really came into its own. The failover process took about 10 minutes per database (and you can do several in parallel). The longest part was actually the final step which reassigns users to their new MDB.

The best thing of all was we had NO DATA LOSS! I admit to some confusion over the whole “inbuilt 50 log limit” thing – but now I see that this is only a roll-in limit – the logs are replicated immediately, and the eseutil command, which you run as part of the failover process, rolls them in. The only way you can lose data with SCR is if the source server crashes before, or during, replicating the absolutely most recent logs. Data loss, if any, will therefore be very small.

Syncing Back

We plan to fail back but we haven’t done it yet. Everything is running on the DRP server and we’re going to let the dust settle a bit before we move back to the (now rebuilt) original server. In my earlier SCR post I outlined a manual database copy back to the source server, which involved downtime.  But actually I’m trying something different now it’s really happening.

Basically I have set the original server as my new SCR target. To do this I did not recreate the Storage Groups and Mail Databases on the original server – I just made sure the same Data and Logs folders were available.

When the time comes to do the full failover I will essentially execute the failover procedure in the opposite direction. I will post again with the exact steps when its done.

Other things to think of

If you want your DRP server to also take over Hub, CAS and Public Folder roles, then there is more than just SCR to think about.

CAS Role

It is good planning to assign a CName to your OWA and ActiveSync URL. Just make sure that all your possible CAS servers include this CName in their certificate: http://technet.microsoft.com/en-us/library/aa995942.aspx

Also be aware of something I had forgotten – Outlook can only redirect a user to their new server if the old server is responding. This is a total sh*t if your old server is dead and gone. I read somewhere that it may work to assign the old server name to the new server as a CName, but you may not be able to do that if you are still trying to resurrect the old one. We got by with OWA and the hard-pressed Helpdesk having to talk a lot of people through changing their Outlook profile. If you really want to be prepared then write a script now that can change the server in outlook profile (googling shows various options – none of which I’ve tried as yet – though one of my collegues says MAINTWIZ can help).

Hub Role

Make sure all Send and Receive connectors are replicated somewhere. Use costs on Send connectors to favour your usual production route.

Also, if you have scripts or applications sending email via the Exchange server, make sure a CName is used which you can rapidly change in DNS.

Public Folders

Make sure all your Public Folders, FREE BUSY and OAB folders have more than one replica server.

I had some weird experiences with trying to add the DRP server as an extra replica to top-level folders. Then I found this post and after that I gave up. It did mean that, after the failover, I had to manually add the DRP server as a replica to the top-level folders.

 I also had other bizarre public folder errors which involved:

  • Manually changing the Default public folder database on the Mail Databases on the DRP server (see the Client Settings tab on the properties of the Mail Database in Exchange Management Console),
  • Manually changing the siteFolderServer property on the Administrative Group objects in AD,
  • Manually changing the siteFolderServer and offLineABServer on the Default Offline Address Book object in AD.

In summary…

The SCR part of the failover was the easiest part of the whole week – we had more trouble with incorrect public folder settings, missing Send connectors, and a fussy backup client that didn’t want to install on the DRP server.

The biggest problem with SCR is that there is no straight-forward “fail back” procedure. As I’ve said before, SCR is not a cluster, but rather a one-way replication to a standby server. However I think it is proving itself to be a great technology, and it’s no wonder that Exchange 2010 is building on the SCR model with Database Availability Groups. I’m looking forward to them! (Despite the dodgy anagram, which you have to be Australian to appreciate. You dag.)