Infrastructure Management: 2014

This is an astonishing migration solution I tested in Q4 2014. The aim was to move a critical application from Singapore to Europe with a great challenge : shift a volume of approximatively 800 GB in less than 8 hours. The latency between the 2 datacenters was ~250 ms and to make it even more challenging, the source Datacenter had a WAN link of 6 MB. Considering these figures, here is the puzzle I had to face :

Lantecy between Datacenters : 250 ms. The maximum throughput we could reach with this latency for a database dump transfer was close to 550 KB/s, with all the risks of the copy failing due to connectivity issues over a huge transfer duration
Volume of darta to Synchronize : 350 GB files + 370 GB Database

As a result, with these statistics, this means the data transfer for the MSSQL Database would require 187 days !!!!! This is way beyond the 8 hours that were initally planned. Here are the different thoughts I had and the one I retained :

1. The obvious but worst solution : fly with the data

This is not a good solution usually but considering the delay above versus the time to fly back from Singapore, I could go from 187 days to only 15 hours flight. The only issue was that the outage window for the migration was much longer since the overall scenario would become :

Stop the application, dump the database & copy the application data to a NAS : 2 hours
Take a taxi to the airport and board plane : 2 hours
Fly back : 15 hours
Take a taxi to the other datacenter : 2 hours
Install the NAS, copy the data, restore the environments & perform testing : 5 hours

Migration Time for Solution #1 : 26 hours

2. Remote synchronization

This is a pretty straightforward solution which is appropriate when the remote system does not change too much. The idea was quite simple : setup an incremental replication of data between the 2 datacenters

All stateless layers would be reinstalled (Citrix with heavy client, Application Server)
Application files (data hosted on filesystems) would be replicated with a Robocopy working in mirror mode
MSSQL Database would be updated with a daily incremental replication process

The first replication would be challenging but I was convinced I could bring a full set of data by plane and then start incrementals from that point. Up to this point, the concept was good.

Should you need to do this, make sure you check one thing which I decided to check a few days later (I maybe should have started with this) : what volume of data will change at the source? This is what made my whole concept collapse : just at the database layer, I had a volume of 113 GB of TLOGs on a daily basis... This means the amount of changes to commit and replicate remotely would never fit in the tiny pipe on the small replication windows I had...

3. Testing a new concept

Starting with the easiest part : the application server. I would keep the replication process I had, based on a Robocopy batch, as it skimmed through 350 GB of files and replicated them incrementally in less than 45 minutes. I could easily run this while I was restoring my database as the WAN link would be unused. Here is the script I made. Should you have many folders to replicate, you can instanciate 1 script per folder for parallel processing :

echo OFF

REM ###### Source Data

set source=\\Server1\d$\data

REM ###### Destination Data

set destination=D:\data\

REM ###### folder to replicate

set folder=invoices

rem ##### Variable Setup

REM ##################################################################

for /f "tokens=1-3 delims=/" %%A in ('echo %DATE%') do set date2=%%A%%B%%C

set LOGNAME="D:\Scripts\logs\RBC_sync_%folder%_%date2%.log"

REM ##################################################################

echo _-_-_-_-_- >>%LOGNAME%

echo *-* Copy of folder %facility% *-* >>%LOGNAME%

robocopy.exe %source%\%folder% %destination%\%folder% /NP /MIR /E /W:3 /R:3 /log+:%LOGNAME%

pause

Echo Date : >>%LOGNAME%

Date /t >>%LOGNAME%

Echo Time : >>%LOGNAME%

Time /t >>%LOGNAME%

echo _-_-_-_-_-

Robocopy is a good tool as it gives a summary of the synchronization you performed. Here is a sample output with the volumes replicated & the speed at which it ran

Now let's look at the complicated part : our 370 GB that has 113 GB of daily TLOG. We looked at export / lift & shift / import and that was a failure. We then looked at replication, and here again it could not work due to the large amount of data to shift every time the replication was triggered. We had to look into something totally different, where volumes could be controlled much easier and this make us look at the different tools used for backup / restores and Disaster Recovery Plans.

The first one was EMC Recover Point but this was an expensive solution, maybe also overkill to move just one application to the other side of the planet. The other one was much simpler as I could choose between a physical appliance or a virtual edition which was straightforward and did not require racking / cabling / complex configuration. I therefore decided to move onto an EMC Avamar with version 7 Virtual Editions.

The reasons to go to Avamar were multiple :

Avamar could work as a "grid" meaning one appliance could backup the data on one site and then, during another time window, perform data replication with throttling, meaning capability to control the impact on the WAN link.
Unlike EMC Datadomain or HP Store Once, Avamar performed source deduplication. This is mandatory for the following steps. Furthermore, Avamar couples this to compression dramatically reducing data volumes

The architecture design was the following :

We had planned initally to have 2 AVE (Avamar Virtual Edition on the source site, in order to have fault tolerance) and a complete Avamar Grid on the destination site. On the final implementation, we decided to go with just one AVE on each side.

The migration process was the following :

Perform a full online backup of the MSSQL Database (just avoid overlapping with the already existing solution, a Symantec Backup Exec. This was done by having backup windows that would not overlap)
Setup the remote Avamar as a replication partner with throttling. The aim of this operation is to prime the deduplication data on the destination side (replicate the backup from the AVE in Singapore to the AVE in Switzerland). For my huge volume, I wanted to seecure the transfer. An Avamar replication was triggered every 4 hours (meaning previous replication halted no matter what the previous status was) and, with a throttling of 200KB/sec, it took me almost 6 days for this first replication.
Once this "cache priming" of the destination AVE was complete, I aimed for an inactivity period on the source side and performed a full online source backup directly from the destination AVE. This first backup was performed one week after the synchronization and I was hoping this would define the time to perform a cross WAN deduplication and backup. This first backup was quite scary due to its duration (almost 13 hours) but was performed without needing a shutdown of the application. At this stage, I was starting to really think this idea was an epic fail... but...
This is where things got awesome... after the first backup, I ran another one, and the backup time just dropped down from 13 hours to 68 minutes ! When playing with backup schedules and leaving more or less time between 2 backups, I realized the backup time simply depended on the volume of data that had changed during the last backup.

I finally got a descent cruise speed by having a unique weekly backup on week ends. This backup took about 10 hours but avoided disturbing the application or degrading performance during the week. As the real migration got close, I simply increased the backup pace by switching to a daily backup. Finally, 2 hours before the real cut over, I performed another full backup so I would save on the cut over time.

As a result, the final migration planning ended as follows :

Application Shutdown + Wait for batch end : 30 minutes
Full source MSSQL DB from destination AVE : 74 minutes
Database restore and post restore controls : 120 minutes
Flat files sync via Robocopy (in parallel of DB restore) : 90 minutes
Reopen application & perform functional tests : 4 hours

Total time for technical operations : 314 minutes

Infrastructure Management

Friday, February 21, 2014

Datacenter Application Relocation Alternative with EMC Avamar