Thursday, January 7, 2016

Datacenter Build considerations

Building a Datacenter is always a challenge. Building a datacenter remotely, by managing integrators, in a far away country is an insane challenge :) The aim of this post is to go through the different issues I have encountered and help you prepare your Datacenter design before being stuck in the implementation

- Distribution : should you be picking a Datacenter with narrow racks, cabling may very quickly become messy. Most people push for top of rack / end of row layout but on some installations this can get very messy and I usually go for middle of rack distribution. Indeed, this shortens all the cables but also allow to only have half of the cables above or below your distribution switch

- In addition to the middle of rack distribution, an important point to assess is the server cabling of your Datacenter. Distribution switches are, as their names indicate, for distribution. As such, they connect all the servers and equipments that have all their ports on the cold aisle side. In order to avoid having massive cable trays and complex management of cabling, I really suggest you study possiblity to have an inverted switch (ports, instead of being oriented towards the cold aisle, would be oriented towards the hot aisle. When turning your switch around to perform this, you have to be sure your airflow is also inverted, so that the switch intake is on the opposite side of the ports. An example of this is Cisco Fabric Extenders (http://www.cisco.com/c/en/us/products/collateral/switches/nexus-2000-series-fabric-extenders/product_bulletin_c25-680197.html) that have this inverted airflow option.

- Cabling plan and ordering : this is usually a critical task on a datacenter build. Make sure a unique company is in charge of the cabling in order to have something consistent but also to have a unique vendor to go back to if you see any issues. Cabling should be preinstalled before the equipments are racked. This is usually done via a cabling matrix. Make sure you ask well ahead if there is a cabling matrix standard to use or if you can provide your own. Should you provide your cabling matrix, you will have to know who is in charge of calculating cable lenghts (should you be in charge, you will need all the details on the rack heights, the distance from the rack top to the cable trays...). The cabling matrix should at least comprize :

  • Source equipment name, rack, rack position, port / interface
  • Target equipment name, rack, rack position, port / interface
  • Cable type (fiber, copper, stack...) and colour
  • Cable function (Management OOB network, LAN, SAN...etc...)
  • Cable label


- Some switches such as Nexus 5k seem to have a physical depth issue. Indeed, the ears (little "L" shaped metal plates you fix on the side of the switches to rack them) are pushed too far back by default. When checking with Cisco, they do not provide different models of these ears. The only solution is usually to remove 2 screws to shift the ear forwards ans give enough space for fibers to reach your switches in the racks. Please note this is only possible if you have rails to hold your switch at the back, as without rails, you will face serious issues. On the picture below, left side, you see the switch with ears put in initial position, and on the right once 2 screws are removed and the ears repositionned.


- When importing equipments from one continent to another (e.g from USA to Brazil), make sure you order the correct power cables (a standard IEC 320 C13 plug is usually shipped in USA as a NEMA-5-15P plug in the USA)











Sunday, October 25, 2015

Pros and cons of External Cloud Services

Nowadays, the cloud is everywhere. All software you see and all providers propose cloud brokering via their Datacenters, platforms or software. Now what are the real cloud drivers? We often speak about cloud acceleration but rarely speak about issues that arise when you start playing in the cloud...

The good points with the cloud :
  • It is fast and flexible (from a service ramp up / down perspective)
  • It usually implements the very latest technology and software platforms
  • It is usually very adapted to geographically dispersed solutions, allowing resilience and regional / local implementation of global solutions across the globe
  • You don't have to handle capacity reporting and planning of the environment... just pay as you go
  • For some very specific requirements, you can have your environment online only a few hours per day (when you really need it). The rest of the time, you turn it off and limit the cost, which is not as cost efficient when done on your owned assets
  • You run on a standard operated environment
  • There is a huge varitery of services (IaaS, PaaS, SaaS) to subscribe out there. Cloud solutions are usually financial efficient when you buy PaaS or Saas. IaaS will usually let you manage a level of complexity between your in-house architecture and the cloud infrastructure which adds complexity and overhead to your teams

Now the drawbacks of these kind of services : 
  • As soon as you have a certain volume and a sustained growth, public cloud is usually more expensive that in house
  • If you don't run on standard "off the shelf" applications, the cloud can be a real issue for you. Even though upgrades are planned and predictable, there is no place for blocking upgrades of versions. Should your application not support that intense patching roadmap, you will be really bothered
  • Service transfer from cloud services can be a real pain point. A very simple example I went through was when using an SAP on an "as a service" mode. When attempting to exit that service, and without any clear contractual statement, I was explained that the SAP image itself was the intellectual property of the vendor. As a result, a complete replatforming of the SAP environment had to be done, follow by a data synchronization.
  • Data reversibility is an extension of the issue above. Be sure, when contracting a cloud service, that you have the right conditions for a transition out. Vendor "lock in" is one of the most common issues on the cloud. 
  • Most cloud services come with limited to no SLA. Cloud providers prever to commit on means than results. Be aware of this before putting any critical service in the cloud
  • Cloud is usually not recommended for critical data. Indeed, it is always complicated to impose data location to cloud providers. Moreover, cloud providers are usually seen as a big regulartory risk since the staff managing the cloud platforms is geographically dispersed and you usually have no control over this
Cloud services continue to progress and seduce more and more companies. They remain perfect for either temporary workload hosting (IaaS) or complete end to end solution hosting (PaaS or SaaS, where limited integration is required with your core company services. 
Keep in mind that cloud services are extremely standardized but always come with limited SLA commitment and a very limited framework. As such, either you will need to be a major player to get an appropriate contract or you will have to accept the risks on the services you contract.

To conclude: should you have a sustained growth, internal private cloud with capacity on demand will be the perfect solution. For other cases, or off the shelf platforms, hybrid or public cloud can greatly enhance your capabilities but keep in mind that getting strong commitment on SLAs and legal / auditability aspects will remain a challenge for you. 

Friday, October 9, 2015

RFP General Principles - RFP for hardware / software with an integration project

Many people would want a complete RFP document sample, which would let them almost copy paste the document, adapt it a little, change the fonts and header and reissue the document. Before providing such examples, it is key to understand what are is the best way to break down an RFP in order to make it easy for you and the vendor to analyze, and understand your expectations.

The structure of your document I suggest is the following :

1. Overview 

- 1.1 Company Presentation : this gives an overview of your company, with global figures (employees, business units, sales or revenue figures, key growth drivers...) 
- 1.2 RFP Context : this is often omitted but more important than you think. It provides the general context of the RFP and lets the suppliers understand if you are just replacing a deprecated platform or planning something scalable to take on a huge business growth or increase flexiblity on your services


2. Project Expectations
In this section, add one chapter per expectation. Based on whether you expect the RFP to be service / SLA driver or a very technical design based on a BOM (bill of materials), you can put several sections. These sections can be related to availability, resilience, flexibility, capability to scale up or scale out the solution... but also be technical requirements such as imposing your standards regarding management network (do you impose a non routed OOB management network), security access (should everything be authenticated against your corporate active directory), do you have specific connectivity & protocols (e.g on a Nexus core stack, you would have LACP network connections, meaning specific teaming settings on network cards). specific plugins with other platforms, reporting and capacity mangement feaures (capacity and performance trending...)

3. Instructions to bidders

- 3.1 : Key contacts (name, function / role, email, phone): indicate the SPOC (single point of contact for the project). Usually, on big projects, you have a project manager, the project sponsor or team leader and a guy from the procurement
- 3.2 General information for the bidders : in all RFPs, you should given the global guidelines to the bidders to avoid any surprises when attempting to close the deal at the end of the RFP cycle. Such general information would be :
  • Proposal acceptance requirement : indicate language in which you expect the bidder's answer, the currency in which prices must be given, if you impose a fixed exchange rate against standard currencies such as dollar (because your budget is stable since it is based on a hedged currency)
  • Proposal validity period
  • Non disclosure information
  • Non compensation conditions : indicate that no kind of compensation will be provided for manpower and costs associated when participating to this RFP

- 3.3 : Timelines : it is compulsory to provide timelines with your RFP, so that bidders know precisely what is expected and when. This is usually summed up in a small table with the key dates (date, milestone, action owner, communication medium)
- 3.4 : Questions for clarifications : be clear about your rules on questions :
  • Best practise would be that you have a milestone in section 3.2 Timelines indicating the hard stop for questions & answers
  • In this section, you should indicate to all bidders if you allow yourself to disclose their questions anonymously to other bidders, should the questions be interesting to share

- 3.5 Evaluation criteria : be very clear on how you will assess their reponse :
  • Is your project TCO (total cost of ownership) driven? 
  • Are you going to check compliance of the bidder's answer with you RFP requirements?
  • Are you looking for a solution with proven stablity & flexibility
  • Will you expect reference calls and reference presentations to prove the bidders professionalism?
  • Will you be expecting a very clear vision of the project, resources to engage, efforts expected on your side, planning...

- 3.6 Proposal Structure : this is where you impose a certain level of structure to the bidders' response. It is key to get this part neat. Indeed, if all proposed are aligned in a  "template" format, parsing the answers and comparing the bidders will be very simple. The usual information I expect in a response to an RFP is the following :
  • Executive summary
  • Bidder's presentation (I usually supply an excel sheet to impose the information I am requesting : turnover, operating income, best in class resellers and integrators, similar project references...)
  • Technical solution : this is where you have to turn verbose mode on, indicating all RFP technical deliverables (high level design, low level design, power consumption, maximum capacity and performance of the solution 
  •  Project Organization : explain all mandatory phases you expect the bidder / integrator to perform during the integration project (design workshops, activities, onboarding of your internal resources, reporting details and frequency)
  • Financial Breakdown of the proposal : here again, provide an excel sheet the bidder should fill in and put all the fields you expect him to fill in (unit public price, unit proposed price, quantity, total price, taxes, non recoverable taxes , project costs, training and handover costs...)
  • Terms and conditions : this is usually an internal document provided by your legal or procurement team 

Saturday, August 1, 2015

Web Application Performance - Tuning from Client to Server, going through Citirx optimization


This first post will give you all the inputs to understand how web protocol works and how performance can be terrible due to either poor application coding, configuration issues on client or server side and / or latency on network links. We will go through different steps in order to understand the complete stack and options available to developers, architects and operational teams in charge of running the service. 

Bear in mind that we are going to looks here at optimizing performance of applications that are laggy due to HTTP protocol being chatty and degrading perfromance on low speed or high latency links. If your application is performing like crap (e.g poor database queries lasting hours), this topic won't help make it better :-)



0. Understanding HTTP : 

Without going into the details, HTTP is a standard process that has been defined in order to standardize the internet. It is pretty straightforward and independant of the web server and client you are using. It always starts very simply : the client queries the server to download an HTML. In the URL, you will see all sorts of extensions to the file which is loaded (aspx, php, html, jar...), but these are simply script or compiled code that will produce html content.

Once the client downloads the HTML file, that file contains a list of tags which define the web page structure but also all media to load. As your browser parses the HTML files, it triggers more requests to the web server to download media this time (javascript files, css style sheets, images, video resources, sounds....)
So to summarize, this is pretty simple... You always start with one file which defines the page structure and all the associated media. When your browser receives the HTML file, it parses it and starts loading all resources defined in the HTML file...


1. A simple test case : 

The easiest way to illustrate and understand the basic behaviour of a web client communicating with its server is to build a very basic web page and load it. In order to emphasize the issues, I have created a 1.5 MB png image and renamed it 20 times (a0 to a9 and b0 to b9) so that the page loads slowly enough to illustrate the breakdown of the activity. 

Let's start with the code, which is very basic. As you can see below, this HTML file has a list of 20 PNG images


And an overview of the page shown in the browser

I used a freeware tool to take very basic measurements from Firefox & Internet Explorer: HttpWatch (Please be aware you will need another tool for similar measurements from Chrome). This tool simply displays graphically the loading of a web page from your browser. When loading the page above, you get the image below : 


Now, what do we see here ?

  • The very first item loaded from the site is test.html (as shown in the first line above). The loading of this page is very quick since the HTML file is very small in size (724 bytes). Note that nothing else happens while the test.html file is loaded. This is simply due to the fact that the browser needs to load the HTML file and parse it to know what other files it has to load from the web server (the HTML file defines the page strucutre in addition to all resources to load such as javascript files, css style sheets, images....)
  • The following part is that is interesting : you would expect that, as soon as the hmtl file is parsed by your browser, all images would be loaded together. This is not the case. Worse, this is never the case. Indeed, there are a set of restrictions on servers and browsers that limit the maximum number of parallel transfers between a client and a server. I used a Firefox version 3 for the test above, which has a limitation of 6 maximum connections per server (controlled by the network.http.max-persistent-connections-per-server setting as explained here)
Now, why do these limitations exist? This is more a "gentleman agreement" on the Internet. Indeed, if users completely removing this limit, they would very quickly hit the limit of maximum connections on the web servers and penalize the service for other users. Be aware that increasing this value above 10 maximum connections per server risks having your IP backlisted... so make sure you play with this in private. Finally, the RFC 2616 defining HTTP 1.1 standards initally stated that only 2 persistent connections per server should be allowed... but this was in the old days of internet, when pages were lightweight and didn't host so much media...

If we go a little further in the analysis, we can not that the screenshot above shows and HTTP GET 200 return code which corresponds in HTTP to an "OK" (meaning you successfully executed the GET request and downloaded the file from the server). Note that if we reload the page, we get another HTTP code : 

This time, you can notice the loading time is way faster. This is simply due to the fact your browser has a local cache of the files. Still, you will notice the browser queries the server to see if the local cache file corresponds to the file stored on the web server... the impact is not noticeable here, due to the fact the web server is local, but if you imaging a web server at 200 ms latency from you (e.g a server in Asia contacted from a client in Europe), and the fact you can only run 6 queries in parallel, you need 4 groups of queries (6 + 6 + 6 + 2) to check all the files with the cached one which adds 800 milliseconds to the loading time of your page... If this is not clear at this stage, don't worry, we will check this out in detail lower in this post.

2. Fixing the performance issue through client configuration : 

There are several places you can work on in order to solve the issue. The ones presented in this section are only configuration changes. Please note this will enhance performance but is it can be combined with solutions proposed lower in this document to really boost end user experience... 

The first obvious setting to change is on your browser side, and increase the maximum connections : 
  • In Firefox, type about:config in the URL bar and change the value of the http.max-persistent-connections-per-server setting



  • In Internet Explorer, the value is changed in the Operating System Registry. Click Start => Run => Regedit and look for the key HKEY_CURRENT_USER> Software> Microsoft>Windows> CurrentVersion> Internet Settings and change the values of the key MaxConnectionsPer1_0Server and MaxConnectionsPerServer.


  • On Chrome, this setting seems to be hard coded and tied to the user profile... meaning fine tuning for a single end user (hence a single profile) will be tricky...

You can find the details of the default maximum connetions per server in the Browserscope web site.

3. Fixing the performance issue through server / application configuration : 

If client configuration is not sufficient, you can start working on the server or application side. The server will allow the same tuning as the client, working on the maximum TCP connections it will accept but also the maximum number of connections per client.

The next changes are directly in the application stack, Where several changes can be done in order to change the end user experience. Here are a few easy ways of improving your solution (this part will be detailed later)

Adding expiration header to your media:
This solution is pretty straightforward as you simply replace every call to a resource by a small piece of code that will inform the client of the expiration date of your resource. This code can be dealt with in several ways. The first solution is to build a small script page that adds header information about resource expiry before transfering the resource to the client.

a) Hard coding expriation 

A simple example would be that, instead of referring to your images in your HTML files, you call a home made script that informs the client of a default expiration date :


b) Global Configuration  of expiration via htaccess

Rather that handle every image individually, an easy solution is to set expiration directly on your .htaccess file in order to have a general control over file expiration
c) Global Configuration  of expiration via web server
Several web browsers will let you configure media expriation directly in their configuration files. The most common web server being Apache, let's look at how to do the change with Apache. A little module named mod_expires will give you control over the expiration headers of media in your web pages. Here again, configuration will allow you to set expiration based on media types :

Now, this is usually not sufficient... so the last section of this article will show information that will make a huge difference...

4. Leveraging other technologies to change user experience : 

Before looking into WAN acceleration devices, let's just compare 2 solutions :

  • A first solution which remains our initial use case : I connected to a central Sharepoint web site located in Europe, from my work computer, when I was travelling in Brazil. You will see below the large blue sections which represent download of media through a WAN link

  • A second attempt, when I was on the same site (and connection) in Brazil, but where I no longer used my browser but a browser hosted on a Citrix platform in Europe, just beside the Sharepoint farm I want to connect to. Here, transfer times are reduced to almost nothing and web experience is completely changed
So what is the main difference here? The first is the page loading time : with my local browser, loading time was 14 seconds. When going through Citrix, this dropped to 5 seconds. Indeed, from my browser, due to the chatty HTTP protocol and the latency coupled to the 5 persistent connections limit with the webserver, I spend most of my time on red and blue parts which correspond to waiting for server response (250 ms round trip latency) and waiting for the data to come through the pipe (data transfer). 
On the second case, with Citrix, waiting for the server and transfers are minimal since all data is transfered between 2 servers in the same Datacenter. The only transfer to my local PC is Brazil is Citrix data, which, as you will see later, can also be optimized / cached with specific WAN acceleration appliances. 




Sunday, February 22, 2015

MSSQL - Simplify and Automate Refresh Operations

One of the most recurring tasks of an MSSQL DBA is refreshing Databases. This task can quickly get boring as you connect to one instance, dump the database, copy it accross the network and then restore it on the other side, remapping manually Data and Log file logical names to the new physical file structure on the target Database.

The script below aims at making your life easier by scripting both steps of a refresh process, leveraging a network file share to host the dump file :

  • Section 1 will export the Database to a defined network share
  • Section 2 will load the Database dump file, analyze the file structure, and remap logical data and log file paths to restore the database in the right place

0. Pre-requisites

The only pre-requisite of this script is to have a shared folder on the network on which the dump will be performed. For security reasons, it might be good to restrict access to that folder and only grant access to service accounts running your MSSQL instances in addition to your DBA accounts.

1. Dumping the Source Database

  • Connect to the source instance with a MSSQL Management Studio
  • Execute the following query on the source instance, replacing the parts in blue with the correct database name. Finally, execute the query.
DECLARE @sourceDB AS varchar(128) = 'sourceDatabaseName';
DECLARE @exportPath AS varchar(128) = '\\mySharedFolderPath\' + @sourceDB + '.bak'; 
BACKUP DATABASE @sourceDB TO DISK = @exportPath;


  • Check the MSSQL query log to make sure there are no errors


Should you have an error message, this could be linked to the fact the MSSQL instance service account is not part of the authorized groups that can connect to the CIFS file share you set up.


2. Restoring the Dump to the Target Database

Connect to the target instance with a MSSQL Management Studio and open a new query. Copy / paste the code below replacing the parts in blue with the correct database names and execute the query.

DECLARE @SourceDB AS varchar(128) = 'sourceDatabaseName'; 
DECLARE @TargetDB AS varchar(128) = 'targetDatabaseName'; 
DECLARE @DumpPath AS varchar(128) = '\\mySharedFolderPath\'+@SourceDB+'.bak';
DECLARE @TargetDBData AS varchar(512); 
DECLARE @TargetDBLog AS varchar(512); 
SET @TargetDBData = (select physical_name from sys.master_files where database_id = (select database_id from sys.databases where name = @targetDB) and type_desc = 'ROWS'); 
SET @TargetDBLog = (select physical_name from sys.master_files where database_id = (select database_id from sys.databases where name = @targetDB) and type_desc = 'LOG'); 
print 'Target Data File Path : '+@TargetDBData; 
print 'Target Log File Path : '+@TargetDBLog;
CREATE TABLE #tmp (LogicalName nvarchar(128) NOT NULL, PhysicalName nvarchar(260) NOT NULL, Type char(1) NOT NULL, FileGroupName nvarchar(120) NULL, Size numeric(20, 0) NOT NULL, MaxSize numeric(20, 0) NOT NULL, FileID bigint NULL, CreateLSN numeric(25,0) NULL, DropLSN numeric(25,0) NULL, UniqueID uniqueidentifier NULL, ReadOnlyLSN numeric(25,0) NULL , ReadWriteLSN numeric(25,0) NULL, BackupSizeInBytes bigint NULL, SourceBlockSize int NULL, FileGroupID int NULL, LogGroupGUID uniqueidentifier NULL, DifferentialBaseLSN numeric(25,0)NULL, DifferentialBaseGUID uniqueidentifier NULL, IsReadOnly bit NULL,
 IsPresent bit NULL, TDEThumbprint varbinary(32) NULL );
 INSERT #tmp EXEC ('restore filelistonly from disk = ''' + @DumpPath + '''')
 DECLARE @DataFileName AS varchar(128);DECLARE @LogFileName AS varchar(128);
 SET @DataFileName = (select logicalname from #tmp where type='D');SET @LogFileName = (select logicalname from #tmp where type='L');
 print 'Data File Name : '+@DataFileName;print 'Log File Name : '+@LogFileName;
 DROP TABLE #tmp
-- Set DB in single user to kill other connections and revert
USE master;
EXEC('ALTER DATABASE ' + @TargetDB+ ' SET SINGLE_USER WITH ROLLBACK IMMEDIATE');
EXEC('ALTER DATABASE ' + @TargetDB+ ' SET MULTI_USER');
-- Start Restore process
RESTORE DATABASE @TargetDB FROM DISK = @DumpPath WITH REPLACE,MOVE @DataFileName TO @TargetDBData,MOVE @LogFileName TO @TargetDBLog

The script is designed to perform the following operations :

  • Analyze source database dump file to identify data & log names
  • Analyze target database structure to define the path to the MDF / LDF files that need to be replaced
  • Restrict the target database to SINGLE_USER mode in order to kill all other connections
  • Perform the refresh operation

Once the refresh has completed, you should get the following log : 

An updated version of this script should soon provide the missing step of the process : export account mappings and remap accounts after the refresh is complete...

MSSQL List Tables, Row Count and Table Size

Part of the basic toolset for MSSQL Databases, you will find below a simple query to list all tables, row count and disk useage of the tables of a DB :
SELECT
    t.NAME AS TableName,
    p.rows AS RowCounts,
    SUM(a.total_pages) * 8 AS TotalSpaceKB,
    SUM(a.used_pages) * 8 AS UsedSpaceKB,
    (SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
WHERE t.NAME NOT LIKE 'dt%' AND t.is_ms_shipped = 0 AND i.OBJECT_ID > 255
GROUP BY t.Name, p.Rows
ORDER BY t.Name

Friday, February 21, 2014

Datacenter Application Relocation Alternative with EMC Avamar

This is an astonishing migration solution I tested in Q4 2014. The aim was to move a critical application from Singapore to Europe with a great challenge : shift a volume of approximatively 800 GB in less than 8 hours. The latency between the 2 datacenters was ~250 ms and to make it even more challenging, the source Datacenter had a WAN link of 6 MB. Considering these figures, here is the puzzle I had to face : 
  • Lantecy between Datacenters : 250 ms. The maximum throughput we could reach with this latency for a database dump transfer was close to 550 KB/s, with all the risks of the copy failing due to connectivity issues over a huge transfer duration
  • Volume of darta to Synchronize : 350 GB files + 370 GB Database
As a result, with these statistics, this means the data transfer for the MSSQL Database would require 187 days !!!!! This is way beyond the 8 hours that were initally planned. Here are the different thoughts I had and the one I retained :

1. The obvious but worst solution : fly with the data 
This is not a good solution usually but considering the delay above versus the time to fly back from Singapore, I could go from 187 days to only 15 hours flight. The only issue was that the outage window for the migration was much longer since the overall scenario would become : 
  • Stop the application, dump the database & copy the application data to a NAS : 2 hours
  • Take a taxi to the airport and board plane : 2 hours
  • Fly back : 15 hours
  • Take a taxi to the other datacenter : 2 hours
  • Install the NAS, copy the data, restore the environments & perform testing : 5 hours
Migration Time for Solution #1 : 26 hours


2. Remote synchronization 

This is a pretty straightforward solution which is appropriate when the remote system does not change too much. The idea was quite simple : setup an incremental replication of data between the 2 datacenters 
  • All stateless layers would be reinstalled (Citrix with heavy client, Application Server)
  • Application files (data hosted on filesystems) would be replicated with a Robocopy working in mirror mode
  • MSSQL Database would be updated with a daily incremental replication process

The first replication would be challenging but I was convinced I could bring a full set of data by plane and then start incrementals from that point. Up to this point, the concept was good. 
Should you need to do this, make sure you check one thing which I decided to check a few days later (I maybe should have started with this) : what volume of data will change at the source? This is what made my whole concept collapse : just at the database layer, I had a volume of 113 GB of TLOGs on a daily basis... This means the amount of changes to commit and replicate remotely would never fit in the tiny pipe on the small replication windows I had...

3. Testing a new concept 

Starting with the easiest part : the application server. I would keep the replication process I had, based on a Robocopy batch, as it skimmed through 350 GB of files and replicated them incrementally in less than 45 minutes. I could easily run this while I was restoring my database as the WAN link would be unused. Here is the script I made. Should you have many folders to replicate, you can instanciate 1 script per folder for parallel processing : 
echo OFF
REM ###### Source Data
set source=\\Server1\d$\data
REM ###### Destination Data
set destination=D:\data\
REM ###### folder to replicate
set folder=invoices
rem ##### Variable Setup
REM ##################################################################
for /f "tokens=1-3 delims=/" %%A in ('echo %DATE%') do set date2=%%A%%B%%C
set LOGNAME="D:\Scripts\logs\RBC_sync_%folder%_%date2%.log"
REM ##################################################################
echo _-_-_-_-_-                                                           >>%LOGNAME%
echo *-* Copy of folder %facility% *-*                               >>%LOGNAME%
robocopy.exe %source%\%folder% %destination%\%folder% /NP /MIR /E /W:3 /R:3 /log+:%LOGNAME%
pause
Echo Date :                                                               >>%LOGNAME%
Date /t                                                                   >>%LOGNAME%
Echo Time :                                                              >>%LOGNAME%
Time /t                                                                   >>%LOGNAME%
echo _-_-_-_-_- 
Robocopy is a good tool as it gives a summary of the synchronization you performed. Here is a sample output with the volumes replicated & the speed at which it ran

Now let's look at the complicated part : our 370 GB that has 113 GB of daily TLOG. We looked at export / lift & shift / import and that was a failure. We then looked at replication, and here again it could not work due to the large amount of data to shift every time the replication was triggered. We had to look into something totally different, where volumes could be controlled much easier and this make us look at the different tools used for backup / restores and Disaster Recovery Plans. 

The first one was EMC Recover Point but this was an expensive solution, maybe also overkill to move just one application to the other side of the planet. The other one was much simpler as I could choose between a physical appliance or a virtual edition which was straightforward and did not require racking / cabling / complex configuration. I therefore decided to move onto an EMC Avamar with version 7 Virtual Editions. 

The reasons to go to Avamar were multiple : 
  • Avamar could work as a "grid" meaning one appliance could backup the data on one site and then, during another time window, perform data replication with throttling, meaning capability to control the impact on the WAN link. 
  • Unlike EMC Datadomain or HP Store Once, Avamar performed source deduplication. This is mandatory for the following steps. Furthermore, Avamar couples this to compression dramatically reducing data volumes
The architecture design was the following :
We had planned initally to have 2 AVE (Avamar Virtual Edition on the source site, in order to have fault tolerance) and a complete Avamar Grid on the destination site. On the final implementation, we decided to go with just one AVE on each side.

The migration process was the following : 
  • Perform a full online backup of the MSSQL Database (just avoid overlapping with the already existing solution, a Symantec Backup Exec. This was done by having backup windows that would not overlap)
  • Setup the remote Avamar as a replication partner with throttling. The aim of this operation is to prime the deduplication data on the destination side (replicate the backup from the AVE in Singapore to the AVE in Switzerland). For my huge volume, I wanted to seecure the transfer. An Avamar replication was triggered every 4 hours (meaning previous replication halted no matter what the previous status was) and, with a throttling of 200KB/sec, it took me almost 6 days for this first replication.
  • Once this "cache priming" of the destination AVE was complete, I aimed for an inactivity period on the source side and performed a full online source backup directly from the destination AVE. This first backup was performed one week after the synchronization and I was hoping this would define the time to perform a cross WAN deduplication and backup. This first backup was quite scary due to its duration (almost 13 hours) but was performed without needing a shutdown of the application. At this stage, I was starting to really think this idea was an epic fail... but...
  • This is where things got awesome... after the first backup, I ran another one, and the backup time just dropped down from 13 hours to 68 minutes ! When playing with backup schedules and leaving more or less time between 2 backups, I realized the backup time simply depended on the volume of data that had changed during the last backup. 


I finally got a descent cruise speed by having a unique weekly backup on week ends. This backup took about 10 hours but avoided disturbing the application or degrading performance during the week. As the real migration got close, I simply increased the backup pace by switching to a daily backup. Finally, 2 hours before the real cut over, I performed another full backup so I would save on the cut over time.


As a result, the final migration planning ended as follows :

  • Application Shutdown + Wait for batch end : 30 minutes
  • Full source MSSQL DB from destination AVE : 74 minutes
  • Database restore and post restore controls : 120 minutes
  • Flat files sync via Robocopy (in parallel of DB restore) : 90 minutes
  • Reopen application & perform functional tests : 4 hours
Total time for technical operations : 314 minutes