Category Archives: Backup

Review: 5 years with CommVault

Introduction:

Backup and recovery is a rather dry topic, but it’s an important one.  After all, what’s more critical to your company than their data?  You can have the best products in the world, but if disaster strikes and you don’t have a good solution in place, it can make your recovery painful or even impossible.  Still, many companies shirk investment in this segment.  The good solutions (like the one I’m about to discuss) cost a pretty penny, and that’s capital that needs to be balanced with technology that makes or saves your company money.  Still, insurance (and that’s what backup is) is something that’s typically on the back of companies minds.

Finding the right product in this segment can be a challenge, not only because every vendor tries to convince you that they’ve cracked the nut, but because it seems like all the good solutions are expensive.  Like many, our budget was initially constrained.  We had an old investment in CV (CommVault), but had not reinvested in it over the years, and needed a new solution.  We initially chose a more affordable Veeam + Windows Storage Spaces to handle our backup duties.  It was a terrible mistake, but you know, sometimes you have to fail to learn, and so we did.

After putting up with Veeam for a year, we threw in the towel and and went back to CV with open arms.  Our timing was also great too, as Veeam had put a serious hurt on their business and some of their licensing changed, to accommodate that.  We ultimately ended up with much better pricing than when we last looked at CV, and on top of that, we actually found their virtualization backup to be more affordable and in many ways more feature rich.  CV isn’t perfect as I’ll outline below, but they’re pretty much as close as you can get to perfection for a product that is the swiss army knife of backup.

CommVault Terms:

For those of you not super familiar with CV, you’ll find the following terms useful for understanding what I’m talking about.  There are a lot more components in CV, but these are the fundamental ones.

  • MA (Media Agent): Simply put, it’s a data mover.  It copies data to disk, tape, cloud, etc.
  • Agent: A client that is installed to backup an application or OS.
  • VSA (Virtual Server Agent): A client specially designed to for virtualization backup.
  • CC (CommCell): The central server that manages all the jobs, history, reporting, configuration, etc.  This is the brains of the whole operation.

Our Environment:

  • We have five MA’s.
    • Two virtual MA’s that backup to a Quantum QXS SAN (DotHill). This was done because we were reusing an old pair of VMhost and have a few other non-CV backup components running on these hosts.
      • The SAN has something like two pools of 80 disks. Not as fast as we’d like, but more than fast enough.  The QXS (DotHill) was our replacement for Storage Spaces.  Overall, better than Storage Spaces, but a lot of room for improvement.  The details of that are for another review.
    • Two physical MA’s with DAS, each MA has 80 disks in a RAID 60, yeah it rips from a disk performance perspective J. Multiple GBps
    • One physical MA that’s attached to our tape library.
  • We have five VSA’s, I’ll go more into this, but we’re not using five because I want to.
  • We have one CC, although we’ll be rolling out a second for resiliency and failover soon.
  • We have a number of agents
    • Several MS Exchange
    • Several MS Active Directory
    • Several Linux
    • The rest are file server / OS image agents.
  • In total, we have about a PB of total backup capacity between our SAN and DAS, but not all of that is consumed by CV (most is though).
  • We only use compression right now, no dedupe.
  • We only use active fulls (real fulls) not synthetics

Pros:

  • Backup:
    • CV can backup practically anything, and also has a number of application specific agents as well. You can backup your entire enterprise with their solution.  I would contend with CV, there are very few cases that you’d need point tools anymore.  Desktops, servers, virtualization, various applications and NAS devices are all systems that can be backed up by CV.  Honestly, it’s hard to find a solution that is as comprehensive as them.  That being said, I can imagine you’re wondering if they do it all, can they do it well?  I would say mostly.  I have some deltas to go over a little farther down, but they do a lot and a lot well.  It’s one of the reasons the solution was (and still is) expensive.
    • I went from having to babysit backup’s with Veeam, to having a solution that I almost never had to think about anymore (other than swapping tapes). There were some initial pains at first as we learned CV’s way of doing virtualization backup, but we quickly got to a stable state.
  • Deployment / Scalability:
    • CommCell has a great deployment model that works well in single office locations all the way to globally distributed implementations. They’re able to accomplish all of this with a single pane of glass, which a number of vendors can’t claim to do.
    • Besides the size of the deployment, you’re not forced into using Windows only for most components of CV. A lot of the roles outlined above run on Linux or Windows.
    • CV is software based, and best of all, its an application that runs on an OS which you’re already comfortable with (Linux / Windows). Because of this, the HW that you deploy the solution on is really only limited by minimum specs, budget and your imagination.  You can build a powerful and affordable solution on simple DAS, or you can go crazy and run on NVMe / all flash SANs.  It also works in the cloud because again, it’s just SW inside a generic OS.  I can’t tell you how many backup solutions I looked at that had zero cloud deployment capabilities.
    • There are so many knobs to turn in this solution, it’s pretty tough to run into a situation that you can’t tune for (there are a few though). Most of the out of box defaults are fine, but you’ll get the best performance when you dig in an optimize.  Some find this overwhelming and I’ll chat more about that in the cons, but with CV’s great support and reading their documentation, it’s not as bad as it sounds.  Ultimately the tuneablity is an incredible strength of this solution.  I’ve been able to increase backup throughput from a few hundred MBps to a few GBps simply by changing the IO size that CV uses.
  • Support:
    • Overall, they have fantastic support. Like any vendors support, it can vary and CV is no different.  Still, I can count on my hand the number of times support was painful, and even of those times, ultimately we got the issue resolved.
    • For the most part, support knows the application they’re backing up pretty well. I had a VMware backup issue that we ran into with Veeam and continued with CV.  CV while not being able to directly solve the problem, provided significantly more data for me to hand off to VMware, which ultimately led to us finding a known issue.   CV analyzed the VMware logs best they could and found the relevant entries that they suspected were the issue.  Veeam, was useless.
    • Getting CV issues fixed is something else that’s great about CV. No vendor is perfect, that’s what hotfixes and service packs are for.  CV, has an amazing escalation process.  I went from a bug, to a hotfix that resolved the issue in under two weeks.
    • My experience with their supports response time is fantastic. I rarely find a time where I don’t hear from them for a few hours.  They’re also not afraid to simply call you and work on the problem real time. I don’t mind email responses for simple questions, but when you’re running into a problem, sometimes you just want someone to call you and hash it out in real time.  I also like that most of the time you get the tech’s direct number if you need to call them.
  • Feature requests: A little hit or miss, but feature requests tend to get taken seriously with CV, especially if it’s something pretty simple.
  • Value: This one is a mixed bag.  Thanks to Veeam eating their lunch, virtualization backup with CV has never been a better value.  I could be wrong, but I actually think virtualization backup in CV rings in at a significantly lower price than Veeam.  I would say at least 50% of our backup’s are virtualization.  It’s our default backup method unless there is a compelling reason to use agents.   This is ultimately what made CV an affordable backup solution for us.  We were able to leverage their virtualization backup for most of our stuff, and utilize agents for the few things that really needed to be backed up at a file level or application level.  The virtualization backup entitles you to all their premium features, which is why I think it’s a huge value add.  That being said, I have some stuff to touch on in the cons with regards to the value.
  • Retention Management: Their retention management is a little tricky to get your head around, but it’s ultimately the right way to do retention.  Their retention is based on a policy, not based on the number of recovery point.    You configure things like how many days of fulls you want and how many cycles you need.  I can take a bazillion one off backup’s and not have to worry about my recovery history being prematurely purged.
  • Copy management: They manage copies of data like a champ.  Mix it with the above point, and you have all kinds of copies with different retentions, different locations, and it all works rock solid.  You have control over what data get’s copied.  So your source data might have all your VM’s and you only want a second copy of select VM’s, not problem for them.  Maybe you want dedupe on some, compression on other, some on tape, some on disk, some on cloud, again, no issue at all.
  • Ahead of the curve: CV seems to be the most forward thinking when it comes to backup / recovery destinations and sources.  They had our Nimble SAN’s certified for backup LONG before our previous vendor.  They support all kinds of cloud destinations, the ability to recover VM’s from physical to virtual, virtual to cloud, etc.  This goes back to the holistic approach that I brought up.  They do a very good job of wrapping everything up, and creating a flexible ecosystem to work with.  You typically don’t need point solutions with them.
  • Storage Management: I love their disk pools, and the way they store their backup data.  First and foremost, it’s tunable, so if you want 512MB files to whatever size files, it’s an option.  They shard the data across disks, etc.  Frankly the way they store data is a no brainer.  They also move jobs / data pretty easily from one disk to another which is great.  This type of flexability is not only helpful for things like making it easier to fit your data on disparate storage, but also in ensuring your backup’s can easily be copied to unreliable destinations.  Having to recopy a 512MB file is a lot better than having to recopy an 8TB file.  CV can take that 8TB file if you want, and break it up into various sized (default is 2GB).
  • Policies: Most of the way things are defined, are defined using policies.  Schedules, retention, copies, etc.  Not everything, but most things.  This makes it easy to establish standards for how things should act, and it also makes it easier to change thing.
  • CLI: They have a ton of capability with their CLI / API.  Almost anything can be executed or configured.  I actually developed a number of external work flows which call their CLI and it works well.
  • Tape Management:
    • They handle tapes like a librarian, minus the dewy decimal system. Seriously though, I haven’t worked with a solution that makes handling tapes as easy as they do.
    • If you happen to use Iron Mountain, they have integration for that too.
    • They’re pretty darn efficient with tape usage as well, which is mostly thanks to their “global copy” concept. We still have some white space issues, but it makes sense why
    • They are very good at controlling tape drive and parallel job management. This allows you to balance how many tape drives are used for what jobs.
  • Documentation: They document everything, and for good reason, there is a lot their product does. This includes things like advanced features and most of the special tuning knobs as well.  It’s not always perfect, but it’s typically very good.
  • Recovery:
    • File level recovery from tape for VM backups, without having to recover the whole file, need I say more. That means if I need one file off an 8TB backup VMDK, I don’t have to restore 8TB first.
    • Most application level backup’s offer some level of item level recovery. It’s not always straight forward, or quick, but its usually possible.
    • They’re smart with how they restore data. You can pick where you want the data recovered from (location and copy), and if it does need tapes, it tells you exactly what tapes you need.  No more throwing every single tape in and hoping that’s all you need.

Cons:

  • Backup:
    • Virtualization:
      • Their VMware backup in many ways isn’t as tunable as it should be. There are places where they don’t have stream limits where they really need them.  For example, they lack a stream limit on a the vCenter host, the ESXi host or even the VSA doing the backup.  It’s honestly a little strange as CV seems to offer a never-ending number of stream controls for other areas of their product.  I bring this up as probably my number one issue with their VMware backup.  This led us to have the most initial problems with their solution.  I would still say this is a glaring hole in their virtualization backup.  I just looked up their CV11 SP7 and nothing has changed with regards to this, which is disappointing to say the least.  This is one area that I think Veeam handles much better than them.
      • The performance of NBD (management network only) based backup is bluntly terrible. The only way we could get really good performance out of their product was to switch to hot add.  Typically speaking I hate hot add for Vmware backup.  It takes forever to mount disks, and it makes the setup of VM backup more complicated than it needs to be.  Not to mention if you do have an issue during the backup process (like vCenter dying) the cleanup of the backup is horrible.
      • They don’t pre-tune VSA for hot add. Things like disabling initialize disk in windows and what not.
      • Their inline compression throughput was also atrocious at first. We had to switch the algorithm used which fixed the issue, but it required a non-gui tweak to achieve and me asking if there was anything else they could do.  It was actually timely that the new algorithm had been released as experimental in the release we just upgraded to.
      • Their default VM dispatch to me is less than ideal. Instead of balancing VM’s in a least load method across the VSA’s, they pick the VSA closest to the VM or datastore.  I needed to go in and disable all of this.
    • Deployment / Scalability:
      • While I applaud their flexibility, the one area that I think still needs work is their dedupe. To me, they really need to focus on building a DataDomain level of solution that can scale to petabytes of logical data in a single media agent, and right now they can’t scale that big.  It seems like you need to have a bunch of mid sized buckets which is better than nothing, but still not as ideal as it should be.
      • Deployment for CV newbies is not straight forward. You’ll definitely need professional services to get most of the initial setup done, at least until you have time to familiarize yourself with it.  You’ll also need training so that you actually know how to care for and grow the solution.  I think CV could do a better job with perhaps implementing a more express setup just to get things up, and maybe even have a couple of into / how to videos to jump start the setup.  It’s complicated, because of it’s power, but I don’t think it needs to be.  The knobs and tuning should be there to customize the solution to a person’s environment, but there should be an easy button that suites most folks out of the box.
    • Support: In general I love their support, but there are times where I’m pretty confident the folks doing the support, don’t have at scale experience with the product.  There are times when I’ve tried explaining the scaling issue we were having, and they couldn’t wrap their heads around the issue.  They also tend to get wrapped up in the “this is the way it works” and not in the “this is the way it SHOULD work”.  Which again I think comes back to the experience with product at scale.  This would tend to happen more when I was trying to explain why I setup something in a particular way, and a way that didn’t match their norm.  For example, VM backups, they like to pile everything into subclients.  For more than a number of reasons I’m not going to go into in this blog post, that doesn’t work for us, and frankly it shouldn’t work for most folks.  I was able to punch holes in why their design philosophy was off, but they were stuck on “this is the way it is”.  The good news is you can typically escalate over short sited techs like this and get to someone who can think outside the box.
    • Value: This is a tough one.  On one hand, I want good support and a feature rich product, but on the other hand, the cost of agent based backup is frankly stupid expensive.  When the cost of my backup product costs more per TB than my SAN, that’s an issue.  It’s one of the primary reasons we push towards VM based backup’s as its honestly the only way we could afford their product.  Even with huge discounts, the cost per TB is insane with their solution.  In some cases, I would almost rather have a per agent cost rather than a per TB cost.  I could see how that could get out of control, but I think there are cases where each licensing model works better for each company.  If I had thousands of servers, I could see where the per TB model might make more sense.  This is one of the reasons we don’t backup SQL direct with CV, it just costs too much per TB.  It’s cheaper for us to use a (still too expensive) file based agent to pick up SQL dump files.
    • Storage Management: Once data is stored on its medium, moving it off isn’t easy.  If you have a mountpoint that needs to be vacated, you need to either aux copy data to a new storage copy, manually move the data to another mountpoint, or wait till it ages out.  They really should have an option in their storage pool to simply right click the mountpoint and say “vacate”.  This operation would then move all data/jobs to whatever mountpoints are left in the whole pool.  Similar to VMwares SDRS.  I would actually like to see this ability at a MA level as well too.
    • CLI: I’ll knock any vendor that doesn’t have a Powershell module and CV is one of those vendors.  Again, glad that you have API’s, but in an enterprise where Windows rules the house, Powershell should be standard CLI option.
    • Tape Management: As much as I think they do it better than anyone else, they could still improve the white space issue.  I almost think they need a tapering off setting.  Perhaps maybe even a preemptive analysis of the optimal number of tapes and tape drives before the start of each new aux copy, and re-analyze that each time you detect more data that needs to be copied to tape.  This way it could balance copy performance with tape utilization.  Maybe even define a range of streams that can be used.
    • Documentation: As great as their documentation is, it needs someone to really organize it better.  Taking into account the differences in CV versions.  I realize it’s probably a monumental task, but it can be really hard to find the right document to the right version of what you’re looking for.  I’ve also found times where certain features are documented in older CV version docs, but not in newer ones (but they do exist).  I guess you could argue at least they have so much documentation that it’s just hard to find the right one, vs. not having any doc at all.  When in doubt though, I contact support and they can generally point me in the right direction, or they’ll just answer the question.
    • Recovery:
      • Item level recovery that’s application based really needs a lot of work. One thing I’ll give Veeam is they seems to have a far more feature rich and intuitive application item level recovery solution than CV.
        • Restoring exchange at an item level is slow and involved (lots of components to install). I honestly still haven’t gotten it working.
        • AD item level recovery is incredibly basic and honestly needs a ton of work.
        • Linux requires a separate appliance, which IMO it shouldn’t. If Linux admins can write tools to read NTFS, why can’t a backup vendor write a Windows tool that can natively mount and ready EXT3/4, ZFS, XFS, UFS, etc.
      • P2P, V2V / P2V leaves a lot to be desired. If you plan to use this method, make sure you have an ISO that already works.  Otherwise you’ll be scrambling to recover bare metal when you need to.

Conclusion:

Despite CommVaults cons, I still think it’s the best solution out there.  It’s not perfect in every category, and that’s a typical problem with most do it all solution, but it’s pretty damn good at most.  It’s an expensive solution, and its complicated, but if you can afford it, and invest the time in learning it, I think you’ll fall in love with it, at least as much as one can with a backup tool.

Problem Solving: Chasing SQL’s Dump

The Problem:

For years as an admin I’ve had to deal with SQL.  At a former employer, our SQL environment / databases were small, and backup licensing was based on agents, not capacity.  Fast forward to my current employer, we have a fairly decent sized SQL environment (60 – 70 servers), our backup’s are large , licensing is based on capacity, and we have a full time DBA crew that manage their own backup schedules, and prefer that backup’s are managed by them.  What that means is dealing with a ton of dumps.  Read into that as you want 🙂

When I started at my current employer, the SQL server backup architecture was kind of a mess.  To being with, where were then was about 40 – 50 physical SQL servers.  So when you’re picturing all of this, keep that in mind.  Some of these issues don’t go hand in hand with physical design limitations, but some do.

  • DAS was used for not only storage the SQL log, DB and index, but also backup’s.  Sometimes if the SQL server was critical enough, we had dedicated disks for backups, but that wasn’t typical.  This of course is a problem for many reasons.
    • Performance for not only backup’s but the SQL service its self were limited often because they were sharing the same disks.  So when a backup kicked off, SQL was reading from the same disks it was attempting to write to.  This wasn’t as big of an issue for the few systems that had dedicated disks, but even there, sometimes they were sharing the same RAID card, which meant you’re still potentially bottlenecking one for the other.
    • Capacity was spread across physical servers.  Some systems had plenty of space and others barely had enough.  Islands are never easy to manage.
    • If that SQL server went down, so did its most recent backup’s.  TL backup’s were also stored here (shudders).
    • Being a dev shop meant doing environment refreshes.  This meant creating and maintaining share / NTFS permissions across servers.  This by its self isn’t inherently difficult if its thought out ahead of time, but it wasn’t (not my design).
    • We were migrating to a virtual environment, and that virtual environment would be potentially vMotioning from one host to another.  DAS was a solution that wouldn’t work long term.
  • The DBA’s managed their backup schedules so it required us all to basically estimate when the best time to pickup their DB’s.  Sometimes we were too early and sometimes we could have started sooner.
  • Adding to the above points if we had a failed backup over night, or a backup that ran long, it had an effect on SQL’s performance during production hours.  This put us in a position of choosing between giving up on backing some data up, or having performance degradation.
  • We didn’t know when they did full’s vs diffs.  Which means, we might be storing thier DIFF files on what we considered “full” backup taps.  By its self not an issue, except for the fact that we did monthly extended fulls.  Meaning we kept the first full backup of each month for 90 days.  If that file we’re keeping is a diff file, that’ doesn’t do us any good.  However, you can see below, why it wasn’t as big of an issue in general.
  • Finally, the the problem that I contended with besides all of these, is that because they were just keeping ALL files on disk in the same location, every time we did a full backup, we backed EVERYTHING up.  Sometimes that was 2 weeks worth of data, TL’s, Diff’s and and Fulls.  This meant we were storing their backup data multiple times over on both disk and tape.

I’m sure there’s more than a few of you out there with similar design issues.  I’m going to lay out how I worked around some of the politics and budget limitation.  I wouldn’t suggest this solution as a first choice, its really not the right way to tackle it, but it is a way that works well for us, and might for you.  This solution of course isn’t limited to SQL.  Really anything that uses a backup file scheme could fit right into this solution.

The solution:

I spent days worth of my personal time while jogging, lifting, etc. just thinking about how to solve all these problems.  Some of them were easy and some of them would be technically complex, but doable.  I also spent hours with our DBA team collaborating on the rough solution I came up with, and honing it to work for both of us.

Here is basically what I came to the table with wanting to solve:

  • I wanted SQL dumping to a central location, no more local SQL backups.
  • The DBA’s wanted to simplify permissions for all environments to make DB refreshing easier.
  • I wanted to minimize or eliminate storing their backup data twice on disk.
  • I wanted them to have direct access to our agreed upon retention without needing to involve us for most historical restores.  Basically giving them self service recovery.
  • I wanted to eliminate backing up more data then we needed
  • I wanted to know for sure when they were done backing up and knowing what type of backup they performed.

Honestly we needed the fix, as the reality was we were moving towards a virtualizing our SQL infrastructure, and presenting local disk on SAN would be both expensive, but also incredibly complex to contend with for 60+ SQL servers.

How we did it:

Like I said, some of it was an easy fix, and some of it more complex, let’s break it down.

The easy stuff:

Backup performance and centralization:

We bought an affordable backup storage solution.  At the time of this writing it was and still is Microsoft Windows Storage Spaces.  After making that mistake, we’re now moving on to what we hope is a more reliable and mostly more simplistic Quantum QXS (DotHill) SAN using all NL-SAS disks.  Point being, instead of having SQL dump to local disk, we setup a fairly high performant file server cluster.   This gave us both high availability, and with the HW we  implemented, very high performance as well.

New problem we had to solve:

Having something centralized means you also have to think about the possibility of needing to move it at some point.  Given that many processes would be written around this new network share, we needed to make sure we could move data around on the backend, update some pointers and things go on without needing to make massive changes.  For that, we relied on DFS-N.  We had the SQL systems point at DFS shares instead of pointing at the raw share.  This is going to prove valuable as we move data very soon to the new SAN.

Reducing multiple disk copies and providing them direct access to historical backups:

The backup storage was sized to store ALL required standard retention, and we (SysAdmins) would continue managing extended retention using our backup solution.  For the most part this now means the DBA’s had access to the data they needed 99% of the time.  This solved the storing the data more than once on disk problem as we would no longer store their standard retention in CommVault, but instead rely on the SQL dumps they already are storing on disk (except extended retention).  They still get copied to tape and sent off site in case you thought that wasn’t covered BTW.

Simplifying backup share permissions:

The DBA’s wanted to simplify permissions, so we worked together and basically came up with a fairly simple folder structure.  We used the basic configuration below.

  • SQL backup root
    • PRD <—- DFS root / direct file share
      • example prd SQL server 1 folder
      • example prd SQL server 2 folder
      • etc.
    • STG <—– DFS root / direct file share
      • example stg SQL server 1 folder
      • etc.
    • etc.
  • Active Directory security group wise we set it up so that all prod SQL servers are part of a “prod” active directory group, all stage are part of a “stage” active directory group, etc.
  • The above AD groups were then assigned at the DFS root (Stg, prd, dev, uat) with the desired permissions.

With this configuration, its now as simple as dropping a SQL service account in one group, it and will now automatically fall into the correct environment level permissions.  In some cases its more permissive then it should be (prod has access to any prod server for example), but it kept things simple, and in our case, I’m not sure the extra security of per server / per environment really would have been a big win.

The harder stuff:

The only two remaining problems we had to solve was knowing what kind of backup the DBA’s did, and making sure we were not backing up more data than we needed.  These were also the two most difficult problems to solve because there wasn’t native way to do it (other than agent based backup).  We had two completely disjointed systems AND processes that we were trying to make work together.  It took many miles of running for me to put all the pieces together and it took a number of meetings with the DBA’s  to figure things out.  The good news is, both problems were solved by aspects of a single solution.  The bad news is, its a fairly complex process, but so far, its been very reliable.  Here’s how we did it.

 The DONE file:

Everything in the work flow is based on the presence of a simple file, what we refer to as the “done” file internally.  This file is used throughout the work flow for various things, and its the key in keeping the whole process working correctly.  Basically the workflow lives and dies by the DONE file.  The DONE file was also the answer to  our knowing what type of backup the DBA’s ran, so we could appropriately sync out backup type with them.

The DONE file follows a very rigid naming convention.  All of our scripts depend on this, and frankly naming standard are just a recommend practice (that’s for another blog post).

Our naming standard is simple:

%FourDigitYear%%2DigitMonth%%2DigitDay%_%24Hour%%Minute%%JobName(usually the sql instance)%_%backuptype%.done

And here are a few examples:

  • Default Instance of SQL
    • 20150302_2008_ms-sql-02_inc.done
    • 20150302_2008_ms-sql-02_full.done
  • Stg instance of SQL
    • 20150302_2008_ms-sql-02stg_inc.done
    • 20150302_2008_ms-sql-02stg_inc.done
The backup folder structure:

Equally as important as the done file, is our folder structure.  Again because this is a repeatable process, everything must follow a standard or the whole thing fall apart.

As you know we have a root folder structure that goes something like this ” \\ShareRoot\Environment\ServerName”.  Inside the servername root I create four folders and I’ll explain their use next.

  • .\Servername\DropOff
  • .\Servername\Queue
  • .\Servername\Pickup
  • .\Servername\Recovery

Dropoff:  This is where the DBA’s dump their backups initially.  The backup’s sit here and wait for our process to begin.

Queue:  This is a folder that we use to stage / queue the backup’s before the next phase.  Again I’ll explain in greater detail.  But the main point of this is to allow us to keep moving data outside of the Dropoff folder to a temp location in the queue folder.  You’ll understand why in a bit.

Pickup:  This is where our tape jobs are configured to look for data.

Recovery:  This is the permanent resting place for the data until it reaches the end of its configured retention period.

Stage 1: SQL side

Prerequisites:

  1. SQL needs a process that can check the Pickup folder for a done file, delete a done file and create a done file.  Our DBA’s created a stored procedure with parameters to handle this, but you can tackle it however you want, so long as it can be executed in a SQL maintenance plan.
  2. For each “job” in sql that you want to run, you’ll need to configure a “full” maintenance plan to run a full backup, and if you’re using SQL diffs, create an “inc” maintenance plan.  In our case, to try and keep things a little simple, we limited a “job” to a single SQL instance.

SQL maintenance plan work flow:

Every step in this workflow will stop on an error, there is NO continuing or ignore.

  1. First thing the plan does is check for the existence of a previous DONE file.
    1. If a DONE file exists, its deleted and an email is sent out to the DBA’s and sysadmins informing them.  This is because its likely that a previous process failed to run
    2. If a DONE file does not exist, we continue to the next step.
  2. Run our backup, whether its a full or inc.
  3. Once complete, we then create a new done file in the root of the PickupFolder directory.  This will either have a “full” or “inc” in the name depending on which maintenance plan ran.
  4. We purge backup’s in the Recovery folder that are past our retention period.

SQL side is complete.  That’s all the DBA’s need to do.  The rest is on us.  From here you can see how they were able to tell us whether or not they ran a full via the done file.  You can also glean a few things about the workflow.

  1. We’re checking to see if the last backup didn’t process
  2. We delete the done file before we start a new backup (you’ll read why in a sec).
  3. We create a new DONE file once the backup’s are done
  4. We don’t purge any backup’s until we know we had a successful backup.
Stage 1: SysAdmin side

Our stuff is MUCH harder, so do your best to follow along and let me know if you need me to clarify anything.

  1. We need a stage 1 script created, and stage 1 script will do the following in sequential order.
    1. Will need to know what job its looking for.  In our case with JAMS, we named our JAMS jobs based on the same pattern as the done file.  So when the job starts the script reads information from the running job and basically fills in all the parameters like the folder location, job name, etc.
    2. The script looks for the presence of ANY done file in the specific folder.
      1. If no done file exists, it goes into a loop, and checks every 5 minutes (this minimizes slack time).
      2. If a done file does exists we…
        1. If there are more than 1, we fail.  As we don’t know for sure which file is correct.  This is a fail safe
        2. If there is only one, we move on.
    3. Using the “_” in the done file, we make sure that it follows all our standards.  So for example, we check that the first split is a date, the second is a time, the third matches the job name in JAMS and the fourth is either an inc or full.  A failure in any one of these, will cause the job to fail and we’ll get notified to manually look into it.
    4. Once we verify the done file is good to go, we now have all we need to start the migration process.  So the next thing we do is use the date and time information, to create a sub-folder in the Queue folder.
    5. Now we use robocopy to mirror the folder structure to the .\Queue\Date_Time
    6. Once that’s complete, we move all files EXCEPT the done file to the Date_Time folder.
    7. Once that’s complete, we then move the done file into said folder.

And that completes stage 1.  So now you’re probably wondering, why wouldn’t we just move that data straight to the pickup folder? A few reasons.

  • When the backup to tape starts we want to make sure no new files are  getting pumped into the pickup folder.  You could say well just wait until the backup’s done before you move data along. I agree and we sort of do that, but we do it in a way that keeps the pickup folder empty.
    • By moving the files to a queue folder, if our tape process is messed up (not running) we can keep moving data out of the pickup folder into a special holding area, all the while still being able to keep track of the various backup sets (each new job would have a different date_timestamp folder in the queue folder).  Our biggest concern is missing a full backup.  Remember, if the SQL job see’s a done file, it deletes it.  We really want to avoid that if possible.
    • We ALSO wanted to avoid a scenario where we were moving data into a queue folder while the second stage job tried to move data out of the queue folder.  Again, buy have an individual queue folder for each job, this allows us to keep track of all the moving pieces and make sure that we’re not stepping on toes.

Gotcha to watch out for with moving files:

If you didn’t pick up on it, I mentioned that I used robocopy to mirror the directory structure, but I did NOT mention using it for moving the files.  There’s a reason for that. Robocopy’s move parameter actually does a copy + delete.  As you can imagine with a multi-TB backup, this process would take a while.  I built a custom “move-files” function in powershell that does a similar thing, and in that function I use “move-file” cmdlet which is a simple pointer update.  MUCH faster as you can imagine.

Stage 2: SysAdmin Side

We’re using JAMS to manage this, and with that, this stage does NOT run, unless stage 1 is complete.  Keep that in mind if you’re trying to use your own work flow solution.

Ok so at this point our pickup directory may or may not be empty, doesn’t matter, what does matter is that we should have one or more jobs sitting in our .\Queue\xxx folder(s).  What you need next is a script that does the following.

  1. When it starts, it looks for any “DONE” file in the queue folder.  Basically doing a recursive search.
    1. If one or more files are found, we do a foreach loop for each done file found and….
      1. Mirror the directory structure using robocopy from queue\date_time to the PickupFolder
        1. Then move the backup files to the Pickup folder
        2. Move the done file to the Pickup Folder
        3. We then confirm the queue \date_time is empty and delete it.
        4. ***NOTE:  Notice how we look for a DONE file first.  This allows stage 1 to be populating a new Queue sub-folder while we’re working on this stage without inadvertently moving data that’s in use by another stage.  This is why there’s a specific order to when we move the done file in each stage.
    2. If NO done files are found, we assume maybe you’re recovering from a failed step and continue on to….
  2. Now that all files (dumps and done) are in the pickup folder we….
    1. Look for all done files.  if any of them are full, the job will be a full backup.  if we find NO fulls, then its an inc.
    2. Kick of a backup using a CommVault scripts.  Again parameters such as the path, client, subclient, etc. are all pulled from JAMS in our case or already present in CommVault.  We use the information determined about the job type in step 2\1 as for what we’ll execute.  Again, this gives the DBA’s the power to control whether a full backup or an inc is going to tape.
    3. As the backup job is running, we’re constantly checking the status of the backup, about once a minute using a simple “while” statement.  If the job fails, our JAMS solution will execute the job two more times before letting us know and killing the job.
    4. if the job succeeds, we move on to the next step
  3. Now we follow the same moving procedure we used above, except this time, we have no queue\date_time folder to contend with.
    1. Move the backup files from Pickup to the Recovery folder.
    2. Move the done files
    3. Check that the Pickup folder is empty
      1. If yes, we delete and recreate it.  Reason?  Simple, its the easiest way to deal with a changing folder structure.  if a DBA deletes a folder in the DropOff directory, we don’t want to continue propagating a stale object.
      2. If not we bomb the script and request manual intervention.
  4. if all that works well, we just completed out backup process.

Issues?

You didn’t think I was going to say it was perfect did you?  Hey, I’m just as hard on myself as I am on vendors.  So here is what sucks with the solution.

  1. For the longest time, *I* was the only one that knew how to troubleshoot it.  After a bit of trainings, and running into issues though, my team is mostly caught up on how to troubleshoot.  Still, this is the issue with home brewed solutions, and ones entirely scripted, don’t help.
  2. Related to the above, if I leave my employer, I’m sure the script could be modified to serve other needs, but its not easy, and I’m sure it would take a bit of reverse engineering.  Don’t get me wrong, I commented the snot out of the script, but that doesn’t make it any easier to understand.
  3. Its tough to extend.  I know I said it could, but really, I don’t want to touch it unless I have to (other than parameters).
  4. When we do UAT refreshes, we need to disable production jobs so the DBA’s have access to the production backups for as long as they need.  its not the end of the world, but it requires us to now be involved at a low level with development refreshes, where as before that wasn’t any involvement on our side.
  5. We’ve had times where full backup’s have been missed tape side. That doesn’t mean they didn’t get copied to tape, rather they were considered an “inc” instead of being considered a “full”. This could easily be fixed simply by having the SQL stored procedure checking if the done file that’s about to be deleted is a full backup and if so, to replace it with a new full DONE file, but that’s not the way it is now, and that depends on the DBA’s.  Maybe in your case, you can account for that.
  6. We’ve had cases where the DBA’s do a UAT refresh and copy a backup file to the recovery folder manually.  When we go to move the data from the pickup folder to the recovery folder, our process bombs because it detects that the same file already exists.  Not the end of the world for sure, easy enough to troubleshoot, but its not seamless.  An additional workaround to this could be to do an md5 hash comparison.  If the file is the same, just delete it out of the pickup directory and move on.
  7. There are a lot of jobs to define and a lot of places to update.
    1. In JAMS we have to create 2 jobs + a workflow that links them per SQL job
    2. in CommVault we have to define the sub-client and all its settings.
    3. On the backup share, 4 folders need to be created per job.

Closing thoughts:

At first glance I know its REALLY convoluted looking.  A  Rube Goldberg for sure.  However, when you really start digging into it, its not as bad as it seems.  In essence, I’m mostly using the same workflow multiple times and simply changing the source / destination.  There are places  for example when I’m doing the actual backup, where there’s more than the generic process being used, but its pretty repetitive otherwise.

In our case, JAMS is a very critical peace of software to making this solution work.  While you can do this without the software, it would be much harder for sure.

At this point, I have to imagine that you’re wondering if this is all worth it?  Maybe not to companies with deep pockets.   And being honest, this was actually one of those processes that I did in house and was frustrated that I had to do it.  I mean really, who wants to go through this level of hassle right?  Its funny, I thought THIS would be the process i was troubleshooting all the time, and NOT Veeam.  However, this process for the most part has been incredibly stable and resilient.  Not bragging, but its probably because I wrote the workflow.  The operational overhead I invested saved a TON of capex.  Backing up SQL natively with CommVault has a list price of 10k per TB, before compression.  We have 45TB of SQL data AFTER compression.  You do the math, and I’m pretty sure you’ll see why we took the path we did.    Maybe you’ll say, that CommVault is too expensive, and to some degree that’s true, but even if you’re paying 1k per TB, if you’re being pessimistic and assuming that 45TB = 90TB before compression, I saved 90k + 20% maintenance each year, and CommVault doesn’t cost anywhere close to 1k per TB, so really, I saved a TON of bacon with the process.

Besides the cost factor, its also enabled us to have a real grip on what’s going happening with SQL backups.  Before it was this black box that we had no real insight into.  You could contend that’s a political issue, but then I suspect lots of companies have political issues.  We now know that SQL ran a full backup 6 days ago.  We now have our backup workflow perfectly coordinated.  We’re not starting to early, and we’re kicking off with in 5 minutes of them being done, so we’re not dealing with slack time either.  We’re making sure that our backup application + backup tape is being used in the most prudent way.  Best of all, our DBA’s now have all their dump files available to them, their environment refreshes are reasonable easy, the backup storage is FAST, we have backup’s centralized and not stored with the server.  All in all, the solution kicks ass in my not so humble opinion.  Would I have loved to do CommVault natively?  For sure, no doubt its ultimately the best solution, but this is a compromise that allowed us to continue using CommVault, save money and accomplish all our goals.

Review: 1.5 years with Veeam

Disclaimer:

These are my opinions, these are not facts.  Take them with a grain of salt, and use your own judgement.  These are also my personal views, not those of my employer.

Also, no one is paying me for this, no one asked me to write this.  It’s not un-biased, but its also not a hired review.

Introduction:

As you may know from reading my backup storage series, we were going from a backup migration of native CommVault to Veeam + CommVault.  I wanted to put together my experience living with Veeam as an almost complete backup  solution replacement for the last year and half.  Again, I’m always looking for this type of stuff, but its always hard to find.

The environment:

We’re backing up approximately 80TB in total (after compression / dedupe) .  That’s what CommVault is copying to tape once a week.  Veeam probably makes up a good 50TB+ of that data set.

The types of VM’s we’re backing up are comprised of mostly file servers and generic servers.  However, we do have almost 6TB of Exchange data and a smattering of small SQL servers.

Source infrastructure specs:

  • 5 Nimble cs460  SAN’s
  • 10g networking (Nexus 5596) with no over subscription
  • 4 Dell r720 hosts with dual socket 12 core procs, and 768GB of RAM.

Backup infrastructure specs:

  • Veeam Infrastructure (started with v6.5   and currently running v8):
    • 1 Veeam server
    • 5 Veeam proxy servers
    • 3 VM’s with 8 vCPU’s and 32GB of vRAM
    • 1 Dell r710, dual socket 6 core 128GB of RAM
    • 1 Dell r720 , dual socket 8 core, 384GB of RAM
  • Windows storage spaces SAN:
    • Made up of the same two Dell proxies as above.
    • 80 4TB disks in a RAID 10 (verified 4GB per second read and 2GB per second write)
    • All 10g
  • CommVault infrastructure:
    • 1 VM running our CommCell
    • 1 Dell r710 attached to our Quantum tape library

All the infrastructure on this side was also connected the same Nexus 5596 @10Gb, and again no over subscription.

Context is everything, which is why I shared my specs.

What we liked pre-migration and still do:

Let’s start with the pros that Veeam has to offer.  Again, we chose Veeam because it was working well as a point solution (replication), so they came in with a good reputation.

  • For the most part, Veeam is super easy to setup and configure.  While I did read the manuals at times, many things were very intuitive to setup.  There are area’s that are not, but I’ll get more into that later.
  • I like that the backup files are all you need to recover out of.  So long as you have the VBK you can recover your last full, and if you have the vibs and vrb, you can get all your restore points too.  This in turn makes it very easy to move the Veeam backup data around, and import the data on other servers.  There’s no database to restore, no super complex recovery process if all you need is the data.
  • The proxy architecture for the most part scales out very well.  There’s little configuration needed, and jobs basically just balance across the proxies on their own.
  • VM restores are mostly intuitive (talking about an entire VM).  And guest file recovery if you know when the file was deleted is also easy to do (just not always quick).
  • I love that they have Powershell for almost everything.  It blows my mind that in 2016 there are vendors still lacking Powershell in a windows environment.  I guess the idea is they provide API’s and that’s good enough.  I disagree, but anyway, kudos to Veeam for having PS.
  • We get backup and replication in the same product.  The two are different, and while you can backup and then restore to achieve a replication type result, its easier to just say “hey, I want to replicate this here”.
  • Its reasonably affordable (although their price is starting to get up there).  Its backup, and its a tough win with management to get decent investment.  Having something affordable is a big pro.
  • Veeam is very quick to have product updates after a new ESXi release and I’ve found  that they’re pretty stable afterwards too.  It’s refreshing to see a vendor strive for supporting the latest and greatest as soon as possible.
  • For the most part, the product is very solid / reliable.  I can’t say I’ve see major issues that were on Veeam’s side.  A few quirks here and there, but ultimately reliable.  The copy jobs are probably the only gripe I have reliability wise.
  • I like that they have application level recovery options baked into the product.  Things like restoring individual email items, AD objects, etc.
  • They try to stay on top of VMware issues related to backups, and come up with ways to work around them.  During the whole CBT debacle, I think they did a great job of helping out their customers.

What we didn’t know about, didn’t think about, and just in general don’t like about Veeam:

  1. It’s rare now a days to find good tech support and Veeam is no exclusion.  I know I’m not alone in my view, if you google objectively, you’ll find tons of people going on about how you need to escalate as soon as humanly possible if you want any chance at getting decent support, and I find this to be true. Their level 1 support is fine for very simple things like “what’s the difference between a reverse and forwards incremental?” but beyond that, good luck getting any decent troubleshooting out of them.  Even when you escalate to level 2, I contend that while they’re certainly better than level 1, it doesn’t inherently mean you’re dealing with the caliber of person you need to solve a tricky problem.  The biggest problem I’ve seen so far with Veeam, is they only know their product well, and if it comes to troubleshooting anything related to backup in VMware, you’re stuck opening a case with VMware and trying to juggle two vendors pointing fingers.  Me, I would have expected much deeper expertise out of Veeam when it comes to the hypervisor they’re trying to backup.  Maybe they do have it, and I just haven’t been escalated enough, but having a level 2 person tell me they’re not super familiar with reading VMware logs is  disappointing.
  2. I honestly don’t feel like product suggestions are taken seriously unless they already align with something Veeam is working on.  I’d also add, there’s no good way to really see what features have been suggested already and to get an idea of how popular they are.  You’re basically stuck taking their word for it that a request is or is not a popular one.  One prime example of this is their SAN snapshot integration.  What a mess.
  3. We knew tape was bad before we got into Veeam, up until v9 (which I haven’t looked at seriously yet), it didn’t get much better in the two revisions that we had.  It was for this reason that we were stuck keeping CommVault around.  Even with version 9, they’re sill not good enough to copy non-Veeam data to tape, which to me, is a big con of the product.
  4. We have a love / hate relationship with their backup file format.  On one hand its nice and easy to follow, each backup job is a single file each time it executes.  So 1 full and 6 incs = 7 files.  However, the con of this really starts to show when you have a server that’s say 8TB.   That means the minimum contiguous allocation you can get away with is 8TB.  Now just imagine you have 3 jobs like that, plus a bunch of smaller ones.  What’s a better way you ask?  File sparsing, and load balancing those file shards across multiple luns automatically (one of my favorite CommVault features).  There are other issues with this file format too, even if they had to keep the shards in the same contiguous spaces, sharding would still have more pros then cons.  For example, windows dedupe, works much better on smaller files.  Try deduping an 8TB file vs. a 4GB file.  One is not doable and other other is.  Here’s another one.  Try copying an 8TB file to tape without ever having a hiccup and needing to restart all over again.
  5. They don’t have global dedupe, and even worse, they don’t even dedupe inside a given job execution.  Each job execution is a unique dedupe container.  All that great disk saving you think you’ll get, will only come if you stack tons of VM’s into a single job, which ends up leading to problem 4.  Add to that, the only jobs that really see any kind of serious dedupe are fulls.
  6. VM backup of clustered applications is just bad, and its made worse in Veeams case when they don’t have SAN snapshot support for your vendor.  Excluding any specific application at the moment, to give you an idea of why its bad picture this scenario.  You have a two node clustered application and you’re using a FSW for quorum.  You’ve followed best practices and only backup the passive node.  Windows patches roll around, and now you need to patch the active node, so you fail your cluster over to the passive node that you’re backing up.  In this scenario, it becomes problematic for two reasons.
    1. if you forget to disable your backup’s during the maint. window, there is a high degree of probability that your cluster will go offline.  All it takes is a backup kicking off while your rebooting your other node.  The stun of a clustered windows server causes the cluster service to hiccup a lot, and if the current node is hiccuping while your other node is offline, that’s not enough votes to satisfy quorum, and hence, your cluster is going down for the count.
    2. Take the above point, but now just imagine your secondary node is down for good.  Maybe a windows patch tanked, maybe your san bombed out.  Doesn’t matter.  Now you’re stuck with a single node cluster, and one that needs a disruptive backup running (more important than ever now).  It’s not a good place to be in.
  7. So what about backing up standalone applications?  Well at least with Exchange 2007, we’ve seen the following issues.
    1. The snapshot removal process has caused the disk to go offline in windows.  You could blame this on VMware, and to a degree its fair.  Except that my entire backup solution is based on Veeam, and this is their only option to backup exchange.
    2. I’ve seen data get corrupt in exchange thanks to snapshot based backup’s.  Again, its technically a VMware issue, but again Veeam relies on VMware.
    3. I’ve seen Veeam end a jobs with a success status, only to discover my transaction logs are NOT truncating.  Looking in the event viewer, you’ll see “backup failed”, but not according to Veeam.  Opened a case about this, and Veeam basically blamed Microsoft.  Again, see point 1 about finger pointing with Veeam’s tech support.
    4. Backing up Exchange with Veeam is SLOW.  We’ve actually noticed that it gets really good if we reseed a database, but over time, as changes occure, as fragmentation occur, the jobs just get slower and slower.
  8. Veeam IMO lacks a healthy portfolio of supported SAN vendors for snapshot based backup.  Take a look at Veeams list, and then go take a look at CommVaults list.  That’s pretty much all I have to say abut that.
  9. Their “sure backup” and “Lab’s” technology sounds great on paper, but if you have an even remotely complex network, trying to setup their NAT appliance is just a PITA.  I’m not knocking it per say, its a nice feature, but its not something that’s simple or straight forward to setup in an enterprise environment IMO.
  10. Kind of a no duh, but they only backup virtual machines.   You can’t call your self an enterprise backup solution, if you can’t support a multitude of systems and services.  Need to backup EMC Isilon?  Not going to happen with Veeam.
  11. This is anecdotal (as most of my views are), but I just find their backup solution to be slow.  The product always blame my source for being the bottleneck, yet my sources is able to deliver far faster throughput then what they’re pulling.  I might get say 120MBps when my SAN can deliver 1024MBps without  breaking a sweat.  It’s not a load issue either as we’ve kicked off jobs during low activity times.  I don’t tend to see a high queue depth or a high latency, so I’m not sure what’s throttling it.  SIOC isn’t kicking in either.  Again, perhaps its just something about VMware based backup.  Also worth noting that for the most part, the IO is sequential, so its not a random IO issue usually.  Although FYI, snapshot commits are almost entirely random IO (at least for exchange).
  12. Opening large backup jobs can be really slow.  I have a few file servers that are 4TB+, and it can take a good minute, maybe two just to open the backup file.  Now to be fair, we don’t index the file system in Veeam so I don’t know if that would help.  Add to that, closing said backup file can also be slow.
  13. I have no relationship with a sales rep.  Maybe for some folks that’s a good thing, but for me, when I spend money with a vendor, I think there should be some correspondences.  In the 3 years we’ve owned Veeam, the only time a sales person got in touch with me was because I was interested in looking at VeeamOne.  Never to check in and see how things were going, or anything like that.  Our environment probably isn’t large from a Veeam view, so I’m not expecting lunches, or monthly calls, but at least once a year would be reasonable.  I  have no idea who our sales rep is just to give you an idea of how out of touch they are.  And I know most of my other vendors sales folks by first name.

I could probably keep going on, but I  think that’s enough for now.

Conclusion:

Veeam is a solution I might recommend if you’re a small shop, that’s 100% virtual with modest needs.  If you’re a full scale enterprise, regardless of how virtualized you are, I would’t recommend Veeam at this stage, even with the v9 improvements.

I’ve certainly heard the stories of larger shops switching to Veeam.  I suspect a lot of that was due to the expense of CommVault or EMC.  I get it, backup is a hard line item to justify expenditures in.  That’s why you see vendors like Veeam, Unitrends and other cheaper solutions filling the gap so well.  Veeam in our case even looked like it might be a better solution.  It was simple, and affordable.  However, you soon realize that at scale, the solution doesn’t match the power of your tried and true “legacy” solution.  Maybe that solution costs more in capex, but I suspect the opex of running that legacy solution far outweighs the capex you’re saving.  In our case, thats an absolute truth.

I wouldn’t say run away from Veeam, but I would say think really hard before you ditch a tried and true enterprise application like CommVault, EMC Avamar, NetBackup, or Tivoli.    If you know me and follow my blog, you know I do have love for the underdog / cutting edge solutions.  This isn’t me looking down my nose at a newer player, this is me looking down my nose at an inferior solution.

Backup Storage Part 3a: Deduplication Targets

Deduplication and backup kind of go hand in hand, so we couldn’t evaluate backup storage and not check out this segment.  We had two primary goals for a deduplication appliance.

  1. Reduce racks space while enabling us to store more data.  As you know in part 1, we had a lot of rack space being consumed.  While we weren’t hurting for rack space in our primary DC, we were in our DR DC.
  2. We were hoping that something like a deduplication target would finally enable us to get rid of tape and replicate our data to our DR site (instead of sneakernet).

For those of you not particularly versed in deduplicated storage, there are a few things to keep in mind.

  • Backup deduplication and the deduplication you’ll find running on high performance storage arrays are a little different.  Backup deduplication tends to use either variable or much smaller block size comparisons. An example, your primary array might be looking for 32k blocks that are the same, where as deduplication target might be looking for 4k blocks that are the same.  Huge difference in the deduplication potential.  The point is, just because you have deduplication baked into your primary array, does not mean its the same level of dedupliation that’s used in deduplication target.
  • Deduplication targets normally also include compression as well.  Again, its not the same level of compression found in your primary storage array, typically a more aggressive (CPU intensive) compression algorithm.
  • Deduplication targets tend to be in-line dedpulication.  Not all are, but the majority of the ones I looked at were.  There are pros and cons to this that I’ll go into later.
  • In all the appliances I’ve looked at, everyone of them had a primary access meathod of NFS/SMB.  Some of them also offered VTL, but the standard deployment method is them acting as a file share.
  • Not all deduplication targets offer whats referred to as global dedplication.  Depending on the target, you may only deduplicate at the share level.  This can make a big difference in your deduplication rates.  A true global deduplication solution, will deduplicate data across the entire target, which is the most ideal.

Now I’d like to elaborate a bit on the pros and cons of in-line vs post process (post process) deduplication.

Pros of In-Line:

  • As the name implies, data is instantly deduplicated as its being absorbed.
  • You don’t need to worry about maintaining a buffer or landing zone space like post process appliances need.
  • Once an appliance has seen the data (meaning its getting a deduplication hit) writes tend to be REALLY fast since its just metadata updates.  In turn replication speed also goes through the roof.
  • You can start replication almost instantly or in real time depending on the appliance.  Post process can’t do this, because you need to wait for the data to be deduplicated.

Pros of Post Process:

  • Data written isn’t deduplicated right away, which means if you’re doing say a tape backup right afterwards, or a DB verification, you’re not having to rehydrate the data.  Basically they tend to deal with reads a lot better.
  • Some of them actually cache the data (un-deduplicated) so that restores and other actions are fast (even days later).
  • I know this probably sounds redundant, but random disk IO in general is much better on these devices.  A good use case example would be doing a Veeam VM verification.  So not only reads in general, but random writes.

Again, like most comparisons, you can draw the inverse of each devices pros to figure out its cons.  Anyway, on to the devices we looked at.

There were three names that kept coming up in my research, EMC’s DataDomain, ExaGrid and Dell.  Its not that they’re the only players in town, HP, Quantum, Seapaton, and a few others all had appliances.  However, EMC and ExaGrid were well known, and we’re a Dell shop, so we stuck with evaluating these three devices.

Dell DR series appliances (In-line):

After doing a lot of research, discussions, demo’s the whole 9 yards.  It became very clear that Dell wasn’t isn’t the same league as the other solutions we looked at.  I’m not saying I wouldn’t recommend them, nor am I saying I wouldn’t reconsider them, but not yet, and not in its current iteration.  That said, as of this writing, its clear Dell is investing in this platform, so its certainly worth keeping an eye on.

Below are the reasons we weren’t sold on their solution at the time of evaluation.

  • At the time, they had a fairly limited set of certified backup solutions.  We planned to dump SQL straight to these devices, and SQL wasn’t on the supported list.
  • They often compared their performance to EMC, except, they were typical quoting their source side deduplicated protocol, vs. EMC’s raw (unoptimized) throughput.  Meaning it wasn’t an apples to apples comparison.  When you’re planning on transferring 100TB+ of data on a weekly basis and not everything can use source side deduplication, this makes a huge difference.  At the time we were evaluating, Dell was comparing their DR4100 vs. a DD2500.  The reality is, the Dell DR6100 is a better match for the DD2500.  Regardless, we were looking at the DD4200, so we were way above what Dell could provide.
  • They would only back a 10:1 deduplication ratio.  Now this, I don’t have a problem with.  I’d much rather a vendor be honest then claim I can fit the moon in my pocket.
  • They didn’t do multi to multi replication.  Not the end of the world, but also kind of a bummer.  Once you pick a destination, that’s it.
  • Their deduplication was at a share level, not global.  If we wanted one share for our DBA’s and one for us, no shared deduplication.
  • They didn’t support snapshots.  Not the end of the world, but its 2015, snapshots have kind of been a thing for 10+ years now.
  • Their source side deduplication protocol was only really suited to Dell products.   Given that we weren’t planning on going all in with Dell’s backup suite, this was a negative for us.
  • No one, and I mean no one was talking about them on the net.  With EMC or ExaGrid, it wasn’t hard at all to find some comments, even if they were negative.
  • They had a very limited amount of raw data (real usable capacity) that they could offer.  This is a huge negative when you consider that splitting off a new appliance means you just lost half or more of your deduplication potential.
  • There was no real analysis done to determine if they were even a good fit for our data.

ExaGrid (Post process ):

I heard pretty good things about ExaGrid after having a chat with a former EMC storage contact of mine.  If EMC has one competitor in this space, it would be ExaGrid.  Like Dell, we spent time chatting with them, researching what others said, and really just mulling on the whole solution.  Its kind of hard to solely place them in the deduplicaiton segment as they’re also scale out storage to a degree, but I think this is a more appropriate spot for them.

Pros:

  • The post process is a bit of a double edged sword.  One of the pros that I outlined above, is that data is not deduplicated right away.  This means we could use this device as our primary and archive backup storage.
  • The storage scaled out linearly in both performance and capacity.  I really like the idea of not having to forklift upgrade our unit if we grew out of it.
  • They had what I’ll refer to as “backup specialists”.  These were techs that were well versed in the backup software we’d be using with ExaGrid.  In our case SQL and Veeam.  Point being, if we had questions about maximizing our backup app with ExaGrid, they’d have folks that know not just ExaGrid but the application as well.
  • The unit pricing wasn’t simply a “lets get’em in cheap and suck’em dry later”.  Predictable (fair) pricing was part of who they are.

Cons:

  • As I mentioned, post process was a bit of a double edged sword.  One of the big negatives for us, was that their replication engine required waiting until a given file was fully deduplicated before it could begin.  So not only did we have to wait say 8 hours for a 4TB file server backup, but then we had to wait potentially another 8 hours before replication could begin.  Trying to keep any kind of RPO with that kind of variable is tough.
  • While they “scale out” their nodes, they’re not true scale out storage IMO.
    • Rather than pointing a backup target at a single share, and letting the storage figure everything out, we’d have to manually balance which backup’s go to which node.  With the number of backup’s we were talking about and the number of nodes there could be, this sounded like too much of a hassle to me.
    • The landing zone space (un-deduplicated storage) was not scale out, and was instead pinned to the local node.
    • There is no node resiliency.  Meaning if you lose one node, everything is down, or at least for that node.   While I’m not in love with giving up two or three nodes for parity, at least having it as an option would be nice.  IIRC (and could be wrong) this also affected the deduplication part of the storage cluster.
    • Individual nodes didn’t have the best throughput IMO.  While its great that you can aggregate multiple nodes throughput, if I have a single 4TB backup, I need that to go as fast as possible and I can’t break that across multiple nodes.
  • I didn’t like that the landing zone : deduplicaiton zone was manually managed on each node.  This just seemed to me like something that should be automated.

EMC DataDomain (Inline):

All I can say is there’s no wonder they’re the leader in this segment.  Just an absolutely awesome product overall.  As many who know me, I’m not a huge EMC (Expensive Machine Company) fan in general, but there area few areas they do well and this is one of them.

Pros:

  • Snapshots, file retention policies, ACL’s, they have all the basic file servers stuff you’d want and expect.
  • Multi : Multi replication.
  • Very high throughput of non-source (DDBoost) optimized data and even better when it is source optimized.
  • Easy to use (based on demo) and intuitive interface.
  • The ability to store huge amounts of data in a single unit.  At time a head swap may be required, but have the ability to simply swap the head is nice.
  • Source based optimization baked into a lot of non-EMC products, SQL and Veeam in our case.
  • Archive storage as a secondary option for data not accessed frequently.
  • End to end data integrity.  These guys were the only ones that actually bragged about it.  When I asked this question to others, they didn’t exactly instill faith in their data integrity.
  • They actually analyzed all my backup data and gave me reasonably accurate predictions of what my dedupe rate would be and how much storage I’d need.  All in all, I can’t speak highly enough about their whole sales process.  Obviously everyone wants to win, but EMC’s process was very diplomatic, non-pushy and in general a good experience.

Cons:

  • EMC provided some great initial pricing for their devices, but any upgrades would be cost prohibitive.  That said, I at least appreciate that they were up front with the upgrade costs so we knew what we were getting into.  If you go down this path yourself, my suggestion is buy a lot more storage than you need.
  • They treat archive storage and backup storage differently and it needs to be manually separated.  For the price you pay for a solution like this, I’d like to think they could auto tier the data.
  • They license al a carte.  Its not like there’s even a slew of options, I don’t get why they don’t make things all inclusive.  Its easier for the customer and its easier for them.
  • In general, the device is super expensive.  Unless you plan on storing 6+ months of data on the device, I’d bet you could do better with large cheap disks, or even something like disk to tape tiering solution (SpectraLogic Black Pearl).  Add to that, unless your data deduplicates well, you’ll also be paying through the nose for storage.
  • Going off the above statement, if you’re only keeping a few weeks worth of data on disk, you can likely build a faster solution $ for $ than what’s offered by them.
  • No cloud option for replication.  I was specifically told they see AWS as competition, not as a partner. Maybe this will change in the future, but it wasn’t something we would have banked on.

All in all, the deduplication appliances were fun to evaluate.  However, cutting to the chase, we ended up not going with any of these solutions.  As for ROI, these devices are too specialized, and too expensive for what we were looking to accomplish.  I think if you’re looking to get rid of tape (and your employer is on board), EMC DataDomain would be my first stop.  Unfortunately, for our needs, tape was staying in the picture, which meant this storage type was not a good fit.

Next up, scale out storage…

Backup Storage Part 2: What we wanted

In part 1, I went over our legacy backup storage solution, and why it needed to change.  This section, I’m going to outline what we were looking for in our next storage refresh.

Also, just to give you a little more context, while evaluating storage, were also looking at new backup solutions.  CommVault, Veeam, EMC (Avamar and Networker), NetVault, Microsoft DPM, and a handful of cloud solutions.  Point being, there were a lot of moving parts and a lot of things to consider.  I’ll dive more into this in the coming sections, but wanted to let you know it was more than just a simple storage refresh.

The core of what we were seeking is outlined below.

  1. Capacity: We wanted capacity, LOTS of capacity.  88TB’s of storage wasn’t cutting it.
  2. Scalability: Just as important as capacity, and obviously related, we needed the ability to scale the capacity and performance.  It didn’t always have to be easy, but we needed it to be doable.
  3. Performance: We weren’t looking for 100K IOps, but we were looking for multiple GBps in throughput, that’s bytes with an uppercase B.
  4. Reliable/Resilient: The storage needed to be reasonably reliable, we weren’t looking for five 9’s or anything crazy like that.  However, we didn’t want to be down for a day at a time, so something with resiliency built in, or a really great on-site warranty was highly desired.
  5. Easy to use: There are tons of solutions out there, but not all of them are easy to use.
  6. Affordable: Again, there’s lots of solutions out there, but not all of them are affordable for backup storage.
  7. Enterprise support: Sort of related to easy to use, but not the same, we needed something with a support contract.  When the stuff hits the fan, we needed someone to fallback on.

At a deeper level, we also wanted to evaluate a few storage architectures.  Each one, meets aspects of our core requirements.

  1. Deduplication Targets:  We weren’t as concerned about deduplications ability to store lots of redundant data on few disks (storage is cheap), but we were interested in the side effect of really efficient replication (in theory).
  2. Scale out storage:  What we were looking for out of this architecture was the ability to limitlessly scale.
  3. Cloud Storage: We liked the idea of our backup’s being off site (get rid of tape).  Also in theory it too was easy to scale (for us).
  4. Traditional SAN / NAS:  Not much worth explaining here, other than that this would be a reliable fallback if the other two architectures didn’t pan out.

With all that out there, it was clear we had a lot of conflicting criteria.  We all know that something can’t be fast, reliable and affordable.  It was apparent that some concessions would need to be made, but we hadn’t figured out what those were yet.  However, after evaluating a lot of vendors and solutions, passing along quotes to management, things started to become much clearer towards the end.  That, is something I’m going to go over in my upcoming sections. There is too much information to force it all in one upcoming section, so I’ll be breaking it out further.

-Eric

Backup Storage Part 1: Why it needed to change

Introduction:

This is going to be a multi-part series where I walk you through the whole process we took in evaluating, implementing and living with a new backup storage solution.  While its not a perfect solution, given the parameters we had to work within, I think we ended up with some very decent storage.

Backup Storage Part 1: The “Why” it needed to change

Last year my team and I began a project to overhaul our backup solution and part of that solution involved researching some new storage options.   At the time, we ran what most folks would on a budget, which is simply a server, with some local DAS. It was a Dell 2900 (IIRC) with 6 MD1000’s and 1 MD1200.  The solution was originally designed to manage about 10TB of data, and really was never expected to handle what ultimately ended up being much greater than that.  The solution was less than ideal for a lot of reasons that I’m sharing below.

  • It wasn’t just a storage unit, it also ran all the backup server (CommVault) on top of it, and our tape drives were  locally attached as well.  Basically a single box to handle a lot of data and ALL aspects of managing it.
  • The whole solution was a single point of failure and many of its sub-components were singe points of failure.
  • This solution was 7 years old, and had been grown organically, one JBOD at a time.  This had a few pitfalls:
    • JBODs were daisy chained off each other.  Which meant that while you added more capacity, and spindles, the throughput was ultimately limited to 12Gbps for each chain (SAS x4 3G).  We only had two chains for the md1000’s and one chain / JBOD for the md1200.
    • The JBODs were carved up into independent LUN’s, which from CommVaults view was fine (awesome SW), but it left potential IOPS on the table.  So as we added JBODS the IOPS didn’t linearly increase per say.  Sure the “aggregate” IOPS increased, but a single job, is now as limited to the speed of a 15 disk RAID 6.  Instead of the potential of say a 60 disk RAID 60.
  • The disks at the time were fast (for SATA drives that is) but compared to modern NL-SAS drives, much lower throughput capability and density.
  • The PCI bus and the FSB (this server still had a FSB) was overwhelmed.  Remember, this was doing tape AND disk copies.  I know a lot of less seasoned folks don’t think its easy to overwhelmed a PCI bus, but that’s not actually true (more on that later) even more so when you’re PCI bus is version 1.x.
  • This solution consumed a TON or rack space, each JBOD was 3U and we had 7 of them (md1200 was 2U).  And each drive was only 1TB, so best case with 15 disk RAID 6, we were looking at 13TB usable   By today’s standards, this TB per RU is terrible even for tier 1 storage, let alone backup.
  • We were using a RAID 6 instead of 10.  I know what some of you are thinking, its backup, so why would you use RAID 10?    Backup’s are probably every much as disk intensive as your production workloads and likely more so.  On top of that, we force them to do all their work in what’s typically a very constrained window of time.  RAID 6 while great for sequential / random reads, does horribly at writes in comparison to a RAID 10 (I’m exuding fancy file systems like CASL for this generalization).  Unless you’re running one backup at a time, you’re likely throwing tons of  parallel writes at this storage.  And while each stream may be sequential in nature, the aggregation of them is random in appearance to the storage.  At the end of the day, disk is cheap, and cheap disk (NL-SAS) is even cheaper so splurge on RAID 10.
    • This was also compounded by the point I made above about one LUN per JBOD
  • It was locked into what IMO is a less than ideal Dell (LSI) RAID card.  Again, I know what some of you are thinking.  “HW” RAID is SO much better than SW RAID.  Its a common and pervasive myth.  EMC, NetApp, HP, IBM, etc. are all simple x86 servers with a really fancy SW RAID.  SW RAID is fine, so long as the SW that’s doing the RAID is good.  In fact, SW RAID is not only fine, in many cases, it’s FAR better than a HW RAID card.    Now, I’m not saying LSI sucks at RAID, they’re fine cards, but SW has come a long way, and I really see them as the preferred solution over HW RAID.  I’m not going to go into the WHY’s in this post, but if you’re curious, do some research of your own, until I have time to write another “vs” article.
  • Using Dell, HP, IBM, etc for bulk storage is EXPENSIVE compared to what I’ll call 2nd tier solutions.  Think of this as somewhere in-between Dell and home brewing your own solution.
    • Add on to this, manufactures only want you running “their” disks in “their” JBODs.  Which means not only are you stuck paying a lot for a false sense of security, you’re also incredibly limited to what options you have for your storage.
  • All the HW was approaching EOL.

There’s probably a few more reason why this solution was no longer ideal, but you get the point.  The reality is, our backup solution was changing, and it was a perfect time to re-think our storage.  In part 2, we’ll get into what we thought we wanted, what we needed, and what we had budget for.