Tag Archives: backup

Review: 5 years with CommVault

Introduction:

Backup and recovery is a rather dry topic, but it’s an important one.  After all, what’s more critical to your company than their data?  You can have the best products in the world, but if disaster strikes and you don’t have a good solution in place, it can make your recovery painful or even impossible.  Still, many companies shirk investment in this segment.  The good solutions (like the one I’m about to discuss) cost a pretty penny, and that’s capital that needs to be balanced with technology that makes or saves your company money.  Still, insurance (and that’s what backup is) is something that’s typically on the back of companies minds.

Finding the right product in this segment can be a challenge, not only because every vendor tries to convince you that they’ve cracked the nut, but because it seems like all the good solutions are expensive.  Like many, our budget was initially constrained.  We had an old investment in CV (CommVault), but had not reinvested in it over the years, and needed a new solution.  We initially chose a more affordable Veeam + Windows Storage Spaces to handle our backup duties.  It was a terrible mistake, but you know, sometimes you have to fail to learn, and so we did.

After putting up with Veeam for a year, we threw in the towel and and went back to CV with open arms.  Our timing was also great too, as Veeam had put a serious hurt on their business and some of their licensing changed, to accommodate that.  We ultimately ended up with much better pricing than when we last looked at CV, and on top of that, we actually found their virtualization backup to be more affordable and in many ways more feature rich.  CV isn’t perfect as I’ll outline below, but they’re pretty much as close as you can get to perfection for a product that is the swiss army knife of backup.

CommVault Terms:

For those of you not super familiar with CV, you’ll find the following terms useful for understanding what I’m talking about.  There are a lot more components in CV, but these are the fundamental ones.

  • MA (Media Agent): Simply put, it’s a data mover.  It copies data to disk, tape, cloud, etc.
  • Agent: A client that is installed to backup an application or OS.
  • VSA (Virtual Server Agent): A client specially designed to for virtualization backup.
  • CC (CommCell): The central server that manages all the jobs, history, reporting, configuration, etc.  This is the brains of the whole operation.

Our Environment:

  • We have five MA’s.
    • Two virtual MA’s that backup to a Quantum QXS SAN (DotHill). This was done because we were reusing an old pair of VMhost and have a few other non-CV backup components running on these hosts.
      • The SAN has something like two pools of 80 disks. Not as fast as we’d like, but more than fast enough.  The QXS (DotHill) was our replacement for Storage Spaces.  Overall, better than Storage Spaces, but a lot of room for improvement.  The details of that are for another review.
    • Two physical MA’s with DAS, each MA has 80 disks in a RAID 60, yeah it rips from a disk performance perspective J. Multiple GBps
    • One physical MA that’s attached to our tape library.
  • We have five VSA’s, I’ll go more into this, but we’re not using five because I want to.
  • We have one CC, although we’ll be rolling out a second for resiliency and failover soon.
  • We have a number of agents
    • Several MS Exchange
    • Several MS Active Directory
    • Several Linux
    • The rest are file server / OS image agents.
  • In total, we have about a PB of total backup capacity between our SAN and DAS, but not all of that is consumed by CV (most is though).
  • We only use compression right now, no dedupe.
  • We only use active fulls (real fulls) not synthetics

Pros:

  • Backup:
    • CV can backup practically anything, and also has a number of application specific agents as well. You can backup your entire enterprise with their solution.  I would contend with CV, there are very few cases that you’d need point tools anymore.  Desktops, servers, virtualization, various applications and NAS devices are all systems that can be backed up by CV.  Honestly, it’s hard to find a solution that is as comprehensive as them.  That being said, I can imagine you’re wondering if they do it all, can they do it well?  I would say mostly.  I have some deltas to go over a little farther down, but they do a lot and a lot well.  It’s one of the reasons the solution was (and still is) expensive.
    • I went from having to babysit backup’s with Veeam, to having a solution that I almost never had to think about anymore (other than swapping tapes). There were some initial pains at first as we learned CV’s way of doing virtualization backup, but we quickly got to a stable state.
  • Deployment / Scalability:
    • CommCell has a great deployment model that works well in single office locations all the way to globally distributed implementations. They’re able to accomplish all of this with a single pane of glass, which a number of vendors can’t claim to do.
    • Besides the size of the deployment, you’re not forced into using Windows only for most components of CV. A lot of the roles outlined above run on Linux or Windows.
    • CV is software based, and best of all, its an application that runs on an OS which you’re already comfortable with (Linux / Windows). Because of this, the HW that you deploy the solution on is really only limited by minimum specs, budget and your imagination.  You can build a powerful and affordable solution on simple DAS, or you can go crazy and run on NVMe / all flash SANs.  It also works in the cloud because again, it’s just SW inside a generic OS.  I can’t tell you how many backup solutions I looked at that had zero cloud deployment capabilities.
    • There are so many knobs to turn in this solution, it’s pretty tough to run into a situation that you can’t tune for (there are a few though). Most of the out of box defaults are fine, but you’ll get the best performance when you dig in an optimize.  Some find this overwhelming and I’ll chat more about that in the cons, but with CV’s great support and reading their documentation, it’s not as bad as it sounds.  Ultimately the tuneablity is an incredible strength of this solution.  I’ve been able to increase backup throughput from a few hundred MBps to a few GBps simply by changing the IO size that CV uses.
  • Support:
    • Overall, they have fantastic support. Like any vendors support, it can vary and CV is no different.  Still, I can count on my hand the number of times support was painful, and even of those times, ultimately we got the issue resolved.
    • For the most part, support knows the application they’re backing up pretty well. I had a VMware backup issue that we ran into with Veeam and continued with CV.  CV while not being able to directly solve the problem, provided significantly more data for me to hand off to VMware, which ultimately led to us finding a known issue.   CV analyzed the VMware logs best they could and found the relevant entries that they suspected were the issue.  Veeam, was useless.
    • Getting CV issues fixed is something else that’s great about CV. No vendor is perfect, that’s what hotfixes and service packs are for.  CV, has an amazing escalation process.  I went from a bug, to a hotfix that resolved the issue in under two weeks.
    • My experience with their supports response time is fantastic. I rarely find a time where I don’t hear from them for a few hours.  They’re also not afraid to simply call you and work on the problem real time. I don’t mind email responses for simple questions, but when you’re running into a problem, sometimes you just want someone to call you and hash it out in real time.  I also like that most of the time you get the tech’s direct number if you need to call them.
  • Feature requests: A little hit or miss, but feature requests tend to get taken seriously with CV, especially if it’s something pretty simple.
  • Value: This one is a mixed bag.  Thanks to Veeam eating their lunch, virtualization backup with CV has never been a better value.  I could be wrong, but I actually think virtualization backup in CV rings in at a significantly lower price than Veeam.  I would say at least 50% of our backup’s are virtualization.  It’s our default backup method unless there is a compelling reason to use agents.   This is ultimately what made CV an affordable backup solution for us.  We were able to leverage their virtualization backup for most of our stuff, and utilize agents for the few things that really needed to be backed up at a file level or application level.  The virtualization backup entitles you to all their premium features, which is why I think it’s a huge value add.  That being said, I have some stuff to touch on in the cons with regards to the value.
  • Retention Management: Their retention management is a little tricky to get your head around, but it’s ultimately the right way to do retention.  Their retention is based on a policy, not based on the number of recovery point.    You configure things like how many days of fulls you want and how many cycles you need.  I can take a bazillion one off backup’s and not have to worry about my recovery history being prematurely purged.
  • Copy management: They manage copies of data like a champ.  Mix it with the above point, and you have all kinds of copies with different retentions, different locations, and it all works rock solid.  You have control over what data get’s copied.  So your source data might have all your VM’s and you only want a second copy of select VM’s, not problem for them.  Maybe you want dedupe on some, compression on other, some on tape, some on disk, some on cloud, again, no issue at all.
  • Ahead of the curve: CV seems to be the most forward thinking when it comes to backup / recovery destinations and sources.  They had our Nimble SAN’s certified for backup LONG before our previous vendor.  They support all kinds of cloud destinations, the ability to recover VM’s from physical to virtual, virtual to cloud, etc.  This goes back to the holistic approach that I brought up.  They do a very good job of wrapping everything up, and creating a flexible ecosystem to work with.  You typically don’t need point solutions with them.
  • Storage Management: I love their disk pools, and the way they store their backup data.  First and foremost, it’s tunable, so if you want 512MB files to whatever size files, it’s an option.  They shard the data across disks, etc.  Frankly the way they store data is a no brainer.  They also move jobs / data pretty easily from one disk to another which is great.  This type of flexability is not only helpful for things like making it easier to fit your data on disparate storage, but also in ensuring your backup’s can easily be copied to unreliable destinations.  Having to recopy a 512MB file is a lot better than having to recopy an 8TB file.  CV can take that 8TB file if you want, and break it up into various sized (default is 2GB).
  • Policies: Most of the way things are defined, are defined using policies.  Schedules, retention, copies, etc.  Not everything, but most things.  This makes it easy to establish standards for how things should act, and it also makes it easier to change thing.
  • CLI: They have a ton of capability with their CLI / API.  Almost anything can be executed or configured.  I actually developed a number of external work flows which call their CLI and it works well.
  • Tape Management:
    • They handle tapes like a librarian, minus the dewy decimal system. Seriously though, I haven’t worked with a solution that makes handling tapes as easy as they do.
    • If you happen to use Iron Mountain, they have integration for that too.
    • They’re pretty darn efficient with tape usage as well, which is mostly thanks to their “global copy” concept. We still have some white space issues, but it makes sense why
    • They are very good at controlling tape drive and parallel job management. This allows you to balance how many tape drives are used for what jobs.
  • Documentation: They document everything, and for good reason, there is a lot their product does. This includes things like advanced features and most of the special tuning knobs as well.  It’s not always perfect, but it’s typically very good.
  • Recovery:
    • File level recovery from tape for VM backups, without having to recover the whole file, need I say more. That means if I need one file off an 8TB backup VMDK, I don’t have to restore 8TB first.
    • Most application level backup’s offer some level of item level recovery. It’s not always straight forward, or quick, but its usually possible.
    • They’re smart with how they restore data. You can pick where you want the data recovered from (location and copy), and if it does need tapes, it tells you exactly what tapes you need.  No more throwing every single tape in and hoping that’s all you need.

Cons:

  • Backup:
    • Virtualization:
      • Their VMware backup in many ways isn’t as tunable as it should be. There are places where they don’t have stream limits where they really need them.  For example, they lack a stream limit on a the vCenter host, the ESXi host or even the VSA doing the backup.  It’s honestly a little strange as CV seems to offer a never-ending number of stream controls for other areas of their product.  I bring this up as probably my number one issue with their VMware backup.  This led us to have the most initial problems with their solution.  I would still say this is a glaring hole in their virtualization backup.  I just looked up their CV11 SP7 and nothing has changed with regards to this, which is disappointing to say the least.  This is one area that I think Veeam handles much better than them.
      • The performance of NBD (management network only) based backup is bluntly terrible. The only way we could get really good performance out of their product was to switch to hot add.  Typically speaking I hate hot add for Vmware backup.  It takes forever to mount disks, and it makes the setup of VM backup more complicated than it needs to be.  Not to mention if you do have an issue during the backup process (like vCenter dying) the cleanup of the backup is horrible.
      • They don’t pre-tune VSA for hot add. Things like disabling initialize disk in windows and what not.
      • Their inline compression throughput was also atrocious at first. We had to switch the algorithm used which fixed the issue, but it required a non-gui tweak to achieve and me asking if there was anything else they could do.  It was actually timely that the new algorithm had been released as experimental in the release we just upgraded to.
      • Their default VM dispatch to me is less than ideal. Instead of balancing VM’s in a least load method across the VSA’s, they pick the VSA closest to the VM or datastore.  I needed to go in and disable all of this.
    • Deployment / Scalability:
      • While I applaud their flexibility, the one area that I think still needs work is their dedupe. To me, they really need to focus on building a DataDomain level of solution that can scale to petabytes of logical data in a single media agent, and right now they can’t scale that big.  It seems like you need to have a bunch of mid sized buckets which is better than nothing, but still not as ideal as it should be.
      • Deployment for CV newbies is not straight forward. You’ll definitely need professional services to get most of the initial setup done, at least until you have time to familiarize yourself with it.  You’ll also need training so that you actually know how to care for and grow the solution.  I think CV could do a better job with perhaps implementing a more express setup just to get things up, and maybe even have a couple of into / how to videos to jump start the setup.  It’s complicated, because of it’s power, but I don’t think it needs to be.  The knobs and tuning should be there to customize the solution to a person’s environment, but there should be an easy button that suites most folks out of the box.
    • Support: In general I love their support, but there are times where I’m pretty confident the folks doing the support, don’t have at scale experience with the product.  There are times when I’ve tried explaining the scaling issue we were having, and they couldn’t wrap their heads around the issue.  They also tend to get wrapped up in the “this is the way it works” and not in the “this is the way it SHOULD work”.  Which again I think comes back to the experience with product at scale.  This would tend to happen more when I was trying to explain why I setup something in a particular way, and a way that didn’t match their norm.  For example, VM backups, they like to pile everything into subclients.  For more than a number of reasons I’m not going to go into in this blog post, that doesn’t work for us, and frankly it shouldn’t work for most folks.  I was able to punch holes in why their design philosophy was off, but they were stuck on “this is the way it is”.  The good news is you can typically escalate over short sited techs like this and get to someone who can think outside the box.
    • Value: This is a tough one.  On one hand, I want good support and a feature rich product, but on the other hand, the cost of agent based backup is frankly stupid expensive.  When the cost of my backup product costs more per TB than my SAN, that’s an issue.  It’s one of the primary reasons we push towards VM based backup’s as its honestly the only way we could afford their product.  Even with huge discounts, the cost per TB is insane with their solution.  In some cases, I would almost rather have a per agent cost rather than a per TB cost.  I could see how that could get out of control, but I think there are cases where each licensing model works better for each company.  If I had thousands of servers, I could see where the per TB model might make more sense.  This is one of the reasons we don’t backup SQL direct with CV, it just costs too much per TB.  It’s cheaper for us to use a (still too expensive) file based agent to pick up SQL dump files.
    • Storage Management: Once data is stored on its medium, moving it off isn’t easy.  If you have a mountpoint that needs to be vacated, you need to either aux copy data to a new storage copy, manually move the data to another mountpoint, or wait till it ages out.  They really should have an option in their storage pool to simply right click the mountpoint and say “vacate”.  This operation would then move all data/jobs to whatever mountpoints are left in the whole pool.  Similar to VMwares SDRS.  I would actually like to see this ability at a MA level as well too.
    • CLI: I’ll knock any vendor that doesn’t have a Powershell module and CV is one of those vendors.  Again, glad that you have API’s, but in an enterprise where Windows rules the house, Powershell should be standard CLI option.
    • Tape Management: As much as I think they do it better than anyone else, they could still improve the white space issue.  I almost think they need a tapering off setting.  Perhaps maybe even a preemptive analysis of the optimal number of tapes and tape drives before the start of each new aux copy, and re-analyze that each time you detect more data that needs to be copied to tape.  This way it could balance copy performance with tape utilization.  Maybe even define a range of streams that can be used.
    • Documentation: As great as their documentation is, it needs someone to really organize it better.  Taking into account the differences in CV versions.  I realize it’s probably a monumental task, but it can be really hard to find the right document to the right version of what you’re looking for.  I’ve also found times where certain features are documented in older CV version docs, but not in newer ones (but they do exist).  I guess you could argue at least they have so much documentation that it’s just hard to find the right one, vs. not having any doc at all.  When in doubt though, I contact support and they can generally point me in the right direction, or they’ll just answer the question.
    • Recovery:
      • Item level recovery that’s application based really needs a lot of work. One thing I’ll give Veeam is they seems to have a far more feature rich and intuitive application item level recovery solution than CV.
        • Restoring exchange at an item level is slow and involved (lots of components to install). I honestly still haven’t gotten it working.
        • AD item level recovery is incredibly basic and honestly needs a ton of work.
        • Linux requires a separate appliance, which IMO it shouldn’t. If Linux admins can write tools to read NTFS, why can’t a backup vendor write a Windows tool that can natively mount and ready EXT3/4, ZFS, XFS, UFS, etc.
      • P2P, V2V / P2V leaves a lot to be desired. If you plan to use this method, make sure you have an ISO that already works.  Otherwise you’ll be scrambling to recover bare metal when you need to.

Conclusion:

Despite CommVaults cons, I still think it’s the best solution out there.  It’s not perfect in every category, and that’s a typical problem with most do it all solution, but it’s pretty damn good at most.  It’s an expensive solution, and its complicated, but if you can afford it, and invest the time in learning it, I think you’ll fall in love with it, at least as much as one can with a backup tool.

Problem Solving: CommVault tape usage

Introduction:

I hate dealing with tapes, pretty much every aspect of them.  The tracking of them is a PITA, having to physically manage them is a PITA, dealing with tape library issues is a PITA, dealing with tape encryption is a PITA, running out of tapes is a PITA, dealing with legal hold for tapes is a PITA, and I could keep going on with the many ways that tape just sucks.  What makes matters worse is when you have to deal with MORE tapes.

Now that you know tapes are one of my personal seven levels of hell in IT, you’ll know why I put a bit of time into this solution.  Anything I can do to reduce the number of tapes getting exported every day, ultimately leads to some reduction in the PITA scale of tapes.

The issue:

To provide a better understanding of the issue at hand, for years I’ve been seeing way too many tapes being used by CV.  We’d kick out tapes that had 5% or 10% consumption, and the number of tapes with that level of consumption varied based on what phase of our backup strategy we were in, and what day of the week it was.  It could be anything as small as 4 partially filled tapes, to times where we had 10+ tapes that weren’t filled all the way up.  If the consumed data should fit on 16 tapes, and we’re kicking out 26 tapes, that’s a problem IMO.  I’m sure many of you out there have contended with this in CV specifically, and I’d bet those of you using other vendors products have run into this too.  I’m going to first explain why the problem is occurring, and then I’ll go over how I’ve reduced most of the waste.

The Why?

In CV, we have storage policies, and short of going into an explanation of what they are for others not familiar with CV, just think of it as an island of backup data.  That island of data doesn’t co-mingle with other islands of data on disk, and tape is no exclusion.  What that means is when you backup data to a storage policy and want to copy it to tape, that data getting copied to tape will automatically reserve the entire tape being used.  In turn, each storage policy then reserves its own unique tapes so that data does not co-mingle together.  This means for every storage policy you have, you’re guaranteed at least one unique tape per storage policy at a minimum.  Now, each storage policy can have a number of streams configured.  To keep things simple, let’s just ignore multiplexing for now.  When a storage policy has a stream limit of 1, that means only 1 tape drive will be used, when it has a stream policy of 4, that means 4 tape drives will be used.  Now, as you copy data to tape, you normally have more than 1 streams worth of data, you probably have at least one for each client in your environment (and likely much more than that).  This is a good thing, having more streams means we can run data copy operations in parallel.  In the case of the 4 streams example, that means we can use 4 tape drives in parallel to copy data for the example storage policy.  What this also means is depending on circumstances, we could end up with 4 tapes not being filled all the way as well.  Streams are optimized for performance, NOT for improving tape utilization.  Now, imagine you have more than one storage policy, let’s just say 4 storage polices, each being their own island, and each with a stream limit of 2.  That means you could end up with up to 8 tapes not being fully utilized.  I’m also ignoring for now that in CV, you can separate incremental and fulls to different storage policies which exacerbates the problem further (taking one island and making it two).

In our case, we have 4 storage policies and we had gone through a process of merging our Fulls and Incs into a single storage policy to consolidate tapes already.  We have a total of 6 tape drives, which means if we just configured the storage policies to fight over the tape drives @6 streams each, we could end up in theory with 24 partially filled tapes.  We’re smarter than that of course, so that wasn’t out problem.  Our problem was finding the right balance between how many streams a storage policy needed to copy all its data in our window, and not making it so high that we ended up wasting tape.  Pre-solution, we almost always had 4 – 6 tapes that were wasted, as in 100GB on a 2000GB tape.  It was annoying and wasteful.

Solution, problems again, improved solution:

There are two main components to the solution.

  • Scripting storage policy stream modification via task scheduler (MVP JAMS in our case).
  • CommVault introducing Global Tape Policies in v11
    • This allows tapes to be shared, no longer residing on an island as mentioned above. So storage policy 1, 2, 3 and 4 can all share the same tape.  Way more efficient.

In our case, when we saw the global tape policy, it was like a halo of light and angels singing, going off in our head.  This was it, our problems were FINALLY solved.  After going through the very tedious task of migrating to this solution, we found that we were still using 4 – 6 tapes a day more than we needed.  The problem was not that data was not co-mingling, it was.  No, the problem was that we set the global tape policy to 6 streams, and every day, it was using 6 tape drives for backups.   At first we tried to solve the problem by limiting the aux copy streams via a scheduled task in CV (start the job with 1 stream only as an example) but we had 4 storage polices, so that only reduced the tape usage to 4.  The problem again was that each storage police was scheduled and run in parallel.  So while we restricted any one storage policy, ultimately we were still letting more tape drives being used than needed and in turn more tapes than was needed.  We had set 6 streams, because we wanted to make sure that our FULL jobs had enough tape drives to complete over the weekend.

At this stage, I came to the conclusion that we needed a way to dynamically control the streams for the global tape policy so that during the week days it was restricted to 1 tape drive (all we needed) and on the weekend, we could start out with 6 and slowly ramp back down to 1, and hopefully more fully fill our tapes.  With a bit of research and some discussions with CV, I found out that they have a CLI option for controlling storage policy streams (found https://documentation.commvault.com/commvault/v10/article?p=features/storage_policies/storage_policy_xml_edit.htm).  Using my trusty scheduling tool, I setup a basic system where on Sunday @4PM we would set the streams to “1”, and then on Friday @4PM we would raise them to “6” and Saturday @7am we would drop them to “2”.  This basically solved our problem, and I’m happy to say that on week days, tapes are filled as much as is possible (1 – 2 tapes depending on which client ran a full), and on the weekend, 2 – 4 tapes are still being used.  I’m still tuning the whole thing, for the fulls (it’s a balance of utilization and performance), but its better than its ever been.  Its also worth noting, we went back and modified our aux copy schedules and told them to use all available streams since we now choke point it at the global tape policy.  This allows any storage policy to go as fast as possible (although potentially blocking other ones).

It’s a hack no doubt.  IMO, CV should develop this concept in their storage policies.  Basically creating a schedule window to dynamically control the queue depth.  For now, this is working well.

Problem Solving: Chasing SQL’s Dump

The Problem:

For years as an admin I’ve had to deal with SQL.  At a former employer, our SQL environment / databases were small, and backup licensing was based on agents, not capacity.  Fast forward to my current employer, we have a fairly decent sized SQL environment (60 – 70 servers), our backup’s are large , licensing is based on capacity, and we have a full time DBA crew that manage their own backup schedules, and prefer that backup’s are managed by them.  What that means is dealing with a ton of dumps.  Read into that as you want 🙂

When I started at my current employer, the SQL server backup architecture was kind of a mess.  To being with, where were then was about 40 – 50 physical SQL servers.  So when you’re picturing all of this, keep that in mind.  Some of these issues don’t go hand in hand with physical design limitations, but some do.

  • DAS was used for not only storage the SQL log, DB and index, but also backup’s.  Sometimes if the SQL server was critical enough, we had dedicated disks for backups, but that wasn’t typical.  This of course is a problem for many reasons.
    • Performance for not only backup’s but the SQL service its self were limited often because they were sharing the same disks.  So when a backup kicked off, SQL was reading from the same disks it was attempting to write to.  This wasn’t as big of an issue for the few systems that had dedicated disks, but even there, sometimes they were sharing the same RAID card, which meant you’re still potentially bottlenecking one for the other.
    • Capacity was spread across physical servers.  Some systems had plenty of space and others barely had enough.  Islands are never easy to manage.
    • If that SQL server went down, so did its most recent backup’s.  TL backup’s were also stored here (shudders).
    • Being a dev shop meant doing environment refreshes.  This meant creating and maintaining share / NTFS permissions across servers.  This by its self isn’t inherently difficult if its thought out ahead of time, but it wasn’t (not my design).
    • We were migrating to a virtual environment, and that virtual environment would be potentially vMotioning from one host to another.  DAS was a solution that wouldn’t work long term.
  • The DBA’s managed their backup schedules so it required us all to basically estimate when the best time to pickup their DB’s.  Sometimes we were too early and sometimes we could have started sooner.
  • Adding to the above points if we had a failed backup over night, or a backup that ran long, it had an effect on SQL’s performance during production hours.  This put us in a position of choosing between giving up on backing some data up, or having performance degradation.
  • We didn’t know when they did full’s vs diffs.  Which means, we might be storing thier DIFF files on what we considered “full” backup taps.  By its self not an issue, except for the fact that we did monthly extended fulls.  Meaning we kept the first full backup of each month for 90 days.  If that file we’re keeping is a diff file, that’ doesn’t do us any good.  However, you can see below, why it wasn’t as big of an issue in general.
  • Finally, the the problem that I contended with besides all of these, is that because they were just keeping ALL files on disk in the same location, every time we did a full backup, we backed EVERYTHING up.  Sometimes that was 2 weeks worth of data, TL’s, Diff’s and and Fulls.  This meant we were storing their backup data multiple times over on both disk and tape.

I’m sure there’s more than a few of you out there with similar design issues.  I’m going to lay out how I worked around some of the politics and budget limitation.  I wouldn’t suggest this solution as a first choice, its really not the right way to tackle it, but it is a way that works well for us, and might for you.  This solution of course isn’t limited to SQL.  Really anything that uses a backup file scheme could fit right into this solution.

The solution:

I spent days worth of my personal time while jogging, lifting, etc. just thinking about how to solve all these problems.  Some of them were easy and some of them would be technically complex, but doable.  I also spent hours with our DBA team collaborating on the rough solution I came up with, and honing it to work for both of us.

Here is basically what I came to the table with wanting to solve:

  • I wanted SQL dumping to a central location, no more local SQL backups.
  • The DBA’s wanted to simplify permissions for all environments to make DB refreshing easier.
  • I wanted to minimize or eliminate storing their backup data twice on disk.
  • I wanted them to have direct access to our agreed upon retention without needing to involve us for most historical restores.  Basically giving them self service recovery.
  • I wanted to eliminate backing up more data then we needed
  • I wanted to know for sure when they were done backing up and knowing what type of backup they performed.

Honestly we needed the fix, as the reality was we were moving towards a virtualizing our SQL infrastructure, and presenting local disk on SAN would be both expensive, but also incredibly complex to contend with for 60+ SQL servers.

How we did it:

Like I said, some of it was an easy fix, and some of it more complex, let’s break it down.

The easy stuff:

Backup performance and centralization:

We bought an affordable backup storage solution.  At the time of this writing it was and still is Microsoft Windows Storage Spaces.  After making that mistake, we’re now moving on to what we hope is a more reliable and mostly more simplistic Quantum QXS (DotHill) SAN using all NL-SAS disks.  Point being, instead of having SQL dump to local disk, we setup a fairly high performant file server cluster.   This gave us both high availability, and with the HW we  implemented, very high performance as well.

New problem we had to solve:

Having something centralized means you also have to think about the possibility of needing to move it at some point.  Given that many processes would be written around this new network share, we needed to make sure we could move data around on the backend, update some pointers and things go on without needing to make massive changes.  For that, we relied on DFS-N.  We had the SQL systems point at DFS shares instead of pointing at the raw share.  This is going to prove valuable as we move data very soon to the new SAN.

Reducing multiple disk copies and providing them direct access to historical backups:

The backup storage was sized to store ALL required standard retention, and we (SysAdmins) would continue managing extended retention using our backup solution.  For the most part this now means the DBA’s had access to the data they needed 99% of the time.  This solved the storing the data more than once on disk problem as we would no longer store their standard retention in CommVault, but instead rely on the SQL dumps they already are storing on disk (except extended retention).  They still get copied to tape and sent off site in case you thought that wasn’t covered BTW.

Simplifying backup share permissions:

The DBA’s wanted to simplify permissions, so we worked together and basically came up with a fairly simple folder structure.  We used the basic configuration below.

  • SQL backup root
    • PRD <—- DFS root / direct file share
      • example prd SQL server 1 folder
      • example prd SQL server 2 folder
      • etc.
    • STG <—– DFS root / direct file share
      • example stg SQL server 1 folder
      • etc.
    • etc.
  • Active Directory security group wise we set it up so that all prod SQL servers are part of a “prod” active directory group, all stage are part of a “stage” active directory group, etc.
  • The above AD groups were then assigned at the DFS root (Stg, prd, dev, uat) with the desired permissions.

With this configuration, its now as simple as dropping a SQL service account in one group, it and will now automatically fall into the correct environment level permissions.  In some cases its more permissive then it should be (prod has access to any prod server for example), but it kept things simple, and in our case, I’m not sure the extra security of per server / per environment really would have been a big win.

The harder stuff:

The only two remaining problems we had to solve was knowing what kind of backup the DBA’s did, and making sure we were not backing up more data than we needed.  These were also the two most difficult problems to solve because there wasn’t native way to do it (other than agent based backup).  We had two completely disjointed systems AND processes that we were trying to make work together.  It took many miles of running for me to put all the pieces together and it took a number of meetings with the DBA’s  to figure things out.  The good news is, both problems were solved by aspects of a single solution.  The bad news is, its a fairly complex process, but so far, its been very reliable.  Here’s how we did it.

 The DONE file:

Everything in the work flow is based on the presence of a simple file, what we refer to as the “done” file internally.  This file is used throughout the work flow for various things, and its the key in keeping the whole process working correctly.  Basically the workflow lives and dies by the DONE file.  The DONE file was also the answer to  our knowing what type of backup the DBA’s ran, so we could appropriately sync out backup type with them.

The DONE file follows a very rigid naming convention.  All of our scripts depend on this, and frankly naming standard are just a recommend practice (that’s for another blog post).

Our naming standard is simple:

%FourDigitYear%%2DigitMonth%%2DigitDay%_%24Hour%%Minute%%JobName(usually the sql instance)%_%backuptype%.done

And here are a few examples:

  • Default Instance of SQL
    • 20150302_2008_ms-sql-02_inc.done
    • 20150302_2008_ms-sql-02_full.done
  • Stg instance of SQL
    • 20150302_2008_ms-sql-02stg_inc.done
    • 20150302_2008_ms-sql-02stg_inc.done
The backup folder structure:

Equally as important as the done file, is our folder structure.  Again because this is a repeatable process, everything must follow a standard or the whole thing fall apart.

As you know we have a root folder structure that goes something like this ” \\ShareRoot\Environment\ServerName”.  Inside the servername root I create four folders and I’ll explain their use next.

  • .\Servername\DropOff
  • .\Servername\Queue
  • .\Servername\Pickup
  • .\Servername\Recovery

Dropoff:  This is where the DBA’s dump their backups initially.  The backup’s sit here and wait for our process to begin.

Queue:  This is a folder that we use to stage / queue the backup’s before the next phase.  Again I’ll explain in greater detail.  But the main point of this is to allow us to keep moving data outside of the Dropoff folder to a temp location in the queue folder.  You’ll understand why in a bit.

Pickup:  This is where our tape jobs are configured to look for data.

Recovery:  This is the permanent resting place for the data until it reaches the end of its configured retention period.

Stage 1: SQL side

Prerequisites:

  1. SQL needs a process that can check the Pickup folder for a done file, delete a done file and create a done file.  Our DBA’s created a stored procedure with parameters to handle this, but you can tackle it however you want, so long as it can be executed in a SQL maintenance plan.
  2. For each “job” in sql that you want to run, you’ll need to configure a “full” maintenance plan to run a full backup, and if you’re using SQL diffs, create an “inc” maintenance plan.  In our case, to try and keep things a little simple, we limited a “job” to a single SQL instance.

SQL maintenance plan work flow:

Every step in this workflow will stop on an error, there is NO continuing or ignore.

  1. First thing the plan does is check for the existence of a previous DONE file.
    1. If a DONE file exists, its deleted and an email is sent out to the DBA’s and sysadmins informing them.  This is because its likely that a previous process failed to run
    2. If a DONE file does not exist, we continue to the next step.
  2. Run our backup, whether its a full or inc.
  3. Once complete, we then create a new done file in the root of the PickupFolder directory.  This will either have a “full” or “inc” in the name depending on which maintenance plan ran.
  4. We purge backup’s in the Recovery folder that are past our retention period.

SQL side is complete.  That’s all the DBA’s need to do.  The rest is on us.  From here you can see how they were able to tell us whether or not they ran a full via the done file.  You can also glean a few things about the workflow.

  1. We’re checking to see if the last backup didn’t process
  2. We delete the done file before we start a new backup (you’ll read why in a sec).
  3. We create a new DONE file once the backup’s are done
  4. We don’t purge any backup’s until we know we had a successful backup.
Stage 1: SysAdmin side

Our stuff is MUCH harder, so do your best to follow along and let me know if you need me to clarify anything.

  1. We need a stage 1 script created, and stage 1 script will do the following in sequential order.
    1. Will need to know what job its looking for.  In our case with JAMS, we named our JAMS jobs based on the same pattern as the done file.  So when the job starts the script reads information from the running job and basically fills in all the parameters like the folder location, job name, etc.
    2. The script looks for the presence of ANY done file in the specific folder.
      1. If no done file exists, it goes into a loop, and checks every 5 minutes (this minimizes slack time).
      2. If a done file does exists we…
        1. If there are more than 1, we fail.  As we don’t know for sure which file is correct.  This is a fail safe
        2. If there is only one, we move on.
    3. Using the “_” in the done file, we make sure that it follows all our standards.  So for example, we check that the first split is a date, the second is a time, the third matches the job name in JAMS and the fourth is either an inc or full.  A failure in any one of these, will cause the job to fail and we’ll get notified to manually look into it.
    4. Once we verify the done file is good to go, we now have all we need to start the migration process.  So the next thing we do is use the date and time information, to create a sub-folder in the Queue folder.
    5. Now we use robocopy to mirror the folder structure to the .\Queue\Date_Time
    6. Once that’s complete, we move all files EXCEPT the done file to the Date_Time folder.
    7. Once that’s complete, we then move the done file into said folder.

And that completes stage 1.  So now you’re probably wondering, why wouldn’t we just move that data straight to the pickup folder? A few reasons.

  • When the backup to tape starts we want to make sure no new files are  getting pumped into the pickup folder.  You could say well just wait until the backup’s done before you move data along. I agree and we sort of do that, but we do it in a way that keeps the pickup folder empty.
    • By moving the files to a queue folder, if our tape process is messed up (not running) we can keep moving data out of the pickup folder into a special holding area, all the while still being able to keep track of the various backup sets (each new job would have a different date_timestamp folder in the queue folder).  Our biggest concern is missing a full backup.  Remember, if the SQL job see’s a done file, it deletes it.  We really want to avoid that if possible.
    • We ALSO wanted to avoid a scenario where we were moving data into a queue folder while the second stage job tried to move data out of the queue folder.  Again, buy have an individual queue folder for each job, this allows us to keep track of all the moving pieces and make sure that we’re not stepping on toes.

Gotcha to watch out for with moving files:

If you didn’t pick up on it, I mentioned that I used robocopy to mirror the directory structure, but I did NOT mention using it for moving the files.  There’s a reason for that. Robocopy’s move parameter actually does a copy + delete.  As you can imagine with a multi-TB backup, this process would take a while.  I built a custom “move-files” function in powershell that does a similar thing, and in that function I use “move-file” cmdlet which is a simple pointer update.  MUCH faster as you can imagine.

Stage 2: SysAdmin Side

We’re using JAMS to manage this, and with that, this stage does NOT run, unless stage 1 is complete.  Keep that in mind if you’re trying to use your own work flow solution.

Ok so at this point our pickup directory may or may not be empty, doesn’t matter, what does matter is that we should have one or more jobs sitting in our .\Queue\xxx folder(s).  What you need next is a script that does the following.

  1. When it starts, it looks for any “DONE” file in the queue folder.  Basically doing a recursive search.
    1. If one or more files are found, we do a foreach loop for each done file found and….
      1. Mirror the directory structure using robocopy from queue\date_time to the PickupFolder
        1. Then move the backup files to the Pickup folder
        2. Move the done file to the Pickup Folder
        3. We then confirm the queue \date_time is empty and delete it.
        4. ***NOTE:  Notice how we look for a DONE file first.  This allows stage 1 to be populating a new Queue sub-folder while we’re working on this stage without inadvertently moving data that’s in use by another stage.  This is why there’s a specific order to when we move the done file in each stage.
    2. If NO done files are found, we assume maybe you’re recovering from a failed step and continue on to….
  2. Now that all files (dumps and done) are in the pickup folder we….
    1. Look for all done files.  if any of them are full, the job will be a full backup.  if we find NO fulls, then its an inc.
    2. Kick of a backup using a CommVault scripts.  Again parameters such as the path, client, subclient, etc. are all pulled from JAMS in our case or already present in CommVault.  We use the information determined about the job type in step 2\1 as for what we’ll execute.  Again, this gives the DBA’s the power to control whether a full backup or an inc is going to tape.
    3. As the backup job is running, we’re constantly checking the status of the backup, about once a minute using a simple “while” statement.  If the job fails, our JAMS solution will execute the job two more times before letting us know and killing the job.
    4. if the job succeeds, we move on to the next step
  3. Now we follow the same moving procedure we used above, except this time, we have no queue\date_time folder to contend with.
    1. Move the backup files from Pickup to the Recovery folder.
    2. Move the done files
    3. Check that the Pickup folder is empty
      1. If yes, we delete and recreate it.  Reason?  Simple, its the easiest way to deal with a changing folder structure.  if a DBA deletes a folder in the DropOff directory, we don’t want to continue propagating a stale object.
      2. If not we bomb the script and request manual intervention.
  4. if all that works well, we just completed out backup process.

Issues?

You didn’t think I was going to say it was perfect did you?  Hey, I’m just as hard on myself as I am on vendors.  So here is what sucks with the solution.

  1. For the longest time, *I* was the only one that knew how to troubleshoot it.  After a bit of trainings, and running into issues though, my team is mostly caught up on how to troubleshoot.  Still, this is the issue with home brewed solutions, and ones entirely scripted, don’t help.
  2. Related to the above, if I leave my employer, I’m sure the script could be modified to serve other needs, but its not easy, and I’m sure it would take a bit of reverse engineering.  Don’t get me wrong, I commented the snot out of the script, but that doesn’t make it any easier to understand.
  3. Its tough to extend.  I know I said it could, but really, I don’t want to touch it unless I have to (other than parameters).
  4. When we do UAT refreshes, we need to disable production jobs so the DBA’s have access to the production backups for as long as they need.  its not the end of the world, but it requires us to now be involved at a low level with development refreshes, where as before that wasn’t any involvement on our side.
  5. We’ve had times where full backup’s have been missed tape side. That doesn’t mean they didn’t get copied to tape, rather they were considered an “inc” instead of being considered a “full”. This could easily be fixed simply by having the SQL stored procedure checking if the done file that’s about to be deleted is a full backup and if so, to replace it with a new full DONE file, but that’s not the way it is now, and that depends on the DBA’s.  Maybe in your case, you can account for that.
  6. We’ve had cases where the DBA’s do a UAT refresh and copy a backup file to the recovery folder manually.  When we go to move the data from the pickup folder to the recovery folder, our process bombs because it detects that the same file already exists.  Not the end of the world for sure, easy enough to troubleshoot, but its not seamless.  An additional workaround to this could be to do an md5 hash comparison.  If the file is the same, just delete it out of the pickup directory and move on.
  7. There are a lot of jobs to define and a lot of places to update.
    1. In JAMS we have to create 2 jobs + a workflow that links them per SQL job
    2. in CommVault we have to define the sub-client and all its settings.
    3. On the backup share, 4 folders need to be created per job.

Closing thoughts:

At first glance I know its REALLY convoluted looking.  A  Rube Goldberg for sure.  However, when you really start digging into it, its not as bad as it seems.  In essence, I’m mostly using the same workflow multiple times and simply changing the source / destination.  There are places  for example when I’m doing the actual backup, where there’s more than the generic process being used, but its pretty repetitive otherwise.

In our case, JAMS is a very critical peace of software to making this solution work.  While you can do this without the software, it would be much harder for sure.

At this point, I have to imagine that you’re wondering if this is all worth it?  Maybe not to companies with deep pockets.   And being honest, this was actually one of those processes that I did in house and was frustrated that I had to do it.  I mean really, who wants to go through this level of hassle right?  Its funny, I thought THIS would be the process i was troubleshooting all the time, and NOT Veeam.  However, this process for the most part has been incredibly stable and resilient.  Not bragging, but its probably because I wrote the workflow.  The operational overhead I invested saved a TON of capex.  Backing up SQL natively with CommVault has a list price of 10k per TB, before compression.  We have 45TB of SQL data AFTER compression.  You do the math, and I’m pretty sure you’ll see why we took the path we did.    Maybe you’ll say, that CommVault is too expensive, and to some degree that’s true, but even if you’re paying 1k per TB, if you’re being pessimistic and assuming that 45TB = 90TB before compression, I saved 90k + 20% maintenance each year, and CommVault doesn’t cost anywhere close to 1k per TB, so really, I saved a TON of bacon with the process.

Besides the cost factor, its also enabled us to have a real grip on what’s going happening with SQL backups.  Before it was this black box that we had no real insight into.  You could contend that’s a political issue, but then I suspect lots of companies have political issues.  We now know that SQL ran a full backup 6 days ago.  We now have our backup workflow perfectly coordinated.  We’re not starting to early, and we’re kicking off with in 5 minutes of them being done, so we’re not dealing with slack time either.  We’re making sure that our backup application + backup tape is being used in the most prudent way.  Best of all, our DBA’s now have all their dump files available to them, their environment refreshes are reasonable easy, the backup storage is FAST, we have backup’s centralized and not stored with the server.  All in all, the solution kicks ass in my not so humble opinion.  Would I have loved to do CommVault natively?  For sure, no doubt its ultimately the best solution, but this is a compromise that allowed us to continue using CommVault, save money and accomplish all our goals.

Backup Storage Part 5: Realization of a failure

No one likes admitting they’re wrong, and I’m certainly no different.  Being a mature person means being able to admit you’re wrong, even if it means doing it publicly, and that is what I’m about to do.

I’ve been writing this series slowly over the past few months, and during that time, I’ve noticed an increasing number of instances where my storage space virtual disks NTFS would go corrupt.  Basically, I’d see Veeam errors writing to our repository, and when investigating, I would find files not deleting (old VBK’s).  When trying to manually delete them, they would either throw some error, or they would act like they were deleted (they’d disappear), but then return only a second later.  The only way to fix this (temporarily) was to do a check disk, which requires taking the disk offline.  When you have a number of backup jobs going at anytime, this means something is going to crash, and it was my luck that it was always in middle of a 4TB+ VM.

Basically what I’m saying, that as of this date, I can no longer recommend NTFS running on Storage Spaces.  At least not on bare metal HW.  My best guess is we were suffering from bit rot, but who knows since storage spaces / NTFS can’t tell me otherwise, or at least I don’t know how to figure it out.

All that said, I suspect I wouldn’t have run into these issues had I been running ReFS.  ReFS has online scrubbing, and its looking for things like failed CRC checks (and auto repairs them) .  At this point, I’m burnt out on running storage spaces, so I’m not going to even attempt to try ReFS.  Enough v1 product evals in prod for me :-).

Fortunately I knew this might not have worked out, so my back out plan is to take the same disks / JBODS and attach them to a few RAID cards.  Not exactly thrilled about it, but hopefully it will bring a bit more consistency / reliability back to my backup environment.  Long term I’m looking at getting a SAN implemented for this, but thats for a later time.

Its a shame as  I really had high hopes for storage spaces, but like many MS products, I should have known better than to go with their v1 release.  At least it was only backup’s and not prod…

Update (09/13/2016):

I wanted to add it bit more information.  At this point it’s theory, but just incase this article is or is not dissuading you from doing storage spaces, it’s worth noting some additional information.

We had two NTFS volumes, each being 100TB in size.  One for Veeam and one for our SQL backup data.  We never had problems with the SQL backup volume (probably luck), but the Veeam volume certainly had issues.  Anyway, after tearing it all down, I was still bugged about the issue, kind of felt really disappointed about the whole thing.  In some random google, I stumbled across this link going over some of NTFS’s practical maximums.  In theory at least, we went over the tested (recommended) max volume size.  Again, I’m not one to hide things and I fess up when I screw up.  Some of the storage spaces issues may have been related to us exceeding the recommended size, and NTFS couldn’t proactively fix things in the background.  I don’t know for sure, and I really don’t have the appetite to try it again.  I know it sounds crazy to have a 100TB volume, but we had 80TB of data stored in there.  In other words, most smaller companies don’t hit that size limit, but we have no problem at all exceeding that.  If you’re wondering why we made such a large volume, it really boiled down to wanting to maximize both contiguous space as well as not wasting space.  Storage spaces doesn’t let you thin provision storage when its clustered, so if we for example would have created five 20TB LUNS instead, the contiguous space would have been much smaller and ultimately more difficult to manage with Veeam.  We don’t have that issue anymore with CommVault as it can deal with lots of smaller volumes with ease.

Anyway, while I would love to say MS shouldn’t let you format a volume larger than what they’ve tested (and they shouldn’t without at least a warning), ultimately the blame falls on me for not digging into this a bit more.  Then again, try as I may, I’ve been unable to validate the information posted on the linked blog above.  I don’t doubt the accuracy of the information, often I find fellow bloggers do a better job of explaining how to do something or conveying real world limits than the vendor.

Best of luck to you, if you do go forward with storage spaces, and if you do have questions, let me know, I worked with it in production for over a year, at a decent scale.

Backup Storage Part 4a: Windows Storage Spaces Gotcha’s

The pro and the con of a software defined storage is that its not a turn key solution.  Not only do  you have the power to customize your solution, you also have no choice but to design your solution.  With Storage Spaces, we figured most of this stuff out before selecting it as our new backup storage solution.  At the time, there was some documentation on storage spaces, but it was very much a learning process.  Most “how to’s” were demonstrated inside labs, so I found some aspects of the documentation to be useless, but I was able to glean enough information, to at least know what to think about.

So to get started, if you end up here because you want to build a solution like this, I would encourage you to start with this FAQ’s that MS has put together.  A lot of the questions I had were answered here.

I want to go over a number of gotcha’s with WSS before you take the plunge:

  1. If you’re used to and demand a solution that notifies you automatically of HW failures, this solution may not be right for you.  Don’t get me wrong, you can tell that things are going bad, but you’ll need to monitor them yourself.  MS has written a health script, and I myself was also able to put together a very simple health script (I’ll post it once I get my GitHub page up).
  2. WSS only polls HW health once every 30 minutes.  You can’t change this.  That means if you rip an power supply out of your enclosure it will take up to 30 minutes before the enclosure goes into an unhealthy state.  I confirmed this with MS’s product manager.
  3. Disk rebuilds are not automatic, nor are they what I would call intuitive.  You shouldn’t just rip a disk out when its bad, plop a disk in and walk away.  There is a process that must be followed in order to replace a failed disk.  BTW, this process as of now, is all powershell based.
  4. Do NOT cheap out on consumer grade HW.  Stick to the MS HCL found here.  There have been a number of stability problems listed with WSS, and its almost always has to do with not sticking with the HCL and not WSS’s reliability.
  5. This isn’t specific to WSS, but do not plan on mixing SATA and SAS drives on the same SAS backplane.  Either go all SAS or go all SATA, avoid mixing.  For HDD’s specifically, the cost is so negligible between SATA and SAS, I would personally recommend just sticking with SAS, unless you never plan to cluster.
    1. Do NOT use SAS expanders either, again, stop cheaping out, this is your data we’re talking about here.
  6. Do NOT plan on using parity based virtual disks, they’re horrible at writes, as in 90MBps tops.  Use mirroring or nothing.
  7. Do NOT plan on using dedicated hot spares, instead plan on reserving free space in your storage pool.  This is one of many advantages to WSS.  it uses free space and ALL disks to rebuild your data.
    1. If you plan on using the “enclosure awareness”, you need to reserve a drives capacity of free space * the number of enclosures you have.  So if you have 4 enclosures and you’re using 4TB drives, you must reserve 16TB of space per storage pool spanned across those enclosures.
  8. Plan on taking point 7, and also ensuring there’s at least 20% free space in your pool.  I like to plan like this.
    1. Subtract 20% right of the top of your raw capacity.  So if you have 200TB raw, that’s 160TB.
    2. As mentioned in point 7\1.  If you plan to use enclosure awareness, subtract a drive for each enclosure, otherwise, subtract at least one drives worth of capacity.  So in we had 4 enclosures and they had 4TB drives, that would be 160TB – 16TB = 144TB usable before mirroring.
  9. I recommend thick provisioned disk, unless you’re going to be VERY diligent about monitoring your pools free space.  Your storage pool will go OFFLINE if even one disk in the pool runs low on free space.  Thick provisioning prevents this, as the space will only allocate what it can reserve.
  10. Figure out your strip width (number of disks in a span) before building a virtual disk.  Plan this carefully because it can’t be reversed.  This has its own set of gotcha’s.
    1. This determines the increments of disks you need to expand your pool.  If you choose a 2 column mirror, that means you need 4 disks to grow the virtual disk.
    2. Enclosure awareness is also taken into account with columns.
    3. Performance is dependent on the number of columns.  if you have 80 disk, and you create a 2 column mirrored virtual disk, you’ll only have the read performance of 4 disks, and write performance of 2 disks (even with 80 disks).  You will however be able to grow at 4 disk increments.  However, if you create a virtual disk with 38 columns, you’ll have the read performance of 76 disks, and the write performance of 38, but you’ll need 76 disks to grow the pool.  So plan your balance of growth vs. performance.
  11. Find a VAR that has WSS experience to purchase your HW through.  Raid Inc. is who we used, but now other vendors such as Dell have also taken to selling WSS approved solutions.  I still prefer Raid Inc due to pricing, but if you want the warm and fuzzies, its good to know you now can go to Dell.
  12. Adding storage to an existing pool does not rebalance the data across the new disks.  The only way to do this is to create a new virtual disk, move the data over, and remove the original virtual disk.
    1. This is resolved in server 2016.

Those are most of the gotcha’s to consider.  I know it looks big, but I’m pretty confident that every storage vendor you look at, has their own unique list of gotcha’s.  Heck, a number of these are actually very similar to NetApp/ZFS, and other tier 1 storage solutions.

We’ll nerd out in my next post on what HW we ended up getting, what it basically looks like and why.

Backup Storage Part 3c: Cloud Storage

Anyone who’s heard the word “cloud” as it pertains to technology knows its a fairly nebulous term.  About the only thing people seem to grasp is that its means “not in my datacenter or home”.  When we talk about “cloud storage” the term is still fuzzy, but at least we know it has something to do with storage.  For this post, I’m going to be writing about two specific types of storage outlined below.

  • Online / Realtime storage:  This is just a name I’m going to use to describe the type of storage that you’d normally run cloud hosted VMs on.  I like to think of it as cloud SAN.  This type of storage is great for primary backup storage, both because of its performance capability and always being online.  However, do to its price, it may not be the most prudent cloud storage option for long term retention or for secondary copies.
  • Nearline or Archive storage: This storage is a kind of like tape (and depending on the vendor may be tape). You don’t write or read to this storage directly, and instead use either a virtual appliance, or some form of API / SW.  Archive and nearline storage are great options for secondary storage or for hosting secondary copies of your backups.  Other than how you interact with the storage, the only downside tends to be the potential of longer recovery times.

With two great options, what’s not to love?  You have the perfect mix of short term, fast storage for your more recent backups, and your cheap and deep storage for your secondary and extended retention copies.

Initially it sounded great, until we started digging deeper into it.  Ultimately we ended up having to pass on the solution for a number of reasons below.

  1. Cost:
    1. Let’s face it, unless you’re one of the lucky few, like us you have a tough enough time getting budget for things that might actually make your company more productive.  Trying to get your company to invest heavily in something they might never need to use, or rarely need to use is not likely to happen.
    2. This ones a toss up, but if your company prefers capex over opex, cloud is not going to be an easy battle for you.  In our case, capex is preferred, which gave cloud storage a big black eye.
    3. The storage is only one small piece of the cost of cloud options.
      1. Your network pipe is probably not big enough for all the data you need to send.  This mostly depends on your backup solution as a whole (software optimized backups with dedupe and compression may help), but I suspect you don’t have enough to send your weekly full, let alone try to recover a weekly full.  So on top of the never ending cost of cloud storage, you’re going to need to add a more expensive pipe to get the data there, and maybe even pull the data back.
      2. Potentially in addition to the network, or maybe instead of the network, you’ll need to invest in a cloud gateway or some other backup data optimizing software.  Either way, you’re going to be investing even more capex on top of your increasing opex.
  2. DR: Our new DR plan wasn’t finalized yet, which made it really difficult to pick a solution that we weren’t sure would make sense in two years.  For example, if we move our DR solution to the cloud, some of the considerations in point 1 go away, as we’ll be running our secondary site in the cloud anyway.  However, if we decided to stay in a colocation unit, while the cloud technically speaking could still work, it wouldn’t make sense to send our data to the cloud compared to just sending it to our DR site.
  3. Speed:  While point 1/3/1 (Network) if invested in correctly shouldn’t pose a bottleneck, its tough to argue against copying our data to tape from a performance perspective.  We easily saturate 5 LTO6 tape drives in parallel (off our new backup storage solution to be discussed in an upcoming post).  There’s no way that it would be cost effective to get that same level of performance out of cloud storage.
  4. Integration:  While more and more backup vendors are integrating cloud storage API’s, they’re not always free, and they’re not always good.   Veeam as an example, had a cloud solution for their product, but it was terribly inefficient (as I recall).  There was no backup optimization to the cloud.  Veeam was simply copying files for you to the cloud.  Its simple, but not efficient.  This also goes back to point 1/3/2.  We could have worked around this limitation with different backup SW, or cloud gateways, but again, we’re talking about adding cost to an already limited budget.

At the end of the day, I really wanted cloud, but financially speaking it doesn’t make much sense to backup to the cloud.  While the price of the storage its self is quiet reasonable, the cost of the pipe or SW to get the data there is what kills the solution.  I’m not saying it doesn’t make sense for others, but for us, our data was too big and the the cost would have been too high.

Factors that would change my view are below:

  1. if ISP’s prices were dropping at the same rates as cloud vendors, I could see this making cloud more affordable.
  2. If the cloud providers themselves started offering deduplication targets as part of their storage offering I think this would make a big difference.  Perhaps instead of charging what we physically consume (per GB) they could instead charge what we logically consume.  This way they still win, and so do we.

Backup Storage Part 3a: Deduplication Targets

Deduplication and backup kind of go hand in hand, so we couldn’t evaluate backup storage and not check out this segment.  We had two primary goals for a deduplication appliance.

  1. Reduce racks space while enabling us to store more data.  As you know in part 1, we had a lot of rack space being consumed.  While we weren’t hurting for rack space in our primary DC, we were in our DR DC.
  2. We were hoping that something like a deduplication target would finally enable us to get rid of tape and replicate our data to our DR site (instead of sneakernet).

For those of you not particularly versed in deduplicated storage, there are a few things to keep in mind.

  • Backup deduplication and the deduplication you’ll find running on high performance storage arrays are a little different.  Backup deduplication tends to use either variable or much smaller block size comparisons. An example, your primary array might be looking for 32k blocks that are the same, where as deduplication target might be looking for 4k blocks that are the same.  Huge difference in the deduplication potential.  The point is, just because you have deduplication baked into your primary array, does not mean its the same level of dedupliation that’s used in deduplication target.
  • Deduplication targets normally also include compression as well.  Again, its not the same level of compression found in your primary storage array, typically a more aggressive (CPU intensive) compression algorithm.
  • Deduplication targets tend to be in-line dedpulication.  Not all are, but the majority of the ones I looked at were.  There are pros and cons to this that I’ll go into later.
  • In all the appliances I’ve looked at, everyone of them had a primary access meathod of NFS/SMB.  Some of them also offered VTL, but the standard deployment method is them acting as a file share.
  • Not all deduplication targets offer whats referred to as global dedplication.  Depending on the target, you may only deduplicate at the share level.  This can make a big difference in your deduplication rates.  A true global deduplication solution, will deduplicate data across the entire target, which is the most ideal.

Now I’d like to elaborate a bit on the pros and cons of in-line vs post process (post process) deduplication.

Pros of In-Line:

  • As the name implies, data is instantly deduplicated as its being absorbed.
  • You don’t need to worry about maintaining a buffer or landing zone space like post process appliances need.
  • Once an appliance has seen the data (meaning its getting a deduplication hit) writes tend to be REALLY fast since its just metadata updates.  In turn replication speed also goes through the roof.
  • You can start replication almost instantly or in real time depending on the appliance.  Post process can’t do this, because you need to wait for the data to be deduplicated.

Pros of Post Process:

  • Data written isn’t deduplicated right away, which means if you’re doing say a tape backup right afterwards, or a DB verification, you’re not having to rehydrate the data.  Basically they tend to deal with reads a lot better.
  • Some of them actually cache the data (un-deduplicated) so that restores and other actions are fast (even days later).
  • I know this probably sounds redundant, but random disk IO in general is much better on these devices.  A good use case example would be doing a Veeam VM verification.  So not only reads in general, but random writes.

Again, like most comparisons, you can draw the inverse of each devices pros to figure out its cons.  Anyway, on to the devices we looked at.

There were three names that kept coming up in my research, EMC’s DataDomain, ExaGrid and Dell.  Its not that they’re the only players in town, HP, Quantum, Seapaton, and a few others all had appliances.  However, EMC and ExaGrid were well known, and we’re a Dell shop, so we stuck with evaluating these three devices.

Dell DR series appliances (In-line):

After doing a lot of research, discussions, demo’s the whole 9 yards.  It became very clear that Dell wasn’t isn’t the same league as the other solutions we looked at.  I’m not saying I wouldn’t recommend them, nor am I saying I wouldn’t reconsider them, but not yet, and not in its current iteration.  That said, as of this writing, its clear Dell is investing in this platform, so its certainly worth keeping an eye on.

Below are the reasons we weren’t sold on their solution at the time of evaluation.

  • At the time, they had a fairly limited set of certified backup solutions.  We planned to dump SQL straight to these devices, and SQL wasn’t on the supported list.
  • They often compared their performance to EMC, except, they were typical quoting their source side deduplicated protocol, vs. EMC’s raw (unoptimized) throughput.  Meaning it wasn’t an apples to apples comparison.  When you’re planning on transferring 100TB+ of data on a weekly basis and not everything can use source side deduplication, this makes a huge difference.  At the time we were evaluating, Dell was comparing their DR4100 vs. a DD2500.  The reality is, the Dell DR6100 is a better match for the DD2500.  Regardless, we were looking at the DD4200, so we were way above what Dell could provide.
  • They would only back a 10:1 deduplication ratio.  Now this, I don’t have a problem with.  I’d much rather a vendor be honest then claim I can fit the moon in my pocket.
  • They didn’t do multi to multi replication.  Not the end of the world, but also kind of a bummer.  Once you pick a destination, that’s it.
  • Their deduplication was at a share level, not global.  If we wanted one share for our DBA’s and one for us, no shared deduplication.
  • They didn’t support snapshots.  Not the end of the world, but its 2015, snapshots have kind of been a thing for 10+ years now.
  • Their source side deduplication protocol was only really suited to Dell products.   Given that we weren’t planning on going all in with Dell’s backup suite, this was a negative for us.
  • No one, and I mean no one was talking about them on the net.  With EMC or ExaGrid, it wasn’t hard at all to find some comments, even if they were negative.
  • They had a very limited amount of raw data (real usable capacity) that they could offer.  This is a huge negative when you consider that splitting off a new appliance means you just lost half or more of your deduplication potential.
  • There was no real analysis done to determine if they were even a good fit for our data.

ExaGrid (Post process ):

I heard pretty good things about ExaGrid after having a chat with a former EMC storage contact of mine.  If EMC has one competitor in this space, it would be ExaGrid.  Like Dell, we spent time chatting with them, researching what others said, and really just mulling on the whole solution.  Its kind of hard to solely place them in the deduplicaiton segment as they’re also scale out storage to a degree, but I think this is a more appropriate spot for them.

Pros:

  • The post process is a bit of a double edged sword.  One of the pros that I outlined above, is that data is not deduplicated right away.  This means we could use this device as our primary and archive backup storage.
  • The storage scaled out linearly in both performance and capacity.  I really like the idea of not having to forklift upgrade our unit if we grew out of it.
  • They had what I’ll refer to as “backup specialists”.  These were techs that were well versed in the backup software we’d be using with ExaGrid.  In our case SQL and Veeam.  Point being, if we had questions about maximizing our backup app with ExaGrid, they’d have folks that know not just ExaGrid but the application as well.
  • The unit pricing wasn’t simply a “lets get’em in cheap and suck’em dry later”.  Predictable (fair) pricing was part of who they are.

Cons:

  • As I mentioned, post process was a bit of a double edged sword.  One of the big negatives for us, was that their replication engine required waiting until a given file was fully deduplicated before it could begin.  So not only did we have to wait say 8 hours for a 4TB file server backup, but then we had to wait potentially another 8 hours before replication could begin.  Trying to keep any kind of RPO with that kind of variable is tough.
  • While they “scale out” their nodes, they’re not true scale out storage IMO.
    • Rather than pointing a backup target at a single share, and letting the storage figure everything out, we’d have to manually balance which backup’s go to which node.  With the number of backup’s we were talking about and the number of nodes there could be, this sounded like too much of a hassle to me.
    • The landing zone space (un-deduplicated storage) was not scale out, and was instead pinned to the local node.
    • There is no node resiliency.  Meaning if you lose one node, everything is down, or at least for that node.   While I’m not in love with giving up two or three nodes for parity, at least having it as an option would be nice.  IIRC (and could be wrong) this also affected the deduplication part of the storage cluster.
    • Individual nodes didn’t have the best throughput IMO.  While its great that you can aggregate multiple nodes throughput, if I have a single 4TB backup, I need that to go as fast as possible and I can’t break that across multiple nodes.
  • I didn’t like that the landing zone : deduplicaiton zone was manually managed on each node.  This just seemed to me like something that should be automated.

EMC DataDomain (Inline):

All I can say is there’s no wonder they’re the leader in this segment.  Just an absolutely awesome product overall.  As many who know me, I’m not a huge EMC (Expensive Machine Company) fan in general, but there area few areas they do well and this is one of them.

Pros:

  • Snapshots, file retention policies, ACL’s, they have all the basic file servers stuff you’d want and expect.
  • Multi : Multi replication.
  • Very high throughput of non-source (DDBoost) optimized data and even better when it is source optimized.
  • Easy to use (based on demo) and intuitive interface.
  • The ability to store huge amounts of data in a single unit.  At time a head swap may be required, but have the ability to simply swap the head is nice.
  • Source based optimization baked into a lot of non-EMC products, SQL and Veeam in our case.
  • Archive storage as a secondary option for data not accessed frequently.
  • End to end data integrity.  These guys were the only ones that actually bragged about it.  When I asked this question to others, they didn’t exactly instill faith in their data integrity.
  • They actually analyzed all my backup data and gave me reasonably accurate predictions of what my dedupe rate would be and how much storage I’d need.  All in all, I can’t speak highly enough about their whole sales process.  Obviously everyone wants to win, but EMC’s process was very diplomatic, non-pushy and in general a good experience.

Cons:

  • EMC provided some great initial pricing for their devices, but any upgrades would be cost prohibitive.  That said, I at least appreciate that they were up front with the upgrade costs so we knew what we were getting into.  If you go down this path yourself, my suggestion is buy a lot more storage than you need.
  • They treat archive storage and backup storage differently and it needs to be manually separated.  For the price you pay for a solution like this, I’d like to think they could auto tier the data.
  • They license al a carte.  Its not like there’s even a slew of options, I don’t get why they don’t make things all inclusive.  Its easier for the customer and its easier for them.
  • In general, the device is super expensive.  Unless you plan on storing 6+ months of data on the device, I’d bet you could do better with large cheap disks, or even something like disk to tape tiering solution (SpectraLogic Black Pearl).  Add to that, unless your data deduplicates well, you’ll also be paying through the nose for storage.
  • Going off the above statement, if you’re only keeping a few weeks worth of data on disk, you can likely build a faster solution $ for $ than what’s offered by them.
  • No cloud option for replication.  I was specifically told they see AWS as competition, not as a partner. Maybe this will change in the future, but it wasn’t something we would have banked on.

All in all, the deduplication appliances were fun to evaluate.  However, cutting to the chase, we ended up not going with any of these solutions.  As for ROI, these devices are too specialized, and too expensive for what we were looking to accomplish.  I think if you’re looking to get rid of tape (and your employer is on board), EMC DataDomain would be my first stop.  Unfortunately, for our needs, tape was staying in the picture, which meant this storage type was not a good fit.

Next up, scale out storage…