Tag Archives: commvault

Review: 5 years with CommVault

Introduction:

Backup and recovery is a rather dry topic, but it’s an important one.  After all, what’s more critical to your company than their data?  You can have the best products in the world, but if disaster strikes and you don’t have a good solution in place, it can make your recovery painful or even impossible.  Still, many companies shirk investment in this segment.  The good solutions (like the one I’m about to discuss) cost a pretty penny, and that’s capital that needs to be balanced with technology that makes or saves your company money.  Still, insurance (and that’s what backup is) is something that’s typically on the back of companies minds.

Finding the right product in this segment can be a challenge, not only because every vendor tries to convince you that they’ve cracked the nut, but because it seems like all the good solutions are expensive.  Like many, our budget was initially constrained.  We had an old investment in CV (CommVault), but had not reinvested in it over the years, and needed a new solution.  We initially chose a more affordable Veeam + Windows Storage Spaces to handle our backup duties.  It was a terrible mistake, but you know, sometimes you have to fail to learn, and so we did.

After putting up with Veeam for a year, we threw in the towel and and went back to CV with open arms.  Our timing was also great too, as Veeam had put a serious hurt on their business and some of their licensing changed, to accommodate that.  We ultimately ended up with much better pricing than when we last looked at CV, and on top of that, we actually found their virtualization backup to be more affordable and in many ways more feature rich.  CV isn’t perfect as I’ll outline below, but they’re pretty much as close as you can get to perfection for a product that is the swiss army knife of backup.

CommVault Terms:

For those of you not super familiar with CV, you’ll find the following terms useful for understanding what I’m talking about.  There are a lot more components in CV, but these are the fundamental ones.

  • MA (Media Agent): Simply put, it’s a data mover.  It copies data to disk, tape, cloud, etc.
  • Agent: A client that is installed to backup an application or OS.
  • VSA (Virtual Server Agent): A client specially designed to for virtualization backup.
  • CC (CommCell): The central server that manages all the jobs, history, reporting, configuration, etc.  This is the brains of the whole operation.

Our Environment:

  • We have five MA’s.
    • Two virtual MA’s that backup to a Quantum QXS SAN (DotHill). This was done because we were reusing an old pair of VMhost and have a few other non-CV backup components running on these hosts.
      • The SAN has something like two pools of 80 disks. Not as fast as we’d like, but more than fast enough.  The QXS (DotHill) was our replacement for Storage Spaces.  Overall, better than Storage Spaces, but a lot of room for improvement.  The details of that are for another review.
    • Two physical MA’s with DAS, each MA has 80 disks in a RAID 60, yeah it rips from a disk performance perspective J. Multiple GBps
    • One physical MA that’s attached to our tape library.
  • We have five VSA’s, I’ll go more into this, but we’re not using five because I want to.
  • We have one CC, although we’ll be rolling out a second for resiliency and failover soon.
  • We have a number of agents
    • Several MS Exchange
    • Several MS Active Directory
    • Several Linux
    • The rest are file server / OS image agents.
  • In total, we have about a PB of total backup capacity between our SAN and DAS, but not all of that is consumed by CV (most is though).
  • We only use compression right now, no dedupe.
  • We only use active fulls (real fulls) not synthetics

Pros:

  • Backup:
    • CV can backup practically anything, and also has a number of application specific agents as well. You can backup your entire enterprise with their solution.  I would contend with CV, there are very few cases that you’d need point tools anymore.  Desktops, servers, virtualization, various applications and NAS devices are all systems that can be backed up by CV.  Honestly, it’s hard to find a solution that is as comprehensive as them.  That being said, I can imagine you’re wondering if they do it all, can they do it well?  I would say mostly.  I have some deltas to go over a little farther down, but they do a lot and a lot well.  It’s one of the reasons the solution was (and still is) expensive.
    • I went from having to babysit backup’s with Veeam, to having a solution that I almost never had to think about anymore (other than swapping tapes). There were some initial pains at first as we learned CV’s way of doing virtualization backup, but we quickly got to a stable state.
  • Deployment / Scalability:
    • CommCell has a great deployment model that works well in single office locations all the way to globally distributed implementations. They’re able to accomplish all of this with a single pane of glass, which a number of vendors can’t claim to do.
    • Besides the size of the deployment, you’re not forced into using Windows only for most components of CV. A lot of the roles outlined above run on Linux or Windows.
    • CV is software based, and best of all, its an application that runs on an OS which you’re already comfortable with (Linux / Windows). Because of this, the HW that you deploy the solution on is really only limited by minimum specs, budget and your imagination.  You can build a powerful and affordable solution on simple DAS, or you can go crazy and run on NVMe / all flash SANs.  It also works in the cloud because again, it’s just SW inside a generic OS.  I can’t tell you how many backup solutions I looked at that had zero cloud deployment capabilities.
    • There are so many knobs to turn in this solution, it’s pretty tough to run into a situation that you can’t tune for (there are a few though). Most of the out of box defaults are fine, but you’ll get the best performance when you dig in an optimize.  Some find this overwhelming and I’ll chat more about that in the cons, but with CV’s great support and reading their documentation, it’s not as bad as it sounds.  Ultimately the tuneablity is an incredible strength of this solution.  I’ve been able to increase backup throughput from a few hundred MBps to a few GBps simply by changing the IO size that CV uses.
  • Support:
    • Overall, they have fantastic support. Like any vendors support, it can vary and CV is no different.  Still, I can count on my hand the number of times support was painful, and even of those times, ultimately we got the issue resolved.
    • For the most part, support knows the application they’re backing up pretty well. I had a VMware backup issue that we ran into with Veeam and continued with CV.  CV while not being able to directly solve the problem, provided significantly more data for me to hand off to VMware, which ultimately led to us finding a known issue.   CV analyzed the VMware logs best they could and found the relevant entries that they suspected were the issue.  Veeam, was useless.
    • Getting CV issues fixed is something else that’s great about CV. No vendor is perfect, that’s what hotfixes and service packs are for.  CV, has an amazing escalation process.  I went from a bug, to a hotfix that resolved the issue in under two weeks.
    • My experience with their supports response time is fantastic. I rarely find a time where I don’t hear from them for a few hours.  They’re also not afraid to simply call you and work on the problem real time. I don’t mind email responses for simple questions, but when you’re running into a problem, sometimes you just want someone to call you and hash it out in real time.  I also like that most of the time you get the tech’s direct number if you need to call them.
  • Feature requests: A little hit or miss, but feature requests tend to get taken seriously with CV, especially if it’s something pretty simple.
  • Value: This one is a mixed bag.  Thanks to Veeam eating their lunch, virtualization backup with CV has never been a better value.  I could be wrong, but I actually think virtualization backup in CV rings in at a significantly lower price than Veeam.  I would say at least 50% of our backup’s are virtualization.  It’s our default backup method unless there is a compelling reason to use agents.   This is ultimately what made CV an affordable backup solution for us.  We were able to leverage their virtualization backup for most of our stuff, and utilize agents for the few things that really needed to be backed up at a file level or application level.  The virtualization backup entitles you to all their premium features, which is why I think it’s a huge value add.  That being said, I have some stuff to touch on in the cons with regards to the value.
  • Retention Management: Their retention management is a little tricky to get your head around, but it’s ultimately the right way to do retention.  Their retention is based on a policy, not based on the number of recovery point.    You configure things like how many days of fulls you want and how many cycles you need.  I can take a bazillion one off backup’s and not have to worry about my recovery history being prematurely purged.
  • Copy management: They manage copies of data like a champ.  Mix it with the above point, and you have all kinds of copies with different retentions, different locations, and it all works rock solid.  You have control over what data get’s copied.  So your source data might have all your VM’s and you only want a second copy of select VM’s, not problem for them.  Maybe you want dedupe on some, compression on other, some on tape, some on disk, some on cloud, again, no issue at all.
  • Ahead of the curve: CV seems to be the most forward thinking when it comes to backup / recovery destinations and sources.  They had our Nimble SAN’s certified for backup LONG before our previous vendor.  They support all kinds of cloud destinations, the ability to recover VM’s from physical to virtual, virtual to cloud, etc.  This goes back to the holistic approach that I brought up.  They do a very good job of wrapping everything up, and creating a flexible ecosystem to work with.  You typically don’t need point solutions with them.
  • Storage Management: I love their disk pools, and the way they store their backup data.  First and foremost, it’s tunable, so if you want 512MB files to whatever size files, it’s an option.  They shard the data across disks, etc.  Frankly the way they store data is a no brainer.  They also move jobs / data pretty easily from one disk to another which is great.  This type of flexability is not only helpful for things like making it easier to fit your data on disparate storage, but also in ensuring your backup’s can easily be copied to unreliable destinations.  Having to recopy a 512MB file is a lot better than having to recopy an 8TB file.  CV can take that 8TB file if you want, and break it up into various sized (default is 2GB).
  • Policies: Most of the way things are defined, are defined using policies.  Schedules, retention, copies, etc.  Not everything, but most things.  This makes it easy to establish standards for how things should act, and it also makes it easier to change thing.
  • CLI: They have a ton of capability with their CLI / API.  Almost anything can be executed or configured.  I actually developed a number of external work flows which call their CLI and it works well.
  • Tape Management:
    • They handle tapes like a librarian, minus the dewy decimal system. Seriously though, I haven’t worked with a solution that makes handling tapes as easy as they do.
    • If you happen to use Iron Mountain, they have integration for that too.
    • They’re pretty darn efficient with tape usage as well, which is mostly thanks to their “global copy” concept. We still have some white space issues, but it makes sense why
    • They are very good at controlling tape drive and parallel job management. This allows you to balance how many tape drives are used for what jobs.
  • Documentation: They document everything, and for good reason, there is a lot their product does. This includes things like advanced features and most of the special tuning knobs as well.  It’s not always perfect, but it’s typically very good.
  • Recovery:
    • File level recovery from tape for VM backups, without having to recover the whole file, need I say more. That means if I need one file off an 8TB backup VMDK, I don’t have to restore 8TB first.
    • Most application level backup’s offer some level of item level recovery. It’s not always straight forward, or quick, but its usually possible.
    • They’re smart with how they restore data. You can pick where you want the data recovered from (location and copy), and if it does need tapes, it tells you exactly what tapes you need.  No more throwing every single tape in and hoping that’s all you need.

Cons:

  • Backup:
    • Virtualization:
      • Their VMware backup in many ways isn’t as tunable as it should be. There are places where they don’t have stream limits where they really need them.  For example, they lack a stream limit on a the vCenter host, the ESXi host or even the VSA doing the backup.  It’s honestly a little strange as CV seems to offer a never-ending number of stream controls for other areas of their product.  I bring this up as probably my number one issue with their VMware backup.  This led us to have the most initial problems with their solution.  I would still say this is a glaring hole in their virtualization backup.  I just looked up their CV11 SP7 and nothing has changed with regards to this, which is disappointing to say the least.  This is one area that I think Veeam handles much better than them.
      • The performance of NBD (management network only) based backup is bluntly terrible. The only way we could get really good performance out of their product was to switch to hot add.  Typically speaking I hate hot add for Vmware backup.  It takes forever to mount disks, and it makes the setup of VM backup more complicated than it needs to be.  Not to mention if you do have an issue during the backup process (like vCenter dying) the cleanup of the backup is horrible.
      • They don’t pre-tune VSA for hot add. Things like disabling initialize disk in windows and what not.
      • Their inline compression throughput was also atrocious at first. We had to switch the algorithm used which fixed the issue, but it required a non-gui tweak to achieve and me asking if there was anything else they could do.  It was actually timely that the new algorithm had been released as experimental in the release we just upgraded to.
      • Their default VM dispatch to me is less than ideal. Instead of balancing VM’s in a least load method across the VSA’s, they pick the VSA closest to the VM or datastore.  I needed to go in and disable all of this.
    • Deployment / Scalability:
      • While I applaud their flexibility, the one area that I think still needs work is their dedupe. To me, they really need to focus on building a DataDomain level of solution that can scale to petabytes of logical data in a single media agent, and right now they can’t scale that big.  It seems like you need to have a bunch of mid sized buckets which is better than nothing, but still not as ideal as it should be.
      • Deployment for CV newbies is not straight forward. You’ll definitely need professional services to get most of the initial setup done, at least until you have time to familiarize yourself with it.  You’ll also need training so that you actually know how to care for and grow the solution.  I think CV could do a better job with perhaps implementing a more express setup just to get things up, and maybe even have a couple of into / how to videos to jump start the setup.  It’s complicated, because of it’s power, but I don’t think it needs to be.  The knobs and tuning should be there to customize the solution to a person’s environment, but there should be an easy button that suites most folks out of the box.
    • Support: In general I love their support, but there are times where I’m pretty confident the folks doing the support, don’t have at scale experience with the product.  There are times when I’ve tried explaining the scaling issue we were having, and they couldn’t wrap their heads around the issue.  They also tend to get wrapped up in the “this is the way it works” and not in the “this is the way it SHOULD work”.  Which again I think comes back to the experience with product at scale.  This would tend to happen more when I was trying to explain why I setup something in a particular way, and a way that didn’t match their norm.  For example, VM backups, they like to pile everything into subclients.  For more than a number of reasons I’m not going to go into in this blog post, that doesn’t work for us, and frankly it shouldn’t work for most folks.  I was able to punch holes in why their design philosophy was off, but they were stuck on “this is the way it is”.  The good news is you can typically escalate over short sited techs like this and get to someone who can think outside the box.
    • Value: This is a tough one.  On one hand, I want good support and a feature rich product, but on the other hand, the cost of agent based backup is frankly stupid expensive.  When the cost of my backup product costs more per TB than my SAN, that’s an issue.  It’s one of the primary reasons we push towards VM based backup’s as its honestly the only way we could afford their product.  Even with huge discounts, the cost per TB is insane with their solution.  In some cases, I would almost rather have a per agent cost rather than a per TB cost.  I could see how that could get out of control, but I think there are cases where each licensing model works better for each company.  If I had thousands of servers, I could see where the per TB model might make more sense.  This is one of the reasons we don’t backup SQL direct with CV, it just costs too much per TB.  It’s cheaper for us to use a (still too expensive) file based agent to pick up SQL dump files.
    • Storage Management: Once data is stored on its medium, moving it off isn’t easy.  If you have a mountpoint that needs to be vacated, you need to either aux copy data to a new storage copy, manually move the data to another mountpoint, or wait till it ages out.  They really should have an option in their storage pool to simply right click the mountpoint and say “vacate”.  This operation would then move all data/jobs to whatever mountpoints are left in the whole pool.  Similar to VMwares SDRS.  I would actually like to see this ability at a MA level as well too.
    • CLI: I’ll knock any vendor that doesn’t have a Powershell module and CV is one of those vendors.  Again, glad that you have API’s, but in an enterprise where Windows rules the house, Powershell should be standard CLI option.
    • Tape Management: As much as I think they do it better than anyone else, they could still improve the white space issue.  I almost think they need a tapering off setting.  Perhaps maybe even a preemptive analysis of the optimal number of tapes and tape drives before the start of each new aux copy, and re-analyze that each time you detect more data that needs to be copied to tape.  This way it could balance copy performance with tape utilization.  Maybe even define a range of streams that can be used.
    • Documentation: As great as their documentation is, it needs someone to really organize it better.  Taking into account the differences in CV versions.  I realize it’s probably a monumental task, but it can be really hard to find the right document to the right version of what you’re looking for.  I’ve also found times where certain features are documented in older CV version docs, but not in newer ones (but they do exist).  I guess you could argue at least they have so much documentation that it’s just hard to find the right one, vs. not having any doc at all.  When in doubt though, I contact support and they can generally point me in the right direction, or they’ll just answer the question.
    • Recovery:
      • Item level recovery that’s application based really needs a lot of work. One thing I’ll give Veeam is they seems to have a far more feature rich and intuitive application item level recovery solution than CV.
        • Restoring exchange at an item level is slow and involved (lots of components to install). I honestly still haven’t gotten it working.
        • AD item level recovery is incredibly basic and honestly needs a ton of work.
        • Linux requires a separate appliance, which IMO it shouldn’t. If Linux admins can write tools to read NTFS, why can’t a backup vendor write a Windows tool that can natively mount and ready EXT3/4, ZFS, XFS, UFS, etc.
      • P2P, V2V / P2V leaves a lot to be desired. If you plan to use this method, make sure you have an ISO that already works.  Otherwise you’ll be scrambling to recover bare metal when you need to.

Conclusion:

Despite CommVaults cons, I still think it’s the best solution out there.  It’s not perfect in every category, and that’s a typical problem with most do it all solution, but it’s pretty damn good at most.  It’s an expensive solution, and its complicated, but if you can afford it, and invest the time in learning it, I think you’ll fall in love with it, at least as much as one can with a backup tool.

Problem Solving: CommVault tape usage

Introduction:

I hate dealing with tapes, pretty much every aspect of them.  The tracking of them is a PITA, having to physically manage them is a PITA, dealing with tape library issues is a PITA, dealing with tape encryption is a PITA, running out of tapes is a PITA, dealing with legal hold for tapes is a PITA, and I could keep going on with the many ways that tape just sucks.  What makes matters worse is when you have to deal with MORE tapes.

Now that you know tapes are one of my personal seven levels of hell in IT, you’ll know why I put a bit of time into this solution.  Anything I can do to reduce the number of tapes getting exported every day, ultimately leads to some reduction in the PITA scale of tapes.

The issue:

To provide a better understanding of the issue at hand, for years I’ve been seeing way too many tapes being used by CV.  We’d kick out tapes that had 5% or 10% consumption, and the number of tapes with that level of consumption varied based on what phase of our backup strategy we were in, and what day of the week it was.  It could be anything as small as 4 partially filled tapes, to times where we had 10+ tapes that weren’t filled all the way up.  If the consumed data should fit on 16 tapes, and we’re kicking out 26 tapes, that’s a problem IMO.  I’m sure many of you out there have contended with this in CV specifically, and I’d bet those of you using other vendors products have run into this too.  I’m going to first explain why the problem is occurring, and then I’ll go over how I’ve reduced most of the waste.

The Why?

In CV, we have storage policies, and short of going into an explanation of what they are for others not familiar with CV, just think of it as an island of backup data.  That island of data doesn’t co-mingle with other islands of data on disk, and tape is no exclusion.  What that means is when you backup data to a storage policy and want to copy it to tape, that data getting copied to tape will automatically reserve the entire tape being used.  In turn, each storage policy then reserves its own unique tapes so that data does not co-mingle together.  This means for every storage policy you have, you’re guaranteed at least one unique tape per storage policy at a minimum.  Now, each storage policy can have a number of streams configured.  To keep things simple, let’s just ignore multiplexing for now.  When a storage policy has a stream limit of 1, that means only 1 tape drive will be used, when it has a stream policy of 4, that means 4 tape drives will be used.  Now, as you copy data to tape, you normally have more than 1 streams worth of data, you probably have at least one for each client in your environment (and likely much more than that).  This is a good thing, having more streams means we can run data copy operations in parallel.  In the case of the 4 streams example, that means we can use 4 tape drives in parallel to copy data for the example storage policy.  What this also means is depending on circumstances, we could end up with 4 tapes not being filled all the way as well.  Streams are optimized for performance, NOT for improving tape utilization.  Now, imagine you have more than one storage policy, let’s just say 4 storage polices, each being their own island, and each with a stream limit of 2.  That means you could end up with up to 8 tapes not being fully utilized.  I’m also ignoring for now that in CV, you can separate incremental and fulls to different storage policies which exacerbates the problem further (taking one island and making it two).

In our case, we have 4 storage policies and we had gone through a process of merging our Fulls and Incs into a single storage policy to consolidate tapes already.  We have a total of 6 tape drives, which means if we just configured the storage policies to fight over the tape drives @6 streams each, we could end up in theory with 24 partially filled tapes.  We’re smarter than that of course, so that wasn’t out problem.  Our problem was finding the right balance between how many streams a storage policy needed to copy all its data in our window, and not making it so high that we ended up wasting tape.  Pre-solution, we almost always had 4 – 6 tapes that were wasted, as in 100GB on a 2000GB tape.  It was annoying and wasteful.

Solution, problems again, improved solution:

There are two main components to the solution.

  • Scripting storage policy stream modification via task scheduler (MVP JAMS in our case).
  • CommVault introducing Global Tape Policies in v11
    • This allows tapes to be shared, no longer residing on an island as mentioned above. So storage policy 1, 2, 3 and 4 can all share the same tape.  Way more efficient.

In our case, when we saw the global tape policy, it was like a halo of light and angels singing, going off in our head.  This was it, our problems were FINALLY solved.  After going through the very tedious task of migrating to this solution, we found that we were still using 4 – 6 tapes a day more than we needed.  The problem was not that data was not co-mingling, it was.  No, the problem was that we set the global tape policy to 6 streams, and every day, it was using 6 tape drives for backups.   At first we tried to solve the problem by limiting the aux copy streams via a scheduled task in CV (start the job with 1 stream only as an example) but we had 4 storage polices, so that only reduced the tape usage to 4.  The problem again was that each storage police was scheduled and run in parallel.  So while we restricted any one storage policy, ultimately we were still letting more tape drives being used than needed and in turn more tapes than was needed.  We had set 6 streams, because we wanted to make sure that our FULL jobs had enough tape drives to complete over the weekend.

At this stage, I came to the conclusion that we needed a way to dynamically control the streams for the global tape policy so that during the week days it was restricted to 1 tape drive (all we needed) and on the weekend, we could start out with 6 and slowly ramp back down to 1, and hopefully more fully fill our tapes.  With a bit of research and some discussions with CV, I found out that they have a CLI option for controlling storage policy streams (found https://documentation.commvault.com/commvault/v10/article?p=features/storage_policies/storage_policy_xml_edit.htm).  Using my trusty scheduling tool, I setup a basic system where on Sunday @4PM we would set the streams to “1”, and then on Friday @4PM we would raise them to “6” and Saturday @7am we would drop them to “2”.  This basically solved our problem, and I’m happy to say that on week days, tapes are filled as much as is possible (1 – 2 tapes depending on which client ran a full), and on the weekend, 2 – 4 tapes are still being used.  I’m still tuning the whole thing, for the fulls (it’s a balance of utilization and performance), but its better than its ever been.  Its also worth noting, we went back and modified our aux copy schedules and told them to use all available streams since we now choke point it at the global tape policy.  This allows any storage policy to go as fast as possible (although potentially blocking other ones).

It’s a hack no doubt.  IMO, CV should develop this concept in their storage policies.  Basically creating a schedule window to dynamically control the queue depth.  For now, this is working well.

Problem Solving: Chasing SQL’s Dump

The Problem:

For years as an admin I’ve had to deal with SQL.  At a former employer, our SQL environment / databases were small, and backup licensing was based on agents, not capacity.  Fast forward to my current employer, we have a fairly decent sized SQL environment (60 – 70 servers), our backup’s are large , licensing is based on capacity, and we have a full time DBA crew that manage their own backup schedules, and prefer that backup’s are managed by them.  What that means is dealing with a ton of dumps.  Read into that as you want 🙂

When I started at my current employer, the SQL server backup architecture was kind of a mess.  To being with, where were then was about 40 – 50 physical SQL servers.  So when you’re picturing all of this, keep that in mind.  Some of these issues don’t go hand in hand with physical design limitations, but some do.

  • DAS was used for not only storage the SQL log, DB and index, but also backup’s.  Sometimes if the SQL server was critical enough, we had dedicated disks for backups, but that wasn’t typical.  This of course is a problem for many reasons.
    • Performance for not only backup’s but the SQL service its self were limited often because they were sharing the same disks.  So when a backup kicked off, SQL was reading from the same disks it was attempting to write to.  This wasn’t as big of an issue for the few systems that had dedicated disks, but even there, sometimes they were sharing the same RAID card, which meant you’re still potentially bottlenecking one for the other.
    • Capacity was spread across physical servers.  Some systems had plenty of space and others barely had enough.  Islands are never easy to manage.
    • If that SQL server went down, so did its most recent backup’s.  TL backup’s were also stored here (shudders).
    • Being a dev shop meant doing environment refreshes.  This meant creating and maintaining share / NTFS permissions across servers.  This by its self isn’t inherently difficult if its thought out ahead of time, but it wasn’t (not my design).
    • We were migrating to a virtual environment, and that virtual environment would be potentially vMotioning from one host to another.  DAS was a solution that wouldn’t work long term.
  • The DBA’s managed their backup schedules so it required us all to basically estimate when the best time to pickup their DB’s.  Sometimes we were too early and sometimes we could have started sooner.
  • Adding to the above points if we had a failed backup over night, or a backup that ran long, it had an effect on SQL’s performance during production hours.  This put us in a position of choosing between giving up on backing some data up, or having performance degradation.
  • We didn’t know when they did full’s vs diffs.  Which means, we might be storing thier DIFF files on what we considered “full” backup taps.  By its self not an issue, except for the fact that we did monthly extended fulls.  Meaning we kept the first full backup of each month for 90 days.  If that file we’re keeping is a diff file, that’ doesn’t do us any good.  However, you can see below, why it wasn’t as big of an issue in general.
  • Finally, the the problem that I contended with besides all of these, is that because they were just keeping ALL files on disk in the same location, every time we did a full backup, we backed EVERYTHING up.  Sometimes that was 2 weeks worth of data, TL’s, Diff’s and and Fulls.  This meant we were storing their backup data multiple times over on both disk and tape.

I’m sure there’s more than a few of you out there with similar design issues.  I’m going to lay out how I worked around some of the politics and budget limitation.  I wouldn’t suggest this solution as a first choice, its really not the right way to tackle it, but it is a way that works well for us, and might for you.  This solution of course isn’t limited to SQL.  Really anything that uses a backup file scheme could fit right into this solution.

The solution:

I spent days worth of my personal time while jogging, lifting, etc. just thinking about how to solve all these problems.  Some of them were easy and some of them would be technically complex, but doable.  I also spent hours with our DBA team collaborating on the rough solution I came up with, and honing it to work for both of us.

Here is basically what I came to the table with wanting to solve:

  • I wanted SQL dumping to a central location, no more local SQL backups.
  • The DBA’s wanted to simplify permissions for all environments to make DB refreshing easier.
  • I wanted to minimize or eliminate storing their backup data twice on disk.
  • I wanted them to have direct access to our agreed upon retention without needing to involve us for most historical restores.  Basically giving them self service recovery.
  • I wanted to eliminate backing up more data then we needed
  • I wanted to know for sure when they were done backing up and knowing what type of backup they performed.

Honestly we needed the fix, as the reality was we were moving towards a virtualizing our SQL infrastructure, and presenting local disk on SAN would be both expensive, but also incredibly complex to contend with for 60+ SQL servers.

How we did it:

Like I said, some of it was an easy fix, and some of it more complex, let’s break it down.

The easy stuff:

Backup performance and centralization:

We bought an affordable backup storage solution.  At the time of this writing it was and still is Microsoft Windows Storage Spaces.  After making that mistake, we’re now moving on to what we hope is a more reliable and mostly more simplistic Quantum QXS (DotHill) SAN using all NL-SAS disks.  Point being, instead of having SQL dump to local disk, we setup a fairly high performant file server cluster.   This gave us both high availability, and with the HW we  implemented, very high performance as well.

New problem we had to solve:

Having something centralized means you also have to think about the possibility of needing to move it at some point.  Given that many processes would be written around this new network share, we needed to make sure we could move data around on the backend, update some pointers and things go on without needing to make massive changes.  For that, we relied on DFS-N.  We had the SQL systems point at DFS shares instead of pointing at the raw share.  This is going to prove valuable as we move data very soon to the new SAN.

Reducing multiple disk copies and providing them direct access to historical backups:

The backup storage was sized to store ALL required standard retention, and we (SysAdmins) would continue managing extended retention using our backup solution.  For the most part this now means the DBA’s had access to the data they needed 99% of the time.  This solved the storing the data more than once on disk problem as we would no longer store their standard retention in CommVault, but instead rely on the SQL dumps they already are storing on disk (except extended retention).  They still get copied to tape and sent off site in case you thought that wasn’t covered BTW.

Simplifying backup share permissions:

The DBA’s wanted to simplify permissions, so we worked together and basically came up with a fairly simple folder structure.  We used the basic configuration below.

  • SQL backup root
    • PRD <—- DFS root / direct file share
      • example prd SQL server 1 folder
      • example prd SQL server 2 folder
      • etc.
    • STG <—– DFS root / direct file share
      • example stg SQL server 1 folder
      • etc.
    • etc.
  • Active Directory security group wise we set it up so that all prod SQL servers are part of a “prod” active directory group, all stage are part of a “stage” active directory group, etc.
  • The above AD groups were then assigned at the DFS root (Stg, prd, dev, uat) with the desired permissions.

With this configuration, its now as simple as dropping a SQL service account in one group, it and will now automatically fall into the correct environment level permissions.  In some cases its more permissive then it should be (prod has access to any prod server for example), but it kept things simple, and in our case, I’m not sure the extra security of per server / per environment really would have been a big win.

The harder stuff:

The only two remaining problems we had to solve was knowing what kind of backup the DBA’s did, and making sure we were not backing up more data than we needed.  These were also the two most difficult problems to solve because there wasn’t native way to do it (other than agent based backup).  We had two completely disjointed systems AND processes that we were trying to make work together.  It took many miles of running for me to put all the pieces together and it took a number of meetings with the DBA’s  to figure things out.  The good news is, both problems were solved by aspects of a single solution.  The bad news is, its a fairly complex process, but so far, its been very reliable.  Here’s how we did it.

 The DONE file:

Everything in the work flow is based on the presence of a simple file, what we refer to as the “done” file internally.  This file is used throughout the work flow for various things, and its the key in keeping the whole process working correctly.  Basically the workflow lives and dies by the DONE file.  The DONE file was also the answer to  our knowing what type of backup the DBA’s ran, so we could appropriately sync out backup type with them.

The DONE file follows a very rigid naming convention.  All of our scripts depend on this, and frankly naming standard are just a recommend practice (that’s for another blog post).

Our naming standard is simple:

%FourDigitYear%%2DigitMonth%%2DigitDay%_%24Hour%%Minute%%JobName(usually the sql instance)%_%backuptype%.done

And here are a few examples:

  • Default Instance of SQL
    • 20150302_2008_ms-sql-02_inc.done
    • 20150302_2008_ms-sql-02_full.done
  • Stg instance of SQL
    • 20150302_2008_ms-sql-02stg_inc.done
    • 20150302_2008_ms-sql-02stg_inc.done
The backup folder structure:

Equally as important as the done file, is our folder structure.  Again because this is a repeatable process, everything must follow a standard or the whole thing fall apart.

As you know we have a root folder structure that goes something like this ” \\ShareRoot\Environment\ServerName”.  Inside the servername root I create four folders and I’ll explain their use next.

  • .\Servername\DropOff
  • .\Servername\Queue
  • .\Servername\Pickup
  • .\Servername\Recovery

Dropoff:  This is where the DBA’s dump their backups initially.  The backup’s sit here and wait for our process to begin.

Queue:  This is a folder that we use to stage / queue the backup’s before the next phase.  Again I’ll explain in greater detail.  But the main point of this is to allow us to keep moving data outside of the Dropoff folder to a temp location in the queue folder.  You’ll understand why in a bit.

Pickup:  This is where our tape jobs are configured to look for data.

Recovery:  This is the permanent resting place for the data until it reaches the end of its configured retention period.

Stage 1: SQL side

Prerequisites:

  1. SQL needs a process that can check the Pickup folder for a done file, delete a done file and create a done file.  Our DBA’s created a stored procedure with parameters to handle this, but you can tackle it however you want, so long as it can be executed in a SQL maintenance plan.
  2. For each “job” in sql that you want to run, you’ll need to configure a “full” maintenance plan to run a full backup, and if you’re using SQL diffs, create an “inc” maintenance plan.  In our case, to try and keep things a little simple, we limited a “job” to a single SQL instance.

SQL maintenance plan work flow:

Every step in this workflow will stop on an error, there is NO continuing or ignore.

  1. First thing the plan does is check for the existence of a previous DONE file.
    1. If a DONE file exists, its deleted and an email is sent out to the DBA’s and sysadmins informing them.  This is because its likely that a previous process failed to run
    2. If a DONE file does not exist, we continue to the next step.
  2. Run our backup, whether its a full or inc.
  3. Once complete, we then create a new done file in the root of the PickupFolder directory.  This will either have a “full” or “inc” in the name depending on which maintenance plan ran.
  4. We purge backup’s in the Recovery folder that are past our retention period.

SQL side is complete.  That’s all the DBA’s need to do.  The rest is on us.  From here you can see how they were able to tell us whether or not they ran a full via the done file.  You can also glean a few things about the workflow.

  1. We’re checking to see if the last backup didn’t process
  2. We delete the done file before we start a new backup (you’ll read why in a sec).
  3. We create a new DONE file once the backup’s are done
  4. We don’t purge any backup’s until we know we had a successful backup.
Stage 1: SysAdmin side

Our stuff is MUCH harder, so do your best to follow along and let me know if you need me to clarify anything.

  1. We need a stage 1 script created, and stage 1 script will do the following in sequential order.
    1. Will need to know what job its looking for.  In our case with JAMS, we named our JAMS jobs based on the same pattern as the done file.  So when the job starts the script reads information from the running job and basically fills in all the parameters like the folder location, job name, etc.
    2. The script looks for the presence of ANY done file in the specific folder.
      1. If no done file exists, it goes into a loop, and checks every 5 minutes (this minimizes slack time).
      2. If a done file does exists we…
        1. If there are more than 1, we fail.  As we don’t know for sure which file is correct.  This is a fail safe
        2. If there is only one, we move on.
    3. Using the “_” in the done file, we make sure that it follows all our standards.  So for example, we check that the first split is a date, the second is a time, the third matches the job name in JAMS and the fourth is either an inc or full.  A failure in any one of these, will cause the job to fail and we’ll get notified to manually look into it.
    4. Once we verify the done file is good to go, we now have all we need to start the migration process.  So the next thing we do is use the date and time information, to create a sub-folder in the Queue folder.
    5. Now we use robocopy to mirror the folder structure to the .\Queue\Date_Time
    6. Once that’s complete, we move all files EXCEPT the done file to the Date_Time folder.
    7. Once that’s complete, we then move the done file into said folder.

And that completes stage 1.  So now you’re probably wondering, why wouldn’t we just move that data straight to the pickup folder? A few reasons.

  • When the backup to tape starts we want to make sure no new files are  getting pumped into the pickup folder.  You could say well just wait until the backup’s done before you move data along. I agree and we sort of do that, but we do it in a way that keeps the pickup folder empty.
    • By moving the files to a queue folder, if our tape process is messed up (not running) we can keep moving data out of the pickup folder into a special holding area, all the while still being able to keep track of the various backup sets (each new job would have a different date_timestamp folder in the queue folder).  Our biggest concern is missing a full backup.  Remember, if the SQL job see’s a done file, it deletes it.  We really want to avoid that if possible.
    • We ALSO wanted to avoid a scenario where we were moving data into a queue folder while the second stage job tried to move data out of the queue folder.  Again, buy have an individual queue folder for each job, this allows us to keep track of all the moving pieces and make sure that we’re not stepping on toes.

Gotcha to watch out for with moving files:

If you didn’t pick up on it, I mentioned that I used robocopy to mirror the directory structure, but I did NOT mention using it for moving the files.  There’s a reason for that. Robocopy’s move parameter actually does a copy + delete.  As you can imagine with a multi-TB backup, this process would take a while.  I built a custom “move-files” function in powershell that does a similar thing, and in that function I use “move-file” cmdlet which is a simple pointer update.  MUCH faster as you can imagine.

Stage 2: SysAdmin Side

We’re using JAMS to manage this, and with that, this stage does NOT run, unless stage 1 is complete.  Keep that in mind if you’re trying to use your own work flow solution.

Ok so at this point our pickup directory may or may not be empty, doesn’t matter, what does matter is that we should have one or more jobs sitting in our .\Queue\xxx folder(s).  What you need next is a script that does the following.

  1. When it starts, it looks for any “DONE” file in the queue folder.  Basically doing a recursive search.
    1. If one or more files are found, we do a foreach loop for each done file found and….
      1. Mirror the directory structure using robocopy from queue\date_time to the PickupFolder
        1. Then move the backup files to the Pickup folder
        2. Move the done file to the Pickup Folder
        3. We then confirm the queue \date_time is empty and delete it.
        4. ***NOTE:  Notice how we look for a DONE file first.  This allows stage 1 to be populating a new Queue sub-folder while we’re working on this stage without inadvertently moving data that’s in use by another stage.  This is why there’s a specific order to when we move the done file in each stage.
    2. If NO done files are found, we assume maybe you’re recovering from a failed step and continue on to….
  2. Now that all files (dumps and done) are in the pickup folder we….
    1. Look for all done files.  if any of them are full, the job will be a full backup.  if we find NO fulls, then its an inc.
    2. Kick of a backup using a CommVault scripts.  Again parameters such as the path, client, subclient, etc. are all pulled from JAMS in our case or already present in CommVault.  We use the information determined about the job type in step 2\1 as for what we’ll execute.  Again, this gives the DBA’s the power to control whether a full backup or an inc is going to tape.
    3. As the backup job is running, we’re constantly checking the status of the backup, about once a minute using a simple “while” statement.  If the job fails, our JAMS solution will execute the job two more times before letting us know and killing the job.
    4. if the job succeeds, we move on to the next step
  3. Now we follow the same moving procedure we used above, except this time, we have no queue\date_time folder to contend with.
    1. Move the backup files from Pickup to the Recovery folder.
    2. Move the done files
    3. Check that the Pickup folder is empty
      1. If yes, we delete and recreate it.  Reason?  Simple, its the easiest way to deal with a changing folder structure.  if a DBA deletes a folder in the DropOff directory, we don’t want to continue propagating a stale object.
      2. If not we bomb the script and request manual intervention.
  4. if all that works well, we just completed out backup process.

Issues?

You didn’t think I was going to say it was perfect did you?  Hey, I’m just as hard on myself as I am on vendors.  So here is what sucks with the solution.

  1. For the longest time, *I* was the only one that knew how to troubleshoot it.  After a bit of trainings, and running into issues though, my team is mostly caught up on how to troubleshoot.  Still, this is the issue with home brewed solutions, and ones entirely scripted, don’t help.
  2. Related to the above, if I leave my employer, I’m sure the script could be modified to serve other needs, but its not easy, and I’m sure it would take a bit of reverse engineering.  Don’t get me wrong, I commented the snot out of the script, but that doesn’t make it any easier to understand.
  3. Its tough to extend.  I know I said it could, but really, I don’t want to touch it unless I have to (other than parameters).
  4. When we do UAT refreshes, we need to disable production jobs so the DBA’s have access to the production backups for as long as they need.  its not the end of the world, but it requires us to now be involved at a low level with development refreshes, where as before that wasn’t any involvement on our side.
  5. We’ve had times where full backup’s have been missed tape side. That doesn’t mean they didn’t get copied to tape, rather they were considered an “inc” instead of being considered a “full”. This could easily be fixed simply by having the SQL stored procedure checking if the done file that’s about to be deleted is a full backup and if so, to replace it with a new full DONE file, but that’s not the way it is now, and that depends on the DBA’s.  Maybe in your case, you can account for that.
  6. We’ve had cases where the DBA’s do a UAT refresh and copy a backup file to the recovery folder manually.  When we go to move the data from the pickup folder to the recovery folder, our process bombs because it detects that the same file already exists.  Not the end of the world for sure, easy enough to troubleshoot, but its not seamless.  An additional workaround to this could be to do an md5 hash comparison.  If the file is the same, just delete it out of the pickup directory and move on.
  7. There are a lot of jobs to define and a lot of places to update.
    1. In JAMS we have to create 2 jobs + a workflow that links them per SQL job
    2. in CommVault we have to define the sub-client and all its settings.
    3. On the backup share, 4 folders need to be created per job.

Closing thoughts:

At first glance I know its REALLY convoluted looking.  A  Rube Goldberg for sure.  However, when you really start digging into it, its not as bad as it seems.  In essence, I’m mostly using the same workflow multiple times and simply changing the source / destination.  There are places  for example when I’m doing the actual backup, where there’s more than the generic process being used, but its pretty repetitive otherwise.

In our case, JAMS is a very critical peace of software to making this solution work.  While you can do this without the software, it would be much harder for sure.

At this point, I have to imagine that you’re wondering if this is all worth it?  Maybe not to companies with deep pockets.   And being honest, this was actually one of those processes that I did in house and was frustrated that I had to do it.  I mean really, who wants to go through this level of hassle right?  Its funny, I thought THIS would be the process i was troubleshooting all the time, and NOT Veeam.  However, this process for the most part has been incredibly stable and resilient.  Not bragging, but its probably because I wrote the workflow.  The operational overhead I invested saved a TON of capex.  Backing up SQL natively with CommVault has a list price of 10k per TB, before compression.  We have 45TB of SQL data AFTER compression.  You do the math, and I’m pretty sure you’ll see why we took the path we did.    Maybe you’ll say, that CommVault is too expensive, and to some degree that’s true, but even if you’re paying 1k per TB, if you’re being pessimistic and assuming that 45TB = 90TB before compression, I saved 90k + 20% maintenance each year, and CommVault doesn’t cost anywhere close to 1k per TB, so really, I saved a TON of bacon with the process.

Besides the cost factor, its also enabled us to have a real grip on what’s going happening with SQL backups.  Before it was this black box that we had no real insight into.  You could contend that’s a political issue, but then I suspect lots of companies have political issues.  We now know that SQL ran a full backup 6 days ago.  We now have our backup workflow perfectly coordinated.  We’re not starting to early, and we’re kicking off with in 5 minutes of them being done, so we’re not dealing with slack time either.  We’re making sure that our backup application + backup tape is being used in the most prudent way.  Best of all, our DBA’s now have all their dump files available to them, their environment refreshes are reasonable easy, the backup storage is FAST, we have backup’s centralized and not stored with the server.  All in all, the solution kicks ass in my not so humble opinion.  Would I have loved to do CommVault natively?  For sure, no doubt its ultimately the best solution, but this is a compromise that allowed us to continue using CommVault, save money and accomplish all our goals.