Tag Archives: SQL

Review: 5 years virtualizing Microsoft SQL Server


I know what you’re thinking, it’s 2017, why are you writing about virtualizing Microsoft SQL?  Most are doing it after all.  And even if they’re not, there’s this whole SQLaaS thing that’s starting to take off, so why would anyone care?  Well I’m writing this as more of a reflection on virtualizing SQL.  What works well, what doesn’t, what lessons I’ve learned, what I’m still learning, etc.

Like most things on the internet, I find that folks tend to share all the good, without sharing any of the bad (or vice versa).  There’s also just a lot of folks out there saying they’ve done it, without quantifying how well it’s working.  Sure, I’ve seen the cranky DBA say it’s the worst thing to happen, and I’ve seen the sysadmins say it’s the best thing that they ever did.  I find both types of feedback to be mostly useless, as they’re all missing context and depth.  This post is going to follow my typical review style, so I’ll outline things like the specs, the pros and cons, and share some general thoughts.


When I first started at ASI, I was told we’d never virtualize SQL.  It was the un-virtualizeable workload.  That was roughly six and a half years ago.  Fast forward to today, and we’ve been running a primarily virtualized SQL environment for close to five years.  It took a bit of convincing on my side, but this is basically how I convinced ASI to virtualize SQL.

  • Virtualizing SQL (and other big iron) was gaining a lot of popularity back in 2012
  • I had just completed my first successful POC of virtualizing a lot of other workloads at ASI.
  • We were running SQL on older physical systems and they were running adequately. The virtual hosts I was proposing were at the time two generations newer processor wise.  Meaning, if it was running ok on this dinosaur HW, it should run even better on this newer processor, regardless of whether it was virtual or not.
  • I did a ton of research, and a lot of political marketing / sales. Basically, compiling a list of things virtualization was going to fix in our current SQL environment.  Best of all, I was able to point at my POC as proof of these things.  For example, we had virtualized Exchange, and Exchange was a pretty big iron system that was running well. Many of the things I laid out as pros, I could point to Exchange as proof.

Basically, it was proposed as a win / win solution.  It wasn’t that I didn’t share the cons of virtualizing SQL, it was that I wasn’t as familiar with the cons until after virtualizing SQL.  This is going back to that whole lack of real-world feedback issue.  I brought up things like there would be some performance overhead, troubleshooting would be more difficult, and some of the more well-known issues.  But there was never a detailed list of gotcha’s.  No one that I was aware of had virtualized BIG SQL servers in the real world and then shared their experience in great detail.  Sure, I saw DBA’s complain a lot, but most of it was FUD (and still is).

Anyway, the point is, we did a 180, and went from not virtualizing any SQL, to virtualizing any and all SQL with the exception of one platform (more on that later).

The numbers and specs:

Bare in mind, this was five years ago, these were big numbers back then.

  • VMware cluster comprised of seven Dell r820’s
    • 32 total cores (quad socket 8 cores per)
    • 768GB of RAM
    • Quad 10gb networking
      • Two for the storage network
      • Two for all other traffic
    • Fusion IO drive 2 card
    • Fusion IO Io Turbine cache acceleration.
    • VMware ESXi 5.x – 6.x (over time)
  • Five Nimble cs460 SANs
  • Dual Nexus 5596 10Gb switches
  • Approximately 80 SQL servers (peak)
    • 20 – 30 of which were two node clusters
    • Started with Windows 2012 R1 + SQL 2012
    • Currently running Windows 2012 R2 + SQL 2014 and moving on to Windows 2016 + SQL 2017

To summarize, we have a dedicated VMware cluster for production SQL systems and another cluster (not detailed) for non-production workloads.  It didn’t start out that way, more on that later.


No surprise, but there are a lot of advantages to virtualizing SQL that even after five years I still think holds true.  Let’s dig into it.

  • The ability to expand resources with minimal disruption. I’m not talking about anything hot-add here, simply the fact that you can add resources.  In essence, give you the ability to right size each SQL server.
  • Through virtualization, you can run any number of OS + SQL version combinations that you need. Previously there was all kinds of instance stacking, OS + SQL version lag.  With virtualization if we want a specific OS + SQL combo, we spool up a new VM and away we go.
  • Virtualization made it easy for us to have a proper dev, stage, UAT and finally production environment for all systems. Before these would have been instances on existing SQL servers.
  • Physical hardware maintenance is mostly non-disruptive. By being able to easily move workloads (scheduled) to different physical hosts, we’re able to perform maintenance without risking data loss.  There’s also the added benefit that there’s basically no firmware or driver updates (other than VMware tools / version) to apply in the OS its self.  This make maintenance a lot easier for the SQL server its self.
  • Related to the above, hardware upgrades are as easy as a shutdown / power on. There’s no need to re-install and re-configure SQL on a new system.
  • We were able to build SQL VM’s for specific purposes rather than trying to co-mingle a bunch of databases on the same SQL server. Some might say six of one and half a dozen of another, but I disagree.  By making a specific SQL server virtual, it enabled us to migrate that workload to any number of virtual hosts.
  • With enterprise licensing, we could build as many SQL systems as we wanted within the confines of resources.
  • Migrating SQL data from one storage location to another was easy, but I won’t go so far as saying non-disruptive. Doing that on a physical SQL server, requires moving the data files manually.  With VMware, we just moved the virtual disk.
  • Better physical host utilization. This is a double-edged sword, but we’ve been able to more fully utilize our physical HW than before.  When you consider how much SQL licensing costs, that’s a pretty big deal.
  • Redundancy for older OS versions. Before Windows 2012, there was no official support for NIC teaming.  You could do it, but Microsoft wouldn’t support it.  With VMware, we had both NIC redundancy and host redundancy.  In a non-clustered SQL server, VMware’s HA could kick in as a backup for host failures.

Pretty much, all the standard pros you’d expect from a virtual environment, and a few SQL specific ones.


This is a tough one to admit, but there are a TON of cons to virtualizing SQL if a sysadmin has to deal with it at scale.

  • Troubleshooting just got tougher with a SQL. VMware will now always be suspect for any and all issues.  Some of it is justified, a lot of it not.  Still, trying to prove it’s not a VMware issue is tough.  You’re no longer simply looking at the OS stats, now you have to review the VM host stats, check for things like co-stop, wait, busy, etc.  Were there any noisy neighbors, anything in the VMware logs, etc.
  • Things behave have differently in a virtual world. In a physical world, “stuns” or “waits” don’t happen.  This is related to the above, but basically, for every simplicity that virtualization adds, it at least matches it with an equal or greater complexity.
  • The politics, OH the politics of a virtual SQL environment. If you don’t’ have a great relationship with your SQL team, I would say, don’t virtualize SQL.  It’s just not worth the pain and agony you’re going to go through.  It will only increase finger pointing.
  • DBA’s in charge of sizing VM’s on a virtual host you’re in charge of supporting. This is related to politics, but basically now that DBA’s know they can expand resources, you can bet your hind end your VM’s will get bigger and almost never shrink (we’ve gotten some resources back, so kudos to our DBA’s).  It doesn’t matter if you explain NUMA concerns, co-stop, etc.  It’s nothing more than “I want more CPU” or “I want more memory”.  Then a week later, when you have VM’s stepping on each other’s toes, it will be finger pointing back at you again.  I think what’s mostly happening here, is the DBA’s are focused on the individual server performance, whereas its difficult to convey the multi-server impact.
  • vMotion (host or storage) will cause interruptions. In a SQL cluster, you will have failovers.  At least that’s my experience.  Despite what VMware puts on their matrix, DON’T plan on using DRS.  Even if you can get the VM’s to migrate without a failover, the applications accessing SQL will slow down.  At least if your SQL VM’s are a decent size.  This was probably the number one disappointment with our SQL environment.
    • Once you can’t rely on DRS, managing VM’s across different hosts becomes a nightmare. You’ll either end up in CPU overload, or memory ballooning. I’ve never seen memory ballooning before virtualizing SQL, and that’s the last application you want to see ballooning and swapping.
    • Since you can’t vmotion VM’s to rebalance the cluster without causing disruptions (save for maybe non-clustered VMs) just keep piling on the struggle.
  • SQL VMware hosts are EXPENSIVE at least when you’re running a good number of big VM’s like we are. We actually maxed out our quad socket servers from a power perspective.  Even if we wanted to do something like add memory it’s not an option.  And when you want to talk about swapping in new hosts, it’s not some cheap 30k host, no it’s a host that probably costs close to 110k if not more.  Adding to that, you’re now tasked with trying to determine if you should stay with the same number of CPU cores, or try to make a case for more CPU cores, which now add SQL licensing costs.

I could probably keep going on, but the point is virtualizing SQL isn’t all sunshine and roses like it is for other workloads.

Lessons learned:

I’m thankful to have had this opportunity, because it’s enabled me to experience first-hand what it’s like virtualizing SQL in a shop where SQL is respectably large and critical.  In this time, I’ve learned a number of things.

  • DRS + SQL clusters = no go. Maybe it works for you and your puny 4 vCPU / 16GB VM, but for one of our vm’s with 24 vCPU and 228GB of RAM, you will cause failovers.  And no DBA wants a failover.
    • Actually DRS + any Windows cluster = no go, but that’s for another post.
  • If I had to do it over again, I would have gotten Dell r920’s instead of 820’s. While both were quad socket, I didn’t realize or appreciate the scalability difference between the 4600 and 8600 series xeons.  If I was building this today, I would go after hosts that are super dense.  Rather than relying on a scale out technique, I’d shoot for a scale up approach.  Most ideal would be something like the HPe SuperDome, but even getting a new M series Xeons with 128GB DIMMS would be a wise choice.  In essence, build a virtual platform just like you would a physical one So if you normally would have had three really big hosts, do the same in VMware.
  • Accept the fact that SQL VM’s are going to be larger than you think they should be. Some of this being fair is SysAdmins think they understand SQL, and we don’t.  There’s a lot more to SQL than CPU utilization.  For example, I’ve seen SQL queries that only used 25% of every CPU core they were running on, but the more vCPUs we allocated to the VM, the faster that query ran.  It was the oddest thing I had ever seen, but it also wasn’t the only application I’ve seen like this.  Likely, a disk bottleneck issue, or at least that’s my guess.
  • Just give SQL memory and be done with it. When we virtualized our first SQL cluster, the one thing we noticed was the disk IO on our SAN (and FusionIO card) was pretty impressive.  At first, it’s pretty cool to see 80k IOPS from a real workload, but then when you hear the DBA’s saying, “it’s slow” and you realize that if every SQL server you add needs this kind of disk IO, you’re going to run out of IOPS in no time.  We added something like 64GB of more memory to those nodes, and the disk IO went from 80k to 3k and the performance from the DBA’s perspective was back to what they expected.  There’s no replacement for memory.
  • Virtualizing SQL is complex. While it CAN be as simple as what you’re used to doing, once you start adding clustering, and managing a lot of monster VM’s on the same cluster, it’s a different kind of experience than you’re used to.  To me, it’s worth investing in VMware log insight for your SQL environment to make it easier to troubleshoot things.  I would also add ops manager as another potential value add.  At least these are things I’m thinking of pushing for.
  • Keep your environment as simple as possible. We started out with Fusion IO cards + Fusion IO caching software.  All that did was create a lot of headache, and once we increased the RAM in SQL, the disk bottleneck went away (mostly).  I could totally see using an Intel NVMe (or 3dxpoint) card for something like TempDB.  However, I would put the virtual disk on the drive directly, not use any sort of caching solution.
  • I would have broken our seven node cluster up into two or three two node clusters. This goes back to treating them like they’re physical servers.  Again, scaling up, much better, but if you’re going to use more, but smaller hosts, treat them like they’re physical.
    • We kind of do this now. Node 1’s on odd hosts, node 2’s on even hosts
  • We found that we ultimately didn’t need Vmware’s enterprise plus. We couldn’t vmotion, or use DRS, and the distributed switch was of little value, so we converted everything to standard edition.  Now, I have no clue what would happen if we wanted Ops Manager.  It used to be a la carte, but I’m not so sure anymore.
  • We originally had non-prod and prod on the same cluster. We eventually moved all of non-prod off.  This provided a little more breathing room, and now we have two out of seven hosts free to use for maintenance.  Before, they were partially consumed with non-prod SQL VM’s.
  • We made the mistake of starting with virtualizing big SQL severs and learning about Microsoft clustering + AlwaysOn Availability Groups at the same time. Not recommended J.  I don’t think it would have been easy to learn the lessons we did, even if it was difficult.
  • Just because VMware says something will work, doesn’t mean it will. I quadruple checked their clustering matrix and recommended practices guides.  We were doing everything they recommended and our clusters still failed over.
  • Big VM’s don’t behave the same way as little VM’s. I know it sounds like a no duh, but it’s really not something you think about.  This is especially true when it comes to vMotion or even trying to balance resources (manually) on different hosts.  You never realize how much you really appreciate DRS.
  • I’ve learned to absolutely despise Microsoft clustering when it’s virtualized. It just doesn’t behave well.  I think MS clustering is built for a physical world, where there are certain assumptions about how the host will react.  For the record, our physical SQL cluster is rock solid.  All our issues typically circle back to virtualization.
    • BTW, yes, we’ve tried tuning the subnet failover thresholds, no it doesn’t work, and no I can’t tell you why.
  • We’ve learned that VMware support just isn’t up to par, and that you’re really playing with fire if you’re virtualizing complex workloads like SQL. We can’t afford mission critical support, so maybe that’s what we need, but production support is basically useless if you need their help.
  • Having access to Microsoft’s premier support would be very beneficial in this environment. It’s probably something we should have insisted on.


Do I recommend virtualizing SQL?  I would say it depends, but mostly yes.  There are certainly days where I want to go back to physical, but then I think about all the things I would miss with our virtual environment.  And I’m sure if you asked our DBA’s, they too would admit to missing some of the pros as well.  Here are my final thoughts.

I would say if you’re a shop that has a lot of smaller SQL servers, and they’re non-clustered, virtualization is a no-brainer.  When SQL is small, and non-clustered, it mostly behaves about the same as other VM’s.  We never have issues with our dev or stage systems, and they’re all on the smaller side and they’re all non-clustered.  Even with our UAT environment, we almost never have issues, even though they are clustered.

For us, it seems to be the combination of a clustered and large SQL server where things start getting sketchy.  I don’t want to make it sound like we’re dealing with failovers all the time.  We’ve worked through most of our issues, and for the most part, things are stable.  We occasionally have random failovers, which is incredibly frustrating for all parties, but they’re rare now a day.

My suggestion is, if you do want to virtualize large clustered SQL systems, treat them like they’re physical.  Here are a few rough recommendations:

  • Avoid heavy CPU oversubscription. Shoot for something like less than 3:1, and more ideal being less than 2:1
  • Size your VM’s so they fit in a NUMA node. That would have been impossible back in the day, but now a day, we could probably do this.  Maybe some of you though, this will still be an issue.  Our largest VM’s (so far) are only 24 vCPU, so we can fit in a single NUMA node on newer HW.
  • Don’t cluster in VMware period. No HA, no DRS.  Keep your hosts standalone and manage your SQL VM’s just like you would if they were physical.  Meaning, plan the VMware host to accommodate the SQL VM(s).
  • Don’t intermix non-SQL VM’s with these systems. We didn’t do this, but I wanted to point it out.
  • Plan on a physical host that can scale up its memory if needed.
  • When doing VMware host maintenance, failover your SQL listeners / clusters before migrating the VMs.
    • BTW, it’s typically faster to shutdown a VM then vMotion it while powered on at the sizes we’re dealing with.

Finally, I wanted to close by pointing out, that performance was never an issue in our environment.  In fact, things got faster when we moved to the newer HW + SAN.  One of the biggest concerns I used to see with virtualizing SQL was performance, and yet it was everything else that no one mentioned that ended up being the issues.

Hope this helps someone else who hasn’t taken the plunge yet or is struggling themselves.

Naming Conventions: SQL Server Names


If you’re working in a Windows environment like me, you have to deal with 15 character limitations (at least if you care about NETBIOS resolution).  Honestly, really cramps my style, but these conventions were written with that limitation in mind.  If you’re using this for a Linux based server running MySQL or PostgreSQL, you might be able to get a little more detailed.

Multi Environment / Cluster or standalone:

At my employer ASI, we have to contend with multiple applications, multiple environments for each application and in a lot of cases clusters.  Trying to come up with a name that works for them all can be tough, but I think I’ve got a few good ones depending on your use cases.

Here is an MSSQL server naming convention:


  • cmp-sqlps1-01
  • cmp-sqlus1-01
  • cmp-sqlss1-01
  • cmp-sqlds1-01
  • cmp-sqlps1-02
  • cmp-sqlps2-02
  • cmp-sqlus1-02

Let’s break it down:

  • cmp = company.  Use whatever prefix you want, doesn’t need to be all letters and doesn’t need to be three, but I wouldn’t go beyond three.
  • (_) underscores = separators
  • SQL = SQL server
    • Next letter = environment
      • P = Prod
      • U = UAT
      • S = Stage
      • D = Dev
    • Next letter = Clustered or standalone
      • S = Standalone
      • C = Cluster ***More on this later
    • Following number = the number of SQL servers for this environment and application.  If you have multiple stand alone servers for “app1” this allows you to accommodate them.
  • The final number is a number to represent the application.  Doesn’t matter what the number is, but once its assigned, your other environments should fall into the same number.  I can look at the last number and quickly go “oh that’s app 20”.  Ok, maybe not quickly, but its easy enough with a matrix.

You can quickly determine if the servers function, environment, is it a cluster and the apps.


Cluster management resource name

  • cmp-sqlpc-01
  • cmp-sqluc-01

Cluster Node Name:

  • cmp-sqlpcn1-01
  • cmp-sqlpcn2-01
  • cmp-sqlucn1-01
  • cmp-sqlucn2-01

Cluster application resource name

SQL AAG Name / SQL Listener Name
  • cmp-sqlpcdg1-01
  • cmp-sqlpcdg2-01
  • cmp-sqlucdg1-01
  • cmp-sqlucdg2-01
SQL Traditional Failover Cluster Name
  • cmp-sqlpcdi1-01
  • cmp-sqlucdi1-01

Let’s break it down:

SQL clusters require a lot of computer accounts / names, and having a naming convention helps to keep them grouped and organized.

  • CMP-SQLPC = we should already know this based on the previous explanation.  This part tells me its a SQL production cluster.
    • If it goes straight to the application number, that means this is the cluster management resource.
    • If it goes to N1 or N2 = that’s the node of that cluster.  N1 = node 1, N2 = Node 2.
    • If it goes dg1 = Database Group 1, or SQL AAG1 and it belongs to this cluster.
    • If it goes DI1 = Database Instance 1, or SQL failover instance 1.
  • The following number of course is the application we’re assigning.


That’s pretty much it, short and sweet.  It’s not designed to scale to massive levels, but I think it will work for most environments.


To see other naming conventions or posts about naming conventions, head over here.

SQL Query: Microsoft – WSUS – Computers Update Status

Sometimes the WSUS console, just doesn’t give you the info you need, or it doesn’t provide it in a format you want.  This query is for one of those examples.  This query can be used in multiple ways to show the update status of a computer, computers or computer in a computer target.  For me, I wanted to see the update status, without worrying about what non-applicable updates were installed.  I also, didn’t care about updates that I didn’t approve, which was another reason I wrote this query.

First off, the query is located here on my GitHub page.  As time allows, I plan to update the read me on that section with more filters as I confirm how accurate they are and what value they may have.

All of the magic in this query is in the “where statement”. That will determine which updates you’re concerned about, which computers, which computer target groups, etc.

To begin with, even with lots of specifics in the “where statement”, this is a heavy query. I would suggest starting with a report about your PC or a specific PC, before using this to run a full report. It can easily take in excess of 30 minutes to an hour to run this report if you do NOT use any filters, and you have a reasonably large WSUS environment.   It’s also worth noting, in my own messing around, I’ve easily run out of memory / temp db space (over 25GB of tempdb).  It has the potential to beat the crap out of your SQL DB server, so proceed with caution.  My WSUS DB is on a fairly fast shared SQL server, so your mileage may vary.

Let’s go over a few way’s to filter data. First, the computer name column would be best served by using a wildcard (“%” not “*”) at the end of the computer name. Unless you’re going to use the FQDN of the computer name.  In other words, use ‘computername%” or ‘computername.domain.com’

Right now, I’m only showing updates that are approved to be installed. That is accomplished by the Where Action = ‘Install’ statement.

The “state” column is one that can quickly let you get down to the the update status you care about. In the case of the one below, we’re showing the update status for a computer called “computername”, but not showing non-applicable updates.

Where Action = ‘Install’ and [SUSDB].[PUBLIC_VIEWS].[vComputerTarget].[Name] like ‘computername%’ and state != 1

if we only wanted to see which updates were not installed, all we’d need to do is the following. By adding “state !=4” we’re saying only show updates that are applicable, and not currently installed.

Where Action = ‘Install’ and [SUSDB].[PUBLIC_VIEWS].[vComputerTarget].[Name] like ‘computername%’ and state != 1 and state != 4

If you want to see the complete update status of a computer, excluding only the non-applicable updates, this will do the trick.  That said, its a BIG query and take a long time.  As in, go get some coffee, chat with your buds and maybe play a round of golf.  You might run out of memory too with this query depending on your SQL server.  In case you didn’t notice, I took out the “where Action = ‘Install'”.  As in show me any update that’s applicable, with any status, and any approval setting.

Where ‘Install’ and [SUSDB].[PUBLIC_VIEWS].[vComputerTarget].[Name] like ‘pc-2158%’ and state != 1

Play around your self and I think you’ll see it’s pretty amazing all the different reports your can create.  I would love to say the WSUS DB was easy to read / figure out, but IMO, its probably one of the more challenging DB’s I’ve figured out.  There are sometimes multiple joins needed in order to link together data that you’d think would have been in a flat table.  I suspect that, combined with missing indexes is part of the reason the DB is so slow.  I wish MS would simplify this DB, but I’m sure there’as a reason its designed the way it is.

SQL Query: Microsoft – WSUS – Computers to Computer Target

This is a simple query you can use map your computers to their various targets in a nice easy to export table.  I’d love to say the script is sexier than that, but its really not.  You can find the sql query here.

There is only one section worth mentioning because it can change the way a computer is mapped to a computer target.

Where [SUSDB].[dbo].[tbExpandedTargetInTargetGroup].[IsExplicitMember] = 1

Powershell Scripting: Invoke-ECSSQLQuery

Quick Powershell post for those of you that may on occasion want to retrieve data out of a SQL table via Powershell.  I didn’t personally do most of the heavy lifting in this, I simply took some work that various folks out there did and put it into a repeatable function instead.

Firstly, head over to here to my GitHub if you want to grab it.  I’ll be keeping it updated as change requests come in, or as I get new ideas, so make sure if you do use my function, that you check back in on occasion for new versions.

The two examples are below:

Syntax example for windows authentication:

Invoke-ECSSQLQuery -DatabaseServer “ServerNameonly or ServerName\Instance” -DatabaseName “database” -SQLQuery “select column from table where column = ‘3’”

Syntax example for SQL authentication:

Invoke-ECSSQLQuery -DatabaseServer “ServerNameOnly or ServerName\Instance” -DatabaseName “database” -SQLUserID “SA” -SQLUserPassword “Password” -SQLQuery “select column from table where column = ‘3’”

There is also an optional “timeout” parameter that can be used for really long running queries.  By default its 30 seconds, you can set it as high as you want, or specify “0” if you don’t want any timeout.

Problem Solving: Chasing SQL’s Dump

The Problem:

For years as an admin I’ve had to deal with SQL.  At a former employer, our SQL environment / databases were small, and backup licensing was based on agents, not capacity.  Fast forward to my current employer, we have a fairly decent sized SQL environment (60 – 70 servers), our backup’s are large , licensing is based on capacity, and we have a full time DBA crew that manage their own backup schedules, and prefer that backup’s are managed by them.  What that means is dealing with a ton of dumps.  Read into that as you want 🙂

When I started at my current employer, the SQL server backup architecture was kind of a mess.  To being with, where were then was about 40 – 50 physical SQL servers.  So when you’re picturing all of this, keep that in mind.  Some of these issues don’t go hand in hand with physical design limitations, but some do.

  • DAS was used for not only storage the SQL log, DB and index, but also backup’s.  Sometimes if the SQL server was critical enough, we had dedicated disks for backups, but that wasn’t typical.  This of course is a problem for many reasons.
    • Performance for not only backup’s but the SQL service its self were limited often because they were sharing the same disks.  So when a backup kicked off, SQL was reading from the same disks it was attempting to write to.  This wasn’t as big of an issue for the few systems that had dedicated disks, but even there, sometimes they were sharing the same RAID card, which meant you’re still potentially bottlenecking one for the other.
    • Capacity was spread across physical servers.  Some systems had plenty of space and others barely had enough.  Islands are never easy to manage.
    • If that SQL server went down, so did its most recent backup’s.  TL backup’s were also stored here (shudders).
    • Being a dev shop meant doing environment refreshes.  This meant creating and maintaining share / NTFS permissions across servers.  This by its self isn’t inherently difficult if its thought out ahead of time, but it wasn’t (not my design).
    • We were migrating to a virtual environment, and that virtual environment would be potentially vMotioning from one host to another.  DAS was a solution that wouldn’t work long term.
  • The DBA’s managed their backup schedules so it required us all to basically estimate when the best time to pickup their DB’s.  Sometimes we were too early and sometimes we could have started sooner.
  • Adding to the above points if we had a failed backup over night, or a backup that ran long, it had an effect on SQL’s performance during production hours.  This put us in a position of choosing between giving up on backing some data up, or having performance degradation.
  • We didn’t know when they did full’s vs diffs.  Which means, we might be storing thier DIFF files on what we considered “full” backup taps.  By its self not an issue, except for the fact that we did monthly extended fulls.  Meaning we kept the first full backup of each month for 90 days.  If that file we’re keeping is a diff file, that’ doesn’t do us any good.  However, you can see below, why it wasn’t as big of an issue in general.
  • Finally, the the problem that I contended with besides all of these, is that because they were just keeping ALL files on disk in the same location, every time we did a full backup, we backed EVERYTHING up.  Sometimes that was 2 weeks worth of data, TL’s, Diff’s and and Fulls.  This meant we were storing their backup data multiple times over on both disk and tape.

I’m sure there’s more than a few of you out there with similar design issues.  I’m going to lay out how I worked around some of the politics and budget limitation.  I wouldn’t suggest this solution as a first choice, its really not the right way to tackle it, but it is a way that works well for us, and might for you.  This solution of course isn’t limited to SQL.  Really anything that uses a backup file scheme could fit right into this solution.

The solution:

I spent days worth of my personal time while jogging, lifting, etc. just thinking about how to solve all these problems.  Some of them were easy and some of them would be technically complex, but doable.  I also spent hours with our DBA team collaborating on the rough solution I came up with, and honing it to work for both of us.

Here is basically what I came to the table with wanting to solve:

  • I wanted SQL dumping to a central location, no more local SQL backups.
  • The DBA’s wanted to simplify permissions for all environments to make DB refreshing easier.
  • I wanted to minimize or eliminate storing their backup data twice on disk.
  • I wanted them to have direct access to our agreed upon retention without needing to involve us for most historical restores.  Basically giving them self service recovery.
  • I wanted to eliminate backing up more data then we needed
  • I wanted to know for sure when they were done backing up and knowing what type of backup they performed.

Honestly we needed the fix, as the reality was we were moving towards a virtualizing our SQL infrastructure, and presenting local disk on SAN would be both expensive, but also incredibly complex to contend with for 60+ SQL servers.

How we did it:

Like I said, some of it was an easy fix, and some of it more complex, let’s break it down.

The easy stuff:

Backup performance and centralization:

We bought an affordable backup storage solution.  At the time of this writing it was and still is Microsoft Windows Storage Spaces.  After making that mistake, we’re now moving on to what we hope is a more reliable and mostly more simplistic Quantum QXS (DotHill) SAN using all NL-SAS disks.  Point being, instead of having SQL dump to local disk, we setup a fairly high performant file server cluster.   This gave us both high availability, and with the HW we  implemented, very high performance as well.

New problem we had to solve:

Having something centralized means you also have to think about the possibility of needing to move it at some point.  Given that many processes would be written around this new network share, we needed to make sure we could move data around on the backend, update some pointers and things go on without needing to make massive changes.  For that, we relied on DFS-N.  We had the SQL systems point at DFS shares instead of pointing at the raw share.  This is going to prove valuable as we move data very soon to the new SAN.

Reducing multiple disk copies and providing them direct access to historical backups:

The backup storage was sized to store ALL required standard retention, and we (SysAdmins) would continue managing extended retention using our backup solution.  For the most part this now means the DBA’s had access to the data they needed 99% of the time.  This solved the storing the data more than once on disk problem as we would no longer store their standard retention in CommVault, but instead rely on the SQL dumps they already are storing on disk (except extended retention).  They still get copied to tape and sent off site in case you thought that wasn’t covered BTW.

Simplifying backup share permissions:

The DBA’s wanted to simplify permissions, so we worked together and basically came up with a fairly simple folder structure.  We used the basic configuration below.

  • SQL backup root
    • PRD <—- DFS root / direct file share
      • example prd SQL server 1 folder
      • example prd SQL server 2 folder
      • etc.
    • STG <—– DFS root / direct file share
      • example stg SQL server 1 folder
      • etc.
    • etc.
  • Active Directory security group wise we set it up so that all prod SQL servers are part of a “prod” active directory group, all stage are part of a “stage” active directory group, etc.
  • The above AD groups were then assigned at the DFS root (Stg, prd, dev, uat) with the desired permissions.

With this configuration, its now as simple as dropping a SQL service account in one group, it and will now automatically fall into the correct environment level permissions.  In some cases its more permissive then it should be (prod has access to any prod server for example), but it kept things simple, and in our case, I’m not sure the extra security of per server / per environment really would have been a big win.

The harder stuff:

The only two remaining problems we had to solve was knowing what kind of backup the DBA’s did, and making sure we were not backing up more data than we needed.  These were also the two most difficult problems to solve because there wasn’t native way to do it (other than agent based backup).  We had two completely disjointed systems AND processes that we were trying to make work together.  It took many miles of running for me to put all the pieces together and it took a number of meetings with the DBA’s  to figure things out.  The good news is, both problems were solved by aspects of a single solution.  The bad news is, its a fairly complex process, but so far, its been very reliable.  Here’s how we did it.

 The DONE file:

Everything in the work flow is based on the presence of a simple file, what we refer to as the “done” file internally.  This file is used throughout the work flow for various things, and its the key in keeping the whole process working correctly.  Basically the workflow lives and dies by the DONE file.  The DONE file was also the answer to  our knowing what type of backup the DBA’s ran, so we could appropriately sync out backup type with them.

The DONE file follows a very rigid naming convention.  All of our scripts depend on this, and frankly naming standard are just a recommend practice (that’s for another blog post).

Our naming standard is simple:

%FourDigitYear%%2DigitMonth%%2DigitDay%_%24Hour%%Minute%%JobName(usually the sql instance)%_%backuptype%.done

And here are a few examples:

  • Default Instance of SQL
    • 20150302_2008_ms-sql-02_inc.done
    • 20150302_2008_ms-sql-02_full.done
  • Stg instance of SQL
    • 20150302_2008_ms-sql-02stg_inc.done
    • 20150302_2008_ms-sql-02stg_inc.done
The backup folder structure:

Equally as important as the done file, is our folder structure.  Again because this is a repeatable process, everything must follow a standard or the whole thing fall apart.

As you know we have a root folder structure that goes something like this ” \\ShareRoot\Environment\ServerName”.  Inside the servername root I create four folders and I’ll explain their use next.

  • .\Servername\DropOff
  • .\Servername\Queue
  • .\Servername\Pickup
  • .\Servername\Recovery

Dropoff:  This is where the DBA’s dump their backups initially.  The backup’s sit here and wait for our process to begin.

Queue:  This is a folder that we use to stage / queue the backup’s before the next phase.  Again I’ll explain in greater detail.  But the main point of this is to allow us to keep moving data outside of the Dropoff folder to a temp location in the queue folder.  You’ll understand why in a bit.

Pickup:  This is where our tape jobs are configured to look for data.

Recovery:  This is the permanent resting place for the data until it reaches the end of its configured retention period.

Stage 1: SQL side


  1. SQL needs a process that can check the Pickup folder for a done file, delete a done file and create a done file.  Our DBA’s created a stored procedure with parameters to handle this, but you can tackle it however you want, so long as it can be executed in a SQL maintenance plan.
  2. For each “job” in sql that you want to run, you’ll need to configure a “full” maintenance plan to run a full backup, and if you’re using SQL diffs, create an “inc” maintenance plan.  In our case, to try and keep things a little simple, we limited a “job” to a single SQL instance.

SQL maintenance plan work flow:

Every step in this workflow will stop on an error, there is NO continuing or ignore.

  1. First thing the plan does is check for the existence of a previous DONE file.
    1. If a DONE file exists, its deleted and an email is sent out to the DBA’s and sysadmins informing them.  This is because its likely that a previous process failed to run
    2. If a DONE file does not exist, we continue to the next step.
  2. Run our backup, whether its a full or inc.
  3. Once complete, we then create a new done file in the root of the PickupFolder directory.  This will either have a “full” or “inc” in the name depending on which maintenance plan ran.
  4. We purge backup’s in the Recovery folder that are past our retention period.

SQL side is complete.  That’s all the DBA’s need to do.  The rest is on us.  From here you can see how they were able to tell us whether or not they ran a full via the done file.  You can also glean a few things about the workflow.

  1. We’re checking to see if the last backup didn’t process
  2. We delete the done file before we start a new backup (you’ll read why in a sec).
  3. We create a new DONE file once the backup’s are done
  4. We don’t purge any backup’s until we know we had a successful backup.
Stage 1: SysAdmin side

Our stuff is MUCH harder, so do your best to follow along and let me know if you need me to clarify anything.

  1. We need a stage 1 script created, and stage 1 script will do the following in sequential order.
    1. Will need to know what job its looking for.  In our case with JAMS, we named our JAMS jobs based on the same pattern as the done file.  So when the job starts the script reads information from the running job and basically fills in all the parameters like the folder location, job name, etc.
    2. The script looks for the presence of ANY done file in the specific folder.
      1. If no done file exists, it goes into a loop, and checks every 5 minutes (this minimizes slack time).
      2. If a done file does exists we…
        1. If there are more than 1, we fail.  As we don’t know for sure which file is correct.  This is a fail safe
        2. If there is only one, we move on.
    3. Using the “_” in the done file, we make sure that it follows all our standards.  So for example, we check that the first split is a date, the second is a time, the third matches the job name in JAMS and the fourth is either an inc or full.  A failure in any one of these, will cause the job to fail and we’ll get notified to manually look into it.
    4. Once we verify the done file is good to go, we now have all we need to start the migration process.  So the next thing we do is use the date and time information, to create a sub-folder in the Queue folder.
    5. Now we use robocopy to mirror the folder structure to the .\Queue\Date_Time
    6. Once that’s complete, we move all files EXCEPT the done file to the Date_Time folder.
    7. Once that’s complete, we then move the done file into said folder.

And that completes stage 1.  So now you’re probably wondering, why wouldn’t we just move that data straight to the pickup folder? A few reasons.

  • When the backup to tape starts we want to make sure no new files are  getting pumped into the pickup folder.  You could say well just wait until the backup’s done before you move data along. I agree and we sort of do that, but we do it in a way that keeps the pickup folder empty.
    • By moving the files to a queue folder, if our tape process is messed up (not running) we can keep moving data out of the pickup folder into a special holding area, all the while still being able to keep track of the various backup sets (each new job would have a different date_timestamp folder in the queue folder).  Our biggest concern is missing a full backup.  Remember, if the SQL job see’s a done file, it deletes it.  We really want to avoid that if possible.
    • We ALSO wanted to avoid a scenario where we were moving data into a queue folder while the second stage job tried to move data out of the queue folder.  Again, buy have an individual queue folder for each job, this allows us to keep track of all the moving pieces and make sure that we’re not stepping on toes.

Gotcha to watch out for with moving files:

If you didn’t pick up on it, I mentioned that I used robocopy to mirror the directory structure, but I did NOT mention using it for moving the files.  There’s a reason for that. Robocopy’s move parameter actually does a copy + delete.  As you can imagine with a multi-TB backup, this process would take a while.  I built a custom “move-files” function in powershell that does a similar thing, and in that function I use “move-file” cmdlet which is a simple pointer update.  MUCH faster as you can imagine.

Stage 2: SysAdmin Side

We’re using JAMS to manage this, and with that, this stage does NOT run, unless stage 1 is complete.  Keep that in mind if you’re trying to use your own work flow solution.

Ok so at this point our pickup directory may or may not be empty, doesn’t matter, what does matter is that we should have one or more jobs sitting in our .\Queue\xxx folder(s).  What you need next is a script that does the following.

  1. When it starts, it looks for any “DONE” file in the queue folder.  Basically doing a recursive search.
    1. If one or more files are found, we do a foreach loop for each done file found and….
      1. Mirror the directory structure using robocopy from queue\date_time to the PickupFolder
        1. Then move the backup files to the Pickup folder
        2. Move the done file to the Pickup Folder
        3. We then confirm the queue \date_time is empty and delete it.
        4. ***NOTE:  Notice how we look for a DONE file first.  This allows stage 1 to be populating a new Queue sub-folder while we’re working on this stage without inadvertently moving data that’s in use by another stage.  This is why there’s a specific order to when we move the done file in each stage.
    2. If NO done files are found, we assume maybe you’re recovering from a failed step and continue on to….
  2. Now that all files (dumps and done) are in the pickup folder we….
    1. Look for all done files.  if any of them are full, the job will be a full backup.  if we find NO fulls, then its an inc.
    2. Kick of a backup using a CommVault scripts.  Again parameters such as the path, client, subclient, etc. are all pulled from JAMS in our case or already present in CommVault.  We use the information determined about the job type in step 2\1 as for what we’ll execute.  Again, this gives the DBA’s the power to control whether a full backup or an inc is going to tape.
    3. As the backup job is running, we’re constantly checking the status of the backup, about once a minute using a simple “while” statement.  If the job fails, our JAMS solution will execute the job two more times before letting us know and killing the job.
    4. if the job succeeds, we move on to the next step
  3. Now we follow the same moving procedure we used above, except this time, we have no queue\date_time folder to contend with.
    1. Move the backup files from Pickup to the Recovery folder.
    2. Move the done files
    3. Check that the Pickup folder is empty
      1. If yes, we delete and recreate it.  Reason?  Simple, its the easiest way to deal with a changing folder structure.  if a DBA deletes a folder in the DropOff directory, we don’t want to continue propagating a stale object.
      2. If not we bomb the script and request manual intervention.
  4. if all that works well, we just completed out backup process.


You didn’t think I was going to say it was perfect did you?  Hey, I’m just as hard on myself as I am on vendors.  So here is what sucks with the solution.

  1. For the longest time, *I* was the only one that knew how to troubleshoot it.  After a bit of trainings, and running into issues though, my team is mostly caught up on how to troubleshoot.  Still, this is the issue with home brewed solutions, and ones entirely scripted, don’t help.
  2. Related to the above, if I leave my employer, I’m sure the script could be modified to serve other needs, but its not easy, and I’m sure it would take a bit of reverse engineering.  Don’t get me wrong, I commented the snot out of the script, but that doesn’t make it any easier to understand.
  3. Its tough to extend.  I know I said it could, but really, I don’t want to touch it unless I have to (other than parameters).
  4. When we do UAT refreshes, we need to disable production jobs so the DBA’s have access to the production backups for as long as they need.  its not the end of the world, but it requires us to now be involved at a low level with development refreshes, where as before that wasn’t any involvement on our side.
  5. We’ve had times where full backup’s have been missed tape side. That doesn’t mean they didn’t get copied to tape, rather they were considered an “inc” instead of being considered a “full”. This could easily be fixed simply by having the SQL stored procedure checking if the done file that’s about to be deleted is a full backup and if so, to replace it with a new full DONE file, but that’s not the way it is now, and that depends on the DBA’s.  Maybe in your case, you can account for that.
  6. We’ve had cases where the DBA’s do a UAT refresh and copy a backup file to the recovery folder manually.  When we go to move the data from the pickup folder to the recovery folder, our process bombs because it detects that the same file already exists.  Not the end of the world for sure, easy enough to troubleshoot, but its not seamless.  An additional workaround to this could be to do an md5 hash comparison.  If the file is the same, just delete it out of the pickup directory and move on.
  7. There are a lot of jobs to define and a lot of places to update.
    1. In JAMS we have to create 2 jobs + a workflow that links them per SQL job
    2. in CommVault we have to define the sub-client and all its settings.
    3. On the backup share, 4 folders need to be created per job.

Closing thoughts:

At first glance I know its REALLY convoluted looking.  A  Rube Goldberg for sure.  However, when you really start digging into it, its not as bad as it seems.  In essence, I’m mostly using the same workflow multiple times and simply changing the source / destination.  There are places  for example when I’m doing the actual backup, where there’s more than the generic process being used, but its pretty repetitive otherwise.

In our case, JAMS is a very critical peace of software to making this solution work.  While you can do this without the software, it would be much harder for sure.

At this point, I have to imagine that you’re wondering if this is all worth it?  Maybe not to companies with deep pockets.   And being honest, this was actually one of those processes that I did in house and was frustrated that I had to do it.  I mean really, who wants to go through this level of hassle right?  Its funny, I thought THIS would be the process i was troubleshooting all the time, and NOT Veeam.  However, this process for the most part has been incredibly stable and resilient.  Not bragging, but its probably because I wrote the workflow.  The operational overhead I invested saved a TON of capex.  Backing up SQL natively with CommVault has a list price of 10k per TB, before compression.  We have 45TB of SQL data AFTER compression.  You do the math, and I’m pretty sure you’ll see why we took the path we did.    Maybe you’ll say, that CommVault is too expensive, and to some degree that’s true, but even if you’re paying 1k per TB, if you’re being pessimistic and assuming that 45TB = 90TB before compression, I saved 90k + 20% maintenance each year, and CommVault doesn’t cost anywhere close to 1k per TB, so really, I saved a TON of bacon with the process.

Besides the cost factor, its also enabled us to have a real grip on what’s going happening with SQL backups.  Before it was this black box that we had no real insight into.  You could contend that’s a political issue, but then I suspect lots of companies have political issues.  We now know that SQL ran a full backup 6 days ago.  We now have our backup workflow perfectly coordinated.  We’re not starting to early, and we’re kicking off with in 5 minutes of them being done, so we’re not dealing with slack time either.  We’re making sure that our backup application + backup tape is being used in the most prudent way.  Best of all, our DBA’s now have all their dump files available to them, their environment refreshes are reasonable easy, the backup storage is FAST, we have backup’s centralized and not stored with the server.  All in all, the solution kicks ass in my not so humble opinion.  Would I have loved to do CommVault natively?  For sure, no doubt its ultimately the best solution, but this is a compromise that allowed us to continue using CommVault, save money and accomplish all our goals.