Problem Solving: Chasing SQL’s Dump

The Problem:

For years as an admin I’ve had to deal with SQL.  At a former employer, our SQL environment / databases were small, and backup licensing was based on agents, not capacity.  Fast forward to my current employer, we have a fairly decent sized SQL environment (60 – 70 servers), our backup’s are large , licensing is based on capacity, and we have a full time DBA crew that manage their own backup schedules, and prefer that backup’s are managed by them.  What that means is dealing with a ton of dumps.  Read into that as you want 🙂

When I started at my current employer, the SQL server backup architecture was kind of a mess.  To being with, where were then was about 40 – 50 physical SQL servers.  So when you’re picturing all of this, keep that in mind.  Some of these issues don’t go hand in hand with physical design limitations, but some do.

  • DAS was used for not only storage the SQL log, DB and index, but also backup’s.  Sometimes if the SQL server was critical enough, we had dedicated disks for backups, but that wasn’t typical.  This of course is a problem for many reasons.
    • Performance for not only backup’s but the SQL service its self were limited often because they were sharing the same disks.  So when a backup kicked off, SQL was reading from the same disks it was attempting to write to.  This wasn’t as big of an issue for the few systems that had dedicated disks, but even there, sometimes they were sharing the same RAID card, which meant you’re still potentially bottlenecking one for the other.
    • Capacity was spread across physical servers.  Some systems had plenty of space and others barely had enough.  Islands are never easy to manage.
    • If that SQL server went down, so did its most recent backup’s.  TL backup’s were also stored here (shudders).
    • Being a dev shop meant doing environment refreshes.  This meant creating and maintaining share / NTFS permissions across servers.  This by its self isn’t inherently difficult if its thought out ahead of time, but it wasn’t (not my design).
    • We were migrating to a virtual environment, and that virtual environment would be potentially vMotioning from one host to another.  DAS was a solution that wouldn’t work long term.
  • The DBA’s managed their backup schedules so it required us all to basically estimate when the best time to pickup their DB’s.  Sometimes we were too early and sometimes we could have started sooner.
  • Adding to the above points if we had a failed backup over night, or a backup that ran long, it had an effect on SQL’s performance during production hours.  This put us in a position of choosing between giving up on backing some data up, or having performance degradation.
  • We didn’t know when they did full’s vs diffs.  Which means, we might be storing thier DIFF files on what we considered “full” backup taps.  By its self not an issue, except for the fact that we did monthly extended fulls.  Meaning we kept the first full backup of each month for 90 days.  If that file we’re keeping is a diff file, that’ doesn’t do us any good.  However, you can see below, why it wasn’t as big of an issue in general.
  • Finally, the the problem that I contended with besides all of these, is that because they were just keeping ALL files on disk in the same location, every time we did a full backup, we backed EVERYTHING up.  Sometimes that was 2 weeks worth of data, TL’s, Diff’s and and Fulls.  This meant we were storing their backup data multiple times over on both disk and tape.

I’m sure there’s more than a few of you out there with similar design issues.  I’m going to lay out how I worked around some of the politics and budget limitation.  I wouldn’t suggest this solution as a first choice, its really not the right way to tackle it, but it is a way that works well for us, and might for you.  This solution of course isn’t limited to SQL.  Really anything that uses a backup file scheme could fit right into this solution.

The solution:

I spent days worth of my personal time while jogging, lifting, etc. just thinking about how to solve all these problems.  Some of them were easy and some of them would be technically complex, but doable.  I also spent hours with our DBA team collaborating on the rough solution I came up with, and honing it to work for both of us.

Here is basically what I came to the table with wanting to solve:

  • I wanted SQL dumping to a central location, no more local SQL backups.
  • The DBA’s wanted to simplify permissions for all environments to make DB refreshing easier.
  • I wanted to minimize or eliminate storing their backup data twice on disk.
  • I wanted them to have direct access to our agreed upon retention without needing to involve us for most historical restores.  Basically giving them self service recovery.
  • I wanted to eliminate backing up more data then we needed
  • I wanted to know for sure when they were done backing up and knowing what type of backup they performed.

Honestly we needed the fix, as the reality was we were moving towards a virtualizing our SQL infrastructure, and presenting local disk on SAN would be both expensive, but also incredibly complex to contend with for 60+ SQL servers.

How we did it:

Like I said, some of it was an easy fix, and some of it more complex, let’s break it down.

The easy stuff:

Backup performance and centralization:

We bought an affordable backup storage solution.  At the time of this writing it was and still is Microsoft Windows Storage Spaces.  After making that mistake, we’re now moving on to what we hope is a more reliable and mostly more simplistic Quantum QXS (DotHill) SAN using all NL-SAS disks.  Point being, instead of having SQL dump to local disk, we setup a fairly high performant file server cluster.   This gave us both high availability, and with the HW we  implemented, very high performance as well.

New problem we had to solve:

Having something centralized means you also have to think about the possibility of needing to move it at some point.  Given that many processes would be written around this new network share, we needed to make sure we could move data around on the backend, update some pointers and things go on without needing to make massive changes.  For that, we relied on DFS-N.  We had the SQL systems point at DFS shares instead of pointing at the raw share.  This is going to prove valuable as we move data very soon to the new SAN.

Reducing multiple disk copies and providing them direct access to historical backups:

The backup storage was sized to store ALL required standard retention, and we (SysAdmins) would continue managing extended retention using our backup solution.  For the most part this now means the DBA’s had access to the data they needed 99% of the time.  This solved the storing the data more than once on disk problem as we would no longer store their standard retention in CommVault, but instead rely on the SQL dumps they already are storing on disk (except extended retention).  They still get copied to tape and sent off site in case you thought that wasn’t covered BTW.

Simplifying backup share permissions:

The DBA’s wanted to simplify permissions, so we worked together and basically came up with a fairly simple folder structure.  We used the basic configuration below.

  • SQL backup root
    • PRD <—- DFS root / direct file share
      • example prd SQL server 1 folder
      • example prd SQL server 2 folder
      • etc.
    • STG <—– DFS root / direct file share
      • example stg SQL server 1 folder
      • etc.
    • etc.
  • Active Directory security group wise we set it up so that all prod SQL servers are part of a “prod” active directory group, all stage are part of a “stage” active directory group, etc.
  • The above AD groups were then assigned at the DFS root (Stg, prd, dev, uat) with the desired permissions.

With this configuration, its now as simple as dropping a SQL service account in one group, it and will now automatically fall into the correct environment level permissions.  In some cases its more permissive then it should be (prod has access to any prod server for example), but it kept things simple, and in our case, I’m not sure the extra security of per server / per environment really would have been a big win.

The harder stuff:

The only two remaining problems we had to solve was knowing what kind of backup the DBA’s did, and making sure we were not backing up more data than we needed.  These were also the two most difficult problems to solve because there wasn’t native way to do it (other than agent based backup).  We had two completely disjointed systems AND processes that we were trying to make work together.  It took many miles of running for me to put all the pieces together and it took a number of meetings with the DBA’s  to figure things out.  The good news is, both problems were solved by aspects of a single solution.  The bad news is, its a fairly complex process, but so far, its been very reliable.  Here’s how we did it.

 The DONE file:

Everything in the work flow is based on the presence of a simple file, what we refer to as the “done” file internally.  This file is used throughout the work flow for various things, and its the key in keeping the whole process working correctly.  Basically the workflow lives and dies by the DONE file.  The DONE file was also the answer to  our knowing what type of backup the DBA’s ran, so we could appropriately sync out backup type with them.

The DONE file follows a very rigid naming convention.  All of our scripts depend on this, and frankly naming standard are just a recommend practice (that’s for another blog post).

Our naming standard is simple:

%FourDigitYear%%2DigitMonth%%2DigitDay%_%24Hour%%Minute%%JobName(usually the sql instance)%_%backuptype%.done

And here are a few examples:

  • Default Instance of SQL
    • 20150302_2008_ms-sql-02_inc.done
    • 20150302_2008_ms-sql-02_full.done
  • Stg instance of SQL
    • 20150302_2008_ms-sql-02stg_inc.done
    • 20150302_2008_ms-sql-02stg_inc.done
The backup folder structure:

Equally as important as the done file, is our folder structure.  Again because this is a repeatable process, everything must follow a standard or the whole thing fall apart.

As you know we have a root folder structure that goes something like this ” \\ShareRoot\Environment\ServerName”.  Inside the servername root I create four folders and I’ll explain their use next.

  • .\Servername\DropOff
  • .\Servername\Queue
  • .\Servername\Pickup
  • .\Servername\Recovery

Dropoff:  This is where the DBA’s dump their backups initially.  The backup’s sit here and wait for our process to begin.

Queue:  This is a folder that we use to stage / queue the backup’s before the next phase.  Again I’ll explain in greater detail.  But the main point of this is to allow us to keep moving data outside of the Dropoff folder to a temp location in the queue folder.  You’ll understand why in a bit.

Pickup:  This is where our tape jobs are configured to look for data.

Recovery:  This is the permanent resting place for the data until it reaches the end of its configured retention period.

Stage 1: SQL side

Prerequisites:

  1. SQL needs a process that can check the Pickup folder for a done file, delete a done file and create a done file.  Our DBA’s created a stored procedure with parameters to handle this, but you can tackle it however you want, so long as it can be executed in a SQL maintenance plan.
  2. For each “job” in sql that you want to run, you’ll need to configure a “full” maintenance plan to run a full backup, and if you’re using SQL diffs, create an “inc” maintenance plan.  In our case, to try and keep things a little simple, we limited a “job” to a single SQL instance.

SQL maintenance plan work flow:

Every step in this workflow will stop on an error, there is NO continuing or ignore.

  1. First thing the plan does is check for the existence of a previous DONE file.
    1. If a DONE file exists, its deleted and an email is sent out to the DBA’s and sysadmins informing them.  This is because its likely that a previous process failed to run
    2. If a DONE file does not exist, we continue to the next step.
  2. Run our backup, whether its a full or inc.
  3. Once complete, we then create a new done file in the root of the PickupFolder directory.  This will either have a “full” or “inc” in the name depending on which maintenance plan ran.
  4. We purge backup’s in the Recovery folder that are past our retention period.

SQL side is complete.  That’s all the DBA’s need to do.  The rest is on us.  From here you can see how they were able to tell us whether or not they ran a full via the done file.  You can also glean a few things about the workflow.

  1. We’re checking to see if the last backup didn’t process
  2. We delete the done file before we start a new backup (you’ll read why in a sec).
  3. We create a new DONE file once the backup’s are done
  4. We don’t purge any backup’s until we know we had a successful backup.
Stage 1: SysAdmin side

Our stuff is MUCH harder, so do your best to follow along and let me know if you need me to clarify anything.

  1. We need a stage 1 script created, and stage 1 script will do the following in sequential order.
    1. Will need to know what job its looking for.  In our case with JAMS, we named our JAMS jobs based on the same pattern as the done file.  So when the job starts the script reads information from the running job and basically fills in all the parameters like the folder location, job name, etc.
    2. The script looks for the presence of ANY done file in the specific folder.
      1. If no done file exists, it goes into a loop, and checks every 5 minutes (this minimizes slack time).
      2. If a done file does exists we…
        1. If there are more than 1, we fail.  As we don’t know for sure which file is correct.  This is a fail safe
        2. If there is only one, we move on.
    3. Using the “_” in the done file, we make sure that it follows all our standards.  So for example, we check that the first split is a date, the second is a time, the third matches the job name in JAMS and the fourth is either an inc or full.  A failure in any one of these, will cause the job to fail and we’ll get notified to manually look into it.
    4. Once we verify the done file is good to go, we now have all we need to start the migration process.  So the next thing we do is use the date and time information, to create a sub-folder in the Queue folder.
    5. Now we use robocopy to mirror the folder structure to the .\Queue\Date_Time
    6. Once that’s complete, we move all files EXCEPT the done file to the Date_Time folder.
    7. Once that’s complete, we then move the done file into said folder.

And that completes stage 1.  So now you’re probably wondering, why wouldn’t we just move that data straight to the pickup folder? A few reasons.

  • When the backup to tape starts we want to make sure no new files are  getting pumped into the pickup folder.  You could say well just wait until the backup’s done before you move data along. I agree and we sort of do that, but we do it in a way that keeps the pickup folder empty.
    • By moving the files to a queue folder, if our tape process is messed up (not running) we can keep moving data out of the pickup folder into a special holding area, all the while still being able to keep track of the various backup sets (each new job would have a different date_timestamp folder in the queue folder).  Our biggest concern is missing a full backup.  Remember, if the SQL job see’s a done file, it deletes it.  We really want to avoid that if possible.
    • We ALSO wanted to avoid a scenario where we were moving data into a queue folder while the second stage job tried to move data out of the queue folder.  Again, buy have an individual queue folder for each job, this allows us to keep track of all the moving pieces and make sure that we’re not stepping on toes.

Gotcha to watch out for with moving files:

If you didn’t pick up on it, I mentioned that I used robocopy to mirror the directory structure, but I did NOT mention using it for moving the files.  There’s a reason for that. Robocopy’s move parameter actually does a copy + delete.  As you can imagine with a multi-TB backup, this process would take a while.  I built a custom “move-files” function in powershell that does a similar thing, and in that function I use “move-file” cmdlet which is a simple pointer update.  MUCH faster as you can imagine.

Stage 2: SysAdmin Side

We’re using JAMS to manage this, and with that, this stage does NOT run, unless stage 1 is complete.  Keep that in mind if you’re trying to use your own work flow solution.

Ok so at this point our pickup directory may or may not be empty, doesn’t matter, what does matter is that we should have one or more jobs sitting in our .\Queue\xxx folder(s).  What you need next is a script that does the following.

  1. When it starts, it looks for any “DONE” file in the queue folder.  Basically doing a recursive search.
    1. If one or more files are found, we do a foreach loop for each done file found and….
      1. Mirror the directory structure using robocopy from queue\date_time to the PickupFolder
        1. Then move the backup files to the Pickup folder
        2. Move the done file to the Pickup Folder
        3. We then confirm the queue \date_time is empty and delete it.
        4. ***NOTE:  Notice how we look for a DONE file first.  This allows stage 1 to be populating a new Queue sub-folder while we’re working on this stage without inadvertently moving data that’s in use by another stage.  This is why there’s a specific order to when we move the done file in each stage.
    2. If NO done files are found, we assume maybe you’re recovering from a failed step and continue on to….
  2. Now that all files (dumps and done) are in the pickup folder we….
    1. Look for all done files.  if any of them are full, the job will be a full backup.  if we find NO fulls, then its an inc.
    2. Kick of a backup using a CommVault scripts.  Again parameters such as the path, client, subclient, etc. are all pulled from JAMS in our case or already present in CommVault.  We use the information determined about the job type in step 2\1 as for what we’ll execute.  Again, this gives the DBA’s the power to control whether a full backup or an inc is going to tape.
    3. As the backup job is running, we’re constantly checking the status of the backup, about once a minute using a simple “while” statement.  If the job fails, our JAMS solution will execute the job two more times before letting us know and killing the job.
    4. if the job succeeds, we move on to the next step
  3. Now we follow the same moving procedure we used above, except this time, we have no queue\date_time folder to contend with.
    1. Move the backup files from Pickup to the Recovery folder.
    2. Move the done files
    3. Check that the Pickup folder is empty
      1. If yes, we delete and recreate it.  Reason?  Simple, its the easiest way to deal with a changing folder structure.  if a DBA deletes a folder in the DropOff directory, we don’t want to continue propagating a stale object.
      2. If not we bomb the script and request manual intervention.
  4. if all that works well, we just completed out backup process.

Issues?

You didn’t think I was going to say it was perfect did you?  Hey, I’m just as hard on myself as I am on vendors.  So here is what sucks with the solution.

  1. For the longest time, *I* was the only one that knew how to troubleshoot it.  After a bit of trainings, and running into issues though, my team is mostly caught up on how to troubleshoot.  Still, this is the issue with home brewed solutions, and ones entirely scripted, don’t help.
  2. Related to the above, if I leave my employer, I’m sure the script could be modified to serve other needs, but its not easy, and I’m sure it would take a bit of reverse engineering.  Don’t get me wrong, I commented the snot out of the script, but that doesn’t make it any easier to understand.
  3. Its tough to extend.  I know I said it could, but really, I don’t want to touch it unless I have to (other than parameters).
  4. When we do UAT refreshes, we need to disable production jobs so the DBA’s have access to the production backups for as long as they need.  its not the end of the world, but it requires us to now be involved at a low level with development refreshes, where as before that wasn’t any involvement on our side.
  5. We’ve had times where full backup’s have been missed tape side. That doesn’t mean they didn’t get copied to tape, rather they were considered an “inc” instead of being considered a “full”. This could easily be fixed simply by having the SQL stored procedure checking if the done file that’s about to be deleted is a full backup and if so, to replace it with a new full DONE file, but that’s not the way it is now, and that depends on the DBA’s.  Maybe in your case, you can account for that.
  6. We’ve had cases where the DBA’s do a UAT refresh and copy a backup file to the recovery folder manually.  When we go to move the data from the pickup folder to the recovery folder, our process bombs because it detects that the same file already exists.  Not the end of the world for sure, easy enough to troubleshoot, but its not seamless.  An additional workaround to this could be to do an md5 hash comparison.  If the file is the same, just delete it out of the pickup directory and move on.
  7. There are a lot of jobs to define and a lot of places to update.
    1. In JAMS we have to create 2 jobs + a workflow that links them per SQL job
    2. in CommVault we have to define the sub-client and all its settings.
    3. On the backup share, 4 folders need to be created per job.

Closing thoughts:

At first glance I know its REALLY convoluted looking.  A  Rube Goldberg for sure.  However, when you really start digging into it, its not as bad as it seems.  In essence, I’m mostly using the same workflow multiple times and simply changing the source / destination.  There are places  for example when I’m doing the actual backup, where there’s more than the generic process being used, but its pretty repetitive otherwise.

In our case, JAMS is a very critical peace of software to making this solution work.  While you can do this without the software, it would be much harder for sure.

At this point, I have to imagine that you’re wondering if this is all worth it?  Maybe not to companies with deep pockets.   And being honest, this was actually one of those processes that I did in house and was frustrated that I had to do it.  I mean really, who wants to go through this level of hassle right?  Its funny, I thought THIS would be the process i was troubleshooting all the time, and NOT Veeam.  However, this process for the most part has been incredibly stable and resilient.  Not bragging, but its probably because I wrote the workflow.  The operational overhead I invested saved a TON of capex.  Backing up SQL natively with CommVault has a list price of 10k per TB, before compression.  We have 45TB of SQL data AFTER compression.  You do the math, and I’m pretty sure you’ll see why we took the path we did.    Maybe you’ll say, that CommVault is too expensive, and to some degree that’s true, but even if you’re paying 1k per TB, if you’re being pessimistic and assuming that 45TB = 90TB before compression, I saved 90k + 20% maintenance each year, and CommVault doesn’t cost anywhere close to 1k per TB, so really, I saved a TON of bacon with the process.

Besides the cost factor, its also enabled us to have a real grip on what’s going happening with SQL backups.  Before it was this black box that we had no real insight into.  You could contend that’s a political issue, but then I suspect lots of companies have political issues.  We now know that SQL ran a full backup 6 days ago.  We now have our backup workflow perfectly coordinated.  We’re not starting to early, and we’re kicking off with in 5 minutes of them being done, so we’re not dealing with slack time either.  We’re making sure that our backup application + backup tape is being used in the most prudent way.  Best of all, our DBA’s now have all their dump files available to them, their environment refreshes are reasonable easy, the backup storage is FAST, we have backup’s centralized and not stored with the server.  All in all, the solution kicks ass in my not so humble opinion.  Would I have loved to do CommVault natively?  For sure, no doubt its ultimately the best solution, but this is a compromise that allowed us to continue using CommVault, save money and accomplish all our goals.

KUDOS: West Coast Technologies (WCT)

In general I have to say I’m not a fan of VAR’s for many reasons.  Most of which because I find little “Value Add” in having them resell technology that would probably be cheaper if I wasn’t stuck going through them.  However, there’s always an exception to the rule, and after a recent mess up I wanted throw a shout out to WCT for helping us out of a tough bind.

We ordered a new Quantum QXS (DotHill) SAN for some new backup storage.  There was supposed to be some SPF’s in there for networking, but we could’t find anything.  Its possible it got thrown out in all the excitement, or perhaps it never was put in the box.  Can’t say for sure.  What I can say is when I called WCT (Dave Holloway to be specific)  about the issue, he took care of it right away.  While working out the minutiae on the backend, they shipped us SFP’s out of their stock so we could get moving.

Now I’m not saying there aren’t any other VAR’s out there with this level of service, but I didn’t cringe making the call to WCT because I know they’re real people and get that shit happens.

Anyway, this is just a public kudos to them, to basically say “I really appreciate what you did”.

Thinking out loud: The broken talent acquisition process

I was scanning through my LinkedIn feed the other day and stumbled across an interesting picture / chart showing common mistakes that interview candidates make posted by a former colleague.   To basically summarize, it was a culmination of physical characteristics (slouching, not smiling, clothing not too trendy, but not too outdated, etc.), knowing a lot about the company and position, and various other common do’s and don’ts.

My first reaction was “no duh”, but then I paused for a moment and said to myself “Why?”  Does this criteria really lead to you getting the best candidate for the position or is it actually filtering out the best candidate for the position?  It actually got me a little annoyed just thinking about how shallow someone recruiting must be if these were deciding factors.  Then I started thinking about other facets of recruiting that just seem not only outdated, but also likely detrimental to a company finding the best possible candidates.  It’s actually bothered me enough that I felt compelled to get a blog post written about what I think is wrong on not only the recruiting side, but also the candidates side too.

I realize a lot of the things I’m going to mention will have exceptions.  There are always exceptions.  I’m not writing to debate the exceptions, I’m writing to discuss the averages, which contrary to most companies beliefs, they’re likely in the average, and that’s not a bad thing.

First, let’s really think about what a companies goal SHOULD be when attempting to fill a position.

  1. A person who has the best demonstrated skills in the chosen position you need filled.
  2. Ideally the person should be reliable.  You might also consider this a great work ethic.  it’s not an inherently easy trait to get out of an interview, but its an important one.
  3. They should be passionate about their career, and love their work.

With those three simple yet crucial traits, there’s no reason you should not be able to find the perfect candidate.  Now, I’m not saying they’re easy traits to determine, nor am I saying that the candidates which posses them are abundant.  However, if you put up superficial filters, you’re reducing your chances of finding them.

What’s broken in the talent acquisition process from a candidates view:

Where I sit as a former job seeker, these are things I saw that led me to not wasting my time applying for a position, or just being generally frustrated with company that I was trying to apply to.  I think more often then not, companies forget that employment is a two way street.

  • Making applying for your position a lot of work.  This could be things like forcing candidate to go through lengthy online applications, to something as simple as not having a “click here to submit your resume”.  The application process should be as simple and as quick as possible.  I’m not condoning applicants shotgunning their resume out either, but I think given the possibility of an applicant not applying or forgetting to apply, outweighs the cost of sifting through more resumes than you’d like.
  • Related to the above, cover letters need to go.  I’ve started noticing this as a slowing trend, so hopefully that continues.  Seriously, its an old fashioned formality, 99% of the time, the candidate is going to use a template, and I suspect, most recruiters don’t read them anyway.
  • If salaries are not going to be in the job posting, they need to be discussed at a high level early on, not at the end.  It’s a waste of a companies time, and candidates time if the salary the candidate is looking for is way off what the company is willing to stretch too.
    • Adding to this, the candidates salary history is frankly none of the hiring companies business.  Add to that, it really should matter what I was paid, what should matter is what I’m willing to work for, and what you’re willing to compensate.  That’s it.
  • Not doing the majority of your interviews via phone IMO is a disservice to your candidate.  As a person that’s now been on both sides of the table, I can say without question that you know if you want to hire someone before you ever meet them in person.  To me, the in person interview is mostly a formality, and really just a chance to meet face to face, and get an idea of the work environment.  Add to this, taking time off of work to go interview for a candidate is WAY HARDER than say slipping out during lunch for a one hour conversation via phone.  Just think about it from the candidates view.  How would you like to tell your manager for the 5th time that you’re sick, or your car broke down, or whatever other lie you force the candidate to make so they can meet with you.
  • If you’re going to bring the candidate in for an in person interview, then make sure its a once and done thing, unless the candidate wants a second in person interview.  Again, looking at the point above as the main reason, but also because its highly likely the candidate is burning up vacation time to come and interview with you.  I can think of one place I interviewed at where I literally went in 5 times, only to get told “no”.
  • Offer interview times after hours, during lunch, and heck even over the weekend.  Forcing someone to interview during business hours makes it tough to coordinate for the candidate.  If you realy want this person, and the person really wants to work for you, what’s the big deal with spending a non-work day in the office to meet / greet.  I’d even add to the fact that maybe you nor the candidate would feel as rushed as you might during the work week.
  • Be both flexible and understanding when it comes to someone showing up a little late or a little early.  I’m not talking about 30 minutes or an hour, but if they’re say 5 – 10 minutes late, so what.  I can tell you a number of times where I was basically racing from work to an interview (pre-GPS) and there were just times it was tough to find a place.  Add to that, accidents and other things happen.    I get being punctual is important, but I bet your average person (even you the recruiter) are late at times, and its beyond your control.
  • Except in a few circumstances, judging a candidate in person based on their appearance and NOT focusing on their skills set and experience isn’t a determination that they’re a bad candidate, its that you’re a poor interviewer.  Sure, maybe they showed up to a job that claims they’re “casual dress” in a polo instead of a suite, but does that really change that they’re a kick ass SysAdmin, that you’re trying to fill for a casual dress code company.  If the company dress codes is business professional and they show up for the interview in a polo shirt, then I would say that might be an issue if they didn’t tell you ahead of time.  For example, what if the candidates current employers dress codes is casual.  Everyone wears jeans and t-shirts.  Now they need to go from that place to your business professional dress code.  You’re only leaving them with a few options.  None of which are good for them.
    • Dress professionally while at their current employer, that won’t set any signals off that they’re going to an interview.
    • Change in a bathroom along the way there.  That’s not awkward for the candidate or anything.
  • Not letting the candidate know that you’re not moving forward with them via a phone call or the very least an email.  Getting some insincere letter in the mail weeks or months from when you interviewed is just rude.
  • If you are still interested in them, but are still interviewing other candidates, a weekly update isn’t too much to ask for.
  • Requiring candidate to have a college degree when you know darn well that the degree isn’t required to accomplish the work.  At least when it comes to IT, a college degree does not inherently make you a great IT person.
  • Worrying about skills that are not critical to the position at hand.  For example, “SysAdmin must have excellent written and oral communication skills.”  Why must they be excellent?  Seriously, do you care that they can automate your entire infrastructure, or do you care that they can write a perfect announcement.  One skill actually is important to getting work done, the other is just looking for something to nitpick about.
  • Putting 100 skills that a candidate must posses when any reasonable person could look at that and go “yeah and unicorns are real”.  Seriously, expecting someone to be wide and deep knowledge wise is so unlikely to exist, and if they do, you probably can’t afford them.  I’ve looked at job posting where they basically wanted an entire IT departments worth of skill sets in a single person and wanted to pay them the salary  of a Jr. Admin.  I’m not saying that well rounded candidates shouldn’t exist, but expecting someone to be an expert in networking, virtualization, storage, Linux and Windows, sorry ain’t gonna happen.  Sure, they might know some stuff in those areas, but they’re not going to be true experts in all of them.
  • Having unrealistic salary expectations is another issue I see a lot.  Contrary to popular belief, finding a GREAT SysAdmin (as an example) is VERY tough to find.  If you want the best talent, and not just a person that says they can do stuff, its going to cost more than you likely budgeted for.  And you know what, if they are that good, you probably will make whatever extra salary overhead you think you’ll incur back, when they do their job about 3x better / faster than the cheaper candidate you wanted to hire.  Not to mention, golden shackles are pretty powerful way to keep most good talent from leaving.
  • Interviewing with either too many people or not enough people and sometimes the wrong people is also an issue I’ve seen.  I remember interviewing through a recruiter where the SVP of infrastructure wanted to interview with me in person.  They insisted that the interview be both in person and during a date / time that they were around.  I had probably 2 – 3 different interviews scheduled that were canceled at the last minute because the SVP either had vacation, or something came up.  When I finally did come in for the interview, I never met with them, but I met with every other person that reported to them.    What’s funny about this is, every person I interviewed with, indirectly lamented how strict and demanding the SVP was.  So not only was I turned off by the fact that the SVP after changing my schedule around about 3 times didn’t bother to meet with me, but the people he delegated to do the interviewing basically convinced me (without knowing it) that there was no way in hell I wanted to work for this guy.  I suspect, had I actually met with the SVP, I may have picked up on the queues, but who knows, maybe not.  Either way, they lost me, and I know I could have turned things around for them.
  • Having a candidate fill out an application even during an interview is a waste of time.  I say, reserve the job application for when you think you’re ready to send them an offer letter.  Focus first on finding the candidate of your dreams, THEN go through the formalities once you’ve found them.
  • Asking stupid interview questions that have no chance of determining the quality of the candidate or are not applicable to the job.  Most of the time I see these coming out of the HR, but every once in a while I’ll see a hiring manager ask them too.  Those types of questions I’m talking about are things like “why do you want to work here?” or “what are your 3 greatest strengths and your 3 greatest weaknesses?”.  Seriously, stop wasting my time and yours and let’s move on to questions that really determine if I’m a good candidate.  One example would be “what project were you most proud of in your career and why?”  Remember my “passion for their career” requirement, if they don’t light up when asked to brag about themselves and their career, there’s something wrong with them.  Or if they can’t explain anything of significance, I think you have your answer as to whether they’re a good fit.  Me personally, I could probably give you a 100’s of projects that made me beam with pride.  Even technical questions that are trivia.  If someone claims to be an expert in a field, then sure they should know the trivia, but if they’re not claiming to be an expert, don’t ask them something that you could just Google.
  • Can we just do away with dumb requirement of bringing three printed resumes and references along?  First, its a waste of paper, and second, you can print out the resume that I sent in, or just look at it on your phone.

What’s broken in the talent acquisition process from a hiring managers view:

These are points where I see that the candidate has messed up, or is making my life tougher than they should.

  • Stop loading your resume up with stuff that you did, that you really didn’t do.  There are so many times where I would read through someones resume and start asking them to explain details about an accomplishment they listed, only to hear that really they just helped and someone else actually did all the complex parts of the project.  If you put “designed and implemented a virtual environment hosting over 500 vm’s” then I’m going to dig into that environment so that I know you actually did it.
  • Sort of related to the above, but don’t apply for jobs that you know you’re not qualified for.  Just because your employer gave you a fake title of *Senior* SysAdmin, when we both know at best your mid level, and more than likely not much better than Jr.  You applying for a job that your not qualified for is only going to lead to you either getting declined (wasting both our time) or me having to let you go if you do talk a good game, but can’t back it up.  Don’t get me wrong, there were times much later in my career where I realized I had to fake it till I made it, but I knew I had the skills to do the job.  Not just because I thought so, but because everyone I worked with told me I did.
  • Related to the above, I actually appreciate a person that admits “you know, I really don’t know how to do this”, so long as that’s not your answer to every question I ask, and so long as that’s not the answer to critical points of the job description.  Let’s be real here, if you DO say you know how to do something, I’m going to ask you about it.
  • Not having a passion for what you do is a huge negative for me.  Look, I’m glad that you enjoy your tomato garden, but I care a hell of a lot more that you love your job to a point where you’re going to stay up to date on your own.  Unless you just happen to have raw talent (and some do) being good in IT, requires a lot of work, and there’s not enough business hours to get work done and stay up to date.  If you love your job, you’ll never work a day in your life.  I live by that, and I want the people I look for to as well.  Think of it like this, do you want a surgeon operating on you that only cares about their job when they’re getting paid, or do you want someone that goes to seminars on their own, researches on their own and in general wants to excel at what they do.  I’m not saying you should live to work, but reading a few blogs every night on the couch isn’t going to kill you, nor is thinking about how to architect solution x while you’re running / biking, etc.
  • You should be an expert at some things and pretty darn good at a lot of things.  If you say you’re an expert in Active Directory, I’m going to ask you about the bridge head controllers and how they’re elected.  I’m going to ask you about the 5 FSMO roles, and what they do.  If you say you’re a VMware expert, I’m going to ask you if HA requires vCenter, and I’m going to ask you if the vMotion kernel and the management kernel can co-exist on the same vLAN.
  • I want you to be able to talk about IT architecture, and how you’d solve certain problems and why you would use that tactic.  I might not agree with you, but if your answer is well thought out, we can always negotiate the tactics as long as the strategy ultimately solves the problem.
  • I want to see progressive experience and responsibility, and you should have the skills to back it up.  I realize the higher up you get the tougher it gets.  But if I see that it took you 10 years before you got your first sysadmin gig, I’m going to wonder if you really have what it takes.
    • Just because you’re a SysAdmin, doesn’t mean I don’t expect very good desktop management skills out of you.
  • I want to see fire and passion when I interview you.  if you disagree with me, diplomatically correct me.  I don’t know everything, and who knows, I might just be testing if you do, and if you have the non-technical skills to lead upwards.  You’re no good to me if you’d let me crash the titanic into the iceberg if you saw the iceberg and didn’t say anything.    Besides, think of it from your view, do you really want to work with someone that isn’t open to discussions?  That doesn’t mean I won’t push back, but if your point is well thought out, I can at least respect your view.
  • If you need to dress down, I’m okay with it, but just let me know ahead of time before you show up at the interview.
  • If you’re calling me from a cell, try to make sure you have decent signal.
  • If you think you might be late, just let me know, I get it.

Closing thoughts:

Just remember with all of this, the point isn’t to make all kinds of crazy demands of both the candidate or the hiring company.  Its about cutting through outdated and in many cases proven to be ineffective recruitment techniques.  I want to work for the best company, and in turn I want to be able to find the best candidates too.  The sooner we get rid of broken acquisition techniques and improve our process, the quicker we’ll all find what we’re looking for.

Review: 1 year with LogicMonitor

Disclaimer:

This is unsolicited feedback.  These are my own opinions and not those of my employer.

Terminology:

I know I’ve been lacking this in some of my other blogs posts, so here’s a stab at trying to make sure you have a rough idea of what certain words mean.

  • LM:  Acronym I user for LogicMonitor
  • Collector: Its a windows or linux box that grabs the data you want monitored from your other servers.  Its an agent that sits on a server and allows LM to either remotely poll other servers, or locally collect data.
  • Datasource: This is a term used to describe a “monitor”.  As an example, if I wanted to monitor a windows logical disk perfmon counter, that would be a datasource.  If I have a script that checks for the status of an exchange database, that would be a datasource.
  • Instance: A term used to describe a very specific item that’s monitored by a datasource.  For example, going back to the logical disk datasource.  If you had a C: E: and F: drive, each drive is an individual instance.  Some datasources are designed to collect multiple instances (like the example) others like “ping” are not.
  • Device: Anything that you’re monitoring.  This is what LM license. A few examples, a windows server, linux server, switch. pdu, etc.

Introduction:

If we’re all honest, I think we can all agree that monitoring solutions suck.  Its a little crass I know, but I think its a fitting word for this software class as a whole.    Backup and monitoring are the two areas I always here folks say “they all suck, this one just sucks a little less”.  And that is really why we chose LogicMonitor.  Its not that I’m blown away with it as a solution, but it sucks a lot less then the other options out there.  So you could say from my view they’e the best, or you could say they’re the least worst.  Either way, we chose them over  a number of other vendors, including the all so popular SolarWinds.

For us, we were coming from absolutely nothing (just some scripts and freebie tools) to try and monitor a 700+ device environment.  It was a huge undertaking to say the least.  We really didn’t have anything to compare anyone too.  I had some personal experience with What’s Up! and Solarwinds, but that was it and it was in a much smaller environment with much simpler needs.

Who we evaluated:

Just going to provide a quick list of everyone we evaluated.  Ultimately the choice boiled down to Solarwinds vs. LogicMonitor for us.

  • Passlers PRTG
  • IPSwitchs Whats up
  • ScienceLogic
  • Solarwinds
  • LogicMonitor
  • Microsoft Systems Center Ops Manager

Not the most comprehensive list of monitoring solutions, but we went after the well known ones, and we stayed way from the more complex and / or expensive ones.  Nagios as an example is one we skipped due to the complexity.  I’ve got about 500 other things to do with my time, and troubleshooting an opensource monitoring solution isn’t one of them.

Pre-sales:

Besides the fact that I liked the way LogicMonitor functioned better than other solutions, it was ultimately the pre-sales experience that sold us on them.   Its not that LM did anything that I wouldn’t normally expect out of a decent company.  As an example.

  1. Highly responsive to any questions and comments
  2. Had their better / best pre-sales engineer at our disposal
  3. Walked us through the product and actually was able to answer odd ball questions.
  4. Helped get us setup and worked with both our network team and my team individually to evaluate point cases.
  5. Helped us setup custom monitors.
  6. Worked with us on pricing
  7. Didn’t badger us, knowing that we were still evaluating other solutions.
  8. Specific to monitoring, didn’t tell us to reach out to a community for support.
  9. They were very gracious with extending our trial multiple times.  We got held up with other issues, or needed more time to test something and they were very easy going about it.

Surprisingly LogicMonitor was one of the last vendors we evaluated.  Mostly due to the fact that I had never heard of them, and it took a bit of creative searching to find them.  I didn’t even really want to look for them if we’re being honest.  Product wise, I thought I wanted Solarwinds.  Really, the only reason I went looking was Solarwinds pre-sales was so horrible.  And when I say horrible, I’m not exaggerating.  Just to give a few examples.

  1. Sales guy pinged me almost every other day asking me when I was going to buy.
  2. Sales guy offered me this “Great Deal!” if I signed on the dotted line by COB Friday (it was Thursday when they sent the message).
    1. Sales guy told me I would never get said deal again
  3. Sales demo was done by sales guy, without a pre-sales engineer on the call.  Sales guy couldn’t answer technical questions.
  4. It took days to get in touch with a pre-sales engineer for any and all questions we had.  It wasn’t like we fired off an email and we’d get a response in a few hours, it was days.  Then if we needed to jump on a call, again it was days.
  5. When we did jump on a call, most recommendations for monitors were “go check out our community database, I’m sure we’ve got something that meets your needs”.  Not “let me go look into that…”
  6. They told me features existed in base level products, that ultimately required a different (more expensive) add on.
  7. We were NOT allowed to access generic tech support to evaluate them.  That was a big red flag for us.

So really, LM (LogicMonitor) just seemed like a breath of fresh air after dealing with Solarwinds.    Like I said, its not that LM did anything above and beyond.  The reality, they just acted like the way you’d expect from a tier 1 solution provider, and that’s a good thing.

The product its self:

All in all, its a great product.  It still has a LONG ways to go, but the good news is they’re on the agile model and it shows.  There are improvements and changes coming out all the time.  To get into the specifics, let’s start with what I like.

Pros:

  1. They have a huge list of pre-built monitors, way larger then Solarwinds or any other vendor we evaluated.  Granted Solarwinds has a ton of community driven monitors, but LogicMonitor has a much larger database of monitors they built and supported.  This is a huge plus, because we didn’t want to spend the next year developing and tweaking our own monitors.
  2. The basics of getting monitoring setup are easy and intuitive.  Define some credentials, setup some collectors, and BOOM! you’re off and monitoring.
  3. The monitors themselves are mostly comprehensive.  They all have a great set of pre-defined thresholds based on a vendors recommended practices.
  4. Building custom monitors (scripts, WMI) are fairly easy.  I had developed multiple Powershell monitors in a couple of days.
  5. Building dashboards is easy, and reasonably intuitive
  6. There’s pretty much no limit to what you can monitor.  If it can be polled in some way, it can be monitored.
  7. They support a lot of out the box collection methods
    1. Powershell, VBS, CMD, Pearl, Groovy, and other scripts
    2. SNMP
    3. WMI
    4. Perfmon
    5. JDBC
    6. JSON API
    7. Generic HTTP gets
    8. etc.
  8. There is only 1 license to worry about.  What ever monitor you assign to that device, doesn’t cost any extra.
    1. Again, its licensed by device, not by object.  So a switch with 96 ports, only counts as one device.
  9. They keep reasonably good documentation on the monitors, although its been slacking off a bit as their database increases.
  10. They have multiple alert methods, including email, sms, phone, slack, pagerduty etc.
  11. They have a good set of integration with other third parties, like Slack, pagerduty, onelogin, and various other vendors.
  12.   Other than the collector (poller) everything is hosted in the cloud.  So you do have some work to get things setup, but ultimately the hard work is take care of by them.
  13. They’re constantly implementing new features and upgrades.  Its slowly getting community driven too, which is a good / bad thing.
  14. Because its web driven I can login from anywhere to check out what’s going on.  They also have a phone app, which still needs some work, but hey, its a nice option.
  15. If you happen to be an MSP, it looks like they have some nice features around that.
  16. Notification are quick and reliable
  17. If they don’t have a datasource, and you don’t have time to create one, you have two options.
    1. You can pay them, and it will get high priority.
    2. You can wait, get put into a queue, but ultimately you won’t have to do much and they’ll take care of it for you.
  18. The new GUI is based on HTML 5, and as far as web consoles go, its very fast / snappy.
  19. They have a feature called “dynamic groups” which is pretty cool.  You can group devices based on a query.  The query can contain anything from the list of properties on the device.  So I might group all “physical” servers by looking for “poweredge” in the model as an example, or all production sql servers based on a naming pattern.
  20. Its rudimentary, but they do have built in external monitoring (global monitoring) of simple things like ping, web checks, etc.  Best of all, so far its unlimited.
  21. They can mix and match discovery and collection methods.  So if you want to for example use a script to find all dell servers with a drac, you can then use their built in “ping” to do the actual monitoring.  Meaning, just because you use a script for the discovery of a device / instance, doesn’t mean you need to use a script for the collection.
  22. They’re SaaS, so its pay as you go.

Before I get into the Cons, I just want to be clear, that EVERY product has cons.  I have yet to use a perfect product.  The list may look large an ominous, but that’s frankly because its a lot easier to pick apart things you don’t like then to appreciate the things that work well.  That said, I’m still not going to pull any punches.  If there’s one tone I want to set with my blog, its not that I’m mean, but that I’m honest, even if it hurts.  Again also keep the disclaimer in mind.  These are my opinions.

Cons:

  1. Collection / Polling has a few issues IMO.  Not deal breaker issues, just issues.
    1. The collectors are not the most reliable solution I’ve used.  We’ve run into dozens of cases where they get “overloaded” and we either need to restart them, or add more collectors to balance out the load.  They will say at max 100 devices if its windows or VMware, IMO I find that to be overly generous, more like 40 – 50 if you’re lucky.
    2. They don’t scale up very well, which means you need a lot of collectors to manage your load reliably.  (we run 10 for roughly 500 devices).  They’ll tell you they can scale up, and maybe that’s true.  But it involves a lot of work getting folks on their end that know all the settings to tweak and what not.
      1. Update 03/01/2017:  They’ve release a method for having collector “sizes”.  I haven’t had a chance to validate it yet, but it looks like it enable the ability to scale up fairly well.
    3. The collection process is horribly inefficient IMO (but getting better).  If you have a switch with 100 switch ports.  At a minimum, you’re going to hit that switch 100 instances * the number of multi instance datasources you have.  For Windows boxes, a good example would be an individual WMI call per hard drive (instance), and per datasource.  So Logical Disk counter + Physical disk counter would each equal unique polls * the number of disks.  If I have two disks per server, just to monitor Logical and Physical, I’m looking at 4 wmi calls just for that one device and those disks.
      1. They have released a newer script collection process which now does one poll per datasource, which is the way it probably should have been all along.  One poll per datasource per device.  This doesn’t fix the other 99% of their data sources, but its a start.
    4. You never really know when the collectors are overloaded, until you login to the console and start seeing “no data” warnings and stuff like that.  Then you contact them, they tell you its overloaded, but that may not even be true.
    5. Troubleshooting issues can be a bit of a pain.  Their collectors return errors that sometimes could mean one of a number of possible issues.  For example, specific to WMI, we’ve frequently been told “its clear you don’t have permissions set right based on our error” to in turn prove that’s not the problem.  They then move on to “WMI must be broke” to which in turn we demonstrate that, that’s not the case, to finally realizing “oh, wmi is just timing out”.    Instead of having very specific and granular error messages, it appears they use very generic messages and then expect the tech / you and me to just take a few guesses as to what might be going wrong.
  2. They don’t appear to have any Powershell script datasources.  There are just certain things that can only effectively be monitored by Powershell.  Windows storage spaces at the time was one example.  To this day, I still don’t see them having any PS datasources.  They have a ton of datasources based on Groovy, which is fine for web and Linux, but really, they need to get some more Powershell experts on staff.
  3. They lack instances, datasource and device aggregation based monitoring.  What I mean by that is as an example, let’s say you have two PDU’s in a rack.  Ultimately you want to know the aggregate power consumed in a rack not just by an individual PDU.  That is not something they can do without some fancy scripting on your part.
  4. They don’t support things like “sub-instances”.  For example, if I have a database server, that will have databases, those databases will have files.  They don’t support displaying and monitoring the data in such a way that i can “drill” down from the SQL server to an individual file in a DB.  They CAN monitor individual files and individual DB’s, they just can aggregate them.
  5. You need professional services for things that frankly you shouldn’t need.  For example, if you want to bulk import a bunch of devices, you can’t do that on your own (at least not easily).  There’s no GUI to import a CSV or a JSON file.  I think at best we “might” have an API finally available that could be used, but even there, at least last time i checked, its not documented well.  Either way, tools should be easy to use, and for importing devices, a GUI should not be a problem at all.
  6. Setting up dashboards is easy, but tedious if you need to make multiple similar dashboards.  If I have 20 web clusters as an example, I might want a dashboard for each one.  There’s no easy way to create those dashboards other than simply cloning a template, and going through (painfully) one chart and one datapoint at a time and updating which device and label to use.  Again, going back to point 5, perhaps something via the API’s could make this much easier.
    1. Update 03/01/2017:  I haven’t had a chance to validate their dashboard enhancements yet, but it looks like they’ve added functionality that would make creating dashboards a lot easier.
  7. Their lack of AD integrate authentication is frustrating.  Forcing us to use either MS SAML or some third party solution just seems unneeded.  its a nice option for sure, but I’d much rather have an appliance that I run which enables AD auth.
  8. They lack a “poll now” feature which I loved in solarwinds.  If I have a datasource (say CPU) set to poll every 5 minutes.  I’m stuck waiting 5 minutes for that device to poll.  5 minutes can be a long time when you’re recovering from an outage.
    1. Update 03/01/2017: They have now added the poll now feature.
  9. They roll up historical data way too fast on average.  With in less than a day, our data went from per poll datapoint (say every minute), to 8 minute averages.   I can understand if after a week that happened, but in less than a day, that’s just crazy.  Now, last I heard, they were looking to move to a different platform which would resolve this (again they continue to innovate / improve), but its a painful issue to live with when you need it.
    1. Update 10/04/2016: I forget the exact release (7.9 maybe?) but this is no longer an issue.  What you collect is what you keep for your agreed upon retention.
  10. They had this really simplistic, but slightly dated UI that they’re now abandoning for a newer GUI.  In many ways I DO like the new GUI, but simple things like finding threshold overrides, or datasources disabled at a group level are now way harder than they need to be.  It seems like the UI designers are spending too much time focusing on aesthetics vs. functionality.
  11. I appreciate that they have tech support, but honestly I feel like they’re reading a script some times.  Multiple cases where we’ve demonstrated that our device wasn’t having issues, and that its probably their collectors, they’ve come back and forced us to prove it time and time again.  Even when we come armed with info like “i logged into the poller, and run this command to grab wmi info on remote problem system as the poller user”.
  12. I’m not going to go into great detail here, only to say that I think their architecture could be a little more secure with some simple changes.
  13. I HATE the way their datasources don’t have an “overrides” function.  What I mean by that is they have system datasources (ones they provide) and sometimes you need to make tweaks to them.  Well in a few months, they may have an update for that datasoruce you tweaked.  When you run the update, it whacks all your settings.  I wish they provided a read only datasource, and that we had an ability to “link” a new datasource that had our customization, which over rode their datasource.  This would allow their new features and tweaks to roll in, but things that we’ve overridden, would remain in place.
  14. When they build datasources, to put it bluntly, I think they cheap out on fully developing them.  For example, if I have an IronPort Email Security Appliance.  They may have a built in datasource, but it may not have all the available OID’s setup or even all the datapoints.  I get that it means they might need to store more data on their end, but why not just bite the bullet and configure ALL MIBS as an example and all datapoints.  Maybe even have an option where only whats truly needed (subjective definition BTW)  is enabled by default, but the other options are available, and as simple as enabling them.
  15. Going off of point 14, they don’t really seem to think like an admin of the thing they’re monitoring.  Or more specifically, in many cases they’re not fully monitoring the health of an application, just monitoring counters that exists.  Let me give you an example.  With active directory, one of the things you want to monitor is DFS-R health.  Simply monitoring a counters doesn’t always provide insight into whether something like this is healthy or functioning as it should.  In my case, I wanted to know for a fact that replication was working in Active Directory, so I ended up having to write a Powershell script that regularly created / updated a health file (on each DC) and then monitored that a given health file was updated on all DC’s.  Now to be fair to LM, they never claimed to be a full tilt applicaiton monitor and I was able to do said monitor with their product.  Still, if LM were to say hire or consult with an Active Directory engineer, there would be things like this that they could probably glean, that would benefit all of their customers.  You could also contend this is something that could be crowd sourced by the community, but TBH, I’d much rather see something provided by them, and supported by them.
  16. They still lack a comprehensive RestAPI.  Adding to that, they lack Powershell cmdlets for administering their products.  Again, I’m going to berate any vendor that doesn’t offer Powershell in 2016.  IMO if you’re catering to a windows crowd, not offering Powershell is just foolish on your part.  Again, why should 100’s to 1,000’s of your customers waste time figuring out your API’s to do common functions, when you could spend the time once, and 1,000’s of customers could benefit from your work.  Not to mention now there would be a standard (searchable) CLI / help doc for how to do something.
    1. Update 10/04/2016: As of 8.0, their API is pretty comprehensive if not complete.  I haven’t had a chance to use their API, so I can’t speak to things like how user friendly it is, but it does exist now.
  17. To say setting up, troubleshooting, and administering their alert / notification rules is a PITA is an understatement.  There are things like making sure rules are properly prioritized based on static numbers (not dynamic).  It works like an ACL.  Adding to that, it seems like there’s only one target per device.  Even thought a device can be a member of multiple groups, only one group can actually be used to configure notifications / alerts.  For example, lets say you have a group called “windows” and a sub-group called “SQL”.  And let’s just say for argument, you have a special SQL server called “specialSQL”.  Now let’s say be default you want all windows admins to monitor all windows boxes.  And you want DBA’s to get all SQL alerts (but windows admins still need the alerts too), but you also have a special team that needs alerts for specialSQL server.  You need three alert groups setup with members as outlined below.  So now just imagine a joint like us with 700 devices and 100 different folks that want different alerts for different systems?  Yikes!  And, just remember the beginning point, each one of these is a rule, with a different number, that needs to be configured correctly or one group may not end up getting a notification, ever.
    1. One for Windows admins, which contains windows admins
    2. One for SQL servers which contains Windows admins + DBA’s
    3. One for Windows admins + DBA’s + special team.
  18. There filters in many cases are “include” only and there’s no “exclude” option.  For example, if I want a dashboard that shows all windows servers, but no SQL servers, not going to happen if the SQL server happens to have *windows* in its group name.  The only way to work around that would be instead of using a wild card for group *windows*, we need to explicitly include *Exchange*, *IIS*, *File*, etc.  Its a PITA
    1. Update 10/04/2016:  I would not go so far as saying this is resolved completely, but they do ofter some exclude capability now.  It’s appears to only be one level deep though.  As in, the above example i could exclude .\windows\SQL by excluding *SQL*, but I couldn’t exclude .\windows\test\SQL with the same statement.  Still kind of lame IMO.  Frankly their operators and filter mechanism is over complicated.
  19. Their datasources besides not always being documented well (or in many cases not at all) have absolutely no decent naming convention.  Instead of doing something like “Vendor_Product_Datasource” they just randomly choose vendors, or product names and in many cases abbreviate them.    For example, one datasource might be named “winos” and another “wincpu”?
  20. Their threshold evaluation leaves a lot to be desired.  They work based on the following operators “=,<=,>=,<,>,!=, etc.”  Some devices have nice even number sequences where anything above 1 = warning until you get to 4 and then its an error.  In those cases their thresholds work fine.  However if you can something tricky like 0 = good, 1 = verybad, 2  = warning, 3 = maybe something you should look into, 4 = error.  There’s no good way to target numbers that are all over the place with thier limitation.  It would be nice to see them add query based evaluation instead.  As in supporting “or” + “and” + grouping threshold conditons.  Like (=1 or = 2) or (>=7).  I suspect it might take some extra CPU power to make that evaluation, but it would be worth it.
  21. Having agent based collection would be a nice option for those of us that don’t like having to doll out admin rights to our collectors.  We have a few cases where we installed the collector right on the server we wanted to monitor to avoid such things.  Perhaps we could do that across the board, but I haven’t looked into it and honestly that would be a hack.
  22. They’re SaaS, so if you’re limited to a lot of their constraints (data retention as an example).
    1. Update 10/04/2016:  It looks like they now retain up to two years worth of data with certain contracts.  Better than before, but still not a good fit for long term retention.
  23. They’re not a replacement solution for all your deep web, application and network monitoring solutions.  Yes they have NetFlow, but I suspect Scrutinizer does it much better.

Summary:

They’re not perfect as I eluded too, and believe me, I could probably keep going on with the cons (and pros, albeit harder).  However, its about context, and in the context of other solutions, I really like LogicMonitor.   They’re fairly new as a company and given everything they’ve been able to do so far, I honestly can’t wait to see what they’ve got on the horizon.  I’ve found that most things I care about, I’ve been able to use LM to monitor.  Even things that you normally wouldn’t monitor, like job status in 3rd party applications.  They for the most part have been the one console I use to monitor everything.  It took a bit of time to roll up all my outlying functions into them, but now we have a great single pain of glass.  Give them a peak when you have some time.

Review: 1.5 years with Veeam

Disclaimer:

These are my opinions, these are not facts.  Take them with a grain of salt, and use your own judgement.  These are also my personal views, not those of my employer.

Also, no one is paying me for this, no one asked me to write this.  It’s not un-biased, but its also not a hired review.

Introduction:

As you may know from reading my backup storage series, we were going from a backup migration of native CommVault to Veeam + CommVault.  I wanted to put together my experience living with Veeam as an almost complete backup  solution replacement for the last year and half.  Again, I’m always looking for this type of stuff, but its always hard to find.

The environment:

We’re backing up approximately 80TB in total (after compression / dedupe) .  That’s what CommVault is copying to tape once a week.  Veeam probably makes up a good 50TB+ of that data set.

The types of VM’s we’re backing up are comprised of mostly file servers and generic servers.  However, we do have almost 6TB of Exchange data and a smattering of small SQL servers.

Source infrastructure specs:

  • 5 Nimble cs460  SAN’s
  • 10g networking (Nexus 5596) with no over subscription
  • 4 Dell r720 hosts with dual socket 12 core procs, and 768GB of RAM.

Backup infrastructure specs:

  • Veeam Infrastructure (started with v6.5   and currently running v8):
    • 1 Veeam server
    • 5 Veeam proxy servers
    • 3 VM’s with 8 vCPU’s and 32GB of vRAM
    • 1 Dell r710, dual socket 6 core 128GB of RAM
    • 1 Dell r720 , dual socket 8 core, 384GB of RAM
  • Windows storage spaces SAN:
    • Made up of the same two Dell proxies as above.
    • 80 4TB disks in a RAID 10 (verified 4GB per second read and 2GB per second write)
    • All 10g
  • CommVault infrastructure:
    • 1 VM running our CommCell
    • 1 Dell r710 attached to our Quantum tape library

All the infrastructure on this side was also connected the same Nexus 5596 @10Gb, and again no over subscription.

Context is everything, which is why I shared my specs.

What we liked pre-migration and still do:

Let’s start with the pros that Veeam has to offer.  Again, we chose Veeam because it was working well as a point solution (replication), so they came in with a good reputation.

  • For the most part, Veeam is super easy to setup and configure.  While I did read the manuals at times, many things were very intuitive to setup.  There are area’s that are not, but I’ll get more into that later.
  • I like that the backup files are all you need to recover out of.  So long as you have the VBK you can recover your last full, and if you have the vibs and vrb, you can get all your restore points too.  This in turn makes it very easy to move the Veeam backup data around, and import the data on other servers.  There’s no database to restore, no super complex recovery process if all you need is the data.
  • The proxy architecture for the most part scales out very well.  There’s little configuration needed, and jobs basically just balance across the proxies on their own.
  • VM restores are mostly intuitive (talking about an entire VM).  And guest file recovery if you know when the file was deleted is also easy to do (just not always quick).
  • I love that they have Powershell for almost everything.  It blows my mind that in 2016 there are vendors still lacking Powershell in a windows environment.  I guess the idea is they provide API’s and that’s good enough.  I disagree, but anyway, kudos to Veeam for having PS.
  • We get backup and replication in the same product.  The two are different, and while you can backup and then restore to achieve a replication type result, its easier to just say “hey, I want to replicate this here”.
  • Its reasonably affordable (although their price is starting to get up there).  Its backup, and its a tough win with management to get decent investment.  Having something affordable is a big pro.
  • Veeam is very quick to have product updates after a new ESXi release and I’ve found  that they’re pretty stable afterwards too.  It’s refreshing to see a vendor strive for supporting the latest and greatest as soon as possible.
  • For the most part, the product is very solid / reliable.  I can’t say I’ve see major issues that were on Veeam’s side.  A few quirks here and there, but ultimately reliable.  The copy jobs are probably the only gripe I have reliability wise.
  • I like that they have application level recovery options baked into the product.  Things like restoring individual email items, AD objects, etc.
  • They try to stay on top of VMware issues related to backups, and come up with ways to work around them.  During the whole CBT debacle, I think they did a great job of helping out their customers.

What we didn’t know about, didn’t think about, and just in general don’t like about Veeam:

  1. It’s rare now a days to find good tech support and Veeam is no exclusion.  I know I’m not alone in my view, if you google objectively, you’ll find tons of people going on about how you need to escalate as soon as humanly possible if you want any chance at getting decent support, and I find this to be true. Their level 1 support is fine for very simple things like “what’s the difference between a reverse and forwards incremental?” but beyond that, good luck getting any decent troubleshooting out of them.  Even when you escalate to level 2, I contend that while they’re certainly better than level 1, it doesn’t inherently mean you’re dealing with the caliber of person you need to solve a tricky problem.  The biggest problem I’ve seen so far with Veeam, is they only know their product well, and if it comes to troubleshooting anything related to backup in VMware, you’re stuck opening a case with VMware and trying to juggle two vendors pointing fingers.  Me, I would have expected much deeper expertise out of Veeam when it comes to the hypervisor they’re trying to backup.  Maybe they do have it, and I just haven’t been escalated enough, but having a level 2 person tell me they’re not super familiar with reading VMware logs is  disappointing.
  2. I honestly don’t feel like product suggestions are taken seriously unless they already align with something Veeam is working on.  I’d also add, there’s no good way to really see what features have been suggested already and to get an idea of how popular they are.  You’re basically stuck taking their word for it that a request is or is not a popular one.  One prime example of this is their SAN snapshot integration.  What a mess.
  3. We knew tape was bad before we got into Veeam, up until v9 (which I haven’t looked at seriously yet), it didn’t get much better in the two revisions that we had.  It was for this reason that we were stuck keeping CommVault around.  Even with version 9, they’re sill not good enough to copy non-Veeam data to tape, which to me, is a big con of the product.
  4. We have a love / hate relationship with their backup file format.  On one hand its nice and easy to follow, each backup job is a single file each time it executes.  So 1 full and 6 incs = 7 files.  However, the con of this really starts to show when you have a server that’s say 8TB.   That means the minimum contiguous allocation you can get away with is 8TB.  Now just imagine you have 3 jobs like that, plus a bunch of smaller ones.  What’s a better way you ask?  File sparsing, and load balancing those file shards across multiple luns automatically (one of my favorite CommVault features).  There are other issues with this file format too, even if they had to keep the shards in the same contiguous spaces, sharding would still have more pros then cons.  For example, windows dedupe, works much better on smaller files.  Try deduping an 8TB file vs. a 4GB file.  One is not doable and other other is.  Here’s another one.  Try copying an 8TB file to tape without ever having a hiccup and needing to restart all over again.
  5. They don’t have global dedupe, and even worse, they don’t even dedupe inside a given job execution.  Each job execution is a unique dedupe container.  All that great disk saving you think you’ll get, will only come if you stack tons of VM’s into a single job, which ends up leading to problem 4.  Add to that, the only jobs that really see any kind of serious dedupe are fulls.
  6. VM backup of clustered applications is just bad, and its made worse in Veeams case when they don’t have SAN snapshot support for your vendor.  Excluding any specific application at the moment, to give you an idea of why its bad picture this scenario.  You have a two node clustered application and you’re using a FSW for quorum.  You’ve followed best practices and only backup the passive node.  Windows patches roll around, and now you need to patch the active node, so you fail your cluster over to the passive node that you’re backing up.  In this scenario, it becomes problematic for two reasons.
    1. if you forget to disable your backup’s during the maint. window, there is a high degree of probability that your cluster will go offline.  All it takes is a backup kicking off while your rebooting your other node.  The stun of a clustered windows server causes the cluster service to hiccup a lot, and if the current node is hiccuping while your other node is offline, that’s not enough votes to satisfy quorum, and hence, your cluster is going down for the count.
    2. Take the above point, but now just imagine your secondary node is down for good.  Maybe a windows patch tanked, maybe your san bombed out.  Doesn’t matter.  Now you’re stuck with a single node cluster, and one that needs a disruptive backup running (more important than ever now).  It’s not a good place to be in.
  7. So what about backing up standalone applications?  Well at least with Exchange 2007, we’ve seen the following issues.
    1. The snapshot removal process has caused the disk to go offline in windows.  You could blame this on VMware, and to a degree its fair.  Except that my entire backup solution is based on Veeam, and this is their only option to backup exchange.
    2. I’ve seen data get corrupt in exchange thanks to snapshot based backup’s.  Again, its technically a VMware issue, but again Veeam relies on VMware.
    3. I’ve seen Veeam end a jobs with a success status, only to discover my transaction logs are NOT truncating.  Looking in the event viewer, you’ll see “backup failed”, but not according to Veeam.  Opened a case about this, and Veeam basically blamed Microsoft.  Again, see point 1 about finger pointing with Veeam’s tech support.
    4. Backing up Exchange with Veeam is SLOW.  We’ve actually noticed that it gets really good if we reseed a database, but over time, as changes occure, as fragmentation occur, the jobs just get slower and slower.
  8. Veeam IMO lacks a healthy portfolio of supported SAN vendors for snapshot based backup.  Take a look at Veeams list, and then go take a look at CommVaults list.  That’s pretty much all I have to say abut that.
  9. Their “sure backup” and “Lab’s” technology sounds great on paper, but if you have an even remotely complex network, trying to setup their NAT appliance is just a PITA.  I’m not knocking it per say, its a nice feature, but its not something that’s simple or straight forward to setup in an enterprise environment IMO.
  10. Kind of a no duh, but they only backup virtual machines.   You can’t call your self an enterprise backup solution, if you can’t support a multitude of systems and services.  Need to backup EMC Isilon?  Not going to happen with Veeam.
  11. This is anecdotal (as most of my views are), but I just find their backup solution to be slow.  The product always blame my source for being the bottleneck, yet my sources is able to deliver far faster throughput then what they’re pulling.  I might get say 120MBps when my SAN can deliver 1024MBps without  breaking a sweat.  It’s not a load issue either as we’ve kicked off jobs during low activity times.  I don’t tend to see a high queue depth or a high latency, so I’m not sure what’s throttling it.  SIOC isn’t kicking in either.  Again, perhaps its just something about VMware based backup.  Also worth noting that for the most part, the IO is sequential, so its not a random IO issue usually.  Although FYI, snapshot commits are almost entirely random IO (at least for exchange).
  12. Opening large backup jobs can be really slow.  I have a few file servers that are 4TB+, and it can take a good minute, maybe two just to open the backup file.  Now to be fair, we don’t index the file system in Veeam so I don’t know if that would help.  Add to that, closing said backup file can also be slow.
  13. I have no relationship with a sales rep.  Maybe for some folks that’s a good thing, but for me, when I spend money with a vendor, I think there should be some correspondences.  In the 3 years we’ve owned Veeam, the only time a sales person got in touch with me was because I was interested in looking at VeeamOne.  Never to check in and see how things were going, or anything like that.  Our environment probably isn’t large from a Veeam view, so I’m not expecting lunches, or monthly calls, but at least once a year would be reasonable.  I  have no idea who our sales rep is just to give you an idea of how out of touch they are.  And I know most of my other vendors sales folks by first name.

I could probably keep going on, but I  think that’s enough for now.

Conclusion:

Veeam is a solution I might recommend if you’re a small shop, that’s 100% virtual with modest needs.  If you’re a full scale enterprise, regardless of how virtualized you are, I would’t recommend Veeam at this stage, even with the v9 improvements.

I’ve certainly heard the stories of larger shops switching to Veeam.  I suspect a lot of that was due to the expense of CommVault or EMC.  I get it, backup is a hard line item to justify expenditures in.  That’s why you see vendors like Veeam, Unitrends and other cheaper solutions filling the gap so well.  Veeam in our case even looked like it might be a better solution.  It was simple, and affordable.  However, you soon realize that at scale, the solution doesn’t match the power of your tried and true “legacy” solution.  Maybe that solution costs more in capex, but I suspect the opex of running that legacy solution far outweighs the capex you’re saving.  In our case, thats an absolute truth.

I wouldn’t say run away from Veeam, but I would say think really hard before you ditch a tried and true enterprise application like CommVault, EMC Avamar, NetBackup, or Tivoli.    If you know me and follow my blog, you know I do have love for the underdog / cutting edge solutions.  This isn’t me looking down my nose at a newer player, this is me looking down my nose at an inferior solution.

Thinking out loud: The cloud (IaaS) delusion

Introduction:

Just so we’re all being honest here, I’m not going to sit here and lie about how I’m not biased and I’m looking at both sides 100% objectively.  I mean I’m going to try to, but I have a slant towards on prem, and a lot of that is based on my experience and research with IaaS solutions as they exist now.  My view of course is subject to change as technology advances (as anyones should), and I think with enough time, IaaS will get to a point where its a no brainer, but I don’t think that time is yet for the masses.  Additionally, I think its worth noting that in general, like any technology, I’m a fan of what makes my life easier, what’s better for my employer, and what’s financially sound.  In many cases cloud fits those requirement, and I currently run and have run cloud solutions (long before being trendy).  I’m not anti cloud, I’m anti throwing money away, which is what IaaS is mostly doing.

Where is this stemming from?  After working with Azure for the past month, and reading why I’m a cranky old SysAdmin for not wanting to move my datacenter to the cloud, I wanted to speak up on why in contrary, I think you’re a fool if you do.  Don’t get me wrong, I think there are perfectly valid reasons to use IaaS, there are things that don’t make sense to do in house, but running a primary (and at times a DR) datacenter in the cloud, is just waisting money and limiting your companies capabilities.  Let’s dig into why…

Basic IaaS History:

Let’s start with a little history as I know it on how IaaS was initially used, and IMO, this is still the best fit for IaaS.

I need more power… Ok, I’m done, you can have it back.

There are companies out there (not mine) that do all kinds of crazy calculations, data crunching and other compute intensive operations.  They needed huge amounts of compute capacity for relatively short periods of time (or at least that was the ideal setup).  Meaning, they were striving to get the work done as fast as possible, and for arguments sake, let’s just say their process scaled linearly as they added compute nodes.  There was only so much time, so much power, so much cooling, and so much budget to be able to house all these physical servers for solving what is in essence one big complex math equation.  What they were left with was a balancing act of buying as much compute as they could manage, without being excessively wasteful.  After all, if they purchased so much compute that they could solve the problem in a minimal amount of time, unless they were keeping those server busy, once the problem was solved, it was a waste of capital.  About 10 years ago (taking a rough guess here), AWS releases this awesome product capable of renting compute by the hour, and offering whats basically unlimited amounts of cpu / gpu power.  Now all of a sudden a company that would have had to operate a massive datacenter has a new option of renting mass amounts of compute by the hour.  This company could fire up as many compute nodes as they could afford, and not only could they solve their problem quicker, but they only had to pay for the time they used.

I want to scale my web platform on demand…. and then shrink it, and then scale it, and then shrink it.

It evolved further, if its affordable for mass scale up and scale down for folks that fold genomes, or trend the stock market, why not for running things like next generation web scale architectures.  Sort of a similar principle, except that you run everything in the cloud.  To make it affordable, and scalable, they designed their web infrastructure so that it could scale out, and scale on demand.  Again, we’re not talking about a few massive database servers, and a few massive web servers, we’re talking about tons of smaller web infrastructure components, all broken out into smaller independently scalable components.  Again the cloud model worked brilliantly here, because it was built on a premise that you designed small nodes, and scaled them out on demand as load increased, and destroyed nodes as demand dwindled.  You could never have this level of dynamic capacity affordably on prem.

I want a datacenter for my remote office, but I don’t need a full server, let alone multiples for redundancy.

At this stage IaaS is working great for the DNA crunchers and your favorite web scale company, and all the while, its getting more and more development time, more functionally, and finally gaining the attention of more folks for different use cases.  I’m talking about folks that are sick of waiting on their SysAdmins to deploy test servers, or folks that needed a handful of servers in a remote location, folks that only needed a handful of small servers in general, and didn’t need a big expensive SAN or server. Again, it worked mostly well for these folks.  They saved money by not needing to manage 20 small datacenters, or they were able to test that code on demand and on the platform they wanted, and things were good.

The delusion begins…

Fast forward to now, and everyone thinks that if the cloud worked for the genome folders, the web scale companies and finally for small datacenter replacements, then it must also be great for my relatively speaking static, large legacy enterprise environment.  At least that’s what every cloud peddling vendor and blogger would have you believe, and thus the cloud delusion was born.

Why do I call it the cloud delusion?  Simple, your enterprise architecture is likely NOT getting the same degrees of wins that these types of companies were/are getting out of IaaS.

Let’s break it down the wins that cloud offered and offers you.  In essence, if this is functionality that you need, then the cloud MAY make sense for you.

  1. Scale on demand:  Do you find your self frequently needing to scale servers by the hundreds every, day, week or even month?   Shucks, I’ll even give you same leeway and ask if you’re adding multiple hundred servers every year?   In turn are you finding that you are also destroying said servers in this quantity?  We’re trying to find out if you really need the dynamic scale on demand advantage that the cloud brings over your on prem solution.
  2. Programatic Infrastructure:  Now I want to be very clear with this from the start, while on prem may not be as advanced as IaaS, infrastructure is mostly programatic on prem, so weigh this pro carefully.  Do you find that you hate using a GUI to manage your infrastructure, or need something that you can that can be highly repeatable, and fully configurable via a few JSON files and a few scripts?  I mean really think about that.  How many of you right now are just drowning because you haven’t automated your infrastructure, and are currently head first in automating every single task you do?  If so, the cloud may be a good fit then because practically everything can be done via a script and some config files.  If however, you’re still running through a GUI, or using a handful of simple scripts, and really have no intention of doing everything through a JSON file / script, its likely that IaaS isn’t offering you a big win here.  Even if you are, you have to question if your on prem solution offers similar capabilities, and if so, whats the win that a cloud provider offers that your on prem does not.
  3. Supplement infrastructure personnel:    Do you find your infrastructure folks are holding you back?  If only they didn’t have to waste time on all that low level stuff like managing hypervisors, SANs, switches, firewalls, and other solutions, they’d have so much free time to do other things.  I’m talking about things like patching firmware, racking / unracking equipment, installing hypervisors, provisioning switch ports.  We’re talking about all of this consuming a considerable portion of your infrastructure teams time.  If they’re not spending that much time on this stuff (and chances are very high that they’re not), then  this is not going to be a big win for you.  Again, companies that would have teams busy with this stuff all the time, probably have problem number 1 that I identified.  I’d also like to add that even if this is an issue you have, there is still a limited amount of gain you’ll get out of this.  You’re still going to need to provision storage, networking and compute, but now instead of in the HW, it will simply be transferred to a CLI / GUI.  Mostly the same problem, just a different interface.  Again, unless you plan to solve this problem ALONG with problem 2, its not going to be a huge win.
  4. VM’s on demand for all:  Do you plan on giving all your folks (developers, DBA,  QA, etc.) access to your portal to deploy VM’s?  IaaS has an awesome on demand capability that’s easy to delegate to folks.  if you’re needing something like this, without having to worry about them killing your production workload, then IaaS might be great for you.  Don’t get me wrong, we can do this on prem too, but there’s a bit more work and planning involved.  Then again, letting anyone deploy as much as they want, can be an equally expensive proposition.  Also, let’s not forget problem number 2, chances are pretty high, your folks need some pre-setup tasks performed, and unless you’ve got that problem figured out, VM’s on demand probably isn’t going to work well anywhere, let alone the cloud.
  5. At least 95% of your infrastructure is going to the cloud:  While the number may seem arbitrary (and to some degree it is a guess), you need a critical mass of some sort for it to make financial sense to send you infrastructure to the cloud (if you’re not fixing a point problem).  What good is it to send 70% of your infrastructure to the cloud, if you have to keep 30% on prem.  You’re still dealing with all the on prem issues, but now your economies of scale are reduced.  If you can’t move the lions share of your infrastructure to the cloud, then what’s the point in moving random parts of it?  I’m not saying don’t move certain workloads to the cloud.  For example, if you have a mission critical web site, but everything else its ok to have an outage for, then move that component to the cloud.  However, if most of your infrastructure needs five 9’s, and you can only move 70% of it, then you’re still stuck supporting five 9’s on prem, so again, what’s the point?

Disclaimer:  Extreme amounts of snark are coming, be prepared.

Ok, ok maybe you don’t need any of these features, but you’ve got money to burn, you want these features just because you might use them at some point, everyone else is “going cloud” so why not you, or who knows whatever reason you might be coming up with for why the cloud is the best decision.  What’s the big deal, I mean you’re probably thinking you lose nothing, but gain all kinds of great things.  Well that my friend is where you’d be wrong.  Now my talking points are going to be coming from my short experience with Azure, so I can’t say these apply to all clouds.

  1. No matter what, you still need on prem infrastructure.  Maybe its not a hoard of servers, but you’ll need stuff.
    1. Networking isn’t going anywhere (should have been a network engineer).    Maybe you won’t have as many datacenter switches to contend with (and you shouldn’t have a lot if your infrastructure is modern and not greater than a few thousand VM’s), but you’ll still need access switches for you staff.  You’re going to need VPN’s and routers.  Oh, and NOW you’re going to need a MUCH bigger router and firewall (err… more expensive).  All that data you were accessing locally now has to go across the WAN, if you’re encrypting that data, that’s going to take more horsepower, and that means bigger badder WAN networking.
    2. You’re probably still going to have some form of servers on site.  In a windows shop that will be at least a few domain controllers, you’ll also have file server caching appliances, and possibly other WAN acceleration devices depending on what apps you’re running in the cloud.
    3. Well, you’ve got this super critical networking and file caching HW in place, you need to make sure it stays on.  That potentially is going to lead back to UPS’s at a minimum and maybe even a generator.  Then again, being fair, if the power is out, perhaps its out for your desktops too, so no one is working anyway.  That’s a call you need to make.
    4. Is your phone system moving to the cloud too?  No… guess you’re going to need to maintain servers and other proprietary equipment for that too.
    5. How about application “x”?  Can you move it to the cloud, will it even run in the cloud?  Its based on Windows 2003, and Azure doesn’t support Windows 2003.  What are application “X”‘s dependencies and how will they effect the application if they’re in the cloud?  That might mean more servers staying on prem.
  2. They told you it would be cheaper right, I mean the cloud saves you on so much infrastructure, so much personnel power, and it provides this unlimited flexibility and scalability that you don’t actually need.
    1. Every VM you build now actually has a hard cost.  Sorry, but there’s no such thing as “over provisioning” in the cloud.  Your cloud provider gets to milk that benefit out of you  and make a nice profit.  Yeah I can run a hundred small VM’s on a single host, those same VM’s I’d pay per in a cloud solution.  But hey, its cheaper in the cloud, or so the cloud providers have told me.
    2. Well at least the storage is cheaper, except that to get decent performance in the cloud, you need to run on premium storage and premium storage isn’t cheap (and not really all that premium either).  You don’t get to enjoy the nice low latency, high iop, high throughput, adaptive caching (or all flash) that your on prem SAN provided.  And if you want to try and match what you can get on prem, you’ll need to over-provision your storage, and do crazy in guest disk stripping techniques.
    3. What about your networking?  I mean what is one of the most expensive reoccurring  networking costs to a business?  The WAN links… well they just got A LOT more expensive.  So on top of now spending more capex on a router and firewall, you also need to pump more money into the WAN link so your users have a good experience.  Then again, they’ll never have the same sub-millisecond latency that they had when the app was local to them.
      1. No problem you say, I’ll just move my desktop to the cloud, and then you remember that the latency still exists, its just been moved from client and application, to the user interfacing with the client.  Not really sure which is worse.
        1. Even if you’re not deterred by this, now you’re incurring the costs of running your desktops in the cloud.  You know, the folks that you force 5 years or older desktops on.
    4. How many IP’s or how many NIC’s does your VM have?  I hope its one and one.  You see there are limitations (in Azure) of one IP per NIC, and in order to run multiple NIC’s per server, you need a larger VM.  Ouch…
    5. I hope you weren’t thinking you’d run exactly 8 vCPU’s and 8GB of vRAM because that’s all your server needs.  Sorry, that’s not the way the cloud works.  You can have any size VM you want, as long as its the sizes that your cloud provider offers.  So you may end up paying for a VM that has 8 vCPU and 64GB of RAM because that’s the closest fit.  But wait, there’s more…  what if you don’t need a ton of CPU or RAM, but you have a ton of data, say a file server.  Sorry, again, the cloud provider only enables a certain number of disks per vCPU, so you now need to bump up your VM size to support the disk size you need.
    6. At least with cloud, everything will be easy, I mean yeah it might cost more, but oh… the simplicity of it all.  Yep, because having a year 2005 limitation of 1TB disks just makes everything easy.  Hope you’re really good with dynamic disks, windows storage spaces, or LVM (Linux) because you’re going to need it  Also, I hope you have everything pre-thought out if you plan to stripe disks in guest.  MS has the most unforgiving disk stripping capabilities if you don’t.
    7. Snapshots, they at least have snapshots… right?  Well sort of, except its totally convoluted, not something you’d probably never want to implement for fear of wrecking your VM (which is what you were trying to avoid with the snap right?).
    8. Ok, ok, well how about dynamically resizing your VM’s?  They can at least do that right?  Yes, sort of, so long as your sizing up in a specific VM class.  Otherwise TMK, you have to rebuild once you outgrow a given VM class.  For example, the “D series” can be scaled until you reach the maximum for the “D”.  You can’t easily convert it to a “G” series in a few clicks to continue growing it.
    9. Changes are quick and non-disruptive right?  LOL, sure with any other hypervisor they might be, but this is the cloud (Azure) and from what I can see, its iffy if your VM’s don’t need to be shutdown, or even worse, if you do something that is supported hot, you may see longer than normal stuns.
    10. Ever need to troubleshoot something in the console?  Me too, a shame because Azure doesn’t let you access the console.
    11. Well at least they have a GUI for everything right?  Nope, I found I need to go drop into PS more often than not.  Want to resize that premium storage disk, that’s gonna take a powershell cmdlet.  That’s good though right, I mean you like wasting time finding the disk guid, digging into a CLI, just to resize one disk, which BTW is a powered off operation, WIN!
    12. You like being in control of maintenance windows right?  Of course you do, but with cloud you don’t get a say.

I could keep going on, but honestly I think you get the point.  There are caveats in spades when switching to the cloud as a primary (or even DR) datacenter.  Its not a simple case of paying more for features you don’t have, you lose flexibility / performance, and you pay more for it too.

Alright, but what about all those bad things they say about on prem, or things like TCO they’re trying to woo you to the cloud for.  Well lets dig into it a bit.

  1. Despite what “they” tell you, they’re likely out of touch.  Most of the cloud folks you’re dealing with, have been chewing their own dog food so long, they don’t have a clue about what exists in the on prem world, let alone dealing with your infrastructure and all its nuances.  They might convince you they’re infrastructure experts, but only THEIR infrastructure, not yours and certainly not on prem in general.  Believe me, most of them have been in their bubble for half a decade at least, and we all know how fast things change in technology, they’re new school in cloud, but a dinosaur in on prem.  Don’t misunderstand me, I’m not saying they’re not smart, I’m saying I doubt they have the on prem knowledge you do, and if you’re smart, you’ll educate yourself in cloud so you’re prepared to evaluate if IaaS really is a good fit for you and your employer.
  2. Going cloud is NOT like virtualization.  With virtualization you didn’t change the app, you didn’t’ lose control and more importantly it actually saved you money and DID provide more flexibility, scalability and simplicity.  Cloud does not guarantee any of those for a traditional infrastructure.  Or rather it may offer different benefits, that are not as equally needed.
  3. They’ll tell you the TCO for cloud is better and they MAY be right if you’re doing foolish things like.
    1. Leasing servers and swapping them every three years.  A total waste of money.  There’s very few good reason you aren’t financing a server (capex) and re-purposing that server through a proper lifecycle.  Five years is the minimum maximum life cycle for a modern server.  You have DR, and other things you can use older HW for.
    2. You’re not maxing out the cores in your server to maximize licensing costs, reduce network connectivity costs, and also reduce power, cooling and rack space.  An average dual socket 18 core server can run 150 average VM’s without breaking a sweat.
    3. Your threshold for a maxed out cluster is too low.  There’s nothing wrong with a 10:1 or even a 15:1 vCPU to pCPU ratio so long as your performance is ok.  Your milage may vary, but be honest with yourself before buying more servers based on arbitrary numbers like these.
    4. You take advice from a greedy VAR.  Do yourself a favor and just hire a smart person that knows infrastructure.  They’ll be cheaper than all the money you waste on a VAR, or cloud.   You should be pushing for someone that  is border line architect, if not an architect.
      1. FYI, I’m not saying all VARs are greedy, but more are than not.  I can’t tell you how many interviews I’ve had where I go “yeah, you got up sold”.
    5. Stop with this BS of only buying “EMC” or “Cisco” or “Juniper” or whatever your arbitrary preferred vendor is. Choose the solution based on price, reliability, performance, scalability and simplicity, not by its name.  I picked Nimble when NetApp would have been an easy, but expensive choice.  Again, see point 4 about getting the right person on staff.
    6. Total datacenter costs (Power, UPS, generator and cooling) are worth considering, but are often not as expensive as the providers would have you think.  If this is the only cost saving’s point they have you sold on, you should consider colocation first which takes care of some of that, but also incurs some of the same costs/caveats that come with cloud (but not nearly as many).  Again, I personally think this is FUD, and in a lot of cases, IT departments, let alone businesses don’t even see the bill for all of this.  Even things like DC space, if you’re using newer equipment, the rack density you get through virtualization is astounding.
    7. You’re not shopping your solution, ever.  I know folks that just love to go out to lunch (takes one to know one), and their VAR’s and vendors are happy to oblige.  If your management team isn’t pushing back on price, and lets you run around throwing PO’s like monopoly money, there’s a good chance you’re paying more for something than you need to.
    8. You suck at your job, or you’ve hired the wrong person.  Sounds a little harsh,but again, going back to point 4.  if you have the right people on staff,you’ll get the right solutions, they’ll be implemented better, and they’ll get implemented quicker.  Cloud by the way, only fixes certain aspects of this problem.
  4. They’ll tell you can’t do it better than them, they scale better, and it would cost you millions to get to their level.  They’re right, they can build a 100,000 VM host datacenter better than you or I, and they can run it better.  But you don’t need that scale, and more importantly, they’re not passing those economies of scale on to you.  That’s their profit margin.  Remember, they’re not doing this to save you money, they’re doing this to make money.   In your case, if your DC is small enough (but not too small) you can probably do it MUCH cheaper than what you’d pay for in a cloud, and it will likely run much better.
  5. They’ll tell you’ll be getting rid of a SysAdmin or two thanks to cloud.  Total BS… An average sysadmin (contrary to marketing slides) does not spend a ton of time with the mundane task or racking HW, patching hypervisors (unless its Microsoft :-)), etc.  They spend most of their time managing the OS layer, doing deployments, etc, which BTW all still need to be done in the cloud.

For now, that’s all I’ve got.  I wrote this because I was so tired of hearing folks spew pro cloud dogma from their mouthes without even having a simplistic understanding of what it takes to run infrastructure in the cloud or on prem.  Maybe I am the cranky main frame guy, and maybe I’m the one who is delusional and wrong.  I’m not saying the cloud doesn’t have its place, and I’m not even saying that IaaS won’t be the home of my DC in ten years.  What I am saying is right now, at this point in time, I see moving to the cloud as big expensive mistake if your goal is to simply replace your on prem DC.  If you’re truly being strategic with what you’re using IaaS for, and there are pain points that are difficult to solve on prem, then by all means go for it.  Just don’t tell me that IaaS is ready for general masses, because IMO, it has a long ways to go yet.

Powershell Scripting: Get-ECSVMwareVirtualDiskToWindowsLogicalDiskMapping

Building off my last function Get-ECSPhysicalDiskToLogicalDiskMapping: which took a windows physical disk and mapped it to a windows logical disk, this function will take a VMware virtual disk and map it to a windows logical disk.

This function has the following dependencies and assumptions:

  • It depends on my windows physical to logical function and all its dependencies.
  • It assumes your VM name matches your windows server name
  • It assumes you’ve pre-loaded the VMware powershell snap-in.

The basic way the function works, is it starts by getting the windows physical to logical mapping and storing that in an array.  This array houses two key peaces of information.

  • The physical disk serial number
  • The logical disk name (C:, D:, etc.)

Then we get a list of all of the VM’s disks, which of course has the same exact serial number, just formatted a little differently (which I convert in the function).

Finally, we’re going to map the windows physical disk serial number to the VMware virtual disk serial number, and add the VMware virtual disk name and the Windows Logical disk name (we don’t care about the windows physical disk, it was just used for the mapping) into the final array, and echo them out for your use.

See below for an example:

Get-ECSVMwareVirtualDiskToWindowsLogicalDiskMapping -ComputerName “YourComputerName”

VMwareVirtualDisk WindowsLogicalDisk
—————– ——————
Hard disk 1 C:
Hard disk 2 K:
Hard disk 3 F:
Hard disk 4 N:
Hard disk 5 J:
Hard disk 6 V:
Hard disk 7 W:
Hard disk 8 G:

… and there you have it, a very quick way to figure out which vmware virtual disk houses your windows drive.  You’ll find the most recent version of the function here.

Powershell Scripting: Get-ECSPhysicalDiskToLogicalDiskMapping

I figured it was about time to knock out something a little technical for a change, and I figured I’d start with this little function, which is part of a larger script that I’ll talk more about later.

There may be a time where you need to find the relationship between a physical disk drive and your logical drive.  In my case, I had a colleague ask me if there was an easy way to match a VMware disk to a Window disk so he could extend the proper drive.  After digging into it a bit, I determined it was possible, but it was going to take a little work.  One of the prerequisites is to first find which drive letter belongs to which physical disk (from windows view).

For this post, I’m going to go over the prerequisite function I built to figure out this portion.  In a later post(s) we’ll put the whole thing together.  Rather than burying this into a larger overall script (which is the way it started), I broke it out and modularized it so that you may be able to use it for other purposes.

First, to get the latest version of the function, head on over to my GitHub project located here.  I’m going to be using GitHub for all my scripts, so if you want to stay up to date with anything I’m writing, that would be a good site to bookmark.  I’m very new to Git, so bear with me as I learn.

To start, as you can tell by looking at the function, its pretty simple by in large.  It’s 100% using WMI.  A lot of this functionality existed in Windows 2012+, but I wanted something that would work with 2003+.   Now that you know its based on WMI, there’s two important notes to bear in mind:

  1. WMI does not require admin rights for local computers, but it does for remote.  Keep this in mind if you’re planning to use this function for remote calls.
  2. WMI also requires that you have the correct FW ports open for remote access.  Again, I’m not going to dig into that.  I’d suspect if you’re doing monitoring, or any kind of bulk administration, you probably already have those ports open.

Microsoft basically has the mapping in place within WMI, the only problem is you need to connect a few different layers to get from Physical to Logical mapping.  In reading my function, you’ll see that I’m doing a number of nested foreach loops, and that’s the way I’m connecting things together.  Basically it goes like this….

  1. First we need to find the physical disk to partition mappings doing the following:  WIN32_DiskDrive property DeviceID  is connected to Win32_DiskDriveToDiskPartition property of Antecedent .  ***NOTE: the DeviceID needed to be formatted so that it matched the Actecedent by adding extra backslashes “\”.
  2. Now we need to map the partition(s) on the physical disk to the logical disk(s) that exist doing the following: Win32_DiskDriveToDiskPartition property Dependent is connected to Win32_LogicalDiskToPartition  property of Actecedent.
  3. Now that we know what logical drives exist on the physical disk, we can use the following to grab all the info we want about the logical drive doing the following: Win32_LogicalDiskToPartition property of Dependent maps to Win32_LogicalDisk property of “_path”.

That’s the basic method for connecting Physical Disk to Logical Disk.  You can see that I use an array to store results, and I’ve picked a number of properties from the the physical disk and the logical disk that I needed.  You could easily add other properties in that you want to serve your needs.  And if you do… please contribute to the function.

As for how to use the function, its simple.

For local computer, simply run the function with no parameters.

Get-ECSPhysicalDiskToLogicalDiskMapping
LogicalDiskSize : 1000198832128
PhysicalDiskController : 0
ComputerName : PC-2158
PhysicalDiskControllerPort : 1
LogicalDiskFreeSpace : 946930122752
PhysicalDiskNumber : 0
PhysicalDiskSize : 1000194048000
LogicalDiskLetter : E:
PhysicalDiskModel : Intel Raid 1 Volume
PhysicalDiskDiskSerialNumber : ARRAY

LogicalDiskSize : 255533772800
PhysicalDiskController : 0
ComputerName : PC-2158
PhysicalDiskControllerPort : 0
LogicalDiskFreeSpace : 185611300864
PhysicalDiskNumber : 1
PhysicalDiskSize : 256052966400
LogicalDiskLetter : C:
PhysicalDiskModel : Samsung SSD 840 PRO Series
PhysicalDiskDiskSerialNumber : S12RNEACC99205W

For a remote system simply specify the “-ComputerName” paramter

Get-ECSPhysicalDiskToLogicalDiskMapping -ComputerName “pc-2158”
LogicalDiskSize : 1000198832128
PhysicalDiskController : 0
ComputerName : PC-2158
PhysicalDiskControllerPort : 1
LogicalDiskFreeSpace : 946930122752
PhysicalDiskNumber : 0
PhysicalDiskSize : 1000194048000
LogicalDiskLetter : E:
PhysicalDiskModel : Intel Raid 1 Volume
PhysicalDiskDiskSerialNumber : ARRAY

LogicalDiskSize : 255533772800
PhysicalDiskController : 0
ComputerName : PC-2158
PhysicalDiskControllerPort : 0
LogicalDiskFreeSpace : 185611976704
PhysicalDiskNumber : 1
PhysicalDiskSize : 256052966400
LogicalDiskLetter : C:
PhysicalDiskModel : Samsung SSD 840 PRO Series
PhysicalDiskDiskSerialNumber : S12RNEACC99205W

Hope that helps you down the road.  Again, this is going to be part of a slightly larger script that will ultimately map a Windows Logical Disk to a VMware Virtual Disk to make finding which disk to expand easier.

Review: 2.5 years with Nimble Storage

Disclaimer: I’m not getting paid for this review, nor have I been asked to do this by anyone.  These views are my own,  and not my employers, and they’re opinions not facts .

Intro:

To begin with, as you can tell, I’ve been running Nimble Storage for a few years at this point, and I felt like it was time to provide a review of both the good and bad.  When I was looking at storage a few years ago, it was hard to find reviews of vendors, they were very short, non-informative, clearly paid for, or posts by obvious fan boys.

Ultimately Nimble won us over against the various  storage lines listed below.  Its not a super huge list as there was only so much time and budget that I had to work with .  There were other vendors I was interested in but the cost would have been prohibitive, or the solution would have been too complex.  At the time, Tintri and Tegile never showed up in my search results, but ultimately Tintri wouldn’t have worked (and still doesn’t) and Tegile is just not something I’m  super impressed with.

  • NetApp
  • X-IO
  • Equallogic
  • Compellent
  • Nutanix

After a lot of discussions and research, it basically boiled down to NetApp vs. Nimble Storage, with Nimble obviously winning us over.  While I made the recommendation with a high degree of trepidation and even after a month with the storage, wondered if I totally made an expensive mistake, I’m happy to say, it was and is still is a great storage decision.  I’m not going into why I chose Nimble over NetApp, perhaps some other time, for now this post is about Nimble, so let’s dig into it.

When I’m thinking about storage, the following are the high level area’s that I’m concerned about.  This is going to be the basic outline of the review.

  • Performance / Capacity ratios
  • Ease of use
  • Reliability
  • Customer support
  • Scaling
  • Value
  • Design
  • Continued innovation

Finally, for your reference, we’re running 5 of their 460’s, which is between their cs300 and cs500 platforms and these are hybrid arrays.

Performance / Capacity Ratios

Good performance like a lot of things is in the eye of the beholder.  When I think of what defines storage as being fast, its IOPS, throughput and latency.  Depending on your workload, more of one than the other may be more important to you, or maybe you just need something that can do ok with all of those factors, but not awesome in any one area.  To me, Nimble falls in the general purpose array, it doesn’t do any one thing great, but it does a lot of things very well.

Below you’ll find a break down of our workloads and capacity consumers.

IO breakdown (estimates):

  • MS SQL (50% of our total IO)
    • 75% OLTP
    • 25% OLAP
  • MS Exchange (30% of total IO)
  • Generic servers (15% of total IO)
  • VDI (5% of total IO)

Capacity consuming apps:

  • SQL (40TB after compression)
  • File server (35TB after compression)
  • Generic VM’s (16TB after compression)
  • Exchange (8TB after compression)

Compression?  yeah, Nimble’s got compression…

Nimble’s probably telling you that compression is better than dedupe, they even have all kinds of great marketing literature to back it up.  The reality like anything is, it all depends.  I will start by saying if you need a general purpose array, and can only get one or the other, there’s only one case where I would choose dedupe over compression, which is data sets mostly consisting of operating system and application installer data.  The biggest example of that would be VDI, but basically where ever you find your data being mostly consistent of the same data over and over.  Dedupe will always reduce better than compression in these cases.    Everything else, you’re likely better off with compression.    At this point, compression is pretty much a commodity, but if you’re still not a believer, below you can see my numbers.  Basically, Nimble (and everyone else using compression) delivers on what they promise.

  • SQL: Compresses very well, right now I’m averaging 3x.  That said, there is a TON of white space in some of my SQL volumes.  The reality is, I normally get a minimum of 1.5x and usually end up more along the 2x range.
  • Exchange 2007: Well this isn’t quite as impressive, but anything is better than nothing,   1.3x is about what we’re looking at.  Still not bad…
  • Generic VM’s: We’re getting about 1.6x, so again, pretty darn good.
  • Windows File Servers: For us its not entirely fair to just use the general average, we have a TON of media files that are pre-compressed.  What I’ll say is our generic user / department file server gets about 1.6 – 1.8 reduction.

Show me the performance…

Ok, so great, we can store a lot of data, but how fast can we access it?  In general, pretty darn fast…

The first thing I did when we got the arrays was fire up IOMeter, and tried trashing the array with a 100% random read 8k IO profile (500GB file), and you know what, the array sucked.  I mean I was getting like 1,200 IOPS, really high latency and was utterly disappointed almost instantly.    In hind sight, that test was unrealistic and unfair to some extent.  Nimble’s caching algorithm is based on random in, random out, and IOmeter was sequential in (ignored) and then attempting random out.  For me, what was more bothersome at the time, and still is to some degree is it took FOREVER before the cache hit ratio got high enough that I was starting to get killer performance.    Its actually pretty simple to figure out how long it would take a cold dataset like that to completely heat up, divide (524288000k/9600) or 15 hours.  The 524288000 is 500GB converted to KB.  The 9600 is 8k * 1200IOPS to figure out the approximate throughput at 8k.

So you’re probably think all kinds of doom and gloom and how could I recommend Nimble with such a long theoretical warm up time?  Well let’s dig into why:

  • That’s a synthetic test and a worst case test.  That’s 500GBs of 100% random, non-compressed data.  If that data was compressed for example to 250GB, it would “only” take 7.5 hours to copy into cache.
  • On average only 10% – 20% of you total dataset is actually hot.  If that file was compressed to 250GB, worst case you’re probably looking at 50GB that’s hot, and more realistic 25GB.
  • That was data that was written 100% sequential and then being read 100% random.  Its not a normal data pattern.
  • That time is how long it takes for 100% of the data to get a 100% cache hit.  The reality is, its not too long before you’re starting to get cache hits and that 1,200 IOPS starts looking a lot higher (depending on your model).

There are a few example cases where that IO pattern is realistic:

  • TempDB: When we were looking at Fusion IO cards , the primary workload that folks used them for in SQL was TempDB.  TempDB can be such a varied workload that its really tough to tune for, unless you know your app.  Having a sequential in, random out in TempDB is a very realistic scenario. 
  • Storage Migrations:  Whether you use Hyper-V or VMware, when you migrate storage, that storage is going to be cold all over again with Nimble.  Storage migrations tend to be sequential write.
  • Restoring backup data:  Most restores tend to be sequential in nature.  With SQL, if you’re restoring a DB, that DB is going to be cold.

if you recall, I highlighted that my IOmeter test was unrealistic  except in a few circumstances, and one of those realistic circumstances can be TempDB, and that’s a big “it depends”.    But what if you did have such a circumstance?  Well any good array should have some knobs to turn and Nimble is no different.  Nimble now has two ways to solves this:

  • Cache Pinning: This feature was released in NOS 2.3, basically volumes that are pinned run out of flash.  You’ll never have a cache miss.
  • Aggressive caching: Nimble had this from day one, and it was reserved for cases like this.  Basically when this is turned on, (volume or performance policy granularity TMK), Nimble caches any IO coming in or going out.  While it doesn’t guarantee 100% cache hit ratios, in the case of TempDB, its highly likely the data will have a very high cache hit ratio.

Performance woes:

That said, Nimble suffers the same issues that any hybrid array does, which is a cache miss will make it fall on its face, which is further amplified in Nimbles case by having a weak disk subsystem IMO.  If you’re not seeing at least a 90% cache hit ratio, you’re going to start noticing pretty high latency .  While their SW can do a lot to defy physics, random reads from disk is one area they can’t cheat.  When they re-assure you that you’ll be just fine with 12 7k drives, they’re mostly right, but make sure you don’t skimp on your cache.  When they size your array, they’ll likely suggest anywhere between 10% and 20% of your total data set size.  Go with 20% of your data set size or higher, you’ll thank me.  Also, if you plan to do pinning or anything like that, account for that on top of the 20%.  When in doubt, add cache.  Yes its more expensive, but its also still cheaper than buying NetApp, EMC, or any other overpriced dinosaur of an array.

The only other area where I don’t see screaming performance is situations where 50% sequential read + 50% sequential write is going on.  Think of something like copying a table from one DB to another.  I’m not saying its slow, in fact, its probably faster than most, but its not going to hit the numbers you see when its closer to 100% in either direction.  Again, I suspect part of this has to do with the NL-SAS drives and only having 12 of them.  Even with coalesced writes, they still have to commit at some point, which means, you have to stop reading data for that to happen, and since sequential data comes off disk by design, you end up with disk contention.

Performance, the numbers…

I touched on it above, but I’ll basically summarize what Nimble’s IO performance spec’s look like in my shop.  Again, remember I’m running their slightly older cs460’s, if these were cs500’s or cs700’s all these numbers (except cache misses) would be much higher.

  • Random Read:
    • Cache hit: Smoking fast (60k IOPS)
    • Cache miss: dog slow (1.2k IOPS)
  • Random Write: fast (36k IOPS)
  • Sequential
    • 100% read: smoking fast (2GBps)
    • 100% write: fast (800MBps – 1GBps)
    • 50%/50%: not bad, not great (500MBps)

Again, its rough numbers, I’ve seen higher number in all the categories, and I’ve seen lower, but these are very realistic numbers I see.

Ease of use:

Honestly the simplest SAN I’ve ever used, or at least mostly.  Carving up volumes, setting up snapshots and replication has all been super easy, and intuitive.  While Nimble provided training, I would content its easy enough that you likely don’t need it.  I’d even go so far as saying you’ll probably think you’re missing something.

Also, growing the HW has been simple as well.  Adding a data shelf or cache shelf has been as simple as a few cables and clicking “activate” in the GUI.

Why do I say mostly?  Well if you care about not wasting cache, and optimizing performance, you do need to adapt your environment a bit.  Things like transaction logs vs DB, SQL vs Exchange, they all should have separate volume types.  Depending on your SAN, this is either common place, or completely new.  I came from an Equallogic shop, where all you did was carve up volumes.  With Nimble you can do that too, but you’re not maximizing your investment, nor would you be maximizing your performance.

Troubleshooting performance can take a bit of storage knowledge in general (can’t fault Nimble for that per say) and also a good understanding of Nimble its self.  That being said, I don’t think they do as good of a job as they could in presenting performance data in a way that would make it easier to pin down the problem.  From the time I purchased Nimble till now, everything I’ve been requesting is being siloed in this tool they call “Infosite”, and the important data that you need to troubleshoot performance  in many ways is still kept under lock and key by them, or is buried in a CLI.  Yeah, you can see IOPS, latency, throughput and cache hits, but you need to do a lot of correlations.  For example, they have a line graph showing total read / write IOPS, but they don’t tell you in the line graph whether it was random or sequential.  So when you see high latency, you now need to correlate that with the cache hits and throughput to make a guess as to whether the latency was due to a cache miss, or if it was a high queue depth sequential workload.  Add to that, you get no view of the CPU, average IO size, or other things that are helpful for troubleshooting performance.  Finally, they role up the performance data so fast, that if you’re out to lunch and there was a performance problem, its hard to find, because the data is average way too quickly.

Reliability:

Besides disk failures (common place) we’ve had two controller failures.    Probably not super normal, but none the less, not a big deal.  Nimble failed over seamlessly, and replacing them was super simple.

Customer Support:

I find that their claim of having engineers staffing support to be mostly true.  By in large, their support is  responsive, very knowledgeable and if they don’t know the answer, they find it out.  Its not always perfect, but certainly better than other vendors I’ve worked with.

Scaling:

I think Nimble scales fantastically so long as you have the budget.  At first when they didn’t have data or cache shelves, I would have said they have some limits, but now a days, with their scale in any direction, its hard not to say that they can’t adapt to your needs.

That said, there is one area where I’m personally very disappointed in their scaling, which is going up from an older generation to a newer generation controllers.  In our case, running the cs460’s requires a disruptive upgrade to go to the cs500’s or cs700’s.  They’ll tell me its non-disruptive if I move my volumes to a new pool, but that first assumes I have SAN groups, and second assumes I have the performance and capacity to do that.  So I would say this is mostly true, but not always.

Value / Design:

The hard parts of Nimble…

If we just take face value, and compare them based on performance and capacity to their competitors, they’re a great value.  If you open up the black box though and start really looking at the HW you’re getting, you start to realize Nimble’s margins are made up in their HW.    A few examples…

  • Using Intel sc3500’s (or comparable) with SAS interposers instead of something like an STEC or HTST SAS based SSD.
  • Supermicro HW instead of something rebranded from Dell or HP.  The build quality of Supermicro just doesn’t compare to the others.  Again, I’ve had two controller failures in 2 years.
  • Crappy rail system.  I know its kind of petty, but honestly they have some of the worst rails I’ve seen next to maybe Dell’s EQL 6550 series.  Tooless kits have kind of been a thing for many years now, it would be nice to see Nimble work on this
  • Lack of cable management, seriously, they have nothing…

Other things that bug me about their HW design…

Its tough to understand how to power off / on certain controllers without looking in the manual.  Again, not something you’re going to be doing a lot, but still it could be better.  Their indicator lights are also slightly mis-leading with a continual blinking amberish orangeish light on their chassis.  The color is initially misleading that perhaps an issue is occurring.

While I like the convince of the twin controller chassis, and understand why they, and many other vendors use it.  I’d really like to see a full sized dual 2u rack mount server chassis.  Not because I like wasting space, but because I suspect it would actually allow them to build a faster array.  Its only slightly more work to unrack a full sized server, and the reality is I’d trade that any day for better performance and scalability (more IO slots).

I would like to see a more space conscious JBOD.  Given that they over subscribe the SAS backplane anyway, they might as well do it while saving space.  Unlike my controller argument, where more space would equal more performance, they’s offering a configuration that chews up more space, with no other value add, except maybe having front facing HDD’s.  I have 60 bay JBODs for backup that fit in 4u.  Would love to see that option for Nimble, that would be 4 times the amount of storage in about the same amount of space.

Its time to talk about the softer side of Nimble….

The web console, to be blunt is a POS.  Its slow, buggy, unstable, and really, I hate using it.  To be fair, I’m bigoted against web consoles in general, but if they’re done well, I can live with them.  Is it usable, sure, but I certainly don’t like living in it.  If I had a magic wand, I would actually do away with the web console on the SAN its self and instead, produce two things:

  • A C# client that mimic’s the architecture of VMware.  VMware honestly had the best management architecture I’ve seen (until they shoved the web console down my throat).  There really is no need for a web site running on the SAN.  The SAN should be locked down to CLI only, with the only web traffic being API calls.  Give me a c# client that I can install on my desktop, and that can connect directly to the SAN or to my next idea below.  I suspect, that Nimble could ultimately display a lot more useful information if this was the case, and things would work much faster.
  • Give me a central console (like vCenter) to centrallly manage my arrays,  I get that you want us to use infosite and while its gotten better, its still not good enough.  I’m not saying do away with info site, but let me have a central, local, fast solution for my arrays.  Heck, if you still want to do a web console option, this would be the perfect place to run it.

The other area I’m not a fan of right now, is their intelligent MPIO.  I mean I like it, but I find its too restrictive.  Being enabled on the entire array or nothing is just too extreme.  I’d much rather see it at the volume level.

Finally, while I love the Windows connection manager, it still needs a lot of work.

  • NCM should be forwards and backwards compatible, at least to some reasonable degree.  Right now its expected that it matches the SAN’s FW version and that’s not realistic.
  • NCM should be able to kick off on demand snaps (in guest) and offer a snapshot browser (meaning show me all snaps of the volume).
  • If Nimble truly want to say they can replace my backup with their snapshots, then make accessing the data off them easier.  For example, if I have a snap of a DB, I should be able to right click that DB, and say (mount a snapshot copy of this DB, with this name) and the Nimble goes off and runs some sort of workflow to make that happen.  Or just let us browse the snaps data almost like a UNC share.

The backup replacement myth…

Nimble will tell you in some cases that they have a combined backup and primary storage solution.  IMO, that’s a load of crap.  Just because you take a snapshot, doesn’t mean you’ve backed up the data.  Even if you replicate that data, it’s still not counting as a backup.  To me, Nimble can say they’ve solved the backup dilemma with their solution when they can do the following:

  • Replicate your data to more than one location
  • Replicate your data to tape every day and send it offsite.
  • Provide an easy straight forward way to restore data out of the snapshots.
  • Truncate transaction logs after a successful backup.
  • Provide a way of replicating the data to non-Nimble solution, so the data can be restored anywhere.  Or provide something like a “Nimble backup / recovery in the cloud” product.

Continued Innovation:

I find Nimble’s innovation to be on the slow side, but steady, which is a good thing.  I’d much rather have a vendor be slow to release something because they’re working on perfecting it.  In the time I’ve been a customer, they’ve released the following features post purchase:

  • Scale out
  • Scale deep
  • External flash expansion
  • Cache Pinning
  • Virtual Machine IOPS break down per volume
  • Intelligent MPIO
  • Cache Pinning
  • QOS
  • RestAPI
  • RBAC
  • Refreshed generation of SANS (faster)
  • Larger and larger cache and disk shelves

Its not a huge list, but I also know what they’re currently working on, and all I can say is, yeah they’re pretty darn innovative.

Conclusion and final thoughts:

Nimble is honestly my favorite general purpose array right now.  Coming from Equallogic, and having looked at much bigger / badder arrays, I honestly find them to be the best bang for the buck out there.  They’re not without faults, but I don’t know an array out there that’s perfect.  If you’re worried they’re not “mature enough”,  I’ll tell you, you having nothing to fear.

That said, its almost 2016 and with flash prices being where they are now, I personally don’t see a very long life for hybrid array going forward, at least not as high performance mid-size to enterprise  storage arrays.  Flash is getting so cheap, its practically not worth the saving you get from a hybrid, compared to the guaranteed performance you get from an all flash.    Hybrids were really filling a niche until all flash became more attainable, and that attainable day is here IMO.  Nimble thankfully has announced that an AFA is in the works, and I think that’s a wise move on their part.  If you have the time, I would honestly wait out your next SAN purchase until their AFA’s are out, I suspect, they’ll be worth the wait.

Backup Storage Part 5: Realization of a failure

No one likes admitting they’re wrong, and I’m certainly no different.  Being a mature person means being able to admit you’re wrong, even if it means doing it publicly, and that is what I’m about to do.

I’ve been writing this series slowly over the past few months, and during that time, I’ve noticed an increasing number of instances where my storage space virtual disks NTFS would go corrupt.  Basically, I’d see Veeam errors writing to our repository, and when investigating, I would find files not deleting (old VBK’s).  When trying to manually delete them, they would either throw some error, or they would act like they were deleted (they’d disappear), but then return only a second later.  The only way to fix this (temporarily) was to do a check disk, which requires taking the disk offline.  When you have a number of backup jobs going at anytime, this means something is going to crash, and it was my luck that it was always in middle of a 4TB+ VM.

Basically what I’m saying, that as of this date, I can no longer recommend NTFS running on Storage Spaces.  At least not on bare metal HW.  My best guess is we were suffering from bit rot, but who knows since storage spaces / NTFS can’t tell me otherwise, or at least I don’t know how to figure it out.

All that said, I suspect I wouldn’t have run into these issues had I been running ReFS.  ReFS has online scrubbing, and its looking for things like failed CRC checks (and auto repairs them) .  At this point, I’m burnt out on running storage spaces, so I’m not going to even attempt to try ReFS.  Enough v1 product evals in prod for me :-).

Fortunately I knew this might not have worked out, so my back out plan is to take the same disks / JBODS and attach them to a few RAID cards.  Not exactly thrilled about it, but hopefully it will bring a bit more consistency / reliability back to my backup environment.  Long term I’m looking at getting a SAN implemented for this, but thats for a later time.

Its a shame as  I really had high hopes for storage spaces, but like many MS products, I should have known better than to go with their v1 release.  At least it was only backup’s and not prod…

Update (09/13/2016):

I wanted to add it bit more information.  At this point it’s theory, but just incase this article is or is not dissuading you from doing storage spaces, it’s worth noting some additional information.

We had two NTFS volumes, each being 100TB in size.  One for Veeam and one for our SQL backup data.  We never had problems with the SQL backup volume (probably luck), but the Veeam volume certainly had issues.  Anyway, after tearing it all down, I was still bugged about the issue, kind of felt really disappointed about the whole thing.  In some random google, I stumbled across this link going over some of NTFS’s practical maximums.  In theory at least, we went over the tested (recommended) max volume size.  Again, I’m not one to hide things and I fess up when I screw up.  Some of the storage spaces issues may have been related to us exceeding the recommended size, and NTFS couldn’t proactively fix things in the background.  I don’t know for sure, and I really don’t have the appetite to try it again.  I know it sounds crazy to have a 100TB volume, but we had 80TB of data stored in there.  In other words, most smaller companies don’t hit that size limit, but we have no problem at all exceeding that.  If you’re wondering why we made such a large volume, it really boiled down to wanting to maximize both contiguous space as well as not wasting space.  Storage spaces doesn’t let you thin provision storage when its clustered, so if we for example would have created five 20TB LUNS instead, the contiguous space would have been much smaller and ultimately more difficult to manage with Veeam.  We don’t have that issue anymore with CommVault as it can deal with lots of smaller volumes with ease.

Anyway, while I would love to say MS shouldn’t let you format a volume larger than what they’ve tested (and they shouldn’t without at least a warning), ultimately the blame falls on me for not digging into this a bit more.  Then again, try as I may, I’ve been unable to validate the information posted on the linked blog above.  I don’t doubt the accuracy of the information, often I find fellow bloggers do a better job of explaining how to do something or conveying real world limits than the vendor.

Best of luck to you, if you do go forward with storage spaces, and if you do have questions, let me know, I worked with it in production for over a year, at a decent scale.