Backup Storage Part 1: Why it needed to change


This is going to be a multi-part series where I walk you through the whole process we took in evaluating, implementing and living with a new backup storage solution.  While its not a perfect solution, given the parameters we had to work within, I think we ended up with some very decent storage.

Backup Storage Part 1: The “Why” it needed to change

Last year my team and I began a project to overhaul our backup solution and part of that solution involved researching some new storage options.   At the time, we ran what most folks would on a budget, which is simply a server, with some local DAS. It was a Dell 2900 (IIRC) with 6 MD1000’s and 1 MD1200.  The solution was originally designed to manage about 10TB of data, and really was never expected to handle what ultimately ended up being much greater than that.  The solution was less than ideal for a lot of reasons that I’m sharing below.

  • It wasn’t just a storage unit, it also ran all the backup server (CommVault) on top of it, and our tape drives were  locally attached as well.  Basically a single box to handle a lot of data and ALL aspects of managing it.
  • The whole solution was a single point of failure and many of its sub-components were singe points of failure.
  • This solution was 7 years old, and had been grown organically, one JBOD at a time.  This had a few pitfalls:
    • JBODs were daisy chained off each other.  Which meant that while you added more capacity, and spindles, the throughput was ultimately limited to 12Gbps for each chain (SAS x4 3G).  We only had two chains for the md1000’s and one chain / JBOD for the md1200.
    • The JBODs were carved up into independent LUN’s, which from CommVaults view was fine (awesome SW), but it left potential IOPS on the table.  So as we added JBODS the IOPS didn’t linearly increase per say.  Sure the “aggregate” IOPS increased, but a single job, is now as limited to the speed of a 15 disk RAID 6.  Instead of the potential of say a 60 disk RAID 60.
  • The disks at the time were fast (for SATA drives that is) but compared to modern NL-SAS drives, much lower throughput capability and density.
  • The PCI bus and the FSB (this server still had a FSB) was overwhelmed.  Remember, this was doing tape AND disk copies.  I know a lot of less seasoned folks don’t think its easy to overwhelmed a PCI bus, but that’s not actually true (more on that later) even more so when you’re PCI bus is version 1.x.
  • This solution consumed a TON or rack space, each JBOD was 3U and we had 7 of them (md1200 was 2U).  And each drive was only 1TB, so best case with 15 disk RAID 6, we were looking at 13TB usable   By today’s standards, this TB per RU is terrible even for tier 1 storage, let alone backup.
  • We were using a RAID 6 instead of 10.  I know what some of you are thinking, its backup, so why would you use RAID 10?    Backup’s are probably every much as disk intensive as your production workloads and likely more so.  On top of that, we force them to do all their work in what’s typically a very constrained window of time.  RAID 6 while great for sequential / random reads, does horribly at writes in comparison to a RAID 10 (I’m exuding fancy file systems like CASL for this generalization).  Unless you’re running one backup at a time, you’re likely throwing tons of  parallel writes at this storage.  And while each stream may be sequential in nature, the aggregation of them is random in appearance to the storage.  At the end of the day, disk is cheap, and cheap disk (NL-SAS) is even cheaper so splurge on RAID 10.
    • This was also compounded by the point I made above about one LUN per JBOD
  • It was locked into what IMO is a less than ideal Dell (LSI) RAID card.  Again, I know what some of you are thinking.  “HW” RAID is SO much better than SW RAID.  Its a common and pervasive myth.  EMC, NetApp, HP, IBM, etc. are all simple x86 servers with a really fancy SW RAID.  SW RAID is fine, so long as the SW that’s doing the RAID is good.  In fact, SW RAID is not only fine, in many cases, it’s FAR better than a HW RAID card.    Now, I’m not saying LSI sucks at RAID, they’re fine cards, but SW has come a long way, and I really see them as the preferred solution over HW RAID.  I’m not going to go into the WHY’s in this post, but if you’re curious, do some research of your own, until I have time to write another “vs” article.
  • Using Dell, HP, IBM, etc for bulk storage is EXPENSIVE compared to what I’ll call 2nd tier solutions.  Think of this as somewhere in-between Dell and home brewing your own solution.
    • Add on to this, manufactures only want you running “their” disks in “their” JBODs.  Which means not only are you stuck paying a lot for a false sense of security, you’re also incredibly limited to what options you have for your storage.
  • All the HW was approaching EOL.

There’s probably a few more reason why this solution was no longer ideal, but you get the point.  The reality is, our backup solution was changing, and it was a perfect time to re-think our storage.  In part 2, we’ll get into what we thought we wanted, what we needed, and what we had budget for.

Thinking out loud: CLI vs. GUI, a pointless debate

I see this come up occasionally and I really don’t get why its always an “x is better than y”.  A lot of times folks are talking about what works best for them in their world, and for all intents and purposes stating that if its best for them, its best for all.  I’d like to challenge this reasoning and also challenge the notion that both a CLI and GUI have their place.

The GUI: It’s pretty and functional

I’m not sure where all this hate on GUI’s comes from, but I can tell you that it stereotypically comes from either the *NIX or network engineer crowds.  Thinking about a GUI from their point of view, I can completely understand why they’re not huge fans of a GUI.  The folks in these crowds spend most of their day buried in a CLI, for the simple reason that no GOOD GUI exists for most of what they’re doing.  KDE? GNOME? some crappy web interface (this ones a toss up on depending on who’s interface)?  I wouldn’t want to live day in and day out with half of the GUI’s they’re using either.  So what’s the problem with their GUI’s?

  • In my admittedly limited experience with Linux GUI’s, most of them offer limited functionally.  They can do the basics, but when you really want to get in and configure something, it almost always leads to firing up a console.  If everything in your ecosystem relies on you ending up in the CLI, pretty soon, you’re going to skip the GUI.  Even with Apple’s OS-X I’ve found this to be the case.  I wanted to change my mouse wheel’s scrolling direction, that required going into the CLI (seriously).
  • Their GUI’s are not always intuitive, and honestly this is supposed to be one of the value adds of a good GUI.
  • Their GUI’s sometimes implement sloppy / bad configurations.  One area I recall hearing this a lot with is the Cisco ASA, but i’m sure this occurs on more solutions than Cisco’s firewalls.
  • Not so much a problem of their GUI, but there’s just a negative stigma with anyone in these crowds using a GUI.  To paraphrase, if use a GUI, you’re not a good/real admin (load of crap BTW).
  • Again, not so much a problem of the GUI, but likely do to the above statement, there really isn’t a lot of  “how to’s” for using a GUI.  You ultimately end up firing up the CLI, because that’s the way things are documented.
  • A lot of GUI’s don’t do bulk or automated administration well.  I think this is pretty true across the board.  That said, I have used purpose built automation tools, but they were for very specific use cases.  98% of the time, I go to a CLI for automation / bulk tasks.
  • Depending on the task you’re doing, GUI’s can be slower IF you’re familiar with the CLI (commands and proper syntax).

Clearly there’s a lot of cons for the GUI, but I think a lot of them tend to pertain more to *NIX and network engineers. its not that a GUI by nature is bad, its just that their GUI is bad.  Incase you haven’t guessed, I’m a Windows admin, and with that, I enjoy an OS that was specifically designed with a great GUI in mind (even windows 8 / 2012).  This is really the key, if the GUI is good, then your desire to use a GUI instead of the CLI will be increased.  We went over what the problems are with a GUI, so why not go over what’s good about a GOOD GUI?

  • They’re easy and intuitive to use.
  • They present data in a way thats easier to analyze visually.  This is important and I really think its overlooked a lot of times.  Sure you CAN analyze data in the CLI, but the term “a picture is worth a thousand words” isn’t a hollow phrase.
  • When they’re designed well, single task items (non-bulk) are quick and easy.  its debatable whether a CLI is quicker and there are A LOT of factors that go into that.
  • GUI’s by nature allow multi-tasking.  Now  I get that you can technically have multiple putty windows open (thats a GUI BTW), but its not quit the same as an application interactively speaking.
  • They’re pretty.  I know that’s not exactly a great reason, but let’s all be honest, a pretty GUI is a lot nicer than a cold dark CLI.
  • Options and parameters tend to be fully displayed making it easier to see all the possible options and requirements. (I know Powershell ISE now offers this, very sweet.)
  • Some newer GUI’s like MS’s SQL and Exchange management consoles even show you the CLI  commands so you can use it as a foundation for scripts.  Meaning, GUI’s can help you to learn the CLI.
  • Certain highly complex tasks are simplified by using a GUI.  Things that may have taken multiple commands or digging deep into an object can be accomplished with a few clicks.

The CLI: Lean, mean automation machine

Like I said, I’m a Windows guy, but at the same time, I have a healthy appreciation and love for the CLI.  I’m lazy by nature, and HATE routine tasks. Speaking of hate, I see a lot of windows admins dolling out their equal amount of hate on the CLI, or more specifically being forced into Powershell.    I get it, after all, it wasn’t until Powershell came out, that Microsoft actually had a decent CLI / Scripting language.  There was vbscript, it worked, but man was it a lot of work (and not really interactive).  None the less, Powershell has been out for what’s got to be close to ten years now and there’s still pushback from some of the windows crowd.  There’s no need for the resistance (its futile), your life will be better with a CLI, AND the GUI you know and love.  So let’s go into why you’re still not using a CLI, and why its all a bunch of nonsense..  Also, this is going to be targeted at windows admins mostly and windows admins that are still avoiding the CLI.  Its kind of redundant to go over the pros / cons of CLI because they’re mostly the inverse of what I mentioned about a GUI.

  • Its a PITA to learn and a GUI is easy, or at least that’s what you tell yourself.  The reality, Powershell is EASY to use and learn.
  • You don’t have time to learn how to script.  After all, you could spend 30 minutes clicking, or spend 4 hours trying to write a script.  Sure, the first time you attempt to write a new script, it might take you 16 hours, but the second script might take you 4 hours, and the third script might take you 5 minutes.  The more you familiarize yourself with all the syntax, commands, functions, methods, etc. the easier it will be to solve the next problem.
  • You’re afraid that your script might purge 500 mailboxes in a single key tap.  This is absolutely a possibility, and you know what, mistakes happen (even GOOD admins make really dumb mistakes).  But that’s why you start out small, and learn how to target.  That’s also why you have backup’s 🙂
  • Your afraid, you’ll automate yourself out of a job.  That’s not likely to happen.  Someone still needs to maintain the automation logic (its never perfect, OR things change), and it frees you up to do more interesting (fun) things.
  • Once you learn one CLI, a lot of its transferable to other CLI’s.  For me, going from Powershell to T-SQL was actually pretty easy.  I’m not a pro with T-SQL, but a lot of the concepts were similar.  I also found that as I learned how to do things in T-SQL, it helped me with other problems in Powershell (see how that works).  I don’t have a lot of experience with *NIX CLI’s, but I’d be willing to bet I could figure it out.

I probably did a bad job of being non-biased, but I really did try.  I truly see a place for both administration methods in IT and I hope you do to.

Thinking out loud: Core vs. Socket Licensing

I recall when Microsoft changed their SQL licensing model from per socket to per core.  It was not a well received model and it certainly wasn’t following the industry standard.  I’d even say it ranked up there with VMware vRAM tax debacle.  The big difference being that VMware reversed their licensing model.  To this day, other than perhaps Oracle, I don’t know of any other product / manufacture using this model.  And you know what, its a shame.

I bet you weren’t expecting that were you?  No I don’t own stock in Microsoft, and yes I do like per socket (at times), but I honestly feel like in more cases than not, per core is a better licensing model for everyone.

I didn’t always feel this way, like most of you I was pretty ticked off when Microsoft changed to this model.  After all, the cores per socket is getting denser and denser, and now when we’re finally start getting to an incredible amount of cores per socket, here’s Microsoft (and only Microsoft BTW) changing their model.  Clearly it must be driven by greed.

No, I don’t think it’s greed, in fact I think its an evolutionary model that was needed for the good of us the consumers and for the manufactures.  Let’s go into why I fell this way.

To start, I’m going to use my companies environment as an example.  We have what I would personally consider a respectably sized environment.

In my production site, I currently have the following setup:

  • 4 Dell R720’s with the v2 processors (dual socket, 12 cores per, 768GB of RAM)
    • General purpose VM’s
  • 10 Dell R720’s with the v1 processors (dual socket, 8 cores per, 384GB of RAM)
    • Web infrastructure
  • 7 Dell R820’s with v1 processors (quad socket, 8 cores per, 768GB of RAM)
    • Microsoft SQL only

In my sister companies site we have the following setup:

  • 3 Dell r710’s (dual socket, quad core, 256GB of RAM)
    • Remote office setup

In my DR site, we have the following setup (there are more servers, they’re just not VMware yet).

  • 4 Dell r710’s (dual socket, quad core, 256GB of RAM)
    • DR servers (more to come).

As you can see, beefy hardware and pretty wide array of different configurations for different uses.  All of these have a per socket licensing model, with the one caveat, that my SQL cluster is also licensed per core.

What got me thinking about this whole licensing model to begin with, is that I’d like to refresh my DR site, and take an opportunity to re-provision our prod site in a more holistic way.  The biggest problem here is SQL because its per core, and everything else is per socket.  Which really limits my options to the following:

  1. Have two separate clusters, one for SQL and one for everything else.  Then I can design my HW and licensing as it makes the most sense with SQL, and also maximize my VM’s per socket with my other cluster.
  2. Take a density hit on the number of VM’s per socket, and run a single cluster with quad socket 8 core procs.
  3. Have my SQL VM’s take a clock rate hit (ie make them slower) and adopt a dual socket 16 core setup.

So how does any of this have to do with this whole core vs. socket?  It’s pretty simple actually, and if you don’t get it, go back and re-read my current setup, and go read the design questions I’m juggling with.  Really think about it…

Let me highlight a few thing about my current infrastructure.  I’m going to attack the principle of per socket, and why its a futile licensing model (even if you don’t run anything else that’s licensed per core).

Every server in the infrastructure I laid out above is licensed on a flat per socket model, regardless of the number of cores.  That means my dual socket quad core procs, costs me the same amount to license as my dual socket 12 core procs.  They’re doing less work, not able to support the same workloads, yet they cost me the same amount.  Now, lets look at it from the manufactures view, at the time of pricing, my quad core proc was a pretty fair deal, but now as cores per socket have increased, their revenue potential decreases.  Depending on which side you’re on, someone is losing out.

Here’s another example of how per socket isn’t fair for us.  Remember my SQL hosts, quad socket 8 core?  Remember how I was saying I was thinking about dual socket 16 core procs, but at the expense of  a really slow clock rate.  How is that fair for me?  The reality is, I have the same number of cores, yet if I design my solution based on performance, it costs me twice as much in licensing, compared to a model that’s mostly about density.

I’d like to share a final example, of why I think per socket, isn’t relevant any longer.  This should hit home a lot more with SMB’s and those that try to convince SMB to go virtual.  We have a sister company that we manage, and they have all of about 20 servers.  That doesn’t mean virtualaztion isn’t a good fit for them, and why should they be forced into using something less optimal like Hyper-V.  Yet, a simple 3 host dual socket config for them, would cost an arm and a leg.  The reason being, VMware is charging them the same price that they’d charge a large enterprise.  You know what would level the playing field?  Per core instead of per socket.

In my opinion, per core would be a win for both the consumers and manufacture.  It provides the most flexibility, the most fair pricing, and it doesn’t force you into building your environment based on simply maximizing your licensing ROI (you do take application performance into consideration, right?)

Finally, if we were to adopt a per core model, there’s one thing I would insist that manufactures do to be fair.  I’m calling Microsoft out on this, because I think its very underhanded of them put this in their EULA.  Modern BIOS’s have the ability to limit the number of cores presented to the OS.  You can effectively turn a 12 core proc into a quad core proc if you want.  Microsoft, specifically doesn’t allow a company to use this feature, and still requires that all physical installed cores be licensed.  This is so wrong on their part, its not fair, its not right, and the reality is, it just makes this model look worse than it really is.  So with that being said, if a manufacture were to switch to this model, I’d implore that they utilize a few different options to allow me to buy more physical cores than i’m licensed for, but still be compliant.

  1. Let me use BIOS feature to limit the number of cores present to the OS.  Do you really think I’m going to shutdown all my servers before an audit and lower their cores?  Really what good would that do?  If my server needs 8 cores, it needs 8 cores, lowering it to 4 cores for the audit would likely have a devastating effect on my business more so then properly licensing the server.
  2. Design your product so that with a simple license key, or license augmentation, you can logically go from “x” cores to “y” cores.  Meaning you self restrict the number of physical cores used in a system.  If I have a dual socket 12 core proc, and i’m only licensed for 8 cores, only utilize 8/24 cores.
    1. Heck, take NUMA minimums into consideration.  Meaning, if I have a quad socket, 4 cores at a minimum are required, and if I have a dual socket, 2 cores at a minimum are required.

What are your thoughts?