Backup Storage Part 3d: Traditional SAN / NAS

Of all the types of storage we looked at, this was the one type that I kept coming back to.  It was a storage technology that I had the most personal experience with, and having just finished implementing my companies first SAN, I was pretty well versed in what was out on the market.

There were other reasons as well.  Traditional storage is arguably the most familiar storage technology out there.  Even if you don’t understand SAN, you probably have a basic understanding of local RAID and to some degree, that knowledge is far more transferable to traditional storage, than say something like a scale out solution.  More disks generally = faster, RAID 10 = good for writes and 7k drives are slower than 10k drives.  I’m over simplifying on purpose, but you can see where I’m coming from.    The point is, adopting a storage solution where existing skills can transfer is a big win when you don’t have time to be an expert in everything.  On top of that, you can NOT beat the cost per GB of traditional storage nor the performance edge of traditional storage compared to all its fancier alternatives.  I’m not saying there aren’t cases where deduplication targets will offer better costs per GB, nor am I saying scale out or cloud don’t have their other winning use cases.  However, as a general purpose, do a little bit of everything good and few things great, you can’t beat traditional storage.

By now, its probably pretty apparent that we went with a traditional storage solution, and I suspect your curious as to whose solution.  Well, I think you’ll find the answer a little surprising, but let first dig in to who/what we looked at.

  1. Dell Equallogic (6510’s):  I actually have a very good bit of experience with Equallogic.  In fact at my previous employer we had been running Equallogic for a good 6 years, so there wasn’t a whole lot of research we needed to do here.  Ultimately, Equallogic, while a great SAN (during its hay day), just could’t compete with what was out in the 2014 time frame.
    1. 15TB LUN limit in the year 2014 is just stupid.  Yes I know I could have spanned volumes in the OS, but that’s messy and frankly, I shouldn’t need to do it.
    2. The cost while better than most, was still on the pricey side for backup.  Things like the lack of 4TB drive support, not being able to add trays of storage, and instead being forced to buy an entire san to expand, just make the solution more expensive then it was worth.
    3. I didn’t like that it was either RAID 10 or RAID 50.  Again its 2014, and they had no support for RAID 60.  Who in their right mind would use RAID 50 with drives greater than 1TB?  Yes I know they have RAID 6, but again, who is going to want a wide disk span with a RAID 6?  That might be fine for 24 400GB drives, but that’s totally not cool with 48 (-2 for hot spares) 3TB drives.
  2. Nimble CS 200 series:  This is our current tier 1 SAN vendor and  still my favorite all around enterprise SAN.  I REALLY wanted to use them for backup, but they ultimately weren’t a good fit.
    1. If I had to pick a single reason, it would be price.  The affordability of them as a storage vendor isn’t just performance, its also that they do inline compression.  The problem is, the data I’d be storing on them would already be compressed, so I wouldn’t be taking advantage of “usable” capacity price point and instead would be left with their raw price point.  I even calculated adding shelves of storage, 100% maxed out, and the price point was still way above what we could afford.  Add to all of that, that in reality, for backup storage, they really didn’t have a ton of capacity at the time.  24TB for the head SAN, and 33TB for each shelf, with a 3 shelf maximum.  That’s 123TB usable, at 100% max capacity.  It would have taken up 12 rack units as well, and that’s if we stacked them on top of each other (which we don’t).
    2. Performance *may* have been an issue with them for backup.  Most of Nimble performance point is based on an assumption that your working data set lives in flash. Well my backup’s working data set is somewhere between 30TB and 40TB over a weekend, good luck with that.  What this means is the spindles alone would need to be able to support the load.  While the data set would be mostly sequential, which is generally easy for disks, it would be A LOT of sequential workloads in parallel, which Nimbles architecture just wouldn’t be able to handle well IMO.  If it was all write or even all read, that may have been different, but this would be 50% to even 75% write, with 25% – 50% reads.  Add to that, some of our work load would be random, but not accessed frequently enough (or small enough) to fit in cache, so again, left to the disks to handle.  There’s a saying most car guys know, which is “there’s no replacement for displacement”, and in the case of backup, you can’t beat lots of disks.
  3. ZFS:  While I know its not true for everyone, I’d like to think most folks in IT have at least heard of ZFS by now.  Me personally, I heard about it almost 8 years ago on a message board, when a typical *NIX zealot was preaching about how horrible NTFS was, and that they were using the most awesome filesystem ever, and it was called ZFS.  Back then I think I was doing tech support, so I didn’t pay it much mind as file system science was too nerdy for my interests.  Fast forward to my SAN evaluation, and good ‘ol ZFS ended up in my search results.  Everything about it sounded awesome about it, except that it was not windows admin friendly.  Don’t mis-understand me, I’m sure there are tons of windows admins that would have no problem with ZFS, and no GUI, but I had to make sure whatever we implemented was easy enough for average admin to support. Additionally, I ideally wanted something with a phone number that I could call for support,and I really wanted something that supported had HA controllers built in.  After a lot of searching, it was clear there’s at least 50 different ways to implement ZFS, but there were only a few what I would consider enterprise implementations of ZFS.
    1. Nexenta:  I looked at these guys during my SAN evaluation.  Pretty awesome product, that was unfortunately axed from my short list almost right away when I saw that they licensed their storage based on RAW capacity and not usable capacity.  While there was tiered based pricing (the more you buy, the cheaper it gets) it was still way too expensive for backup grade storage.  For the price we would have paid for their solution, plus HW, we would have been better off with Nimble.
    2. TrueNAS:  I had heard a lot about and even tried out FreeNAS.  TrueNAS was an enterprise supported version of FreeNAS.  Unfortunately I didn’t give them much thought because I honestly couldn’t find one damn review about them on the web.  Add to that, I honestly found FreeNAS to be totaly unpolished as a solution.  So many changes required restarting services, which led to storage going offline.  Who knows, maybe these services were replaced with more enterprise frinedly services in the TrueNAS version, but I doubted it.
    3. Napp-IT:  I tried running this on OpenIndiana, and some Joyent distribution of Unix (can’t recall its name).  In either case, while I had no doubt I could get things running.  I found the GUI so un-intuitive, that I just reverted to the CLI for configuring things.  This of course defeats the purpose of looking at Napp-IT to begin with.  The only postive thing I could say about it, is that it did make it a little easier to see what the lable were for the various drives in the system.  On top of all this, no HA was natively built in to this solution (not without a third party solution) so this was pretty much doomed to fail from the begining.  if we were a shop that had a few full time *nix admins, I probably would have been all over it, but I didn’t know enough to feel 100% comfortable supporting it, and I couldn’t expect others in my team to pick it up either.
    4. Tegile:  Not exactly the first name you think of when you’re looking up ZFS, but they are in fact based on ZFS.  Again, they were a nice product, but way too expensive for backup storage.
    5. Oracle:  Alas, in desperation, knowing there was one place left to look, one place that I knew would check all the boxes, I hunkered down and called Oracle.  Now let me be clear, *I* do not have a problem with Oracle, but pretty much everyone at my company hates them.  In fact, I’m pretty sure everyone that was part of the original ZFS team hates Oracle too for obvious reasons.  Me, I was personally calling them as a last resort, because I just assumed pricing would be stupid expensive and huge waste of my time.  I thought wrong!  Seriously, at the end these guys were my second overall pick for backup storage, and they’re surprisingly affordable.  As far as ZFS is goes, if you’re looking for an enterprise version of ZFS, join the dark side and buy into Oracle.  You get enterprise grade HW 100% certified to work (and with tech support / warranty), a kick ass GUI, and compared to the likes of Tegile, NetApp and even Nexenta, they’re affordable.  Add to that, you’re now dealing with a company that owns ZFS (per say) so you’re going to end up with decent development and support.  There were only a few reason we didn’t go with Oracle
      1. My company hated Oracle.  Now, if everything else was equal, I’m sure I could work around this, but it was a negative for them right off the get go.
      2. We have a huge DC and rack space isn’t a problem in general, but man do they chew up a ton of rack space.  24 drives per 4u, wasn’t the kind of density I was looking for.
      3. They’re affordable, but only at scale.  I think when comparing them against my first choice, we didn’t break even until I hit six JBOD filled with 4TB drives.  And that assumed that my first choice was running brand new servers.
      4. Add to point 3, I couldn’t fill all their JBODs.  They either have 20 drive options or 24 drive options.  I needed to allow room for ZIL drives, which means at a minimum, one JBOD would be only populated with 20 4TB drives.
      5. Those that work with ZFS know you need LOTs of memory for metadata/caching.  In my case, we were looking at close to 400TB of raw capacity, which meant close to 400GB’s of DRAM or DRAM + SSD.  Oracle specifically in their design doesn’t use shared read cache and instead populates each controller with read cache.  In my case, that meant I was paying for cache in another controller that would rarely get used and those cache drives weren’t cheap.
    6. Windows Storage Spaces (2012 R2):  I know what you’re thinking, and let me be clear, despite what some may say, this is not the same solution as old school Windows RAID.  Windows Storage Spaces is surprisingly a very decent storage solution, as long as you understand its limitations, and work within them.  It’s very much their storage equivalent of Hyper-V.  It’s not as polished as other solution, but in many cases its likely good enough, and in our case, it was perfect for backup storage.  They ultimately beat out all the other storage solutions we looked at, which if you’ve been following this series is a lot.  As for why?  Here are a few reasons:
      1. We were able to use commodity HW, which  I know sounds so cliche, but honestly, its a big deal when you’re trying to keep the cost per TB down.  I love Dell, but their portfolio was not only too limiting, but its also expensive.  I still use Dell servers, but everything else is generic (LSI, Segate) HW.
      2. With the money we saved going commodity, we were able to not only buy more storage, but also design our storage so that it was resilient.  There’s not a single component in our infrastructure that’s not resilient.
        1. Dual Servers (Failover clustering)
        2. Dual port  10g NICS with teaming
        3. Dual Quad port SAS cards with MPIO for SAS failover in each server
        4. A RAID 10 distributed across 4 JBOD’s in a way that we can survive and entire JBOD going offline.
      3. Because these are servers running windows, we could also dual purpose them for other uses, in our case, we were able to use them for Veeam Proxy’s.  Storage Spaces is very low in memory and CPU consumption.
      4. Storage spaces is portable.  if we find that we grow out of the JBODs or the servers, its as simple as attaching the storage to new servers, or moving the disks from one JBOD into a newer one.  Windows will import the storage pool and virtual disk, and you’ll be back and running in no time.
      5. Storage spaces offered a good compromise of enabling us to build a solution on our own, but still allowed us to have a vendor that we could call in the event of a serious issue.  Sure there is more than one neck to choke now, but honestly, its not that much different than Vmware and Dell.
      6. Its surprisingly fast, or at least windows doesn’t cause overhead like you might think.  I’ll go more into my HW spec’s later, but basically I’m limited by my HW, not Windows.  As a teaser, I’m getting over 5.8GBps sequential read (that’s Bytes with a big B).  Normally I’m not a big fan of sequential IO as a benchmark, but for backup, most of what goes on is sequential IO, so its a pretty good IO pattern to demonstrate.  Again, its not the fastest solution out there, but its fast enough.

This is the final post on the types of storage we looked at.  All in all, it was pretty fun to check out all of the solutions out there, even if it was for something that was kind of on the boring side of IT.  Coming up, I’m going start going over how we tackled going from nothing, to a fairly large, fast and highly resilient storage spaces solution.