Tag Archives: storage spaces

Backup Storage Part 5: Realization of a failure

No one likes admitting they’re wrong, and I’m certainly no different.  Being a mature person means being able to admit you’re wrong, even if it means doing it publicly, and that is what I’m about to do.

I’ve been writing this series slowly over the past few months, and during that time, I’ve noticed an increasing number of instances where my storage space virtual disks NTFS would go corrupt.  Basically, I’d see Veeam errors writing to our repository, and when investigating, I would find files not deleting (old VBK’s).  When trying to manually delete them, they would either throw some error, or they would act like they were deleted (they’d disappear), but then return only a second later.  The only way to fix this (temporarily) was to do a check disk, which requires taking the disk offline.  When you have a number of backup jobs going at anytime, this means something is going to crash, and it was my luck that it was always in middle of a 4TB+ VM.

Basically what I’m saying, that as of this date, I can no longer recommend NTFS running on Storage Spaces.  At least not on bare metal HW.  My best guess is we were suffering from bit rot, but who knows since storage spaces / NTFS can’t tell me otherwise, or at least I don’t know how to figure it out.

All that said, I suspect I wouldn’t have run into these issues had I been running ReFS.  ReFS has online scrubbing, and its looking for things like failed CRC checks (and auto repairs them) .  At this point, I’m burnt out on running storage spaces, so I’m not going to even attempt to try ReFS.  Enough v1 product evals in prod for me :-).

Fortunately I knew this might not have worked out, so my back out plan is to take the same disks / JBODS and attach them to a few RAID cards.  Not exactly thrilled about it, but hopefully it will bring a bit more consistency / reliability back to my backup environment.  Long term I’m looking at getting a SAN implemented for this, but thats for a later time.

Its a shame as  I really had high hopes for storage spaces, but like many MS products, I should have known better than to go with their v1 release.  At least it was only backup’s and not prod…

Update (09/13/2016):

I wanted to add it bit more information.  At this point it’s theory, but just incase this article is or is not dissuading you from doing storage spaces, it’s worth noting some additional information.

We had two NTFS volumes, each being 100TB in size.  One for Veeam and one for our SQL backup data.  We never had problems with the SQL backup volume (probably luck), but the Veeam volume certainly had issues.  Anyway, after tearing it all down, I was still bugged about the issue, kind of felt really disappointed about the whole thing.  In some random google, I stumbled across this link going over some of NTFS’s practical maximums.  In theory at least, we went over the tested (recommended) max volume size.  Again, I’m not one to hide things and I fess up when I screw up.  Some of the storage spaces issues may have been related to us exceeding the recommended size, and NTFS couldn’t proactively fix things in the background.  I don’t know for sure, and I really don’t have the appetite to try it again.  I know it sounds crazy to have a 100TB volume, but we had 80TB of data stored in there.  In other words, most smaller companies don’t hit that size limit, but we have no problem at all exceeding that.  If you’re wondering why we made such a large volume, it really boiled down to wanting to maximize both contiguous space as well as not wasting space.  Storage spaces doesn’t let you thin provision storage when its clustered, so if we for example would have created five 20TB LUNS instead, the contiguous space would have been much smaller and ultimately more difficult to manage with Veeam.  We don’t have that issue anymore with CommVault as it can deal with lots of smaller volumes with ease.

Anyway, while I would love to say MS shouldn’t let you format a volume larger than what they’ve tested (and they shouldn’t without at least a warning), ultimately the blame falls on me for not digging into this a bit more.  Then again, try as I may, I’ve been unable to validate the information posted on the linked blog above.  I don’t doubt the accuracy of the information, often I find fellow bloggers do a better job of explaining how to do something or conveying real world limits than the vendor.

Best of luck to you, if you do go forward with storage spaces, and if you do have questions, let me know, I worked with it in production for over a year, at a decent scale.

Backup Storage Part 4a: Windows Storage Spaces Gotcha’s

The pro and the con of a software defined storage is that its not a turn key solution.  Not only do  you have the power to customize your solution, you also have no choice but to design your solution.  With Storage Spaces, we figured most of this stuff out before selecting it as our new backup storage solution.  At the time, there was some documentation on storage spaces, but it was very much a learning process.  Most “how to’s” were demonstrated inside labs, so I found some aspects of the documentation to be useless, but I was able to glean enough information, to at least know what to think about.

So to get started, if you end up here because you want to build a solution like this, I would encourage you to start with this FAQ’s that MS has put together.  A lot of the questions I had were answered here.

I want to go over a number of gotcha’s with WSS before you take the plunge:

  1. If you’re used to and demand a solution that notifies you automatically of HW failures, this solution may not be right for you.  Don’t get me wrong, you can tell that things are going bad, but you’ll need to monitor them yourself.  MS has written a health script, and I myself was also able to put together a very simple health script (I’ll post it once I get my GitHub page up).
  2. WSS only polls HW health once every 30 minutes.  You can’t change this.  That means if you rip an power supply out of your enclosure it will take up to 30 minutes before the enclosure goes into an unhealthy state.  I confirmed this with MS’s product manager.
  3. Disk rebuilds are not automatic, nor are they what I would call intuitive.  You shouldn’t just rip a disk out when its bad, plop a disk in and walk away.  There is a process that must be followed in order to replace a failed disk.  BTW, this process as of now, is all powershell based.
  4. Do NOT cheap out on consumer grade HW.  Stick to the MS HCL found here.  There have been a number of stability problems listed with WSS, and its almost always has to do with not sticking with the HCL and not WSS’s reliability.
  5. This isn’t specific to WSS, but do not plan on mixing SATA and SAS drives on the same SAS backplane.  Either go all SAS or go all SATA, avoid mixing.  For HDD’s specifically, the cost is so negligible between SATA and SAS, I would personally recommend just sticking with SAS, unless you never plan to cluster.
    1. Do NOT use SAS expanders either, again, stop cheaping out, this is your data we’re talking about here.
  6. Do NOT plan on using parity based virtual disks, they’re horrible at writes, as in 90MBps tops.  Use mirroring or nothing.
  7. Do NOT plan on using dedicated hot spares, instead plan on reserving free space in your storage pool.  This is one of many advantages to WSS.  it uses free space and ALL disks to rebuild your data.
    1. If you plan on using the “enclosure awareness”, you need to reserve a drives capacity of free space * the number of enclosures you have.  So if you have 4 enclosures and you’re using 4TB drives, you must reserve 16TB of space per storage pool spanned across those enclosures.
  8. Plan on taking point 7, and also ensuring there’s at least 20% free space in your pool.  I like to plan like this.
    1. Subtract 20% right of the top of your raw capacity.  So if you have 200TB raw, that’s 160TB.
    2. As mentioned in point 7\1.  If you plan to use enclosure awareness, subtract a drive for each enclosure, otherwise, subtract at least one drives worth of capacity.  So in we had 4 enclosures and they had 4TB drives, that would be 160TB – 16TB = 144TB usable before mirroring.
  9. I recommend thick provisioned disk, unless you’re going to be VERY diligent about monitoring your pools free space.  Your storage pool will go OFFLINE if even one disk in the pool runs low on free space.  Thick provisioning prevents this, as the space will only allocate what it can reserve.
  10. Figure out your strip width (number of disks in a span) before building a virtual disk.  Plan this carefully because it can’t be reversed.  This has its own set of gotcha’s.
    1. This determines the increments of disks you need to expand your pool.  If you choose a 2 column mirror, that means you need 4 disks to grow the virtual disk.
    2. Enclosure awareness is also taken into account with columns.
    3. Performance is dependent on the number of columns.  if you have 80 disk, and you create a 2 column mirrored virtual disk, you’ll only have the read performance of 4 disks, and write performance of 2 disks (even with 80 disks).  You will however be able to grow at 4 disk increments.  However, if you create a virtual disk with 38 columns, you’ll have the read performance of 76 disks, and the write performance of 38, but you’ll need 76 disks to grow the pool.  So plan your balance of growth vs. performance.
  11. Find a VAR that has WSS experience to purchase your HW through.  Raid Inc. is who we used, but now other vendors such as Dell have also taken to selling WSS approved solutions.  I still prefer Raid Inc due to pricing, but if you want the warm and fuzzies, its good to know you now can go to Dell.
  12. Adding storage to an existing pool does not rebalance the data across the new disks.  The only way to do this is to create a new virtual disk, move the data over, and remove the original virtual disk.
    1. This is resolved in server 2016.

Those are most of the gotcha’s to consider.  I know it looks big, but I’m pretty confident that every storage vendor you look at, has their own unique list of gotcha’s.  Heck, a number of these are actually very similar to NetApp/ZFS, and other tier 1 storage solutions.

We’ll nerd out in my next post on what HW we ended up getting, what it basically looks like and why.