Naming Conventions: Windows Local Administrator Group via Active Directory

Introduction:

One thing that always kind of sucked with windows servers (and its even worse with Linux) is dealing with how to manage local administrator rights.  Now I realize a lot of you just add yourself to the domain admins and use that account (not a security best practices) to administer your servers, but what about everyone and everything else that might need local admin access?  What I have for you here is how I not only created a new naming conventions, but also how I tackled managing local administrator access in windows domain environment.

What you need:

  1. This is dependent on active directory. If you’re dealing with non-domain systems, this post isn’t really for you.
  2. You’re going to need a group policy object and your servers / desktops are going to need to support group policy preferences (2003/XP and above).

What you’ll want:

  1. More than likely a provisioning process that automates something which GPO/GPP can’t. For example, a simple PowerShell script that you run once the server is provisioned to finish the little things.

The naming convention:

I use a pretty simple naming convention for local administrators.  There are three main conventions I used which are below.

  • _asi_lad_%Server or Computer Name%
  • _asi_lad_All Servers in AD
  • _asi_lad_All Desktops in AD

Naming convention breakdown:

  1. The underscores are used as separators. At some point you may need to do queries with scripts, and knowing that an “_” represents a separator will make it easy to know where to start looking for variables.  For example, if you just looked in AD for something like *ServerName* you might get back more than one AD object, but if you look for *_lad_Servername you’re only going to get back local administrator groups.
  2. “ASI” in my case stands for Advertising Specialty Institute, which is an acronym for my company name. I used to work at the Pew Charitable Trusts and used “PCT” for them.  I use a company name to start out everything because down the road, you never know who you’re going to merge with.  While there is always a chance that ASI could merge with a company named “Advanced Internetworked Servers” or something like that, its not likely.  Plus, if you’re running a sort of multi-tenancy, this allows you to keep different companies separate.  It doesn’t need to be an acronym, nor does it need to be three letters, but it should be specific to a company.  Heck, if a small GUID makes sense go for it, just keep in mind that 64 character limit with AD (yes I’ve hit it with group names).
  3. “LAD” as you can guess stands for Local Ad Again, it doesn’t need to be three, but it should be consistent.
  4. The final part
    1. The server name is kind of obvious.
    2. All servers is kind of obvious
    3. All desktops is kind of obvious

 

You could of course take it much further than this if you want, it really depends on how granular you want to get.  I try to keep things balanced.  If your servers themselves have a good naming convention, you can likely already key off of that.

How it works:

  1. I created a GPO for my servers and a different one for my desktops. To be VERY clear, this GPO is not dedicated to this purpose.  I try to use one GPO to rule them all as much as possible.  So I add this to my GPO which applies to all servers and the one for all desktops for other things.
  2. Create a computer GPP setting which adds the respective “_company_lad_all server” and “_company_lad_all desktops”. This gives you an easy way to drop a service account into a group that will gives them local admin access to all servers WITHOUT giving them domain admin rights.
    1. As an aside, we also create a GPP that removes domain admins.
  3. Now, for the servers specifically (we don’t do this for desktops) I create a server specific group (via a script) and add it to that server (via script). You now have an ability to add a user to the local admins without ever needing to login to that local server anymore.

 

Powershell Snippet for creating a group and adding it to the local admins:

#Create and assign default local admin group

#Parameters

$ServerName = “cmp-servername-01”

$GroupName = “_lad_cmp_” + $ServerName

$OUPath = “OU=YourOUPath,DC=Domain,DC=Name”

$FQDN = “Domain.name”

#Creating group

New-ADGroup -DisplayName $GroupName -GroupCategory “1” -GroupScope “1” -Name $GroupName -SamAccountName $GroupName -Path $OUPath

Start-Sleep -Seconds 15

#Add Group to servers local administrators

$de = [ADSI]”WinNT://$ServerName/administrators,group”

$de.psbase.Invoke(“Add”,([ADSI]”WinNT://$FQDN/$GroupName”).path)

Closing:

That’s it really.  You can of course take it further, but this is a simple way to get a handle on your local administrators, and centrally manage them all from AD going forward.  You also now have a naming convention which makes it easy to track them down and automate various tasks.

Index:

To see other naming conventions or posts about naming conventions, head over here.

 

Naming Conventions: Introduction

Introduction:

If there was one area in IT that I tend to be fanatical about (some would say fanatical isn’t a strong enough word), I would have to say its naming conventions.  For me, learning about naming conventions in school and then seeing it put into practice in my first real gig, really enforced the concept that naming conventions are important.  I actually consider myself pretty lucky in that respect because I don’t think a lot of people have had the fortune of studying about the concept, let alone working for a company that had a decent naming convention.  After walking fresh into two different employers and seeing the mess they made, and also working through a few mergers, I know that good naming conventions don’t always exist.

This post is going to be refreshingly short, and instead really serve as an introduction to a new series I’m going to start.  The series will be a culmination of naming conventions for different things in IT.  I manage a ton of different things, so there’s always an opportunity for a new naming convention.

Why do you need a naming convention?

Here’s the thing, you don’t need a naming convention, but you should want them.  Not caring about names of objects not only makes your life hell, but also everyone that either replaces you or works with you.    The end goal of a naming convention is to put a little effort up front, so you can save a ton of time down the line.

How does a naming convention save you time?

Let me give a few examples:

  1. Once you have an established naming convention for an object, naming the next related object usually requires little thought.
  2. It keeps you and your peers organized.
  3. It provides consistency, and consistency leads to intuition, and intuition basically means second nature. If something is second nature, you’re not really thinking about it, and thus saving time.
  4. Getting a little more technical, naming conventions allow you to search and filter predictably.
    1. Being able to do this, means you can now effectively use scripts with deterministic targets.
    2. Writing reports is easier
  5. Identification of what something is becomes easier
  6. When done right, they’re forward thinking so they can be easily updated to reflect new changes.

When naming conventions don’t make sense?

Almost never, I don’t care if you have one AD group for managing a single server.  Now you might not need a super complex naming convention in that case, but you should still have something predictable.

That being said, Tags are slowly but surely becoming the new way to identify objects.  It makes a lot of sense, and to some degree it does invalidate the need to have monolithic naming convention.  Beyond that, even the best naming conventions will at times run into scaling issues.  Tags are infinitely more flexible and I’m looking forward to their eventual ubiquity in every aspect of technology.  Even with this, you should still keep in mind that tags are just as easily susceptible to bad naming conventions.  Your tag names and categories should have well thought out standards to ensure consistency as well.

Closing:

Naming conventions to me are right up there with regular maintenance, patching and other important but boring aspect of IT.  You need a plan, and you need to stick to it.  Don’t be one of those lazy IT people that can’t think further into the future than now.  Saying you “don’t have time” is a load of bytes, there is ALWAYS time for a naming convention.

Index:

This post is also where I plan to host the table of contents for all naming conventions posts.  If you think you missed a post, or can’t find it, track back here to look for it.  I plan to include a link to this post in the end of every naming convention post.

Windows:

Quicky Review: GPO/GPP vs. DSC

Introduction:

If you’re not in a DevOps based shop, or living under a rock, you may not know that Microsoft has been working on a solution that by all accounts sounds like its poised to usurp GPO / GPP.  The solution I’m talking about is Desired State Configuration, or DSC. According to all the marketing hype, DSC is the next best thing for IT since virtualization.  If the vision comes to fruition, GPO and GPP will be a legacy solution.  Enterprise mobility management will be used for desktops and DSC will be used for servers.  Given that I currently manage 700 VM’s and about an equal number of desktops, I figured why not take it for a test drive. So I stood up a simplistic environment, and played around with it for a full week and my conclusion is this.

I can totally see why DSC is awesome for non-domain joined systems, but its absolutely not a good replacement in todays iteration for domain joined systems. Does that mean you should shun it since all your systems are domain joined?  That depends on the size of your environment and how much you care about consistency and automation.  Below are all my unorganized thoughts on the subject.

The points:

DSC can do everything GPO can do, but the reverse is not true. At first that sounds like DSC is clearly a winner, but I say no.  The reality is, GPO does what it was meant to do, and it does it well.  To reproduce what you’ve already done in GPO while certainly doable, has the potential of making your life hell.  Here are a few fun facts about DSC.

  1. The DSC “agent” runs as local system. This means it only has local computer rights, and nothing else.
  2. Every server that you want to manage with DSC, needs its own unique config file built. That means if you have 700 servers like me, and you want to manage them with DSC, they each are going to have a unique config file.  Don’t get me wrong, you can create a standard config, and duplicated it “x” times, but none the less, its not like GPO where you just drop the computer in an OU and walk away.  That being said, and to be fair, there’s no reason you couldn’t automate DSC config build process to do just that.
    1. DSC has no concept of “inheritance / merging” like you’re used to with GPO. Each config must be built to encompass all of those things that GPO would normally handle in a very easy way.  DSC does have config merges in the sense that you can have a partial config for say your OS team, your SQL team and maybe some other team.  So they can “merge” configs, and work on them independently (awesome).  However, if the DBA config and the OS config, conflict, errors are thrown, and someone has to figure it out.  Maybe not a bad thing at all, but none the less, it’s a different mindset, and there is certainly potential for conflicts to occur.
  3. A DSC configuration needs to store user credentials for a lot of different operations. It stores these credentials in a config file that hosted both on a pull server (SMB share / HTTPs site) and on the local host.  What this means is you need a certificate to encrypt the config file and then of course for the agent to decrypt the config file.  You thought managing certificates was a pain for a few exchange servers and some web sites?  Ha! now every server and the the build server need certs.  In the most ideal scenario, you’re using some sort of PKI infrastructure.  This is just the start of the complexity.
    1. You of course need to deploy said certificate to the DSC system before the DSC config file can be applied. In case you can’t figure it out by now, this is a boot strap solution you have to implement on your own if you don’t use GPO.  You could use the same certificate and bake it into an image.  That certainly makes your life easier, but its also going to make your life that much harder when it comes to replacing those certs on 700 systems.  Not to mention, a paranoid security nut would argue how terrible that potentially is.
  4. The DSC agent of course need to be configured before it knows what to do. You can “push” configurations, which does mitigate some of these issues, but the preferred method is “pull”.  So that means you need to find a way (boot strap again) to configure your DSC agent so that it knows where to pull its config from, and what certificate thumbprint to use.

Based on the above point, you probably think DSC is a mess, and to some degree it is. However, a few other thoughts.

  1. It’s a new solution, so it still needs time to mature. GPO has been in existence since 2000, and DSC, I’m going to guess, since maybe 2012.  GPO is mature, and DSC is the new kid.
  2. Remember when I wrote that DSC can do everything that GPO can do, but not the reverse? Well, lets dig into that.  Let’s just say you still manage Exchange on premises, or more likely, you manage some IIS / SQL systems.  DSC has the potential to make setting those up and administering them, significantly easier.  DSC can manage not only the simple stuff that GPO does, but also way beyond that.  For example, here are just a few things.
    1. For exchange:
      1. DSC could INSTALL exchange for you
      2. Configure all your connectors, including not only creating them, but defining all the “allowed to relay” and what not.
      3. Configure all your web settings (think removing the default domain\username).
      4. Install and configure your exchange certificate in IIS
      5. Configure all your DAG relationships
      6. Setup your disks and folders
    2. For SQL
      1. DSC could INSTALL sql for you.
      2. Configure your max member min memory
      3. Configure your TempDB requirements
      4. Setup all your SQL jobs and other default DB’s
    3. Pick another MS app, and there’s probably a series of DSC resources for it…
    4. DSC let’s you know when things are compliant, and it can automatically attempt to remediate them. It can even handle things like auto reboots if you want it to.  GPO can’t do this.  To the above point, what I like about DSC, is I’ll know if someone went in to my receive connector and added an unauthorized IP, and even better, DSC will whack it and set it back to what it should be.
    5. Part of me thinks that while DSC is cool, I wish Microsoft would just extend GPO to encompass the things that DSC does that GPO doesn’t. I know its because the goal is to start with non-domain joined systems, but none the less, GPO works well and honestly, I think most people would rather use GPO over DSC if both were equally capable.

Conclusion:

Should you use DSC for domain joined systems?  I think so, or at least I think it would be a great way to learn DSC.  I currently look at DSC as being a great addition to GPO, not a replacement.  My goal is going to be to use GPO to manage the DSC dependencies (like the certificates as one example) and then use DSC for specific systems where I need consistency, like our exchange, SQL and web servers.  At this stage, unless you have a huge non-domain joined infrastructure, and you NEED to keep it that way, I wouldn’t use DSC to replace GPO.

 

Review: 4.5 years with Dell Poweredge Servers

Disclaimer:

Typical stuff, these are my views not my employers, they’re opinions not facts (mostly at least), use your own best judgement, take my views with a quarry full of salt.

Introduction:

When I started at my current employer, one of the things I wasn’t super keen on was having to deal with Dell hardware.   I had some previous experience with Dell servers, none of which was good.  My former employer was an HP shop that had converted from Dell, and early on in my SysAdmin gig, I had the misfortune of having to work with some of the older Dell servers (r850 as an example).

You’re probably thinking this means I’ll be going over why one vendor is better than the other?  Nope, not at all.  I don’t have the current experience with the HP ecosystem to draw any conclusions like that.  I provided that information so that you the reader know that I write this all as a former HP fanboy.  I hope you’ll find it objective and informative.  I also hope that someone with some sway at Dell is reading this, so they can look for ways to improve their product platform.

Pros:

As always, I like to start out with the pros of a solution.

  • Hardware lifecycle: I find that Dell has a very good HW lifecycle.  They’re very quick to release new servers after Intel releases a new chip.  Maybe not a big deal for some, but for us, it’s a nice win.  In turn I also find that Dell keeps their current and former server models around for a very respectable amount of time.  If you’re in a situation where you’re trying to keep a cluster in a homogeneous HW configuration, this is advantageous.
  • Support: For general HW issues I find Dell to have fantastic support.  We use pro support for everything which I’m sure helps.  When I say general HW support, I’m talking about getting hard drives, or memory replaced.  To some degree I’ll even say troubleshooting more complex HW issues they tend to be pretty good at.  When you get above the HW stack, I’ll chat about that a little bit later.  It also worth mentioning that from what I can tell, support is 100% based in the US which is surprising but certainly appreciated.
  • HW Design / Build: When you’re comparing Dell to something like a Supermicro server, it’s a night and day difference.  I find that the overall build quality of Dell’s rackmount solutions are excellent.  Internally, most things are easy to get to and replace.  The cover is labeled well as are things like the memory which makes it easy to track down a problem DIMM.  Outside the server, the cable management arms are nice (if you use them).  Overall, the server is built sturdy, the edges are typically smooth (not sharp) and pretty much everything is tooless.  I only have two gripes, which I’ll add here instead of duplicating it in the cons.  I find their bezels to be somewhat useless.  We’ve stopped using them because we’ve run into cases where its actually hard to get them off.  Also, depending on your chassis config, the bezel blocks some idiot lights.  Finally, and admittedly this is more of a personal preference, I HATE Dell’s rails.  I think the theory is they’re supposed to be installed vertically starting in the rear, and then lay into the rails.  The problem is the rails are just too damn flimsy when they’re extended (all rails are; this isn’t just Dell).  I personally prefer the slide in style like found on the HP’s.
  • Purchasing configuration: One thing I love with Dell is that each server is 100% customizable, or at least reasonably customizable.  I’ve worked with other vendors where you were forced to use cookie cutter templates or wait a month+ for a custom build.
  • Price: I can’t say they’re cheaper than every vendor out there, but I suspect if you compare them to HP, Cisco or Lenovo I find that they tend to be a more affordable solution. For us, we go Dell direct and we’re considered an enterprise customer.  Your mileage may vary in this case of course.
  • Sales team: I can’t typically say this about many vendors, but I generally love working with Dell’s sales team.  They’re all very friendly and very responsive to requests.  Its one of those things where its just nice to see a vendor meet what I’ll call minimum expectations.  Dells sales team does this for sure and in many cases exceeds them.

And that is pretty much where the pros end.

Cons:

Like I’ve said in previous reviews, its always easier to pick out the cons of a solution.  Chalk it up to taking things for granted, loss of perspective, etc.  I would say take some of these cons with a grain of salt, no vendor is perfect.  Some of this stuff I’m not just outlining for your information, but also so that perhaps a product manager for Dell’s server solutions gets some much needed, unfiltered feedback.  I say unfiltered, because as a customer, I honestly feel like this stuff never makes it up to the right people, or when it does, its watered down.

  • Innovation: Dell has absolutely ZERO innovation capabilities across their entire portfolio, and the Poweredge line is no exception.  I’ve often joked that when Dell buys a company, that’s where they’re going die a slow non-innovative death.  Look at what they’ve done with EQL, Compellent, Ocarina, and Quest.  Seriously, the whole line up is a joke compared to what’s out and about now a day.   I know this is supposed to be about Poweredge servers (and I’ll get to that) but if Dell keeps any of those storage products around post EMC merger, they’re fools.  How does this apply to servers?  Take a look at Cisco UCS and then take a look at Dell.  Dells server solution is fine if you have say less than 50 or so servers and most of them are virtual hosts.  The instant you start going above that, the more value there is in a solution like Cisco UCS.  If I was still running a 100+ real server environment, no way I’d run Dell.  Why do I categorize this under innovation? Because there is nothing coming out from Dell that I feel compelled to write about.  There is no “wow” factor with Dell servers. When I first saw a UCS demo, THAT was a wow factor.  I don’t like the Cisco architecture, but no one can say that they’re not innovative, and I’d totally jump in with Cisco if my environment was larger.
  • Central Management (physical servers): Getting into a little bit more details of why I wouldn’t buy Dell if I had a larger shop.  They have a half assed / practically non-existent server management solution.  They have this POS called Dell Open Manage Essentials, that’s supposed to take care of pretty much everything one would need to take care of with a Dell server.  The problem is it doesn’t do a whole lot and what it does do, it doesn’t do very well.
    • Monitoring: OME only told us that there is a problem, but it wouldn’t tell you what the problem was.  So we’d still have to log into each server to find out what the specific problem was.  That wouldn’t have been a big deal, except 95% of the time it was something dumb like the drivers / FW being out of date (really, do we need a warning about that?).
    • Remote FW and Driver updates: This never worked, tried it a million times, and it would either only partially work, or not work at all.  Installing things like drivers or new tools would just show “failed” with some cryptic reason.  Manually download the same update, and run it manually, and it works fine.
    • Server profiles: I didn’t even try these because honestly nothing else worked well, and it looks like a PITA to configure.  There is clearly no “vision” for how to make managing servers easy at Dell.  I don’t mind doing some prep work here, but its not like we’re talking about standing up an OS + Application, we’re talking about a few server settings, some drac settings and icing on the cake would be configuring and managing the local RAID cards.  I know its probably not fair to put it down, maybe it’s a diamond of a solution baked into a crappy product, but I doubt it.
  • Central Management (Vmware): They actually do “ok” here, but its not great.  When the solution works (hit or miss) it does work pretty well.  Still I’m only using it to manage FW updates, and even with that it only works part of the time.  There have been countless cases where I’ve needed to kill outstanding FW update jobs and soft reset the drac to shake things loose.  There’s also times where I need to reboot the FW update appliance, and sometimes I have to reboot vCenter + do all the above, because who knows what the problem is.
  • Non-standard HW: It’s beyond frustrating that I can’t say I want all Intel SSD of a specific model.  Dell uses vague terms (low, mid or high write duty) to describe their SSD drives.  There’s no way for me to know if I’m getting an SSD that can do 35k IOPS or 100k IOPS.  With HDD’s its mostly a commodity, but with SSD’s, there are very much big pros and cons depending on which SSD you use.  IMO, with HCI (however over rate the architecture is) becoming a hot new thing, having a standard SSD and HHD IMO is a must.  You’re now building around local storage characteristics, and you need predictability.
  • SSD and HDD are stupid expensive: Just a matter of opinion, but their prices are insane, especially for the SSD’s.  I get that they might wear out prematurely, but then put some clause that they’re simply good for “x” writes are whatever.
  • DRACs: There remote management cards have gotten better over the years, but they’re still nothing compared to iLO.  They can still be somewhat unreliable and they’re only now starting to release an HTML5 interface instead of Java.  Adding to that, as mentioned above, I’ve seen multiple cases where FW update stop working and its almost always the DRAC that’s the issue.
  • Documentation and downloads: Dell is almost as bad as Microsoft when it comes to documentation and downloads.  Things are just scattered all over the place and it lacks consistency.  Yes, I can go to the drivers and downloads section for a server model, but there have been many times where I’ve seen Openmanage Systems Administrator (OSMA) latest version on say a Dell poweredge r730 page, but NOT on a poweredge r720.

With things like the Dell Openmanage vCenter appliance, I also find it hard to find the latest updates or to know where to download my serial keys.  Documentation about the product is also difficult to find at times.

  • Standalone FW updates: Dell has methods updating the HW for standalone system, but they’re all overly complex, or not refined.  One method is booting of an SBU disk and then swapping CD’s (ISOs) to load your repo.  It works (most of the time) but it’s a PITA.  The other option is using their repository manager (another half-baked solution) to create a standalone FW update ISO that just blindly updates all HW that you’ve loaded a FW for.  Neither solution is seamless or refined.  I cringe every time I have to use either.
  • Support: I know I marked this as a pro, but I wanted to elaborate a bit here on the cons of support.  I know in many cases they’re probably not different than other vendors, but I’m so sick of hearing “we can’t do x or y until you’ve updated all your FW and drivers”.  To me, I don’t care what vendor you are, this is lazy, kick the can troubleshooting.  Its one thing if you can point to a release note in a FW or driver that describes the specific problem I’m having, but if you’re not even looking at the release notes, and blindly telling me to update component x,y and z, that’s just you being lazy.  Here is normally what ends up happening anyway.  I call you, you tell me to update x,y and z.  I begrudgingly do it after debating with you, problem is NOT solved and now you’ve just pissed me off and wasted my time (and depending on the server, other teams time).

Conclusion:

At the end of the day, I’m sure you’re thinking I hate Dell servers.  I don’t, but they leave a lot to be desired.  I look at Dell as a high end Supermicro server.  I know I’ll get good support from them, a solid server and if I don’t buy their disks, a reasonably priced server.  Because our environment is relatively small (35 servers or so), I can deal with their cons.  If I had a larger physical server environment, I would probably lean more towards Cisco, but at this size and below its not worth the added cost or complexity of UCS.  That being said, if cost was no option, or if Cisco were to offer their solution at a more affordable price, I don’t know why anyone would buy Dell.

Thinking out loud: Hyper converged storages missing link

Introduction:

In general, I’m not a huge fan of hyper converged infrastructure.  To me, its more “hype” than substance at the moment.  It was born out of web scale infrastructure like Google, Facebook, etc. and IMO, that is still the area where it’s better suited.    The only enterprise layer where I see HCI being a good fit is VDI, other than that, almost every other enterprise workload would be better suited on new school shared storage.  I could probably go into a ton of reasons why I personally see shared storage still being the preferred architecture for enterprises, but instead I’ll focus on one area that if adopted might change my view (slightly).  You see, there is a balance between the best and good enough.  Shared storage IMO is the best, but HCI could be good enough.

What’s missing?

What is the missing link (pun intended)?  IMO, its external / independent DAS.  Can’t see where this is going?  Follow along on why I think external DAS will make hyper converged storage good enough for almost anyone’s environment.

Scaling Deep:  Right now the average server tops out at 24 2.5” drives and less for 3.5” drives.  In a lot of larger shops, that would mean running more hosts in order to meet your storage requirements, and that will come at the cost of paying for more CPU, memory and licensing then you should have to.  Just imagine a typical 1ru r630 + a 2u 60 drive JBOD!  That’s a lot more storage that you can fit under a single host, and it would only consume one more rack unit than a typical r730.  Add to this, theoretically speaking, the number of drives you could add to a single host would go beyond a single JBOD.  A quad port SAS HBA could have four 60 drive enclosures attached, and that’s a single HBA.

Storage independence:  Having the storage outside the server also makes that storage infinitely more flexible.  This is even true when you’re building vendor homogeneous solutions.  Take Dell for example.  Typically speaking their enclosures are movable between different server generations.  Currently with the storage stuck in the chassis, it gets really messy (support wise) and in many cases not doable, to move the storage from one chassis to another, especially if you’re talking about going from an older generation server to a newer generation.

Adding to this, depending on your confidence, white boxing also allows you to cut a server vendor out of the costliest part of the solution, which is the disks themselves.  Going with an enclosure from someone like RAID inc. or DataOn, Quanta QCT, Seagate, etc.  Add in a generic LSI (sorry Avago, oh sorry again, Broadcom) HBA, and now you have a solution that is likely good enough supportability wise.  JBODs tend to be pretty dumb and reliable, which just leaves the LSI card (well known established vendor) and your SSD / HDD.

Why do you want to move the storage anyway?  Simple, I’d bet a nice steak dinner that you want to upgrade or replace your compute long before you need to replace your storage. If you’re simply replacing your compute (not adding a new node but swapping it) then moving a SAS card + DAS is far more efficient than rebuying the storage, or moving the internal storage into a new host (remember warranty gets messy).  Simply vacate the host like you would with internal storage, shutdown, rip the hba out, swap server, put existing HBA back in, done.

If you’re adding a new host, depending on your storage, you may have the option of buying another enclosure and spreading the disks you have evenly across all hosts again.  So if for example, you had 50 disks in 4 hosts (total 200 disks) and you add a fifth host.  One option could be you simply remove 10 disks from each current node and place them in the new node. Your only additional cost was JBOD enclosure, and you now continue to keep your current investment in disks (with flash, that would be the expensive part).

Mix and match 3.5 / 2.5 drives:  Right now with internal storage, you are either running a 3.5” chassis, which doesn’t hold a lot of drives, but CAN support 2.5” drives with a sled.  Or you are running a 2.5” chassis which guarantees no 3.5” drives.  External DAS could mean one of two options:

  1. Use a denser 3.5” JBOD (say 60 disks) and use 2.5” sleds when you need to.
  2. Use one JBOD for 3.5” drives and a different one for 2.5” drives.

Again it comes down to flexibility.

Performance upgrades:  Now this is a big “it depends”.  Hypothetically if there were no SW imposed bottlenecks (which there are), one of your likely bottlenecks (with all flash at least) are going to be either how many drives you have per SAS lane, or how many drives you have per SAS card.  For example, if your SAS card is PCIe 3.0 internally, but the PCIe bus is 4.0, there’s a chance you could upgrade your server to a newer / better storage controller card.   More so, even if you were stuck on PCIe 3 (as an example).  There would be nothing stopping you from slicing your JBOD in half, and using two HBA to double your throughput.  Before you even go there, yes I do know the 730xd has an option for two RAID cards, glad you brought that up.  Guess what, with external DAS, you’re only limited by your budget, the number of PCIe slots you have and the constrains of your HCI vendor.  I for example could have 4 SAS cards, and 2 JBODs partially filled and each sliced in half.  You don’t have that flexibility with internal storage.

With the case of white boxing your storage, this also means to the extent of the HCL, you can run what you want.  So if you want to use all Intel dc3700’s, you can.  Heck, they’re even starting to make JBOF (just a bunch of flash) enclosures for NVMe, which again, would be REALLY fast.

Conclusion:

I say external DAS support is the missing link because it is what would allow HCI to offer similar scaling flexibility that exists in SAN/NAS.  I still think the HCI industry is at least 3 – 5 years out from matching the performance, scalability and features we’ve come to expect in enterprise storage, but external storage support would knock a big hole in a large facet of the scalability win with SAN/NAS.

Wish list: Nimble Storage (2016)

Introduction:

These are just some random things that I would love to see implemented by Nimble Storage in their solution.  I don’t actually expect most of this stuff to happen in a year, but it would be interesting to see how many do.  Some are more realistic than others of course, but part of the fun in this is asking for some stuff that perhaps is a bit of a stretch.

Hardware:

  • Better than 10G networking: Given that we now have SSD’s becoming the storage of choice, its more than likely the networking which will bottleneck the SAN. I understand Nimble offers 4 10g links per controller, but this is less ideal than two 40g links.  40g is now available in all respectable server and switches, I see no reason why Nimble also shouldn’t offer the same.
  • Denser JBODs: I think Nimble has decent capacity per rack unit, but they could do a lot better. Modern JBOD’s offer as high as 84 disks in 5u and 60 disks in 4u for 3.5 inch drives and 60 disks in 2u for 2.5 inch drives.  I would love to see Nimble offer better density.  Just like they do now with SSD cache, I see no reason why the JBOD needs to be filled at time of purchase.  Make 15 disk packs available that are installable by customers, then simply add an “activate” option for the new disk span.
  • Cheap and Deep: Nimble now has a great storage line for most tier 0 – 3 workloads, but I still feel they’re missing one additionally important tier, backup. Yes, you could use their hybrid arrays for backup, but at the current cost structure, and based on the problem I raised above (rack units), I don’t feel Nimbles CS series are a prudent use of resources for backup.  Most backup data is already compressed, and random IO tends to not be a constraining factor.  I’d love to see an array designed for high throughput sequential workloads, with lots of non-compressed usable capacity.  Me personally, I’d even be ok if they offered a detuned CS series that disabled caching and compression to ensure the array is used for its designed purpose.  What I’m NOT asking for is yet another dedupe target.
  • More scalable architecture: Just my opinion, but I think the whole idea of keeping the controllers in the chassis while great for smaller workloads, doesn’t make as much sense for larger storage implementations.  There is a reason EMC, NetApp and others have external controllers.  It allows much greater expansion, and theoretically better performance.  I get the idea of having a series that can go from the smallest to the largest, with a simple hot swappable controller, but with external controllers, that would be doable as well and, it would more than likely work for completely different generations.  This would also allow more powerful controllers (quad socket as an example).  If CPU’s are truly the bottleneck, then why limit the number of cores and sockets you have access to by using a chassis that restricts such things?
  • All NVMe array: I know the tech is still new, but it would be awesome to see an NVMe array in the near future.

Hardware / Software:

  • Storage Tiers: Its either all flash or hybrid, but why not offer both under the same controller?  Simply rate the controller based on peak IOPS, and then lets us pick what storage tier (s) we want to run underneath.  Any spinning disk would go into a hybrid tier and any flash would go into a flash tier.  I’m not even looking for automated tiering, just straight up manual tiers.

Software:

  • Monitoring: I would really like to have more metrics available via SNMP that are typical on other arrays. Queue depth, average IO size, per disk metrics, CPU, RAM, etc.
  • File Protocols: Now that you have FC + iSCSI, why not add SMB + NFS as well?  I would love to see Tintri like features in Nimble.
  • Replica destinations: I’ve been asking for more than one destination for 4 years, so here it goes again.  I would like to replicate a volume to more than one destination.  If I have a traditional failover cluster, I want to replicate it locally and to DR via a snapshot.  Right now, that’s not doable TMK.
  • Full Clones: Linked clones are great for cases where you think you’re going to get rid of a volume after X period of time, but they really don’t make sense long term.  I would love an option to create full clones, or even promote existing linked clones into full clones.  Maybe I want to get rid of the parent volume and keep the clone, I can’t do that right now.
  • Full featured client agent: I would LOVE to have the ability for a client to do the following actions.  This would allow my DBA’s to do their own test / dev refreshes without me needing to delegate them access to the Nimble console.
    • Initiate a new snapshot (either at a volume group level or an individual volume)
    • Pick from a library of existing snapshots to mount
    • Clone a volume and mount it automatically
    • Delete a volume / snapshot / clone.
    • Swap a volume group (this would be a great workflow feature for UAT refreshes).
  • Virtual Appliance GA’ed:  How about letting me have access to a crippled virtual appliance so I can test some things out that might be a little risky trying in prod.  Everyone in IT knows its not a good practice to test in prod, and yet with storage, we have no choice but to do that.
  • Better Volume Management:  If there is one model that everyone should try and duplicate when it comes to management its Microsoft ACL propagation and inheritance structure.  I’m tired of going into individual volumes to change things like ACLS.  Even using volumes groups, while it helps, doesn’t eliminate the need.  I would much rather group volumes in folders (or even better tags) and apply things like protection policies and ACLS globally, rather than contend with trying to script static entries.  Additionally, being able to use tags IMO, is far smarter than say using folders.  Folders are rigid, tags are flexible.  Basically a tag can do everything a folder can do, but the reverse is not true.
  • Update Picker:  I like running the latest version of NOS, but for consistency sake, I also like having my SANs on the same revision.  There have been times where we were in the middle of a series of SAN upgrades and you released a new minor version.  When I go to download the NOS version on SAN I’m updating, it no longer matches the rev on the SAN’s I’ve already upgraded.  So ultimately I end up having to call support and request the older rev, and manually download it.
  • Centralized Console:  It would be great if you guys came out with something like a vCenter server for Nimble.  A central on premises console that I could use to do everything across all my SAN’s.  This would mitigate the need to use groups for management reasons, and instead would allow me to manage all SAN.  I could easily see this console being where things such as updates are pushed out, new volume ACL’s are created, performance monitoring is done, etc.
  • Aggressive Caching in GUI:  Offer an option right in the GUI to use default (random only), sequential and random, or pin to cache.
  • Web UI improvements:
    • Switch to a nice and fast HTML5 console.
    • Show random / sequential and write / read in the time line graph, and show it as a stacked graph instead of a line graph.
    • Don’t show cache misses for sequential IO.
    • Show CPU usage
    • Show the top 5 volumes for real time, last minute, last 5 minutes, last 15 minutes, last hour and last day for the following metrics.
      • Cache misses
      • Total IOPS
      • Total throughput
      • CPU usage
  • Use GUIDS not friendly names:  I would actually like to see Nimble switch to using a generic GUID for volume ID, and then have a simple friendly name that’s associated with it.  There are times where I wanted to change my naming convention, but doing so, would require detaching the volume from the host that’s serving it.
  • Per Volume / Per host MPIO policy:  Right now it seems the entire array needs to be enabled for intelligent MPIO.  I would actually like to see there be an additional option to only do this for certain initiator groups or certain volumes.
  • Snapshot browser:  I would love a tool that would allow us to open a snapshot through the management interface and browse for files to recover, rather than having to mount a snapshot to the OS.  Even better would be if I could open a Vmware snapshot and then open a VMDK as well.

Problem Solving: CommVault tape usage

Introduction:

I hate dealing with tapes, pretty much every aspect of them.  The tracking of them is a PITA, having to physically manage them is a PITA, dealing with tape library issues is a PITA, dealing with tape encryption is a PITA, running out of tapes is a PITA, dealing with legal hold for tapes is a PITA, and I could keep going on with the many ways that tape just sucks.  What makes matters worse is when you have to deal with MORE tapes.

Now that you know tapes are one of my personal seven levels of hell in IT, you’ll know why I put a bit of time into this solution.  Anything I can do to reduce the number of tapes getting exported every day, ultimately leads to some reduction in the PITA scale of tapes.

The issue:

To provide a better understanding of the issue at hand, for years I’ve been seeing way too many tapes being used by CV.  We’d kick out tapes that had 5% or 10% consumption, and the number of tapes with that level of consumption varied based on what phase of our backup strategy we were in, and what day of the week it was.  It could be anything as small as 4 partially filled tapes, to times where we had 10+ tapes that weren’t filled all the way up.  If the consumed data should fit on 16 tapes, and we’re kicking out 26 tapes, that’s a problem IMO.  I’m sure many of you out there have contended with this in CV specifically, and I’d bet those of you using other vendors products have run into this too.  I’m going to first explain why the problem is occurring, and then I’ll go over how I’ve reduced most of the waste.

The Why?

In CV, we have storage policies, and short of going into an explanation of what they are for others not familiar with CV, just think of it as an island of backup data.  That island of data doesn’t co-mingle with other islands of data on disk, and tape is no exclusion.  What that means is when you backup data to a storage policy and want to copy it to tape, that data getting copied to tape will automatically reserve the entire tape being used.  In turn, each storage policy then reserves its own unique tapes so that data does not co-mingle together.  This means for every storage policy you have, you’re guaranteed at least one unique tape per storage policy at a minimum.  Now, each storage policy can have a number of streams configured.  To keep things simple, let’s just ignore multiplexing for now.  When a storage policy has a stream limit of 1, that means only 1 tape drive will be used, when it has a stream policy of 4, that means 4 tape drives will be used.  Now, as you copy data to tape, you normally have more than 1 streams worth of data, you probably have at least one for each client in your environment (and likely much more than that).  This is a good thing, having more streams means we can run data copy operations in parallel.  In the case of the 4 streams example, that means we can use 4 tape drives in parallel to copy data for the example storage policy.  What this also means is depending on circumstances, we could end up with 4 tapes not being filled all the way as well.  Streams are optimized for performance, NOT for improving tape utilization.  Now, imagine you have more than one storage policy, let’s just say 4 storage polices, each being their own island, and each with a stream limit of 2.  That means you could end up with up to 8 tapes not being fully utilized.  I’m also ignoring for now that in CV, you can separate incremental and fulls to different storage policies which exacerbates the problem further (taking one island and making it two).

In our case, we have 4 storage policies and we had gone through a process of merging our Fulls and Incs into a single storage policy to consolidate tapes already.  We have a total of 6 tape drives, which means if we just configured the storage policies to fight over the tape drives @6 streams each, we could end up in theory with 24 partially filled tapes.  We’re smarter than that of course, so that wasn’t out problem.  Our problem was finding the right balance between how many streams a storage policy needed to copy all its data in our window, and not making it so high that we ended up wasting tape.  Pre-solution, we almost always had 4 – 6 tapes that were wasted, as in 100GB on a 2000GB tape.  It was annoying and wasteful.

Solution, problems again, improved solution:

There are two main components to the solution.

  • Scripting storage policy stream modification via task scheduler (MVP JAMS in our case).
  • CommVault introducing Global Tape Policies in v11
    • This allows tapes to be shared, no longer residing on an island as mentioned above. So storage policy 1, 2, 3 and 4 can all share the same tape.  Way more efficient.

In our case, when we saw the global tape policy, it was like a halo of light and angels singing, going off in our head.  This was it, our problems were FINALLY solved.  After going through the very tedious task of migrating to this solution, we found that we were still using 4 – 6 tapes a day more than we needed.  The problem was not that data was not co-mingling, it was.  No, the problem was that we set the global tape policy to 6 streams, and every day, it was using 6 tape drives for backups.   At first we tried to solve the problem by limiting the aux copy streams via a scheduled task in CV (start the job with 1 stream only as an example) but we had 4 storage polices, so that only reduced the tape usage to 4.  The problem again was that each storage police was scheduled and run in parallel.  So while we restricted any one storage policy, ultimately we were still letting more tape drives being used than needed and in turn more tapes than was needed.  We had set 6 streams, because we wanted to make sure that our FULL jobs had enough tape drives to complete over the weekend.

At this stage, I came to the conclusion that we needed a way to dynamically control the streams for the global tape policy so that during the week days it was restricted to 1 tape drive (all we needed) and on the weekend, we could start out with 6 and slowly ramp back down to 1, and hopefully more fully fill our tapes.  With a bit of research and some discussions with CV, I found out that they have a CLI option for controlling storage policy streams (found https://documentation.commvault.com/commvault/v10/article?p=features/storage_policies/storage_policy_xml_edit.htm).  Using my trusty scheduling tool, I setup a basic system where on Sunday @4PM we would set the streams to “1”, and then on Friday @4PM we would raise them to “6” and Saturday @7am we would drop them to “2”.  This basically solved our problem, and I’m happy to say that on week days, tapes are filled as much as is possible (1 – 2 tapes depending on which client ran a full), and on the weekend, 2 – 4 tapes are still being used.  I’m still tuning the whole thing, for the fulls (it’s a balance of utilization and performance), but its better than its ever been.  Its also worth noting, we went back and modified our aux copy schedules and told them to use all available streams since we now choke point it at the global tape policy.  This allows any storage policy to go as fast as possible (although potentially blocking other ones).

It’s a hack no doubt.  IMO, CV should develop this concept in their storage policies.  Basically creating a schedule window to dynamically control the queue depth.  For now, this is working well.

SQL Query: Microsoft – WSUS – Computers Update Status

Sometimes the WSUS console, just doesn’t give you the info you need, or it doesn’t provide it in a format you want.  This query is for one of those examples.  This query can be used in multiple ways to show the update status of a computer, computers or computer in a computer target.  For me, I wanted to see the update status, without worrying about what non-applicable updates were installed.  I also, didn’t care about updates that I didn’t approve, which was another reason I wrote this query.

First off, the query is located here on my GitHub page.  As time allows, I plan to update the read me on that section with more filters as I confirm how accurate they are and what value they may have.

All of the magic in this query is in the “where statement”. That will determine which updates you’re concerned about, which computers, which computer target groups, etc.

To begin with, even with lots of specifics in the “where statement”, this is a heavy query. I would suggest starting with a report about your PC or a specific PC, before using this to run a full report. It can easily take in excess of 30 minutes to an hour to run this report if you do NOT use any filters, and you have a reasonably large WSUS environment.   It’s also worth noting, in my own messing around, I’ve easily run out of memory / temp db space (over 25GB of tempdb).  It has the potential to beat the crap out of your SQL DB server, so proceed with caution.  My WSUS DB is on a fairly fast shared SQL server, so your mileage may vary.

Let’s go over a few way’s to filter data. First, the computer name column would be best served by using a wildcard (“%” not “*”) at the end of the computer name. Unless you’re going to use the FQDN of the computer name.  In other words, use ‘computername%” or ‘computername.domain.com’

Right now, I’m only showing updates that are approved to be installed. That is accomplished by the Where Action = ‘Install’ statement.

The “state” column is one that can quickly let you get down to the the update status you care about. In the case of the one below, we’re showing the update status for a computer called “computername”, but not showing non-applicable updates.


Where Action = ‘Install’ and [SUSDB].[PUBLIC_VIEWS].[vComputerTarget].[Name] like ‘computername%’ and state != 1


if we only wanted to see which updates were not installed, all we’d need to do is the following. By adding “state !=4” we’re saying only show updates that are applicable, and not currently installed.


Where Action = ‘Install’ and [SUSDB].[PUBLIC_VIEWS].[vComputerTarget].[Name] like ‘computername%’ and state != 1 and state != 4


If you want to see the complete update status of a computer, excluding only the non-applicable updates, this will do the trick.  That said, its a BIG query and take a long time.  As in, go get some coffee, chat with your buds and maybe play a round of golf.  You might run out of memory too with this query depending on your SQL server.  In case you didn’t notice, I took out the “where Action = ‘Install'”.  As in show me any update that’s applicable, with any status, and any approval setting.


Where ‘Install’ and [SUSDB].[PUBLIC_VIEWS].[vComputerTarget].[Name] like ‘pc-2158%’ and state != 1


Play around your self and I think you’ll see it’s pretty amazing all the different reports your can create.  I would love to say the WSUS DB was easy to read / figure out, but IMO, its probably one of the more challenging DB’s I’ve figured out.  There are sometimes multiple joins needed in order to link together data that you’d think would have been in a flat table.  I suspect that, combined with missing indexes is part of the reason the DB is so slow.  I wish MS would simplify this DB, but I’m sure there’as a reason its designed the way it is.

SQL Query: Microsoft – WSUS – Computers to Computer Target

This is a simple query you can use map your computers to their various targets in a nice easy to export table.  I’d love to say the script is sexier than that, but its really not.  You can find the sql query here.

There is only one section worth mentioning because it can change the way a computer is mapped to a computer target.


Where [SUSDB].[dbo].[tbExpandedTargetInTargetGroup].[IsExplicitMember] = 1

Powershell Scripting: Invoke-ECSSQLQuery

Quick Powershell post for those of you that may on occasion want to retrieve data out of a SQL table via Powershell.  I didn’t personally do most of the heavy lifting in this, I simply took some work that various folks out there did and put it into a repeatable function instead.

Firstly, head over to here to my GitHub if you want to grab it.  I’ll be keeping it updated as change requests come in, or as I get new ideas, so make sure if you do use my function, that you check back in on occasion for new versions.

The two examples are below:

Syntax example for windows authentication:


Invoke-ECSSQLQuery -DatabaseServer “ServerNameonly or ServerName\Instance” -DatabaseName “database” -SQLQuery “select column from table where column = ‘3’”


Syntax example for SQL authentication:


Invoke-ECSSQLQuery -DatabaseServer “ServerNameOnly or ServerName\Instance” -DatabaseName “database” -SQLUserID “SA” -SQLUserPassword “Password” -SQLQuery “select column from table where column = ‘3’”


There is also an optional “timeout” parameter that can be used for really long running queries.  By default its 30 seconds, you can set it as high as you want, or specify “0” if you don’t want any timeout.