Thinking out loud: Why do server vendors still struggle with driver and firmware management?

History:

Let me give you a little back story before digging into the meat of this post.  My team and I make a very concerted effort to keep our servers firmware and drivers updated.  We’ve gone so far as to purchase software from Dell, implement a process on how firmware / drivers are to be updated, and ensuring that its routinely done every quarter.  We do this because in general it’s a best practice, but also because we’ve run into too many occasions where troubleshooting with a vendor stops (if it ever starts) very quickly if the drivers / firmware isn’t recent.  In essence, we’re doing our best to be diligent and proactive with keeping our servers healthy, secure and updated.

Late last year we ran into two issues, both of which are related to drivers / firmware.

  1. A Broadcom NIC causing a purple screen of death (PSOD) in ESXi. This was a server that was freshly rebuilt and all drivers (or so we thought) and firmware updated.  Turns out the driver we were running was more than two years old and the PSOD we were having was a resolved issue in a newer driver.
  2. An Intel x710 quad port 10Gb NIC causing packets to black hole for certain VM’s on the same vLAN. Again, these were new hosts that were patched, firmware updated and in theory up to date.  This issue is what really triggered us to start evaluating other server vendors and their solutions.

 

Of those two issues above, only one was solved with a simple driver update, and the other, we just gave up on and switched to a different NIC (x520).

The Issue:

If you can’t see where this post is going already, let me lay it out.  Server vendors still can’t properly manage their own drivers, firmware and vendor specific software. I know what you’re thinking, you have tools that the vendor provided, you’re using them and you’re fine.  I hate to be the bearer of bad news, but I doubt it.  We just got done a rigorous evaluation of Dell, HP and Cisco.  None of them have a complete solution.  Don’t get me wrong, some of them are getting there, but no one has the problem solved.  If you’re wondering what the specific problems are, see the bullets below.

  • Server vendors would like that you keep your firmware, drivers and tools up to date.
  • OS vendors (and sometimes server vendors) require that the driver and firmware have a certified pairing. It is NOT good enough to simply have the latest driver and the latest firmware.  This of course may vary slightly depending on which server and OS vendor.  VMware though as an example, absolutely has driver and firmware paring that’s required.
    • This driver and firmware pairing is typically worked out between the server and OS vendor.
    • VMware has a strict HCL for this use case, and TMK, MS has an HCL, although they’re a little more forgiving when it comes to the pairing. I can’t speak about Linux, Solaris or other OS’s.

 

Think about this, when was the last time you did the following?

  • Retrieved an inventory of all your hardware firmware revisions and driver revisions.
    • Do you even know how to do this? Probably not as easy as you think.
  • Logged into your OS vendors HCL and one hardware item at a time, checked if you are running the latest driver and firmware, and that the pairing is also certified.
    • With VMware, you can use the device vendor ID, device ID and sub vendor ID to find the specific hardware in question on their HCL. Just remember its hex values.

I bet you’re either relying on the following.

  • VMware update manager, and vendor provided depots (if they exist).
  • Vendor supplied firmware management solutions.
    • Some may have driver management for select OS, but no one does it all.
  • Vendor custom ISOs / install discs.

I’m going to suspect if you go and check your VMware HCL, you’re out compliance in one way or another, or something is woefully out of date.

Solution:

Let’s share a pipe for a second and dream about what it should look like.

 

Server Vendor:

  • It should be a central console.
  • It should handle downloading all firmware, drivers and vendor specific tools.
  • It should use a concept of baselines. A baseline being defined as.
    • OS and server model specific.
    • Based on a release date. The baseline should define an approved pairing of drivers, firmware and vendor tools for a given month, quarter or however often the server vendor feels the need to establish a new baseline.
    • The baseline should support the concept of cumulative updates / hotfixes.
  • It should support grouping servers, and applying baselines to the server groups.
  • It should support compliance checking for the baselines. Not simply deploying the drivers and firmware and assuming everything is ok.  This would let you know if an admin went rouge and manually updated or downgraded firmware / drivers or tools.
  • It should support rolling back drivers, firmware or tools if it is determined to be too far ahead.
  • Provide very verbose information as to why an update process failed.
  • Bonus points
    • Support a multi-site architecture. Meaning be able to have a cached copy of the repo and a local server to perform the actual update and auditing process.
    • Auto discover servers

OS Vendor:

  • Should provide a comprehensive API
    • Look up hardware and driver pairings.
    • Enable the download of the driver or firmware directly would be nice.

Conclusion:

Coming back to reality a bit, what can you do?  Use the tools you have to the best of your ability, script what you can, and manually deal with the gaps.  That said, I’m working on the auditing part of the problem, at least with VMware. Hope to have blog post about it and a new GitHub commit in the coming month or so, stay tuned.

Oh, one final thing you can do, start bugging your server vendor sales team about the issue. If enough people raise the issue, it will get the attention it desperately needs.

Thinking out loud: What HP + Nimble means to me

Disclaimer:

These are opinions, not facts and these opinions are mine, not my employers.

Introduction:

Upon receiving the news that HP was to acquire Nimble, I can’t say I was exactly thrilled.  Nothing personal against HP, they make great servers, but I like Nimble the way it is right now.  None the less, I know the industry is moving in a direction where its either get big, or get out.  There is a huge storm “cloud” looming, and if you’re an on premises solution, its going to be a scary time in the coming years.

I was thinking about what would be some of the pros and cons of the HP acquisition and this is what I’ve come up with.

Pros:

  1. HP is a big company and an established one at that. We’ll focus on the pros being big/established in this section.
    1. HP will likely have an easier time pitching Nimble into companies that would not have given them a second look. HP is established, so there’s a perception that Nimble is established.  This leads to better market penetration.  Nimble getting better market penetration means Nimble makes HP more money and if Nimble makes HP more money, HP invests more into Nimble.  Hopefully the circle of money keeps snow balling and we all win.
      1. HP is a world wide company and while Nimble has done a great job so far, HP is going to take them into more countries faster than they could on their own. If you’ve had difficulty getting Nimble equipment purchased “in country”, I can see this getting easier long term.
    2. Obviously HP has more capital at their disposal than Nimble did. If invested correctly, I could see this accelerating Nimble’s innovation.
    3. HP has more purchasing power than Nimble does, this could lead to Nimble’s margins being better, which in turn may lead to us having a more affordable product (or more profit for HP).
  2. Look, we all know why tech companies pick Supermicro, and its not quality, its affordability. HP makes some pretty kick ass hardware, so if we were to see Nimble’s hardware platform change from Supermicro to HP equipment, not only would my datacenter look a little sexier, I wouldn’t cut my fingers trying to rack Nimble anymore.
  3. If you’re a current HP customer, I can see two nice integration points.
    1. Infosight for other HP solutions.
    2. Nimble integration into OneView.

Cons:

  1. As mentioned in the pros, HP is a large established company. While this in its self can have some pros, it also has the potential for a number of cons.
    1. Big companies tend to move slow, bureaucracy and over analysis being suspect causes. Nimble had far fewer hoops to jump through before making a decision.  Just remember, deadlines and accomplishments drift a day at a time.  Days become weeks and weeks become months, and you get the picture.
    2. Every company is profit conscious, but some larger companies will kill any sliver of waste, even at the cost of productivity or customer satisfaction. I’m not saying it will happen with HP, only that it could.
    3. While HP will open a lot of new doors for Nimble, it has the potential to close a lot of existing ones too. There are a lot of companies that have had bad experiences with HP and this may be enough for them to drop Nimble.  That said, being realistic, it seems one way or another, you’re going to be purchasing storage from some big vendor, and it may not be the same as the one you purchase servers from.
    4. If HP tries to assimilate Nimble into their ways, I can see this being bad for Nimble customers. Nimble for example has a great support experience.  If HP tries to force Nimble to adopt their triage and support structure, that would be a quick way to devalue Nimble.  There’s other things too, like getting stuck speaking with a generic HP sales rep and sales engineer, instead of having direct access to a Nimble SE and a Nimble sales person, or other typical large company sales and support processes.
  2. HP hasn’t made a great name for themselves here of late. We know they’ve split the company in half and sold off a lot of assets.  It’s hard to say if it’s too little too late, or if it was the right move and just in time.  Regardless, HP to me is a company that’s walking a fine line of a falling giant, or one that’s getting back on its feet.  If HP goes down, Nimble goes with it, and that’s not good for Nimble customers.
  3. HP isn’t exactly synonymous with innovation, at least not any more. I fear that HP has the potential of choking the life out of Nimble.  In my opinion, 3Par was a great storage solution.  Part of me wonders if HP couldn’t make that work, what makes them think Nimble will be any different?  Meaning, are they going to turn Nimble into the next Equallogic?

Other thoughts:

I think deep down everyone knew Nimble wanted to get bought.  Me personally, I was REALLY hoping Cisco was going to buy Nimble.  In my opinion, Cisco + Nimble would go together like peanut butter and fluff.  HP already has a storage company that’s flailing.  I don’t want Nimble to follow suite.  People like to remind me about about Whiptail and how bad that was.  I look at that as a rash move on Cisco part (the solution was doomed to fail), but Nimble would be a pick that no one could blame Cisco for.  Best of all, Cisco doesn’t have any competing products (other than Hyperflex, but that’s a different type of solution).  This would have led to a much stronger and untied focus on pushing Nimble.  From Nimble’s view, it would have solidified them as being established (opening the closed doors), and for Cisco, it would have given them a proven storage startup that’s on fire.  Honestly, if I was Cisco’s CEO, I would be doing everything I could to steal the deal from HP.  If it was a matter of HP vs. Dell vs. Cisco, and Cisco was the one with Nimble, IMO, Cisco would crush the other two like a ten-ton hammer.

Conclusion:

This is obviously all speculation at this point, just thinking out loud.  I hope all the pros of what I pointed out occur with the acquisition and none of the cons.  I wish both vendors the best of luck, and until proven otherwise, I’m still a diehard Nimble (HP) fan.

Naming Conventions: Server Names

Introduction:

One of the things I’m struggling with, is how to balance the number of posts per naming convention.  It would be easy in some ways to use a single post per server type, but it would also be overly redundant in many ways.  I originally wrote a dedicated post to SQL server naming conventions, and realized that the logic behind its name is ultimately a similar logic for other server names.  With that said, I’ve decided to create a consolidated post for server names.  I’ll rehash the overall structure used for the SQL naming convention, and show you how its reusable for other servers.

Limitations:

To begin with, 15 characters is a length limitation I would always suggest maintaining.  The only exception is in the following circumstances.  If you’re building a server that isn’t Windows, and will NEVER need to join a Windows domain, then and only then can you make names longer than 15.  Microsoft in their perpetual need to maintain excessive backwards compatibility, still hasn’t dropped NetBIOS out of its architecture.  Even if you build a Microsoft domain, 100% running on DNS resolution, they still truncate names for backwards compatibility.  I wish they would provide a naming resolution compatibility mode that would in essence switch the domain from NetBIOS supported to DNS only, but that’s a whole different blog post.

My naming conventions are designed to scale for smaller to mid-sized companies.  If you’re dealing with 10s of thousands of servers, this naming convention won’t scale to your needs.  My names are designed to give you a hint of what the server does.  When you’re at the 10s of thousands size, you need a whole new way of dealing with server names.

Other stuff:

I used to really get hung up on server numbering.  What I mean by that, is if I was running a smaller shop with say two domain controllers.  Let’s call them DC1 and DC2 for simplicity.  I would want to keep those names every time I did a major upgrade.  Ultimately it led to a lot of shuffling, and a lot of time spent for something that was ultimately cosmetic and not important.  Point being, if you are or were like me, learn to let it go.  Sometimes you will have a DC3 and a DC4 when there are no DC1 and no DC2.

When you’re dealing with systems that need to connect to other systems by name, learn that CNAME records can be your best friend. I strongly suggest to avoid pointing things directly at a server name, if whatever you’re pointing is mostly a generic service.  For example, its very common to have many things pointing at your DC’s for LDAP lookups.  Rather than doing something like pointing at DC1 and DC2, create a CNAME record for something like “ldap1.domain.com” and “ldap2.domain.com”.  Then when you change your DC’s, you only have two records to change.

It’s expensive, but load balancers can also help with renaming / moving things.  They help because of their ability to create a “virtual IP” and redirect that traffic to any real IP as needed.  In the case of DNS, it would enable you to move your DNS functionality to a new server without having to change the IP you have configured across all your systems.

My naming convention basics:

There are a few basics planned into my naming conventions.  These basics make it easier to organize servers, and ultimately find / figure out what a server’s purpose is.  Obviously with a 15-character limit, there are going to be a lot of abbreviations.  In my opinion, so long as you’re consistent, even vague abbreviations will eventually be memorable or make sense.

Hyphens:

I use hyphens to separate may of the naming conventions purposes.  I know its potentially wasting very precious characters, but at our scale, we can mostly afford to do it, in exchange for making it easier to programmatically find servers.  You don’t need to use hyphens, I do it because its easier for me to script with.  Plus, they make for a consistent separator.  Ultimately though, consistency as I stated above, is what’s important.

Location variable:

I prefer to start all names with either a location variable.  At two completely different companies, I inherited naming conventions that used their company’s acronym for their primary site, and DR for their disaster recovery site servers.  There are two problems with using something this descriptive.

  1. I’ve worked for a company that flipped the location of their DR and primary site. It meant for a very confusing period of time where some servers might have said “DR” but were actually now in the primary headquarters, and vice versa.
  2. If you’re company has more than one office or more than one DR site, the naming convention kind of falls apart.

The main goal for the server location should be generic, but consistent.  Using something like AA1, is just as likely to suffer from the issue above in point 1, but it does solve the issue of problem 2.  If you have multiple locations, you just keep incrementing the number.  So AA2, AA3 ….AA9, and then increment the letter to AB1.  It leaves a TON of room for different / unique locations.  Heck, maybe you don’t even need the double letter.  Math isn’t my strength, but if my calculations are correct even something like one letter + one number = 234 locations, and that assumes you never use the number “0”, in which case it would be 260 locations.

Application:

I like to use something short (as in 3 letters or less) to tell me something about the application or purpose of the server.  For example, I might use “SQL” for a SQL Database Server, or RMQ for a Rabbit MQ server, or EXM for an Exchange Mailbox server.  It can get a little tricky of course, after all you have MySQL and MSSQL, but maybe that doesn’t matter.  After all, a SQL DB is a SQL DB.  The reason I keep it three letters or shorter (on average) is to leave room for a clustering naming conventions (coming up).  Of course if you’re not limited by the 15 characters, you can get a lot more verbose, but at least for those of us in windows shops, that’s tough.

Environment:

This one is short and easy, I use a few letters to denote the environment of the server.

  • P = Production
  • S = Stage
  • U = UAT
  • D = Dev
  • T = Test
  • X = Sandbox

Clustered or Standalone:

I like to denote if this is a clustered system or a standalone system.  The standalone part of is pretty easy, I just use an “S”.  Sometimes I’ll trail it with a number, like S1, or S2 to denote that the application isn’t clustered but that they’re related (you’ll see an example later).

For the cluster part, it gets a little more involved and varies a little bit based on the application.  We start out with a simple “C” to denote clustered, but then I like to use another trailing letter / number as well.  Let’s look at a few cluster examples.

  • CN1 = Clustered Node 1
  • CN2 = Clustered Node 2
  • CDI1 = Clustered Database instance
  • CDG1 = Clustered Database Group 1

Application group number:

This is the final number that really ties everything together.  I use a simple “01”, “02” or whatever number really to tie all clustered nodes or even standalone (related) systems together.

Putting it all together:

Here are a few practical examples to give you an idea of how it all goes together.

Example 1:  A SQL environment to support the widgets application.  This SQL environment will have a full development lifecycle environment and UAT will mirror Production exactly.  We’ll be using SQL AAG’s.

  1. Dev = a1-sqlds-01
  2. Stage = a1-sqlss-01
  3. UAT =
    1. Nodes
      1. A1-sqlucn1-01
      2. A1-sqlucn2-01
    2. Clustered Named Object (management)
      1. A1-sqluc-01
      2. SQL AAG listener names
        1. A1-sqlucdg1-01
        2. A1-sqlucdg2-01
      3. Prod =
        1. Nodes
          1. A1-sqlpcn1-01
          2. A1-sqlpcn2-01
        2. Clustered Named Object (management)
          1. A1-sqlpc-01
  • SQL AAG listener names
    1. A1-sqlpcdg1-01
    2. A1-sqlpcdg2-01

Notice how the last number glues everything together.  Also notice how everything is built off a consistent naming standard.  If we deployed a new application

Example 2: How about something simple like a domain controller environment for 3 sites?  Let’s just say there will be a production environment and a test environment.

  1. Test
    1. Site a1
      1. A1-dctcn1-01
      2. A1-dctcn2-01
    2. Site a2
      1. A2-dctcn1-01
      2. A2-dctcn2-01
    3. Site a3
      1. A3-dctcn1-01
      2. A3-dctcn2-01
    4. Production:
      1. Site a1
        1. A1-dcpcn1-01
        2. A1-dcpcn2-01
      2. Site a2
        1. A2-dcpcn1-01
        2. A2-dcpcn2-01
      3. Site a3
        1. A3-dcpcn1-01
        2. A3-dcpcn2-01

Again, notice how I use a single final number to glue an entire “purpose” together.  If I built a second discrete domain, I would likely change the last number to “02” which would quickly tell me that the domain controller is part of a separate domain.  Also notice how you can easily tell which node a DC is, which site a DC is in, and what its environment is.

Example 3:  A non-clustered exchange server environment that’s serving the same company.

  1. Mailbox servers:
    1. A1-exmps1-01
    2. A1-exmps2-01
  2. CAS Nodes (load balanced)
    1. A1-excpcn1-01
    2. A1-excpcn2-01
  3. CAS VIP
    1. A1-excpc-01_vip

Here you can see how the standalone mailbox servers are working for the same purpose, but ultimately they’re not clustered together.  With the CAS servers, you can see that they are in fact clustered and that we even created a VIP DNS name that helps you understand how everything is related.

Other examples:

At this stage I’m just going to bullet a list of various options, I’ll use a single destination and a single application number since we’ve already gone over that.

File Servers:

  1. Clusters
    1.  Nodes
      1. a1-fspcn1-01
      2. a1-fspcn2-01
    2. Cluster Resource
      1. a1-fspc-01
    3. Clustered SMB share
      1. a1-fspcsmb1-01
      2. a1-fspcsmb2-01
    4. Clustered NFS share
      1. a1-fspcnfs1-01
      2. a1-fspcnfs1-01
  2. Standalone
    1. a1-fsps-01

CommVault:

  1. Comcell
    1. a1-cvccps1-01
  2. MediaAgent
    1. a1-cvmaps1-01
    2. a1-cvmaps2-01
  3. Virtual Server Agent (for dedicated VM’s)
    1. a1-cvvsaps1-01
    2. a1-cvvsaps2-01

DHCP:

  1. Clusters ***Note: because DHCP consumes 4 characters, I leave the “c” off the name. 
    1. a1-dhcppn1-01
    2. a2-dhcppn2-01
  2. Standalone
    1. a1-dhcpps1-01

Review: 1.5 years with MVP Systems Job Automation Scheduler (JAMS)

Introduction:

I wrote a really quick review here about MVP systems JAMS product about 1 year or so ago (maybe a little less).  At the time, I was in search of a solution that could help me glue together several disjointed systems in a workflow.  Specifically, we were trying to integrate Veeam and CommVault backup’s together.  Veeam was of course doing the VM backup’s, and CommVault was copying the Veeam files to tape.  We’ve since moved on from Veeam, but JAMS has continued to be a vital part of our infrastructure.

What is JAMS?

The simple answer is it’s a centralized task scheduler, the long answer is its not only that, but a whole lot more.  This is a solution that replaces cron, windows task scheduler, SQL agent jobs, or pretty much anything else that you would normally use to schedule and execute something.

What makes up a JAMS solution?

There are four main components.

  • The JAMS server: This is clusterable component that schedules, queues and executes any jobs or workflows.
  • The JAMS client: This is the administration GUI.  Kind of self-explanatory, but this is where you would configure all of the settings for the various jobs, and server settings.
    • For windows, this also includes a Powershell module for CLI administration. Pretty sure they have a generic API, but I never bothered to look since PS was available.
  • The JAMS agent: This is a component that is installed on a system where you want to execute jobs.  All kinds of OS’s supported.
  • Microsoft SQL server: Check with MVP systems if other DB’s are supported, but we’re a MS shop, and SQL is on their list.  This is used to store the job history, job status, and pretty much the entire server configuration.  If this goes down, you have big issues to deal with J.  And yes, a clustered SQL server IS supported.

All in all, the infrastructure is pretty simple to understand and for smaller use cases, these roles can all be installed on the same system.

History:

I didn’t start out with JAMS, in fact, they were nowhere in sight when the initial problem came to fruition.  I figured this would be a relatively trivial Powershell solution, and started down the path of trying to write a quick workflow.  Building the logic for the workflow was actually pretty easy, but what I kept running into was the good ‘ol Kerberos double hop condition.  Never heard about it?   Read about it here.  In order to centralize the solution, I basically tried to build my own poor mans centralized task scheduler.  In order to keep it central, I was utilizing “invoke-command” to execute scripts on our Veeam server and our CommVault server.  With Veeam, our database was stored on a different server, so when my “invoke-command” executed against Veeam, my credentials were never passed along to the SQL server.  I was able to work around it by using CredSSP, but it wasn’t reliable.  Sometimes it would work, and sometimes I guess it would timeout or something similar (don’t really remember to be honest).  Then there was the issue with CommVault.  See they used old fashion EXE’s to start jobs (we were on v8) from the command line.  The commands I needed to run had to be executed in sequential order.  Anyone who has worked with Powershell’s “start-process” via invoke-command, knows that the “-wait” parameter is ignored.  I don’t recall the reason, but it was lame on MS part.  Ultimately, it was this that was the deal breaker, and so started the search for some sort of centralized task scheduler.
We ultimately landed on a cheap’o but well known solution called “VisualCron”.  I’ve got nothing against the solution, but after working with it for a few days, not only was it very hacked together, but it wasn’t the most user friendly solution.  So the search continued and we ultimately stumbled across JAMS.  It took a lot of creative searching to find them, but I’m glad we did.  After installing the trial, we knew it was the solution we were looking for, and the rest as they say, is history.

The pros:

  • Easy solution: Pretty easy to install the solution and understand the components. Unlike some other solutions we’ve installed, JAMS takes care of installing any pre-requisites and also has an easy to understand architecture.
  • You get tech support: Normally not something to write home about, but we leveraged their support quite a bit at first, and they were normally helpful.  As simple as JAMS is, it can do a lot of stuff, and that’s where support can (and is) a huge help.  I remember one part of a solution where were trying to pass a variable from one job to another.  Called up support, and sure enough, JAMS could do it and they showed us how.  How about bulk creating a bunch of jobs via PS?  Yep, support had an example of that too.
  • The GUI: This is one where I have pros and cons.   We’re in the pros section, so that’s what I’ll focus on here.
    • I’ve never worked with a GUI that was capable of bulk edits, but JAMS is and it rocks. Just imagine wanting to change the start time on 60 jobs.  You could write a script to do it, or you could highlight the 60 jobs in a folder, right click and basically change the value of one field (time) to another value.  Then BAM! It goes and changes the time for all highlighted jobs.  Pretty much any column you can add to the GUI has this functionality and it rocks.
    • Easy to see all jobs scheduled to run, running or failed in one view.
    • It keeps a detailed log of each job execution. If you write output to the host (think write-host in PowerShell or REM in batch), that output gets logged to a file and stored for historical purposes.  So as long as your script has verbose output, you’ll know exactly what happened in your job.
    • Sort of related to the above, it keeps a history of all executed jobs and their final status. It also tracks things like when it ran, how long it ran, how many resources it consumed, etc.
    • They have some pretty neat dashboards (once you figure them out). There are a few cool built in ones (like projected schedule) too.
    • Last but not least, it’s a pretty easy GUI to use. I won’t say it doesn’t have any learning curve, but I think the learning curve is really more related to the solution than the client its self.
  • Scripting engines: The agent can execute all kinds of scripts.
    • Powershell
    • Batch
    • Bash
    • SSH
    • T-SQL
  • Agent OS Support: The agent can be installed on different OS’s, so this isn’t a 100% windows only solution.
  • Workflows (setups): You can build “setups” (workflows) that tie jobs together. The jobs themselves can run on completely different systems.  In our case, we had a setup which had a “job” that ran on a veeam server and a different job that ran on the CV server.  The Setup was configured to wait until the first job completed with a success before moving on.
  • Job Queueing: It supports queueing jobs. Probably not an issue for many folks, but we used to limit the number of tape backup’s running in parallel.  What’s great is each “job” in jams can share a queue or have different queues.  This allows a Setup to execute the first job (as an example) and if needed, the second job will queue.  We typically had 50 setups running in parallel, but only 4 tape jobs that were allowed to run in parallel.  JAMS would execute all 50 setups in parallel, but when it came time to run the tape portion of a setup, the tape jobs would go into a queue and trickle out as others completed (or failed).  This didn’t stop the first jobs (the backup its self) from completing, so it ultimately kept things moving at a great pace.
  • PowerShell: Being able to admin JAMS through PS is a huge win in my book.  You can create, modify and delete, jobs, setups, queues, etc.  Everything in the GUI can be done in PS.  It’s sad in 2017 that I even have to list this as a pro of a solution. None the less, it’s not as common as it should be, and it’s a win for JAMS.
  • Different Licensing: With a lot of solutions, there’s only one licensing strategy. I found that JAMS had several, and they do so to accommodate differing needs and purposes.
  • Sales team: The sales team I worked with was friendly, knowledgeable and not pushy in any kind of way.  Additionally, what I think is worth noting, is while it felt like we were shopping for a Ferrari, they understood we were on a Corvette budget, and worked with us to find a licensing model (and some pricing breaks) to let us drive home in a solution we really wanted.

The cons:

  • Price: I’m not saying it’s overpriced, all I’m saying is its not cheap.  I would love to use this solution for my whole environment, but it’s not cheap to do that.  I’m not saying they won’t work with you (they will), but to scale the solution, you will be digging deeper in your wallet.
  • The GUI: I think the GUI has some great design characteristics, but I also think it has some flaws too.
    • They recently updated the GUI look. I’m personally not a fan.  It’s a matter of opinion of course, but I find it harder to see what I need to see now.
    • I don’t like the way they separate jobs from setups. I wish they just used a different icon, or a value in a field to separate them.  There are plenty of times I click on a folder and forgot that I’m in the “setup view” when I’m looking for a job.
    • They don’t support right click for certain job management features. I intuitively want to right click in the jobs window and select “new job” or something related, but that’s not the way the GUI is designed.
    • When you bulk submit jobs, they ask you if you want to, for each job. That means if you selected 25 jobs, you’re clicking “submit” 25 times afterwards.
  • Their security design: I found that their security model didn’t work quite like one might think.  I remember working with a tech to do something simple like let our DBA’s manage jobs (execute and read) and something as simple as that required what seemed like a million hoops to jump through.  Ultimately IIRC (it’s been a while) we ended up needing to grant them more rights than I would have wanted in order to accomplish what seemed like a trivial task.  I gave up on it because I didn’t want to create a solution that going to be too complex to manage.
  • Overlapping job detection: I remember when I first started with their solution, we had run into a few cases where jobs (or setups) were overlapping on themselves.  Meaning Job A from Monday night was still running and Tuesday nights job started up and started running.  When I asked support about this, they handed me a script that would nuke the Tuesday job, but ultimately didn’t solve my issue of needing Tuesday’s job to just wait.  I ultimately ended up writing a pre-check job that would detect if any of the same jobs were running and if so, to go into a loop where it checks every minute, waiting for the previous job to complete.  What sucks about this and the script they gave me, is every job I launch that has a pre-check job, this ends up burning my job count.  To me, this just seems like something that should be built into the solution.
  • Maintenance Mode: They don’t seem to have a maintenance mode option.  What I mean by that, is being able to put JAMS into a paused state.  I think you can stop a service on the windows hosts, but honestly that’s a hack.  They should just have a maintenance mode option built right into the GUI.   I could see something like having a few options like, queue any new jobs that start, or let existing jobs finish, but queue anything new, or don’t let any jobs start at all.  Bonus points if this could be done at a folder level.

Conclusion:

Ultimately after living with JAMS for almost 1.5 years, I think they really rock as a solution.  I can’t say I have any experience with any other enterprise job scheduling solutions, but my overall experience with JAMS has been a pleasure.  No solution is perfect, and theirs is no exception, but the great news is they have a solution that is ultimately awesome, with a few negatives, which is a far cry from other vendor’s solutions I’ve used.  My suggestion is if you’re looking for something to replace SQL jobs, task scheduler, cron or any other isolated solution, to give them a look, I think you’ll be pleased.

Naming Conventions: SQL Server Names

Introduction:

If you’re working in a Windows environment like me, you have to deal with 15 character limitations (at least if you care about NETBIOS resolution).  Honestly, really cramps my style, but these conventions were written with that limitation in mind.  If you’re using this for a Linux based server running MySQL or PostgreSQL, you might be able to get a little more detailed.

Multi Environment / Cluster or standalone:

At my employer ASI, we have to contend with multiple applications, multiple environments for each application and in a lot of cases clusters.  Trying to come up with a name that works for them all can be tough, but I think I’ve got a few good ones depending on your use cases.

Here is an MSSQL server naming convention:

Standalones:

  • cmp-sqlps1-01
  • cmp-sqlus1-01
  • cmp-sqlss1-01
  • cmp-sqlds1-01
  • cmp-sqlps1-02
  • cmp-sqlps2-02
  • cmp-sqlus1-02

Let’s break it down:

  • cmp = company.  Use whatever prefix you want, doesn’t need to be all letters and doesn’t need to be three, but I wouldn’t go beyond three.
  • (_) underscores = separators
  • SQL = SQL server
    • Next letter = environment
      • P = Prod
      • U = UAT
      • S = Stage
      • D = Dev
    • Next letter = Clustered or standalone
      • S = Standalone
      • C = Cluster ***More on this later
    • Following number = the number of SQL servers for this environment and application.  If you have multiple stand alone servers for “app1” this allows you to accommodate them.
  • The final number is a number to represent the application.  Doesn’t matter what the number is, but once its assigned, your other environments should fall into the same number.  I can look at the last number and quickly go “oh that’s app 20”.  Ok, maybe not quickly, but its easy enough with a matrix.

You can quickly determine if the servers function, environment, is it a cluster and the apps.

Clusters:

Cluster management resource name

  • cmp-sqlpc-01
  • cmp-sqluc-01

Cluster Node Name:

  • cmp-sqlpcn1-01
  • cmp-sqlpcn2-01
  • cmp-sqlucn1-01
  • cmp-sqlucn2-01

Cluster application resource name

SQL AAG Name / SQL Listener Name
  • cmp-sqlpcdg1-01
  • cmp-sqlpcdg2-01
  • cmp-sqlucdg1-01
  • cmp-sqlucdg2-01
SQL Traditional Failover Cluster Name
  • cmp-sqlpcdi1-01
  • cmp-sqlucdi1-01

Let’s break it down:

SQL clusters require a lot of computer accounts / names, and having a naming convention helps to keep them grouped and organized.

  • CMP-SQLPC = we should already know this based on the previous explanation.  This part tells me its a SQL production cluster.
    • If it goes straight to the application number, that means this is the cluster management resource.
    • If it goes to N1 or N2 = that’s the node of that cluster.  N1 = node 1, N2 = Node 2.
    • If it goes dg1 = Database Group 1, or SQL AAG1 and it belongs to this cluster.
    • If it goes DI1 = Database Instance 1, or SQL failover instance 1.
  • The following number of course is the application we’re assigning.

Closing:

That’s pretty much it, short and sweet.  It’s not designed to scale to massive levels, but I think it will work for most environments.

Index:

To see other naming conventions or posts about naming conventions, head over here.

Naming Conventions: Windows Local Administrator Group via Active Directory

Introduction:

One thing that always kind of sucked with windows servers (and its even worse with Linux) is dealing with how to manage local administrator rights.  Now I realize a lot of you just add yourself to the domain admins and use that account (not a security best practices) to administer your servers, but what about everyone and everything else that might need local admin access?  What I have for you here is how I not only created a new naming conventions, but also how I tackled managing local administrator access in windows domain environment.

What you need:

  1. This is dependent on active directory. If you’re dealing with non-domain systems, this post isn’t really for you.
  2. You’re going to need a group policy object and your servers / desktops are going to need to support group policy preferences (2003/XP and above).

What you’ll want:

  1. More than likely a provisioning process that automates something which GPO/GPP can’t. For example, a simple PowerShell script that you run once the server is provisioned to finish the little things.

The naming convention:

I use a pretty simple naming convention for local administrators.  There are three main conventions I used which are below.

  • _asi_lad_%Server or Computer Name%
  • _asi_lad_All Servers in AD
  • _asi_lad_All Desktops in AD

Naming convention breakdown:

  1. The underscores are used as separators. At some point you may need to do queries with scripts, and knowing that an “_” represents a separator will make it easy to know where to start looking for variables.  For example, if you just looked in AD for something like *ServerName* you might get back more than one AD object, but if you look for *_lad_Servername you’re only going to get back local administrator groups.
  2. “ASI” in my case stands for Advertising Specialty Institute, which is an acronym for my company name. I used to work at the Pew Charitable Trusts and used “PCT” for them.  I use a company name to start out everything because down the road, you never know who you’re going to merge with.  While there is always a chance that ASI could merge with a company named “Advanced Internetworked Servers” or something like that, its not likely.  Plus, if you’re running a sort of multi-tenancy, this allows you to keep different companies separate.  It doesn’t need to be an acronym, nor does it need to be three letters, but it should be specific to a company.  Heck, if a small GUID makes sense go for it, just keep in mind that 64 character limit with AD (yes I’ve hit it with group names).
  3. “LAD” as you can guess stands for Local Ad Again, it doesn’t need to be three, but it should be consistent.
  4. The final part
    1. The server name is kind of obvious.
    2. All servers is kind of obvious
    3. All desktops is kind of obvious

 

You could of course take it much further than this if you want, it really depends on how granular you want to get.  I try to keep things balanced.  If your servers themselves have a good naming convention, you can likely already key off of that.

How it works:

  1. I created a GPO for my servers and a different one for my desktops. To be VERY clear, this GPO is not dedicated to this purpose.  I try to use one GPO to rule them all as much as possible.  So I add this to my GPO which applies to all servers and the one for all desktops for other things.
  2. Create a computer GPP setting which adds the respective “_company_lad_all server” and “_company_lad_all desktops”. This gives you an easy way to drop a service account into a group that will gives them local admin access to all servers WITHOUT giving them domain admin rights.
    1. As an aside, we also create a GPP that removes domain admins.
  3. Now, for the servers specifically (we don’t do this for desktops) I create a server specific group (via a script) and add it to that server (via script). You now have an ability to add a user to the local admins without ever needing to login to that local server anymore.

 

Powershell Snippet for creating a group and adding it to the local admins:

#Create and assign default local admin group

#Parameters

$ServerName = “cmp-servername-01”

$GroupName = “_lad_cmp_” + $ServerName

$OUPath = “OU=YourOUPath,DC=Domain,DC=Name”

$FQDN = “Domain.name”

#Creating group

New-ADGroup -DisplayName $GroupName -GroupCategory “1” -GroupScope “1” -Name $GroupName -SamAccountName $GroupName -Path $OUPath

Start-Sleep -Seconds 15

#Add Group to servers local administrators

$de = [ADSI]”WinNT://$ServerName/administrators,group”

$de.psbase.Invoke(“Add”,([ADSI]”WinNT://$FQDN/$GroupName”).path)

Closing:

That’s it really.  You can of course take it further, but this is a simple way to get a handle on your local administrators, and centrally manage them all from AD going forward.  You also now have a naming convention which makes it easy to track them down and automate various tasks.

Index:

To see other naming conventions or posts about naming conventions, head over here.

 

Naming Conventions: Introduction

Introduction:

If there was one area in IT that I tend to be fanatical about (some would say fanatical isn’t a strong enough word), I would have to say its naming conventions.  For me, learning about naming conventions in school and then seeing it put into practice in my first real gig, really enforced the concept that naming conventions are important.  I actually consider myself pretty lucky in that respect because I don’t think a lot of people have had the fortune of studying about the concept, let alone working for a company that had a decent naming convention.  After walking fresh into two different employers and seeing the mess they made, and also working through a few mergers, I know that good naming conventions don’t always exist.

This post is going to be refreshingly short, and instead really serve as an introduction to a new series I’m going to start.  The series will be a culmination of naming conventions for different things in IT.  I manage a ton of different things, so there’s always an opportunity for a new naming convention.

Why do you need a naming convention?

Here’s the thing, you don’t need a naming convention, but you should want them.  Not caring about names of objects not only makes your life hell, but also everyone that either replaces you or works with you.    The end goal of a naming convention is to put a little effort up front, so you can save a ton of time down the line.

How does a naming convention save you time?

Let me give a few examples:

  1. Once you have an established naming convention for an object, naming the next related object usually requires little thought.
  2. It keeps you and your peers organized.
  3. It provides consistency, and consistency leads to intuition, and intuition basically means second nature. If something is second nature, you’re not really thinking about it, and thus saving time.
  4. Getting a little more technical, naming conventions allow you to search and filter predictably.
    1. Being able to do this, means you can now effectively use scripts with deterministic targets.
    2. Writing reports is easier
  5. Identification of what something is becomes easier
  6. When done right, they’re forward thinking so they can be easily updated to reflect new changes.

When naming conventions don’t make sense?

Almost never, I don’t care if you have one AD group for managing a single server.  Now you might not need a super complex naming convention in that case, but you should still have something predictable.

That being said, Tags are slowly but surely becoming the new way to identify objects.  It makes a lot of sense, and to some degree it does invalidate the need to have monolithic naming convention.  Beyond that, even the best naming conventions will at times run into scaling issues.  Tags are infinitely more flexible and I’m looking forward to their eventual ubiquity in every aspect of technology.  Even with this, you should still keep in mind that tags are just as easily susceptible to bad naming conventions.  Your tag names and categories should have well thought out standards to ensure consistency as well.

Closing:

Naming conventions to me are right up there with regular maintenance, patching and other important but boring aspect of IT.  You need a plan, and you need to stick to it.  Don’t be one of those lazy IT people that can’t think further into the future than now.  Saying you “don’t have time” is a load of bytes, there is ALWAYS time for a naming convention.

Index:

This post is also where I plan to host the table of contents for all naming conventions posts.  If you think you missed a post, or can’t find it, track back here to look for it.  I plan to include a link to this post in the end of every naming convention post.

Windows:

Quicky Review: GPO/GPP vs. DSC

Introduction:

If you’re not in a DevOps based shop, or living under a rock, you may not know that Microsoft has been working on a solution that by all accounts sounds like its poised to usurp GPO / GPP.  The solution I’m talking about is Desired State Configuration, or DSC. According to all the marketing hype, DSC is the next best thing for IT since virtualization.  If the vision comes to fruition, GPO and GPP will be a legacy solution.  Enterprise mobility management will be used for desktops and DSC will be used for servers.  Given that I currently manage 700 VM’s and about an equal number of desktops, I figured why not take it for a test drive. So I stood up a simplistic environment, and played around with it for a full week and my conclusion is this.

I can totally see why DSC is awesome for non-domain joined systems, but its absolutely not a good replacement in todays iteration for domain joined systems. Does that mean you should shun it since all your systems are domain joined?  That depends on the size of your environment and how much you care about consistency and automation.  Below are all my unorganized thoughts on the subject.

The points:

DSC can do everything GPO can do, but the reverse is not true. At first that sounds like DSC is clearly a winner, but I say no.  The reality is, GPO does what it was meant to do, and it does it well.  To reproduce what you’ve already done in GPO while certainly doable, has the potential of making your life hell.  Here are a few fun facts about DSC.

  1. The DSC “agent” runs as local system. This means it only has local computer rights, and nothing else.
  2. Every server that you want to manage with DSC, needs its own unique config file built. That means if you have 700 servers like me, and you want to manage them with DSC, they each are going to have a unique config file.  Don’t get me wrong, you can create a standard config, and duplicated it “x” times, but none the less, its not like GPO where you just drop the computer in an OU and walk away.  That being said, and to be fair, there’s no reason you couldn’t automate DSC config build process to do just that.
    1. DSC has no concept of “inheritance / merging” like you’re used to with GPO. Each config must be built to encompass all of those things that GPO would normally handle in a very easy way.  DSC does have config merges in the sense that you can have a partial config for say your OS team, your SQL team and maybe some other team.  So they can “merge” configs, and work on them independently (awesome).  However, if the DBA config and the OS config, conflict, errors are thrown, and someone has to figure it out.  Maybe not a bad thing at all, but none the less, it’s a different mindset, and there is certainly potential for conflicts to occur.
  3. A DSC configuration needs to store user credentials for a lot of different operations. It stores these credentials in a config file that hosted both on a pull server (SMB share / HTTPs site) and on the local host.  What this means is you need a certificate to encrypt the config file and then of course for the agent to decrypt the config file.  You thought managing certificates was a pain for a few exchange servers and some web sites?  Ha! now every server and the the build server need certs.  In the most ideal scenario, you’re using some sort of PKI infrastructure.  This is just the start of the complexity.
    1. You of course need to deploy said certificate to the DSC system before the DSC config file can be applied. In case you can’t figure it out by now, this is a boot strap solution you have to implement on your own if you don’t use GPO.  You could use the same certificate and bake it into an image.  That certainly makes your life easier, but its also going to make your life that much harder when it comes to replacing those certs on 700 systems.  Not to mention, a paranoid security nut would argue how terrible that potentially is.
  4. The DSC agent of course need to be configured before it knows what to do. You can “push” configurations, which does mitigate some of these issues, but the preferred method is “pull”.  So that means you need to find a way (boot strap again) to configure your DSC agent so that it knows where to pull its config from, and what certificate thumbprint to use.

Based on the above point, you probably think DSC is a mess, and to some degree it is. However, a few other thoughts.

  1. It’s a new solution, so it still needs time to mature. GPO has been in existence since 2000, and DSC, I’m going to guess, since maybe 2012.  GPO is mature, and DSC is the new kid.
  2. Remember when I wrote that DSC can do everything that GPO can do, but not the reverse? Well, lets dig into that.  Let’s just say you still manage Exchange on premises, or more likely, you manage some IIS / SQL systems.  DSC has the potential to make setting those up and administering them, significantly easier.  DSC can manage not only the simple stuff that GPO does, but also way beyond that.  For example, here are just a few things.
    1. For exchange:
      1. DSC could INSTALL exchange for you
      2. Configure all your connectors, including not only creating them, but defining all the “allowed to relay” and what not.
      3. Configure all your web settings (think removing the default domain\username).
      4. Install and configure your exchange certificate in IIS
      5. Configure all your DAG relationships
      6. Setup your disks and folders
    2. For SQL
      1. DSC could INSTALL sql for you.
      2. Configure your max member min memory
      3. Configure your TempDB requirements
      4. Setup all your SQL jobs and other default DB’s
    3. Pick another MS app, and there’s probably a series of DSC resources for it…
    4. DSC let’s you know when things are compliant, and it can automatically attempt to remediate them. It can even handle things like auto reboots if you want it to.  GPO can’t do this.  To the above point, what I like about DSC, is I’ll know if someone went in to my receive connector and added an unauthorized IP, and even better, DSC will whack it and set it back to what it should be.
    5. Part of me thinks that while DSC is cool, I wish Microsoft would just extend GPO to encompass the things that DSC does that GPO doesn’t. I know its because the goal is to start with non-domain joined systems, but none the less, GPO works well and honestly, I think most people would rather use GPO over DSC if both were equally capable.

Conclusion:

Should you use DSC for domain joined systems?  I think so, or at least I think it would be a great way to learn DSC.  I currently look at DSC as being a great addition to GPO, not a replacement.  My goal is going to be to use GPO to manage the DSC dependencies (like the certificates as one example) and then use DSC for specific systems where I need consistency, like our exchange, SQL and web servers.  At this stage, unless you have a huge non-domain joined infrastructure, and you NEED to keep it that way, I wouldn’t use DSC to replace GPO.

 

Review: 4.5 years with Dell Poweredge Servers

Disclaimer:

Typical stuff, these are my views not my employers, they’re opinions not facts (mostly at least), use your own best judgement, take my views with a quarry full of salt.

Introduction:

When I started at my current employer, one of the things I wasn’t super keen on was having to deal with Dell hardware.   I had some previous experience with Dell servers, none of which was good.  My former employer was an HP shop that had converted from Dell, and early on in my SysAdmin gig, I had the misfortune of having to work with some of the older Dell servers (r850 as an example).

You’re probably thinking this means I’ll be going over why one vendor is better than the other?  Nope, not at all.  I don’t have the current experience with the HP ecosystem to draw any conclusions like that.  I provided that information so that you the reader know that I write this all as a former HP fanboy.  I hope you’ll find it objective and informative.  I also hope that someone with some sway at Dell is reading this, so they can look for ways to improve their product platform.

Pros:

As always, I like to start out with the pros of a solution.

  • Hardware lifecycle: I find that Dell has a very good HW lifecycle.  They’re very quick to release new servers after Intel releases a new chip.  Maybe not a big deal for some, but for us, it’s a nice win.  In turn I also find that Dell keeps their current and former server models around for a very respectable amount of time.  If you’re in a situation where you’re trying to keep a cluster in a homogeneous HW configuration, this is advantageous.
  • Support: For general HW issues I find Dell to have fantastic support.  We use pro support for everything which I’m sure helps.  When I say general HW support, I’m talking about getting hard drives, or memory replaced.  To some degree I’ll even say troubleshooting more complex HW issues they tend to be pretty good at.  When you get above the HW stack, I’ll chat about that a little bit later.  It also worth mentioning that from what I can tell, support is 100% based in the US which is surprising but certainly appreciated.
  • HW Design / Build: When you’re comparing Dell to something like a Supermicro server, it’s a night and day difference.  I find that the overall build quality of Dell’s rackmount solutions are excellent.  Internally, most things are easy to get to and replace.  The cover is labeled well as are things like the memory which makes it easy to track down a problem DIMM.  Outside the server, the cable management arms are nice (if you use them).  Overall, the server is built sturdy, the edges are typically smooth (not sharp) and pretty much everything is tooless.  I only have two gripes, which I’ll add here instead of duplicating it in the cons.  I find their bezels to be somewhat useless.  We’ve stopped using them because we’ve run into cases where its actually hard to get them off.  Also, depending on your chassis config, the bezel blocks some idiot lights.  Finally, and admittedly this is more of a personal preference, I HATE Dell’s rails.  I think the theory is they’re supposed to be installed vertically starting in the rear, and then lay into the rails.  The problem is the rails are just too damn flimsy when they’re extended (all rails are; this isn’t just Dell).  I personally prefer the slide in style like found on the HP’s.
  • Purchasing configuration: One thing I love with Dell is that each server is 100% customizable, or at least reasonably customizable.  I’ve worked with other vendors where you were forced to use cookie cutter templates or wait a month+ for a custom build.
  • Price: I can’t say they’re cheaper than every vendor out there, but I suspect if you compare them to HP, Cisco or Lenovo I find that they tend to be a more affordable solution. For us, we go Dell direct and we’re considered an enterprise customer.  Your mileage may vary in this case of course.
  • Sales team: I can’t typically say this about many vendors, but I generally love working with Dell’s sales team.  They’re all very friendly and very responsive to requests.  Its one of those things where its just nice to see a vendor meet what I’ll call minimum expectations.  Dells sales team does this for sure and in many cases exceeds them.

And that is pretty much where the pros end.

Cons:

Like I’ve said in previous reviews, its always easier to pick out the cons of a solution.  Chalk it up to taking things for granted, loss of perspective, etc.  I would say take some of these cons with a grain of salt, no vendor is perfect.  Some of this stuff I’m not just outlining for your information, but also so that perhaps a product manager for Dell’s server solutions gets some much needed, unfiltered feedback.  I say unfiltered, because as a customer, I honestly feel like this stuff never makes it up to the right people, or when it does, its watered down.

  • Innovation: Dell has absolutely ZERO innovation capabilities across their entire portfolio, and the Poweredge line is no exception.  I’ve often joked that when Dell buys a company, that’s where they’re going die a slow non-innovative death.  Look at what they’ve done with EQL, Compellent, Ocarina, and Quest.  Seriously, the whole line up is a joke compared to what’s out and about now a day.   I know this is supposed to be about Poweredge servers (and I’ll get to that) but if Dell keeps any of those storage products around post EMC merger, they’re fools.  How does this apply to servers?  Take a look at Cisco UCS and then take a look at Dell.  Dells server solution is fine if you have say less than 50 or so servers and most of them are virtual hosts.  The instant you start going above that, the more value there is in a solution like Cisco UCS.  If I was still running a 100+ real server environment, no way I’d run Dell.  Why do I categorize this under innovation? Because there is nothing coming out from Dell that I feel compelled to write about.  There is no “wow” factor with Dell servers. When I first saw a UCS demo, THAT was a wow factor.  I don’t like the Cisco architecture, but no one can say that they’re not innovative, and I’d totally jump in with Cisco if my environment was larger.
  • Central Management (physical servers): Getting into a little bit more details of why I wouldn’t buy Dell if I had a larger shop.  They have a half assed / practically non-existent server management solution.  They have this POS called Dell Open Manage Essentials, that’s supposed to take care of pretty much everything one would need to take care of with a Dell server.  The problem is it doesn’t do a whole lot and what it does do, it doesn’t do very well.
    • Monitoring: OME only told us that there is a problem, but it wouldn’t tell you what the problem was.  So we’d still have to log into each server to find out what the specific problem was.  That wouldn’t have been a big deal, except 95% of the time it was something dumb like the drivers / FW being out of date (really, do we need a warning about that?).
    • Remote FW and Driver updates: This never worked, tried it a million times, and it would either only partially work, or not work at all.  Installing things like drivers or new tools would just show “failed” with some cryptic reason.  Manually download the same update, and run it manually, and it works fine.
    • Server profiles: I didn’t even try these because honestly nothing else worked well, and it looks like a PITA to configure.  There is clearly no “vision” for how to make managing servers easy at Dell.  I don’t mind doing some prep work here, but its not like we’re talking about standing up an OS + Application, we’re talking about a few server settings, some drac settings and icing on the cake would be configuring and managing the local RAID cards.  I know its probably not fair to put it down, maybe it’s a diamond of a solution baked into a crappy product, but I doubt it.
  • Central Management (Vmware): They actually do “ok” here, but its not great.  When the solution works (hit or miss) it does work pretty well.  Still I’m only using it to manage FW updates, and even with that it only works part of the time.  There have been countless cases where I’ve needed to kill outstanding FW update jobs and soft reset the drac to shake things loose.  There’s also times where I need to reboot the FW update appliance, and sometimes I have to reboot vCenter + do all the above, because who knows what the problem is.
  • Non-standard HW: It’s beyond frustrating that I can’t say I want all Intel SSD of a specific model.  Dell uses vague terms (low, mid or high write duty) to describe their SSD drives.  There’s no way for me to know if I’m getting an SSD that can do 35k IOPS or 100k IOPS.  With HDD’s its mostly a commodity, but with SSD’s, there are very much big pros and cons depending on which SSD you use.  IMO, with HCI (however over rate the architecture is) becoming a hot new thing, having a standard SSD and HHD IMO is a must.  You’re now building around local storage characteristics, and you need predictability.
  • SSD and HDD are stupid expensive: Just a matter of opinion, but their prices are insane, especially for the SSD’s.  I get that they might wear out prematurely, but then put some clause that they’re simply good for “x” writes are whatever.
  • DRACs: There remote management cards have gotten better over the years, but they’re still nothing compared to iLO.  They can still be somewhat unreliable and they’re only now starting to release an HTML5 interface instead of Java.  Adding to that, as mentioned above, I’ve seen multiple cases where FW update stop working and its almost always the DRAC that’s the issue.
  • Documentation and downloads: Dell is almost as bad as Microsoft when it comes to documentation and downloads.  Things are just scattered all over the place and it lacks consistency.  Yes, I can go to the drivers and downloads section for a server model, but there have been many times where I’ve seen Openmanage Systems Administrator (OSMA) latest version on say a Dell poweredge r730 page, but NOT on a poweredge r720.

With things like the Dell Openmanage vCenter appliance, I also find it hard to find the latest updates or to know where to download my serial keys.  Documentation about the product is also difficult to find at times.

  • Standalone FW updates: Dell has methods updating the HW for standalone system, but they’re all overly complex, or not refined.  One method is booting of an SBU disk and then swapping CD’s (ISOs) to load your repo.  It works (most of the time) but it’s a PITA.  The other option is using their repository manager (another half-baked solution) to create a standalone FW update ISO that just blindly updates all HW that you’ve loaded a FW for.  Neither solution is seamless or refined.  I cringe every time I have to use either.
  • Support: I know I marked this as a pro, but I wanted to elaborate a bit here on the cons of support.  I know in many cases they’re probably not different than other vendors, but I’m so sick of hearing “we can’t do x or y until you’ve updated all your FW and drivers”.  To me, I don’t care what vendor you are, this is lazy, kick the can troubleshooting.  Its one thing if you can point to a release note in a FW or driver that describes the specific problem I’m having, but if you’re not even looking at the release notes, and blindly telling me to update component x,y and z, that’s just you being lazy.  Here is normally what ends up happening anyway.  I call you, you tell me to update x,y and z.  I begrudgingly do it after debating with you, problem is NOT solved and now you’ve just pissed me off and wasted my time (and depending on the server, other teams time).

Conclusion:

At the end of the day, I’m sure you’re thinking I hate Dell servers.  I don’t, but they leave a lot to be desired.  I look at Dell as a high end Supermicro server.  I know I’ll get good support from them, a solid server and if I don’t buy their disks, a reasonably priced server.  Because our environment is relatively small (35 servers or so), I can deal with their cons.  If I had a larger physical server environment, I would probably lean more towards Cisco, but at this size and below its not worth the added cost or complexity of UCS.  That being said, if cost was no option, or if Cisco were to offer their solution at a more affordable price, I don’t know why anyone would buy Dell.

Thinking out loud: Hyper converged storages missing link

Introduction:

In general, I’m not a huge fan of hyper converged infrastructure.  To me, its more “hype” than substance at the moment.  It was born out of web scale infrastructure like Google, Facebook, etc. and IMO, that is still the area where it’s better suited.    The only enterprise layer where I see HCI being a good fit is VDI, other than that, almost every other enterprise workload would be better suited on new school shared storage.  I could probably go into a ton of reasons why I personally see shared storage still being the preferred architecture for enterprises, but instead I’ll focus on one area that if adopted might change my view (slightly).  You see, there is a balance between the best and good enough.  Shared storage IMO is the best, but HCI could be good enough.

What’s missing?

What is the missing link (pun intended)?  IMO, its external / independent DAS.  Can’t see where this is going?  Follow along on why I think external DAS will make hyper converged storage good enough for almost anyone’s environment.

Scaling Deep:  Right now the average server tops out at 24 2.5” drives and less for 3.5” drives.  In a lot of larger shops, that would mean running more hosts in order to meet your storage requirements, and that will come at the cost of paying for more CPU, memory and licensing then you should have to.  Just imagine a typical 1ru r630 + a 2u 60 drive JBOD!  That’s a lot more storage that you can fit under a single host, and it would only consume one more rack unit than a typical r730.  Add to this, theoretically speaking, the number of drives you could add to a single host would go beyond a single JBOD.  A quad port SAS HBA could have four 60 drive enclosures attached, and that’s a single HBA.

Storage independence:  Having the storage outside the server also makes that storage infinitely more flexible.  This is even true when you’re building vendor homogeneous solutions.  Take Dell for example.  Typically speaking their enclosures are movable between different server generations.  Currently with the storage stuck in the chassis, it gets really messy (support wise) and in many cases not doable, to move the storage from one chassis to another, especially if you’re talking about going from an older generation server to a newer generation.

Adding to this, depending on your confidence, white boxing also allows you to cut a server vendor out of the costliest part of the solution, which is the disks themselves.  Going with an enclosure from someone like RAID inc. or DataOn, Quanta QCT, Seagate, etc.  Add in a generic LSI (sorry Avago, oh sorry again, Broadcom) HBA, and now you have a solution that is likely good enough supportability wise.  JBODs tend to be pretty dumb and reliable, which just leaves the LSI card (well known established vendor) and your SSD / HDD.

Why do you want to move the storage anyway?  Simple, I’d bet a nice steak dinner that you want to upgrade or replace your compute long before you need to replace your storage. If you’re simply replacing your compute (not adding a new node but swapping it) then moving a SAS card + DAS is far more efficient than rebuying the storage, or moving the internal storage into a new host (remember warranty gets messy).  Simply vacate the host like you would with internal storage, shutdown, rip the hba out, swap server, put existing HBA back in, done.

If you’re adding a new host, depending on your storage, you may have the option of buying another enclosure and spreading the disks you have evenly across all hosts again.  So if for example, you had 50 disks in 4 hosts (total 200 disks) and you add a fifth host.  One option could be you simply remove 10 disks from each current node and place them in the new node. Your only additional cost was JBOD enclosure, and you now continue to keep your current investment in disks (with flash, that would be the expensive part).

Mix and match 3.5 / 2.5 drives:  Right now with internal storage, you are either running a 3.5” chassis, which doesn’t hold a lot of drives, but CAN support 2.5” drives with a sled.  Or you are running a 2.5” chassis which guarantees no 3.5” drives.  External DAS could mean one of two options:

  1. Use a denser 3.5” JBOD (say 60 disks) and use 2.5” sleds when you need to.
  2. Use one JBOD for 3.5” drives and a different one for 2.5” drives.

Again it comes down to flexibility.

Performance upgrades:  Now this is a big “it depends”.  Hypothetically if there were no SW imposed bottlenecks (which there are), one of your likely bottlenecks (with all flash at least) are going to be either how many drives you have per SAS lane, or how many drives you have per SAS card.  For example, if your SAS card is PCIe 3.0 internally, but the PCIe bus is 4.0, there’s a chance you could upgrade your server to a newer / better storage controller card.   More so, even if you were stuck on PCIe 3 (as an example).  There would be nothing stopping you from slicing your JBOD in half, and using two HBA to double your throughput.  Before you even go there, yes I do know the 730xd has an option for two RAID cards, glad you brought that up.  Guess what, with external DAS, you’re only limited by your budget, the number of PCIe slots you have and the constrains of your HCI vendor.  I for example could have 4 SAS cards, and 2 JBODs partially filled and each sliced in half.  You don’t have that flexibility with internal storage.

With the case of white boxing your storage, this also means to the extent of the HCL, you can run what you want.  So if you want to use all Intel dc3700’s, you can.  Heck, they’re even starting to make JBOF (just a bunch of flash) enclosures for NVMe, which again, would be REALLY fast.

Conclusion:

I say external DAS support is the missing link because it is what would allow HCI to offer similar scaling flexibility that exists in SAN/NAS.  I still think the HCI industry is at least 3 – 5 years out from matching the performance, scalability and features we’ve come to expect in enterprise storage, but external storage support would knock a big hole in a large facet of the scalability win with SAN/NAS.