Tag Archives: vmware

VMware DRS default memory load balancing

In general, I find DRS does a fantastic job of keeping VM’s happy.  However, in the past, I’ve seen a number of unexplained situations where hosts in a cluster run out of memory when a VM goes from idle, to busy all of a sudden.  In fact, this happened three times to us in a dedicated cluster for our SQL VM’s.  What was unexplained wasn’t how the host ran out of memory, that was pretty easy to track down.  What was unexplained was why DRS didn’t move any of the VM’s.  In this cluster we had a lot of VM’s with restrictive rules, but there were plenty of VM’s on this host that could have been easily relocated to prevent the over-committing of memory.

We were so perplexed by the situation that we called VMware support.  We explained the situation to them, and asked what we could do to mitigate it from happening.  I had the idea of using memory reservations, and that maybe DRS wouldn’t move VM’s to a host that didn’t have enough memory to back the configured vRAM.  Turns out out memory reservations in VMware aren’t exactly “reservations” per se.  They only reserve memory once it’s active.  So even if you say “I want to reserve 100% of the configured memory” VMware doesn’t actually do that.  It’s more of a “once I use it, I don’t give it back” functionality. As an aside, they have a new advanced setting in 6.5 that does pre-allocate the memory reservation. Given the confusion around memory reservations, I’d love to see that option exposed in the GUI with a brief comment about how it works.

Regardless, it wasn’t a mitigating solution recommended by support. In fact they suspected it would make things worse. Anyway, after looking through our logs, and chatting with their colleagues, the tech basically came to the conclusion that our cluster was overloaded and we needed more memory.  At the time, I was a little skeptical, but I had to concede our cluster was pretty full.  I didn’t think it was loaded so bad that DRS couldn’t shuffle things around, but we took the answer as is and came up with a different solution. We ended up disabling DRS since it was causing more issues than it was solving.

Fast forward to this week, we had a completely different cluster with a host on the verge of running out of memory.  It was at 96% all the while other nodes were chilling at ~50-60%.  Unlike the above scenario, there was TON of memory to balance things out, and there weren’t super restrictive DRS rules either.  I was poking around DRS, as I recall there being some DRS enhancements in 6.5.  I noticed this one setting.  “Load balance based on consumed memory instead of active memory”.  And so the light bulb went off.  I Googled the setting to make sure I understood what it meant and came across this article.  It did exactly what I thought it was going to, which honestly made me wonder why it’s not the default.  I ALSO noticed in this article that we probably could have influenced DRS in our SQL cluster to mitigate the overcommit.  A bit of a bummer, but life goes on.

In closing, I think this will be our default tweak whenever we setup a new cluster.  Within a few minute of enabling that feature,  I watched our host that was consuming 96% of it’s memory, vacated a number of VM’s and things looked a lot healthier.  I’m not suggesting that you should do this, but I might suggest that you consider it if you’re the type of shop that doesn’t like to overcommit memory.

Review: 5 years virtualizing Microsoft SQL Server

Introduction:

I know what you’re thinking, it’s 2017, why are you writing about virtualizing Microsoft SQL?  Most are doing it after all.  And even if they’re not, there’s this whole SQLaaS thing that’s starting to take off, so why would anyone care?  Well I’m writing this as more of a reflection on virtualizing SQL.  What works well, what doesn’t, what lessons I’ve learned, what I’m still learning, etc.

Like most things on the internet, I find that folks tend to share all the good, without sharing any of the bad (or vice versa).  There’s also just a lot of folks out there saying they’ve done it, without quantifying how well it’s working.  Sure, I’ve seen the cranky DBA say it’s the worst thing to happen, and I’ve seen the sysadmins say it’s the best thing that they ever did.  I find both types of feedback to be mostly useless, as they’re all missing context and depth.  This post is going to follow my typical review style, so I’ll outline things like the specs, the pros and cons, and share some general thoughts.

Background:

When I first started at ASI, I was told we’d never virtualize SQL.  It was the un-virtualizeable workload.  That was roughly six and a half years ago.  Fast forward to today, and we’ve been running a primarily virtualized SQL environment for close to five years.  It took a bit of convincing on my side, but this is basically how I convinced ASI to virtualize SQL.

  • Virtualizing SQL (and other big iron) was gaining a lot of popularity back in 2012
  • I had just completed my first successful POC of virtualizing a lot of other workloads at ASI.
  • We were running SQL on older physical systems and they were running adequately. The virtual hosts I was proposing were at the time two generations newer processor wise.  Meaning, if it was running ok on this dinosaur HW, it should run even better on this newer processor, regardless of whether it was virtual or not.
  • I did a ton of research, and a lot of political marketing / sales. Basically, compiling a list of things virtualization was going to fix in our current SQL environment.  Best of all, I was able to point at my POC as proof of these things.  For example, we had virtualized Exchange, and Exchange was a pretty big iron system that was running well. Many of the things I laid out as pros, I could point to Exchange as proof.

Basically, it was proposed as a win / win solution.  It wasn’t that I didn’t share the cons of virtualizing SQL, it was that I wasn’t as familiar with the cons until after virtualizing SQL.  This is going back to that whole lack of real-world feedback issue.  I brought up things like there would be some performance overhead, troubleshooting would be more difficult, and some of the more well-known issues.  But there was never a detailed list of gotcha’s.  No one that I was aware of had virtualized BIG SQL servers in the real world and then shared their experience in great detail.  Sure, I saw DBA’s complain a lot, but most of it was FUD (and still is).

Anyway, the point is, we did a 180, and went from not virtualizing any SQL, to virtualizing any and all SQL with the exception of one platform (more on that later).

The numbers and specs:

Bare in mind, this was five years ago, these were big numbers back then.

  • VMware cluster comprised of seven Dell r820’s
    • 32 total cores (quad socket 8 cores per)
    • 768GB of RAM
    • Quad 10gb networking
      • Two for the storage network
      • Two for all other traffic
    • Fusion IO drive 2 card
    • Fusion IO Io Turbine cache acceleration.
    • VMware ESXi 5.x – 6.x (over time)
  • Five Nimble cs460 SANs
  • Dual Nexus 5596 10Gb switches
  • Approximately 80 SQL servers (peak)
    • 20 – 30 of which were two node clusters
    • Started with Windows 2012 R1 + SQL 2012
    • Currently running Windows 2012 R2 + SQL 2014 and moving on to Windows 2016 + SQL 2017

To summarize, we have a dedicated VMware cluster for production SQL systems and another cluster (not detailed) for non-production workloads.  It didn’t start out that way, more on that later.

Pros:

No surprise, but there are a lot of advantages to virtualizing SQL that even after five years I still think holds true.  Let’s dig into it.

  • The ability to expand resources with minimal disruption. I’m not talking about anything hot-add here, simply the fact that you can add resources.  In essence, give you the ability to right size each SQL server.
  • Through virtualization, you can run any number of OS + SQL version combinations that you need. Previously there was all kinds of instance stacking, OS + SQL version lag.  With virtualization if we want a specific OS + SQL combo, we spool up a new VM and away we go.
  • Virtualization made it easy for us to have a proper dev, stage, UAT and finally production environment for all systems. Before these would have been instances on existing SQL servers.
  • Physical hardware maintenance is mostly non-disruptive. By being able to easily move workloads (scheduled) to different physical hosts, we’re able to perform maintenance without risking data loss.  There’s also the added benefit that there’s basically no firmware or driver updates (other than VMware tools / version) to apply in the OS its self.  This make maintenance a lot easier for the SQL server its self.
  • Related to the above, hardware upgrades are as easy as a shutdown / power on. There’s no need to re-install and re-configure SQL on a new system.
  • We were able to build SQL VM’s for specific purposes rather than trying to co-mingle a bunch of databases on the same SQL server. Some might say six of one and half a dozen of another, but I disagree.  By making a specific SQL server virtual, it enabled us to migrate that workload to any number of virtual hosts.
  • With enterprise licensing, we could build as many SQL systems as we wanted within the confines of resources.
  • Migrating SQL data from one storage location to another was easy, but I won’t go so far as saying non-disruptive. Doing that on a physical SQL server, requires moving the data files manually.  With VMware, we just moved the virtual disk.
  • Better physical host utilization. This is a double-edged sword, but we’ve been able to more fully utilize our physical HW than before.  When you consider how much SQL licensing costs, that’s a pretty big deal.
  • Redundancy for older OS versions. Before Windows 2012, there was no official support for NIC teaming.  You could do it, but Microsoft wouldn’t support it.  With VMware, we had both NIC redundancy and host redundancy.  In a non-clustered SQL server, VMware’s HA could kick in as a backup for host failures.

Pretty much, all the standard pros you’d expect from a virtual environment, and a few SQL specific ones.

Cons:

This is a tough one to admit, but there are a TON of cons to virtualizing SQL if a sysadmin has to deal with it at scale.

  • Troubleshooting just got tougher with a SQL. VMware will now always be suspect for any and all issues.  Some of it is justified, a lot of it not.  Still, trying to prove it’s not a VMware issue is tough.  You’re no longer simply looking at the OS stats, now you have to review the VM host stats, check for things like co-stop, wait, busy, etc.  Were there any noisy neighbors, anything in the VMware logs, etc.
  • Things behave have differently in a virtual world. In a physical world, “stuns” or “waits” don’t happen.  This is related to the above, but basically, for every simplicity that virtualization adds, it at least matches it with an equal or greater complexity.
  • The politics, OH the politics of a virtual SQL environment. If you don’t’ have a great relationship with your SQL team, I would say, don’t virtualize SQL.  It’s just not worth the pain and agony you’re going to go through.  It will only increase finger pointing.
  • DBA’s in charge of sizing VM’s on a virtual host you’re in charge of supporting. This is related to politics, but basically now that DBA’s know they can expand resources, you can bet your hind end your VM’s will get bigger and almost never shrink (we’ve gotten some resources back, so kudos to our DBA’s).  It doesn’t matter if you explain NUMA concerns, co-stop, etc.  It’s nothing more than “I want more CPU” or “I want more memory”.  Then a week later, when you have VM’s stepping on each other’s toes, it will be finger pointing back at you again.  I think what’s mostly happening here, is the DBA’s are focused on the individual server performance, whereas its difficult to convey the multi-server impact.
  • vMotion (host or storage) will cause interruptions. In a SQL cluster, you will have failovers.  At least that’s my experience.  Despite what VMware puts on their matrix, DON’T plan on using DRS.  Even if you can get the VM’s to migrate without a failover, the applications accessing SQL will slow down.  At least if your SQL VM’s are a decent size.  This was probably the number one disappointment with our SQL environment.
    • Once you can’t rely on DRS, managing VM’s across different hosts becomes a nightmare. You’ll either end up in CPU overload, or memory ballooning. I’ve never seen memory ballooning before virtualizing SQL, and that’s the last application you want to see ballooning and swapping.
    • Since you can’t vmotion VM’s to rebalance the cluster without causing disruptions (save for maybe non-clustered VMs) just keep piling on the struggle.
  • SQL VMware hosts are EXPENSIVE at least when you’re running a good number of big VM’s like we are. We actually maxed out our quad socket servers from a power perspective.  Even if we wanted to do something like add memory it’s not an option.  And when you want to talk about swapping in new hosts, it’s not some cheap 30k host, no it’s a host that probably costs close to 110k if not more.  Adding to that, you’re now tasked with trying to determine if you should stay with the same number of CPU cores, or try to make a case for more CPU cores, which now add SQL licensing costs.

I could probably keep going on, but the point is virtualizing SQL isn’t all sunshine and roses like it is for other workloads.

Lessons learned:

I’m thankful to have had this opportunity, because it’s enabled me to experience first-hand what it’s like virtualizing SQL in a shop where SQL is respectably large and critical.  In this time, I’ve learned a number of things.

  • DRS + SQL clusters = no go. Maybe it works for you and your puny 4 vCPU / 16GB VM, but for one of our vm’s with 24 vCPU and 228GB of RAM, you will cause failovers.  And no DBA wants a failover.
    • Actually DRS + any Windows cluster = no go, but that’s for another post.
  • If I had to do it over again, I would have gotten Dell r920’s instead of 820’s. While both were quad socket, I didn’t realize or appreciate the scalability difference between the 4600 and 8600 series xeons.  If I was building this today, I would go after hosts that are super dense.  Rather than relying on a scale out technique, I’d shoot for a scale up approach.  Most ideal would be something like the HPe SuperDome, but even getting a new M series Xeons with 128GB DIMMS would be a wise choice.  In essence, build a virtual platform just like you would a physical one So if you normally would have had three really big hosts, do the same in VMware.
  • Accept the fact that SQL VM’s are going to be larger than you think they should be. Some of this being fair is SysAdmins think they understand SQL, and we don’t.  There’s a lot more to SQL than CPU utilization.  For example, I’ve seen SQL queries that only used 25% of every CPU core they were running on, but the more vCPUs we allocated to the VM, the faster that query ran.  It was the oddest thing I had ever seen, but it also wasn’t the only application I’ve seen like this.  Likely, a disk bottleneck issue, or at least that’s my guess.
  • Just give SQL memory and be done with it. When we virtualized our first SQL cluster, the one thing we noticed was the disk IO on our SAN (and FusionIO card) was pretty impressive.  At first, it’s pretty cool to see 80k IOPS from a real workload, but then when you hear the DBA’s saying, “it’s slow” and you realize that if every SQL server you add needs this kind of disk IO, you’re going to run out of IOPS in no time.  We added something like 64GB of more memory to those nodes, and the disk IO went from 80k to 3k and the performance from the DBA’s perspective was back to what they expected.  There’s no replacement for memory.
  • Virtualizing SQL is complex. While it CAN be as simple as what you’re used to doing, once you start adding clustering, and managing a lot of monster VM’s on the same cluster, it’s a different kind of experience than you’re used to.  To me, it’s worth investing in VMware log insight for your SQL environment to make it easier to troubleshoot things.  I would also add ops manager as another potential value add.  At least these are things I’m thinking of pushing for.
  • Keep your environment as simple as possible. We started out with Fusion IO cards + Fusion IO caching software.  All that did was create a lot of headache, and once we increased the RAM in SQL, the disk bottleneck went away (mostly).  I could totally see using an Intel NVMe (or 3dxpoint) card for something like TempDB.  However, I would put the virtual disk on the drive directly, not use any sort of caching solution.
  • I would have broken our seven node cluster up into two or three two node clusters. This goes back to treating them like they’re physical servers.  Again, scaling up, much better, but if you’re going to use more, but smaller hosts, treat them like they’re physical.
    • We kind of do this now. Node 1’s on odd hosts, node 2’s on even hosts
  • We found that we ultimately didn’t need Vmware’s enterprise plus. We couldn’t vmotion, or use DRS, and the distributed switch was of little value, so we converted everything to standard edition.  Now, I have no clue what would happen if we wanted Ops Manager.  It used to be a la carte, but I’m not so sure anymore.
  • We originally had non-prod and prod on the same cluster. We eventually moved all of non-prod off.  This provided a little more breathing room, and now we have two out of seven hosts free to use for maintenance.  Before, they were partially consumed with non-prod SQL VM’s.
  • We made the mistake of starting with virtualizing big SQL severs and learning about Microsoft clustering + AlwaysOn Availability Groups at the same time. Not recommended J.  I don’t think it would have been easy to learn the lessons we did, even if it was difficult.
  • Just because VMware says something will work, doesn’t mean it will. I quadruple checked their clustering matrix and recommended practices guides.  We were doing everything they recommended and our clusters still failed over.
  • Big VM’s don’t behave the same way as little VM’s. I know it sounds like a no duh, but it’s really not something you think about.  This is especially true when it comes to vMotion or even trying to balance resources (manually) on different hosts.  You never realize how much you really appreciate DRS.
  • I’ve learned to absolutely despise Microsoft clustering when it’s virtualized. It just doesn’t behave well.  I think MS clustering is built for a physical world, where there are certain assumptions about how the host will react.  For the record, our physical SQL cluster is rock solid.  All our issues typically circle back to virtualization.
    • BTW, yes, we’ve tried tuning the subnet failover thresholds, no it doesn’t work, and no I can’t tell you why.
  • We’ve learned that VMware support just isn’t up to par, and that you’re really playing with fire if you’re virtualizing complex workloads like SQL. We can’t afford mission critical support, so maybe that’s what we need, but production support is basically useless if you need their help.
  • Having access to Microsoft’s premier support would be very beneficial in this environment. It’s probably something we should have insisted on.

Conclusion:

Do I recommend virtualizing SQL?  I would say it depends, but mostly yes.  There are certainly days where I want to go back to physical, but then I think about all the things I would miss with our virtual environment.  And I’m sure if you asked our DBA’s, they too would admit to missing some of the pros as well.  Here are my final thoughts.

I would say if you’re a shop that has a lot of smaller SQL servers, and they’re non-clustered, virtualization is a no-brainer.  When SQL is small, and non-clustered, it mostly behaves about the same as other VM’s.  We never have issues with our dev or stage systems, and they’re all on the smaller side and they’re all non-clustered.  Even with our UAT environment, we almost never have issues, even though they are clustered.

For us, it seems to be the combination of a clustered and large SQL server where things start getting sketchy.  I don’t want to make it sound like we’re dealing with failovers all the time.  We’ve worked through most of our issues, and for the most part, things are stable.  We occasionally have random failovers, which is incredibly frustrating for all parties, but they’re rare now a day.

My suggestion is, if you do want to virtualize large clustered SQL systems, treat them like they’re physical.  Here are a few rough recommendations:

  • Avoid heavy CPU oversubscription. Shoot for something like less than 3:1, and more ideal being less than 2:1
  • Size your VM’s so they fit in a NUMA node. That would have been impossible back in the day, but now a day, we could probably do this.  Maybe some of you though, this will still be an issue.  Our largest VM’s (so far) are only 24 vCPU, so we can fit in a single NUMA node on newer HW.
  • Don’t cluster in VMware period. No HA, no DRS.  Keep your hosts standalone and manage your SQL VM’s just like you would if they were physical.  Meaning, plan the VMware host to accommodate the SQL VM(s).
  • Don’t intermix non-SQL VM’s with these systems. We didn’t do this, but I wanted to point it out.
  • Plan on a physical host that can scale up its memory if needed.
  • When doing VMware host maintenance, failover your SQL listeners / clusters before migrating the VMs.
    • BTW, it’s typically faster to shutdown a VM then vMotion it while powered on at the sizes we’re dealing with.

Finally, I wanted to close by pointing out, that performance was never an issue in our environment.  In fact, things got faster when we moved to the newer HW + SAN.  One of the biggest concerns I used to see with virtualizing SQL was performance, and yet it was everything else that no one mentioned that ended up being the issues.

Hope this helps someone else who hasn’t taken the plunge yet or is struggling themselves.

Thinking out loud: VMware, this is what I want from you

Warning:

This post is clicking in at 6k words.  If you are looking for a quick read, this isn’t for you.

Disclaimer:

Typical stuff, these are my personal views, not views of my employers.  These are not facts, merely opinions and random thoughts I’m writing down.

Introduction:

I don’t know about all of you, but for me, VMware has been an uninspiring company over the last couple of years.  VMworld was a time when I used to get excited.  It used to mean big new features were coming, and the platform would evolve in nice big steps.  However, over the last 5 – 7 years, VMware has gotten progressively disappointing.  My disappointment however is not limited to the products alone, but the company culture as well.

This post will not follow a review format like many of you are used to seeing, but instead, will be more of a pointed list of the areas I feel need improvement.

With that in mind, let it go on the record, that in my not so humble option, VMware is still the best damn virtualization solution.  I bring these points up not to say that the product / company sucks, but rather to outline that in many ways, VMware has lost its mojo, and IMO some of these areas would be good steps in recovering that.

The products:

The death of ESXi:

You know, there are a lot of folks out there that want to say the hypervisor is a commodity.  Typically, those folks are either pitching or have switched to a non-VMware hypervisor.  To me, they’re suffering from Stockholm’s syndrome.  Here’s the deal, ESXi kicks so much ass as a hypervisor.  If you try to compare Hyper-V, KVM, Xen or anything else to VMware’s full featured ESXi, there is no competition.  I don’t give a crap about anything you will try to point out, you’re wrong, plain and simple.  Any argument you make will get shot down in a pile of flames.  Even if you come at me with the “product x is free” I’m still going to shoot you down.

With that out of the way, it’s a no wonder that everyone is chanting the hypervisor commodity myth.  I mean, let’s be real here, what BIG innovation has been released to the general ESXi platform without some up charge?  You can’t count vSAN because that’s a separate “product” (more on the quotes later).  vVOLs you say?  Yeah, that’s a nice feature, only took how long?

So, what else?  How about the lack of trickle down and the elimination of Enterprise edition? There was a time in VMware’s history when features trickle down from Enterprise Plus > Enterprise > Standard.  Usually it occurred each year, so by the time year three rolled around, that one feature in Enterprise Plus you were waiting for, finally got gifted to Standard edition.  The last feature I recall this happening too, was the MPIO provider support, and that was ONLY so they could support vVOLS on Standard edition (TMK).

Here is my view on this subject, VMware is making the myth of a commoditized hypervisor a self-fulfilling prophecy.  Not only is there a complete lack innovation, but there’s no trickle down occurring.

If you as a customer, have gone from receiving regular (significant) improvements as part of your maintenance agreement, to basically nothing year over year, why would you want to continue to invest in that product?  Believe me, the thought has crossed my mind more than once.

From what I understand, VMware’s new business plan, is to make “products” like vSAN that depend on ESXi, but that aren’t included with the ESXi purchase.  Thus, a new revenue stream for VMware and renewed dependence on ESXi.  First glance says it working, at least sort of, but is it really doing as well as it could?  While it sounds like a great business model, if you’re just comparing whether you’re black / red, what about the softer side of things?  What is the customer perception of moving innovations to an al a carte model?  For me, I wonder if they took the approach below, would it have had the same revenue impact they were looking for, while at the same time, also enabling a more positive customer perception?  I think so…

  1. First and foremost, VMware needs to make money. I know I just went through that whole diatribe above, but hear me out.  This whole “per socket” model is dead.  It’s just not a sustainable licensing model for anyone.  Microsoft started with SQL and has finally moved Windows to a per core model.  In my opinion, VMware needs to evolve its licensing model in two directions.
    1. Per VM: There are cases, where you’re running monster VMs, and while you’re certainly taking advantage of VMware’s features, you’re not getting anywhere near the same vale add as someone who’s running 20, 30, 50, 100 VM’s per host.  Allowing customers to allocate per VM licenses to single host or an entire cluster would be a fair model for those that aren’t using virtualization for the overcommit, but for the flexibility.
    2. Per Core: I know this is probably the one I’m going to get the most grief from, but let’s be real, YOU KNOW it’s fair.  Let’s just pretend, VMware wasn’t the evil company that Microsoft is, and actually let you license as few as 2 cores at a time?  For all of you VARs that have to support small businesses, or for all of you smaller business out there, how much likelier would you have just done a full blow ESXi implementation for your clients?  Let’s just say VMware charged $165 per core for ESXi standard edition and your client had a quad core server.  Would you think $659 would be a reasonable price?  I get that number simply by taking VMware’s list price and dividing by 8 cores, which is exactly how Microsoft arrived at their trade-ins for SQL and Windows.  NOW, let’s also say you’re a larger company like mine and you’re running enterprise plus.  The new 48 core server I’m looking at would normally cost $11,238 at list for Enterprise Plus.  However, if we take my new per core model, that server would now cost ($703 per core) $33,714.  That’s approximately $22k that VMware is losing out on for just ONE server.  I know what you’re thinking, Eric, why in the world would you want to pay more?  I don’t, but I also don’t want a company that makes a kick ass product to stagnate, or worse crumble.  I’ve invested in a platform, and I want that platform to evolve.  In order for VMware to evolve, it needs capital.
  2. Ok, now that we have the above out of the way, I want a hell of a lot more out of VMware for that kind of cash, so let’s dig into that.
    1. vSAN should have never been a separate product. Including vSAN into that per core or per VM cost just like they do with Horizon, would add value into the platform.  Let’s be real, not everyone is going to use every feature of VMware.  I’m personally not a fan of vSAN, but that doesn’t mean I don’t think I should be entitled to it.  This could easily be something that is split among Standard and Enterprise plus editions.
      1. Yes, that also means the distributed switch would trickle down into Standard edition, which it should be by now.
    2. Similar to vSAN, NSX should really be the new distributed switch. I’m not sure exactly how to split it across the editions, but I think some form of NSX should be included with Standard, and the whole darn thing for Enterprise Plus.
    3. At this stage, I think it’s about time for Standard edition to really become the edition of the 80%. Meaning, 80% of the companies would have their needs met by Standard edition, and Enterprise plus is truly reserved for those that need the big bells and whistles.  A few notable things I would like to trickle down to Standard Edition are as follows.
      1. DRS (Storage and Host)
      2. Distributed Switch (as pointed out in 2ai)
      3. SIOC and NIOC
      4. NVIDIA Grid
  3. As for Enterprise Plus, and Enterprise Plus with Ops manager, those two should merge and be sold at the same price as Enterprise plus. I would also like to see some more of the automation aspects from the cloud suite brought into the Enterprise Plus edition as well.  I kind of view Enterprise Plus edition, as being an edition that focuses on all the automation goodies, that smaller companies don’t need.
  4. IMO, selling vCenter as separate SKU is just silly. So as part of all of this, I would like to see vCenter simply included with your per core or per VM licenses.  At the end of the day, a host can only be connected to one vCenter at a time anyway.
  5. Include a log insight licenses for every ESXi host sold, strictly used for collecting and managing a hosts VMware logs, including the VM’s running on top of them. I don’t mean inside the OS, rather things like the vmware.log as an example.

Evolving the features:

vCenter changes:

I know I was a little tough on VMware in the intro, and while I still stand behind my assertion in their lack of innovation, what they’ve done with the VCSA, it’s pretty kick ass.  I would say it’s long overdue, but at least it finally here.  That said, there’s still a ton of things VMware could be doing better with vCenter.

  1. If you have ever tried to setup a simplistic, but secure profile for some self-service VM management, you know that it’s nightmare. 99% of that problem is attributed to VMware’s very shitty ACL scheme.  The way permission entitlements work is confusing, conflicting, and ultimately leads to having more access granted, so you can get things to work.  It shouldn’t be this difficult to setup a small resource pool, a dedicated datastore and a dedicated network, and yet it is.  I would love to see VMware duplicate the way Microsoft handles ACLS, because to be 100% honest, they’ve nailed it.
  2. In general, the above point wouldn’t even be an issue, if VMware would just create a multi-tenancy ability. I’m not talking about wanting a “private cloud”.  This isn’t a desire for more automation or the like, simply a built-in way, to securely carve up logical resources, and allocated them to others.  I would LOVE to have an easy way for my Dev, QA and DBAs to all have access discrete buckets of resources.
  3. So, I generally hate web clients, and nothing enforced that more than VMware. Don’t get me wrong, web clients can be great, but the vSphere web client is not.  Here is what I would like to see, if you’re going to cram a web client down my throat.
    1. Finish the HTML5, before ripping the c# away from us. The flash client is terrible.
    2. Whoever did the UI design for the c# client, mostly got it right the first time. The web client should be duplicated aspects of the c# client that worked well.  Things like the right click menu, the color schemes and icons.  I have no problem with seeing a UI evolve over time, but us old heads, like things where they were.  The web clients feel like developers just moved shit around for no reason.  The manage vs. monitor tab gets a big thumb up from me, but it’s after that where it starts to fall apart.  Finding simple things like the storage paths, which used to be a simple right click on the datastore have moved to who knows where.  Take a lesson from Windows 8 and 10, because those UI’s are a disaster.  Moving shit around for the sake of moving it around is the wrong.  Apples OS X UI is the right way to progress change.
  4. The whole PSC + vCenter integration, feels half assed if you ask me. I think for a lot of admins, they have no clue why these roles should be separate, how to properly admin the PSC’s, and if shit break, good luck.  It was like one day you only had vCenter, and the next thing you know, there’s this SSO thing that who knows what about, and then the PSC pops out of nowhere.  It wasn’t a gradual migration, rather this huge burst of changes to authentication, permissions and certificate management.  I would say there a better understanding of the PSC’s at this point, but it wasn’t executed in a good way.  Ultimately though, I still think the PSC’s need some TLC.  Here are a few things l’d like to see.
    1. You guys need to make vCenter and the like smart enough to not need a load balancer in front of the PSC’s. When vCenter joins a PSC domain, it should become aware of all PSC’s that exist, and have automated failover.
    2. There should be PowerCLI for managing the PSC’s, and I mean EVERYTHING about them. Even the stuff where you might run for troubleshooting.
    3. There should be a really friendly UI that walks you through a few scenarios.
      1. Removing a PSC cleanly.
      2. Removing an orphaned PSC controllers or other components (like vCenter).
      3. Putting a PSC into maintenance mode. (which means a maintenance mode should exist)
      4. Troubleshooting replication.
        1. Show the status
        2. Let us force a replication
      5. Rolling back / restoring items, like users or certs.
      6. Re-linking a vCenter that’s orphaned, or even transferring a vCenter persona to a new vCenter environment.
      7. How about some really good health monitors? As in like single API / PowerCLI command type of stuff.
      8. Generating an overall status report.
  5. Update manager, while an awesome feature, hasn’t seen much love over the years, and what I’d really like to see are as follows.
    1. Let me remove an individual update, and provide an option to delete the patch on disk, or simply remove the update from the DB.
    2. Scan the local repo for orphaned patches (think in the above scenario where someone deletes a patch from update manager, without removing it from the file system).
    3. Add the dynamic ability baselines to all classifications of updates, not just updates themselves. Right now, we can’t create a dynamic extensions baseline.
    4. Give me PowerCLI admin abilities. I’d love to be able to use PowerClI to do all the things I can do in the GUI.  Anything from uploading a patch, to creating baselines.
    5. Open the product up, so that vendors could integrate firmware remediation abilities.
    6. Have an ability to check the VMware HCL for updated VIBs, that are certified to work with the current firmware we’re running. This would make managing drivers in ESXi so much easier.
    7. Offer a query derived baseline. Meaning let us use things like a SQL query to determine what a baseline should be.
    8. Check if a VIB is applicable before installing it, or have an option for it. Things like, “hey, you don’t have this NIC, so you don’t need this driver”.  I’ve seen drivers installed, that had nothing to do with the HW I had, actually cause outages.
  6. There are still so many things that can’t be adminsterd using PowerCLI, at least not without digging into extension data or using methods. Keep building the portfolio of cmdlets.  I want to be able to do everything in PowerCLI that I can in the GUI.  Starting with the admin stuff, but also on top of that, doing vCenter type tasks like repointing or other troubleshooting tasks.
  7. How about overhauled host profiles?
    1. Provide a Microsoft GPO like function. Basically, present me a template that shows “not configured” for everything and explain what the default setting is.  Then let me choose whatever values are supported then apply that vCenter wide, datacenter wide, folder / cluster wide or host specific.
      1. Similar feature for VM settings.
      2. Support the concept of inheritance, blocking and over rides.
    2. Let me create a host independent profile, and perhaps support the concept of sub-profiles for cases where we have different hosts. Basically, let me start with a blank canvas and enable what I want to control through the profile.
  8. Let us manage ESXi local users / groups and permissions from vCenter its self. In fact, having the ability to automatically create local users / groups via a GPO like policy would be great.
  9. I had an issue where a 3rd party plugin kept crashing my entire vSphere web client. Why in the world can a single plugin, crash my soon to be only admin interface?  That’s a very bad design.  Protect the admin interface, if you have to kill something, kill the plugins, and honestly, I’d much rather see you simply kill the troublesome plugin.  Adding to that, actually have some meaningful troubleshooting abilities for plugins.  Like “hey, I needed more memory, and there wasn’t enough”.
  10. vCenter should serve as a proxy for all ESXi access. Meaning if I want to upload an ISO, or connect to a VM’s console, proxies those connections through vCenter.  This allows me to keep ESXi more secure, while still allowing developers and other folks to have basic access to our VMware environment.
  11. Despite its maturity, I think vMotion and DRS need some love too.
    1. Resource pools basically get ripped apart during maintenance mode evacuations or moving VM’s (if you’re not careful). VMware should develop a similar wizard to what’s done when you move storage.  That is, default to leaving a VM in a resource pool when we switch hosts, but ask if we’d like to switch it to a resource pool.
    2. I would love to see a setting or setting(s) where we can influence DRS decision a bit more in a heavily loaded cluster. For example, I’ve personally had vCenter move VM’s to hosts that didn’t have enough physical memory to back the allocated memory, and guess what happened?  Ballooning like a kid’s birthday party.  Allow us to have a tick box or something that prevents VM’s from moving to hosts that don’t have enough physical memory to back the allocated + overhead memory of the VM’s.
    3. Would love to see fault zones added to compute. For example, maybe I want my anti-affinity rules to not only be host aware, but fault zone aware as well.
      1. Have a concept of dynamic fault zones based on host values / parameters. For example, the rack that a host happens to run in.
    4. Show me WHY you moved my VM’s around in the vMotion history.
  12. How about a mobile app for basic administration and troubleshooting? I shouldn’t need a third party to make that happen.  And for the record I know you have one, I want it to be good though.  I shouldn’t need to add servers manually, just let me point at vCenter(s) and bring everything in.

SDRS, vVOLS, vSAN and storage in general:

If I had to pick a weak spot of VMware, it would be storage.  It’s not that its bad, it’s just that it seems slow to evolve.  I get it, it’s super critical to your environment, but in the same tone, it’s super critical to my environment, and that means I need them to keep up with demand.  Here is some example.

  1. Add support for tape drives, and I mean GOOD support / GOOD performance. This way my tape server can finally be virtualized too without the need to do things like remote iSCSI, or SR-IOV.  I know what some of you might be thinking, tape is dead.  Wish it were true, but it’s not.  What I really want to see VMware do, is have some sort of library certification process, and then enable the ability to present a physical library as a virtual one to my VM.  Either that, or related to that, let me do things like raw device mappings of tape drives.  Give me like a virtual SAS or fiber channel card, that can do a raw mapping of a table library.  Even cooler, would be enabling me to have those libraries be part of a switch, and enabling vMotion too.
  2. I still continue to sweat bullets about the amount of open storage I have on a given host, or at least when purchasing new hosts. It’s 2017, a period of time where data has been growing at incredible rates, and the default ESXi is still tuned for 32TB of open storage?  I know that sounds like a lot, but it really isn’t.  To make matters worse, the tuning parameters to enable more open storage (VMDK’s on VMFS) is buried in an advanced setting and not documented very well.  If the memory requirements are negligible, ESXi should be tuned for the max open storage it can support.  Beyond that, VMware should throw a warning if the amount of open storage exceeds the configured storage pointer cache.  Why burry something so critical and make an admin dig through log messages to know what’s going on (after the fact mind you)?
    1. Related to the above, why is ESX even limited to 128TB (pointer cache)? Don’t get me wrong, it’s a lot of storage, but it’s not like a wow factor.  A PB of open storage would be a more reasonable maximum IMO.   If it’s a matter of consuming more memory (and not performance) make that an admin choice.
  3. RDM’s via local RAID should be a generally supported ability. I know it CAN work in some cases, but it’s not a generally supported configuration.  There are times where an RDM makes sense, and local RAID could very much be one of those cases.  I should be able to carve up vDisks and present them to a VM directly.
  4. How about better USB disk support? It’s more of a small business need, but a need none the less.  In fact, I would say being even more generic, removable disks in general.
  5. Why in the world is removing a disk/LUN such an involved task still? There should literally be a right click, delete disk, and then the whole work flow kicks off in the background.  Needing to launch PowerCLI, do an unmount, detach process is just a PITA.  There shouldn’t even need to be an order of operations.  I mean, in windows I can just rip the disk out and no issues occur (presuming nothings on the disk of course).  I don’t mind VMware making some noise about a disk being removed, but then make it an easy process to say “yeah, that disk is dead, whack it from your memory”.
  6. Pretty much everything on my vSAN / what’s missing in HCI posts has gone unimplemented in vSAN. You can check that out here and here.  That said, they have added a few things like parity and compression / dedupe, but that’s nothing in the grand scheme of things.
    1. What I really wished vSAN was / is, is a non-hyperconverged storage solution. As in, I wish I could install vSAN as a standalone solution on storage, and use it as a generic SAN for anything, without needing to share it with compute.  Hedvig storage has the right idea.  Don’t know what I’m talking about, go check them out here.  Just imagine what vSAN could do with all that potential CPU power, if it didn’t have to hold its self-back for the sake of the VM’s.  And yes, THIS would be worth of a separate product SKU.
  7. SDRS:
    1. I wish VMware would let you create fault zones with SDRS. This way when I create VM anti-affinity rules and specific different fault zones, I’d sleep better at night knowing my two domain controllers weren’t running on the same SAN, BUT, that they could move wherever they needed to.
    2. It would be really great to see SDRS have the ability to balance VM’s across ANY storage type. And have expanded use to local storage as well.  For example, I would love to see vVOLs have SDRS in front of it.  So, my VM’s could still float from SAN to SAN, even if they’re a vVOL.  For the local storage bit, what if I have a few generic local non-san luns.  I could still see there being value in pooling that storage from an automation standpoint.
    3. I would love to see a DRS integration for non-shared storage DRS. I know it would be REALLY expensive to move VM’s around.  But in the case of things like web servers, where shared storage isn’t needed, and vSAN just adds complexity, I could see this being a huge win.  If nothing else, it would make putting a host into maintenance mode a lot easier.
    4. Let me have affinity rules in standard edition of VMware. This way I can at least be warned that I have two VM’s comingling on the same host that shouldn’t be.
  8. vFlash (or whatever it’s called)
    1. It would be nice to see VMware actually continue to innovate this. For example.
      1. Support for multiple flash drives per host and LARGE flash drives per host.
      2. Cache a data store instead of a single VM. This way the cache is used more efficiently.  Or make it part of a storage policy / profile.
      3. Do away with static capacity amounts per VMDK. In essence offer a dynamic cache ability based on the frequency of the data access patterns.
      4. I would also suggest write caching, but let’s get decent read caching first.

ESXi itself:

The largest stagnation in the platform has been ESXi its self.  You can’t count vSAN or NSX if you’re going to sell it as a separate product.  Here are some areas I would like to see improved.

  • I would love to see the installation wizard ask more questions early on, so that when they’re all answered, my host is closer to being provisioned. I understand that’s what the host deploy is for, but that’s likely overkill for a lot of customers.
    • ASK me for my network settings and verify they work.
    • ASK me if I want to join vCenter and if so, where I want the host located
    • ASK me if I want to provision this host straight to a distributed switch so I don’t need to go through the hassle of migrating to one later.
  • Let the free edition be joined to vCenter. This way we can at least move a vm (shutdown) from one host to another, and also be able to keep them updated.  I could see a great use case for this if developers want / need dedicated hosts, but we need to keep them patched.  I’m not asking for you do anything, other than let us patch them, move vm, and be able to monitor their basic health of the host.  Keep all the other limits in place.
  • Give us an option to NEVER overcommit memory. I’d rather see a VM fail to power on, not migrate or anything if it’s going to risk memory swapping / ballooning.
  • Make reservations an actual “reservation” If I say I want the whole VM’s memory reserved, pre-reserve the whole memory space for that VM, regardless of whether the VM is using it.
  • Support for virtualizing other types of HW, like SSL offload cards and presenting them to VMs. I suspect this would also involve support from the card vendors of course, but it would still be a useful thing to see.  For example, SSL offloading in our virtual F5’s.
  • I want to see EVERYTHING that can done in an ESX CLI and other troubleshooting / config tools also be available in PowerCLI.
  • Have a pre-canned command I can run to report on all hardware, its drivers, firmware and modules.
  • I think it would be kind of slick to run ESXi as a container. Perhaps I want to carve up a single physical ESXi host, into a couple of smaller ESXi hosts and use the same license.  Again, developers would be a potentially great use case for this.
  • I would like to see an ability to export and import and ESXi image to another physical server. Simple use case would be migrating a server from one host to another.  Maybe even have a wizard for remapping resources such as the NICS, and the log location.  I’m not talking about a host backup, more like a host migration wizard.
  • Actually, get ESXi joining to an Active Directory working reliably.
  • How about showing us active NFC connections, how much memory they’re consuming and the last time they were used. While we’re at it, how about supporting MORE NFC connections.
  • Create a new kernel for NFC and cold migration traffic with a related friendly name.
  • Help us detect performance issues easier with top. Meaning, if there are particular metrics that have crossed well known thresholds, maybe raise an event or something in the logs.  Related though, perhaps offing a GUI (or PowerCLI) related option for creating / scheduling an ESXTOP trace and storing the results in a CSV.

Evolving the company:

Documentation:

Look, almost everyone hates being stuck with documenting things, or at least I do.  However, it’s something that everyone relies on, and when done well, it’s very useful.   I get that VMware is large and complex, so I have to imagine documentation is a tough job.  Still, I think they need to do better at it.  Here is what I see that’s not working well.

  • KB articles aren’t kept up to date as new ESXi versions are released. Is that limitation still applicable?  I don’t know, the documentation doesn’t tell me.
  • There is a lack of examples on changing a particular setting. For example, they may show a native ESXCLI method, while completely leaving out PowerCLI and the GUI.
  • There is a profound lack of good documentation on designing and tuning ESXi for more extreme situation. Things like dealing with very large VM’s, designing for high IOPS or high throughput, large memory and vCPU VM’s.  I don’t know, maybe the thought is you should engage professional services (or buy a book), but that seems overkill to me.
  • Tuning and optimizing for specific application workloads. For example, Microsoft Clustering on top of VMware.  Yeah they have a doc, but no it’s not good.  Most of their testing is under best case scenarios, small VM’s, minimal load, empty ESXi servers, etc.  It’s time for VMware to start building documentation based on reality.  To use a lazy excuse like “everyone’s environment is different” doesn’t absolve even an attempt at more realistic simulations.  For example, I would love to see them test a 24 vCPU, 384GB of vRAM VM with other similarlay sized VM’s on the same host, under some decent load.  I think they’d find, vMotion causes a lot of headaches at that scale.
  • Related to above, I find their documentation a little untrustworthy when they say “x” is supported. Supported in what way?  Is vMotion not supposed to cause a failover, or do you simply mean, the vMotion operation will complete?  Even still, there are SO many conflicting sub-notes it’s just confusing to know what restrictions exist and what doesn’t.  It’s almost like the writer doesn’t understand the application they’re documenting.

Support:

If there is one thing that has taken a complete downward spiral, it’s support.  Like, the VMware execs basically decided customers don’t need good support and decided to outsource it to the cheapest entity out there.  Let me be perfectly clear, VMware support sucks, big time, and I’m talking about production support just to be clear.  Sure, I occasionally get in touch with someone that knows the product well, communicates clearly, and actually corresponds within a reasonable time, but that’s a rarity.  Here are just a few examples of areas that they drop the ball in.

  • Many times, they don’t contact you within your time zone. Meaning, if I work 9 – 5 and I’m EST, I might get a call at 5 or 6, or an email at 4am.
  • Instead of coordinating a time with you, they just randomly call and hope you’re there, otherwise its “hey, get back to me when you’re ready”, which is followed by another 24-hour delay (typically). Sometimes attempts to coordinate a time with them works, other times it doesn’t.
  • I have seen plenty of times where they say they’ll get back to you the next day, and a week or more goes by.
  • Re-Opening cases, has led to me needing to work with a completely different tech. A tech that didn’t bother reading the former case notes, or contacting the original owner to get the back story.  In essence, I might as well have opened a completely new case.
  • Communication is hit or miss. Sometimes, they communicate well, other times, there’s a huge breakdown.  It’s not so much understanding words, but an inability to understand tone, the severity of the situation, or other related factors.
  • Being trained in products that have been out for months. I remember when I called about some issues with a PSC appliance 6 MONTHS after vSphere 6 was released, and the tech didn’t have a clue on how the PSC’s worked.  I had to explain to him the basics, it was a miserable experience.
  • Having a desire to actually figure out an issue, or really solve a problem. It’s like they read from a book, and if the answer isn’t there, they don’t know how to think beyond that.

While we’re still on the support topic, this whole notion of business critical and mission critical support is a little messed up.  I guess VMware basically wants us to fund the salary of an entire TAM or something like that, which is bluntly stupid.  It doesn’t matter if I’m a company with one socket of Enterprise Plus, or a company with 100 sockets, we apparently all pay the same price.  I don’t entirely have a problem with pay a little extra to get access to better support, but then it should be something that’s an upgrade to my production support per socket, not a flat fee. Again, it should be based around fair consumption.

Sales:

You know when I hear from my sales team, when they want to sell me something.  They don’t call to check-in and see if I’m happy.  They’re not calling to go over the latest features included with products I own to make sure I’m maximizing value, none of that happens.  All that kind of stuff is reactive at best.  It’s ME reaching out to learn about something new, or ME reaching out to let them know support is really dropping the ball.  I spend a TON of money on VMware, I’d like to see some better customer service out of my reps.  I have vendors that reach out to me all the time, just to make sure things are going ok.  A little effort like that, goes a long way in keeping a relationship healthy.

Website:

I want to pull my hair out with your website.  Finding things is so tough, because your marketing team is so obsessed with big stupid graphics, and trying to shove everything and anything down my throat.  You’re a company that sells lean and mean software, and your website should follow the same tone.  Everything is all over the place with your site.  Also, it’s 2017, having a proper mobile optimized site would be nice too.

Finally, you guys run blogs, but one thing I’ve noticed is you stop allowing new comments after “x” time.  Why do you do this?  I might need further clarification on a topic that was written, even if it’s years ago.

Cloud and innovation:

This one is a tough area, I’m not sure what to say, other than I hope you’re not the next Novell.  You guys had a pretty spectacular fail at cloud, and I could probably go into a lot of reasons, and most of them wouldn’t be related to Microsoft or AWS being too big to beat.  I suspect part of it was you guys got fat, lazy and way too cocksure.  It’s ok, it happens to a lot of companies, and professionals alike.  While it’s hard for me to forsee someone wanting to consume a serverless platform from you guys, I wouldn’t find it hard to believe that someone might want to consume a better IaaS platform than what’s offered by Microsoft or AWS.  While they have great automation, their fundamental platform still leaves a lot to be desired.  That to me, is an area that you guys could still capture.  I could foresee a great use case for a virtual colocation + all the IaaS scalability and automation abilities.  I still have to shutdown an Azure VM for what feels like every operation, need I say more?

Closing:

Look I could probably keep going on, and one may wonder why stop, I’m already at 6,000 plus words.  I will say kudos to you, if you’ve actually read this far and didn’t simply skip down.  However, the point of this post wasn’t to tear down VMware, nor was it to go after writing my longest post ever.  I needed to vent a little bit, and wanted VMware to know that I’m frustrated with them and what they could do to fix that.  I suspect a lot of my view points aren’t shared by all, but in turn, I’m sure some are.  VMware was the first tech company that I was truly inspired by.  To me, they exemplified what a tech company should strive to be, and somewhere along the way, they lost it.  Here’s to hoping VMware will be with us for the long haul, and that what’s going on now, is simply a bump in the road.

 

Powershell Scripting: Get-ECSESXHostVIBToPCIDevices

Introduction:

If you remember a little bit ago, I said I was trying to work around the lack of driver management with vendors.  This function is the start of a few tools you can use to potentially make your life a little easier.

VMware’s drivers are VIBS (but not all VIBS are drivers).  So the key to knowing if you have the correct drivers is to find which VIB matches which PCI device.  This function does that work for you.

How it works:

First, I hate to be the bearer of bad news, but if you’re running ESXi 5.5 or below, this function isn’t going to work for you.  It seems the names of the modules  and vibs don’t line up via ESXCLI in 5.5, but they do in 6.0.  So if you’re running 6.0 and above, you’re in luck,.

As for how it works, its actually pretty simple.

  1. Get a list of all PCI devices
  2. Get a list of all modules (which aren’t the same as VIBS).
  3. Get a list of all VIBs.
  4. Loop through each PCI device
    1. See if we find a matching module
      1. Take the module and see if we find a VIB that matches it.
  5. Take the results of each loop, add it to an array
  6. Spit out the array once the function is done running
  7. Your results should be present.

How to execute it:

Ok, to begin with, I’m not doing fancy pipelining or anything like that.  Simply enter the name of the ESXi host as it is in vCenter and things will work just fine.  There is support for verbose output if you want to see all the PCI devices, modules and vibs that are being looped through.

Get-ECSESXHostVIBToPCIDevices -VMHostName "ServerNameAsItIsInvCenter"

If you want to do something like loop through a bunch of hosts in a cluster, that’s awesome, you can write that code :).

How to use the output:

Ok great, so now you’ve got all this output, now what?  Well, this is where we’re back to the tedious part of managing drivers.

  1. Fire up the VMware HCL web site and go to the IO devices section
  2. Now, there are three main columns from my output that you need to find the potential list of drivers.  Yeah, even with an exact match, there maybe anywhere from 0 devices listed (take that as you’re running the latest) to having on or more hits.
    1. PCIDeviceSubVendorID
    2. PCIDeviceVendorID
    3. PCIDeviceDeviceID
  3. Those three columns are are all you need.  Now a few notes with this.
    1. if there are less than four characters, VMware will add leading zeros on their web drop down picker.  For example, if my output shows “e3f”, on VMwares drop down picker, you want to look for “0e3f”.
    2. if you get a lot of results, what I suggest doing next, is seeing if the vendor matches your server vendor.  If you find a server vendor match and there are still more than one result, see if its something like the difference between a dual port or single port card.  If you don’t see your server vendor listed, see if the card vendor is listed.  For example, in UCS servers, instead of seeing Cisco for a RAID controller, you would likely find a match for “Avago” or “Broadcom”.  Yeah, it totally gets confusing with HW vendors buying each other LOL.
  4. Once you find a match, the only thing left to do, is look at the output of column “ModuleVibVersion” in my script and see if you’re running the latest driver available, or if it at least is recent.  Just keep in mind, if you update the driver, make sure the FW you’re running is also certified for that driver.

Where’s the code?

Right here

What’s next / missing?

Well, a few things:

  1. I haven’t found a good way yet to loop through each PCI device and see its FW version.  That’s a pretty critical bit of info as I’ve said before.
  2. Even if i COULD find the firmware version for you, you’re still going to need to cross reference it against your server vendor.  Without an API, this is also going to be a tedious process.
  3. You need to manually check the HCL because in 2017, VMware still doesn’t have an API, let alone a restful one to do the query.  If we had that, the next logical step would be to take this output and query an API to find a possible match(es).  For now, you’ll need to do it manually.
    1. Ideally, the same API would let you download a driver if you wanted.
  4. VMware lacks an ability to add VIBS via PowerCLI or really manage baselines and what not.  So again, VMware really dropping the ball here.  This time it’s the “Update Manger” team.

Conclusion:

Hope this helps a bit, it’s far from perfect, but I’ve used it a few times, and found a few NIC drivers and RAID controllers that had older drivers.

Thinking out loud: Why do server vendors still struggle with driver and firmware management?

History:

Let me give you a little back story before digging into the meat of this post.  My team and I make a very concerted effort to keep our servers firmware and drivers updated.  We’ve gone so far as to purchase software from Dell, implement a process on how firmware / drivers are to be updated, and ensuring that its routinely done every quarter.  We do this because in general it’s a best practice, but also because we’ve run into too many occasions where troubleshooting with a vendor stops (if it ever starts) very quickly if the drivers / firmware isn’t recent.  In essence, we’re doing our best to be diligent and proactive with keeping our servers healthy, secure and updated.

Late last year we ran into two issues, both of which are related to drivers / firmware.

  1. A Broadcom NIC causing a purple screen of death (PSOD) in ESXi. This was a server that was freshly rebuilt and all drivers (or so we thought) and firmware updated.  Turns out the driver we were running was more than two years old and the PSOD we were having was a resolved issue in a newer driver.
  2. An Intel x710 quad port 10Gb NIC causing packets to black hole for certain VM’s on the same vLAN. Again, these were new hosts that were patched, firmware updated and in theory up to date.  This issue is what really triggered us to start evaluating other server vendors and their solutions.

 

Of those two issues above, only one was solved with a simple driver update, and the other, we just gave up on and switched to a different NIC (x520).

The Issue:

If you can’t see where this post is going already, let me lay it out.  Server vendors still can’t properly manage their own drivers, firmware and vendor specific software. I know what you’re thinking, you have tools that the vendor provided, you’re using them and you’re fine.  I hate to be the bearer of bad news, but I doubt it.  We just got done a rigorous evaluation of Dell, HP and Cisco.  None of them have a complete solution.  Don’t get me wrong, some of them are getting there, but no one has the problem solved.  If you’re wondering what the specific problems are, see the bullets below.

  • Server vendors would like that you keep your firmware, drivers and tools up to date.
  • OS vendors (and sometimes server vendors) require that the driver and firmware have a certified pairing. It is NOT good enough to simply have the latest driver and the latest firmware.  This of course may vary slightly depending on which server and OS vendor.  VMware though as an example, absolutely has driver and firmware paring that’s required.
    • This driver and firmware pairing is typically worked out between the server and OS vendor.
    • VMware has a strict HCL for this use case, and TMK, MS has an HCL, although they’re a little more forgiving when it comes to the pairing. I can’t speak about Linux, Solaris or other OS’s.

 

Think about this, when was the last time you did the following?

  • Retrieved an inventory of all your hardware firmware revisions and driver revisions.
    • Do you even know how to do this? Probably not as easy as you think.
  • Logged into your OS vendors HCL and one hardware item at a time, checked if you are running the latest driver and firmware, and that the pairing is also certified.
    • With VMware, you can use the device vendor ID, device ID and sub vendor ID to find the specific hardware in question on their HCL. Just remember its hex values.

I bet you’re either relying on the following.

  • VMware update manager, and vendor provided depots (if they exist).
  • Vendor supplied firmware management solutions.
    • Some may have driver management for select OS, but no one does it all.
  • Vendor custom ISOs / install discs.

I’m going to suspect if you go and check your VMware HCL, you’re out compliance in one way or another, or something is woefully out of date.

Solution:

Let’s share a pipe for a second and dream about what it should look like.

 

Server Vendor:

  • It should be a central console.
  • It should handle downloading all firmware, drivers and vendor specific tools.
  • It should use a concept of baselines. A baseline being defined as.
    • OS and server model specific.
    • Based on a release date. The baseline should define an approved pairing of drivers, firmware and vendor tools for a given month, quarter or however often the server vendor feels the need to establish a new baseline.
    • The baseline should support the concept of cumulative updates / hotfixes.
  • It should support grouping servers, and applying baselines to the server groups.
  • It should support compliance checking for the baselines. Not simply deploying the drivers and firmware and assuming everything is ok.  This would let you know if an admin went rouge and manually updated or downgraded firmware / drivers or tools.
  • It should support rolling back drivers, firmware or tools if it is determined to be too far ahead.
  • Provide very verbose information as to why an update process failed.
  • Bonus points
    • Support a multi-site architecture. Meaning be able to have a cached copy of the repo and a local server to perform the actual update and auditing process.
    • Auto discover servers

OS Vendor:

  • Should provide a comprehensive API
    • Look up hardware and driver pairings.
    • Enable the download of the driver or firmware directly would be nice.

Conclusion:

Coming back to reality a bit, what can you do?  Use the tools you have to the best of your ability, script what you can, and manually deal with the gaps.  That said, I’m working on the auditing part of the problem, at least with VMware. Hope to have blog post about it and a new GitHub commit in the coming month or so, stay tuned.

Oh, one final thing you can do, start bugging your server vendor sales team about the issue. If enough people raise the issue, it will get the attention it desperately needs.