Tag Archives: thinking out loud

Thinking out loud: VMware, this is what I want from you

Warning:

This post is clicking in at 6k words.  If you are looking for a quick read, this isn’t for you.

Disclaimer:

Typical stuff, these are my personal views, not views of my employers.  These are not facts, merely opinions and random thoughts I’m writing down.

Introduction:

I don’t know about all of you, but for me, VMware has been an uninspiring company over the last couple of years.  VMworld was a time when I used to get excited.  It used to mean big new features were coming, and the platform would evolve in nice big steps.  However, over the last 5 – 7 years, VMware has gotten progressively disappointing.  My disappointment however is not limited to the products alone, but the company culture as well.

This post will not follow a review format like many of you are used to seeing, but instead, will be more of a pointed list of the areas I feel need improvement.

With that in mind, let it go on the record, that in my not so humble option, VMware is still the best damn virtualization solution.  I bring these points up not to say that the product / company sucks, but rather to outline that in many ways, VMware has lost its mojo, and IMO some of these areas would be good steps in recovering that.

The products:

The death of ESXi:

You know, there are a lot of folks out there that want to say the hypervisor is a commodity.  Typically, those folks are either pitching or have switched to a non-VMware hypervisor.  To me, they’re suffering from Stockholm’s syndrome.  Here’s the deal, ESXi kicks so much ass as a hypervisor.  If you try to compare Hyper-V, KVM, Xen or anything else to VMware’s full featured ESXi, there is no competition.  I don’t give a crap about anything you will try to point out, you’re wrong, plain and simple.  Any argument you make will get shot down in a pile of flames.  Even if you come at me with the “product x is free” I’m still going to shoot you down.

With that out of the way, it’s a no wonder that everyone is chanting the hypervisor commodity myth.  I mean, let’s be real here, what BIG innovation has been released to the general ESXi platform without some up charge?  You can’t count vSAN because that’s a separate “product” (more on the quotes later).  vVOLs you say?  Yeah, that’s a nice feature, only took how long?

So, what else?  How about the lack of trickle down and the elimination of Enterprise edition? There was a time in VMware’s history when features trickle down from Enterprise Plus > Enterprise > Standard.  Usually it occurred each year, so by the time year three rolled around, that one feature in Enterprise Plus you were waiting for, finally got gifted to Standard edition.  The last feature I recall this happening too, was the MPIO provider support, and that was ONLY so they could support vVOLS on Standard edition (TMK).

Here is my view on this subject, VMware is making the myth of a commoditized hypervisor a self-fulfilling prophecy.  Not only is there a complete lack innovation, but there’s no trickle down occurring.

If you as a customer, have gone from receiving regular (significant) improvements as part of your maintenance agreement, to basically nothing year over year, why would you want to continue to invest in that product?  Believe me, the thought has crossed my mind more than once.

From what I understand, VMware’s new business plan, is to make “products” like vSAN that depend on ESXi, but that aren’t included with the ESXi purchase.  Thus, a new revenue stream for VMware and renewed dependence on ESXi.  First glance says it working, at least sort of, but is it really doing as well as it could?  While it sounds like a great business model, if you’re just comparing whether you’re black / red, what about the softer side of things?  What is the customer perception of moving innovations to an al a carte model?  For me, I wonder if they took the approach below, would it have had the same revenue impact they were looking for, while at the same time, also enabling a more positive customer perception?  I think so…

  1. First and foremost, VMware needs to make money. I know I just went through that whole diatribe above, but hear me out.  This whole “per socket” model is dead.  It’s just not a sustainable licensing model for anyone.  Microsoft started with SQL and has finally moved Windows to a per core model.  In my opinion, VMware needs to evolve its licensing model in two directions.
    1. Per VM: There are cases, where you’re running monster VMs, and while you’re certainly taking advantage of VMware’s features, you’re not getting anywhere near the same vale add as someone who’s running 20, 30, 50, 100 VM’s per host.  Allowing customers to allocate per VM licenses to single host or an entire cluster would be a fair model for those that aren’t using virtualization for the overcommit, but for the flexibility.
    2. Per Core: I know this is probably the one I’m going to get the most grief from, but let’s be real, YOU KNOW it’s fair.  Let’s just pretend, VMware wasn’t the evil company that Microsoft is, and actually let you license as few as 2 cores at a time?  For all of you VARs that have to support small businesses, or for all of you smaller business out there, how much likelier would you have just done a full blow ESXi implementation for your clients?  Let’s just say VMware charged $165 per core for ESXi standard edition and your client had a quad core server.  Would you think $659 would be a reasonable price?  I get that number simply by taking VMware’s list price and dividing by 8 cores, which is exactly how Microsoft arrived at their trade-ins for SQL and Windows.  NOW, let’s also say you’re a larger company like mine and you’re running enterprise plus.  The new 48 core server I’m looking at would normally cost $11,238 at list for Enterprise Plus.  However, if we take my new per core model, that server would now cost ($703 per core) $33,714.  That’s approximately $22k that VMware is losing out on for just ONE server.  I know what you’re thinking, Eric, why in the world would you want to pay more?  I don’t, but I also don’t want a company that makes a kick ass product to stagnate, or worse crumble.  I’ve invested in a platform, and I want that platform to evolve.  In order for VMware to evolve, it needs capital.
  2. Ok, now that we have the above out of the way, I want a hell of a lot more out of VMware for that kind of cash, so let’s dig into that.
    1. vSAN should have never been a separate product. Including vSAN into that per core or per VM cost just like they do with Horizon, would add value into the platform.  Let’s be real, not everyone is going to use every feature of VMware.  I’m personally not a fan of vSAN, but that doesn’t mean I don’t think I should be entitled to it.  This could easily be something that is split among Standard and Enterprise plus editions.
      1. Yes, that also means the distributed switch would trickle down into Standard edition, which it should be by now.
    2. Similar to vSAN, NSX should really be the new distributed switch. I’m not sure exactly how to split it across the editions, but I think some form of NSX should be included with Standard, and the whole darn thing for Enterprise Plus.
    3. At this stage, I think it’s about time for Standard edition to really become the edition of the 80%. Meaning, 80% of the companies would have their needs met by Standard edition, and Enterprise plus is truly reserved for those that need the big bells and whistles.  A few notable things I would like to trickle down to Standard Edition are as follows.
      1. DRS (Storage and Host)
      2. Distributed Switch (as pointed out in 2ai)
      3. SIOC and NIOC
      4. NVIDIA Grid
  3. As for Enterprise Plus, and Enterprise Plus with Ops manager, those two should merge and be sold at the same price as Enterprise plus. I would also like to see some more of the automation aspects from the cloud suite brought into the Enterprise Plus edition as well.  I kind of view Enterprise Plus edition, as being an edition that focuses on all the automation goodies, that smaller companies don’t need.
  4. IMO, selling vCenter as separate SKU is just silly. So as part of all of this, I would like to see vCenter simply included with your per core or per VM licenses.  At the end of the day, a host can only be connected to one vCenter at a time anyway.
  5. Include a log insight licenses for every ESXi host sold, strictly used for collecting and managing a hosts VMware logs, including the VM’s running on top of them. I don’t mean inside the OS, rather things like the vmware.log as an example.

Evolving the features:

vCenter changes:

I know I was a little tough on VMware in the intro, and while I still stand behind my assertion in their lack of innovation, what they’ve done with the VCSA, it’s pretty kick ass.  I would say it’s long overdue, but at least it finally here.  That said, there’s still a ton of things VMware could be doing better with vCenter.

  1. If you have ever tried to setup a simplistic, but secure profile for some self-service VM management, you know that it’s nightmare. 99% of that problem is attributed to VMware’s very shitty ACL scheme.  The way permission entitlements work is confusing, conflicting, and ultimately leads to having more access granted, so you can get things to work.  It shouldn’t be this difficult to setup a small resource pool, a dedicated datastore and a dedicated network, and yet it is.  I would love to see VMware duplicate the way Microsoft handles ACLS, because to be 100% honest, they’ve nailed it.
  2. In general, the above point wouldn’t even be an issue, if VMware would just create a multi-tenancy ability. I’m not talking about wanting a “private cloud”.  This isn’t a desire for more automation or the like, simply a built-in way, to securely carve up logical resources, and allocated them to others.  I would LOVE to have an easy way for my Dev, QA and DBAs to all have access discrete buckets of resources.
  3. So, I generally hate web clients, and nothing enforced that more than VMware. Don’t get me wrong, web clients can be great, but the vSphere web client is not.  Here is what I would like to see, if you’re going to cram a web client down my throat.
    1. Finish the HTML5, before ripping the c# away from us. The flash client is terrible.
    2. Whoever did the UI design for the c# client, mostly got it right the first time. The web client should be duplicated aspects of the c# client that worked well.  Things like the right click menu, the color schemes and icons.  I have no problem with seeing a UI evolve over time, but us old heads, like things where they were.  The web clients feel like developers just moved shit around for no reason.  The manage vs. monitor tab gets a big thumb up from me, but it’s after that where it starts to fall apart.  Finding simple things like the storage paths, which used to be a simple right click on the datastore have moved to who knows where.  Take a lesson from Windows 8 and 10, because those UI’s are a disaster.  Moving shit around for the sake of moving it around is the wrong.  Apples OS X UI is the right way to progress change.
  4. The whole PSC + vCenter integration, feels half assed if you ask me. I think for a lot of admins, they have no clue why these roles should be separate, how to properly admin the PSC’s, and if shit break, good luck.  It was like one day you only had vCenter, and the next thing you know, there’s this SSO thing that who knows what about, and then the PSC pops out of nowhere.  It wasn’t a gradual migration, rather this huge burst of changes to authentication, permissions and certificate management.  I would say there a better understanding of the PSC’s at this point, but it wasn’t executed in a good way.  Ultimately though, I still think the PSC’s need some TLC.  Here are a few things l’d like to see.
    1. You guys need to make vCenter and the like smart enough to not need a load balancer in front of the PSC’s. When vCenter joins a PSC domain, it should become aware of all PSC’s that exist, and have automated failover.
    2. There should be PowerCLI for managing the PSC’s, and I mean EVERYTHING about them. Even the stuff where you might run for troubleshooting.
    3. There should be a really friendly UI that walks you through a few scenarios.
      1. Removing a PSC cleanly.
      2. Removing an orphaned PSC controllers or other components (like vCenter).
      3. Putting a PSC into maintenance mode. (which means a maintenance mode should exist)
      4. Troubleshooting replication.
        1. Show the status
        2. Let us force a replication
      5. Rolling back / restoring items, like users or certs.
      6. Re-linking a vCenter that’s orphaned, or even transferring a vCenter persona to a new vCenter environment.
      7. How about some really good health monitors? As in like single API / PowerCLI command type of stuff.
      8. Generating an overall status report.
  5. Update manager, while an awesome feature, hasn’t seen much love over the years, and what I’d really like to see are as follows.
    1. Let me remove an individual update, and provide an option to delete the patch on disk, or simply remove the update from the DB.
    2. Scan the local repo for orphaned patches (think in the above scenario where someone deletes a patch from update manager, without removing it from the file system).
    3. Add the dynamic ability baselines to all classifications of updates, not just updates themselves. Right now, we can’t create a dynamic extensions baseline.
    4. Give me PowerCLI admin abilities. I’d love to be able to use PowerClI to do all the things I can do in the GUI.  Anything from uploading a patch, to creating baselines.
    5. Open the product up, so that vendors could integrate firmware remediation abilities.
    6. Have an ability to check the VMware HCL for updated VIBs, that are certified to work with the current firmware we’re running. This would make managing drivers in ESXi so much easier.
    7. Offer a query derived baseline. Meaning let us use things like a SQL query to determine what a baseline should be.
    8. Check if a VIB is applicable before installing it, or have an option for it. Things like, “hey, you don’t have this NIC, so you don’t need this driver”.  I’ve seen drivers installed, that had nothing to do with the HW I had, actually cause outages.
  6. There are still so many things that can’t be adminsterd using PowerCLI, at least not without digging into extension data or using methods. Keep building the portfolio of cmdlets.  I want to be able to do everything in PowerCLI that I can in the GUI.  Starting with the admin stuff, but also on top of that, doing vCenter type tasks like repointing or other troubleshooting tasks.
  7. How about overhauled host profiles?
    1. Provide a Microsoft GPO like function. Basically, present me a template that shows “not configured” for everything and explain what the default setting is.  Then let me choose whatever values are supported then apply that vCenter wide, datacenter wide, folder / cluster wide or host specific.
      1. Similar feature for VM settings.
      2. Support the concept of inheritance, blocking and over rides.
    2. Let me create a host independent profile, and perhaps support the concept of sub-profiles for cases where we have different hosts. Basically, let me start with a blank canvas and enable what I want to control through the profile.
  8. Let us manage ESXi local users / groups and permissions from vCenter its self. In fact, having the ability to automatically create local users / groups via a GPO like policy would be great.
  9. I had an issue where a 3rd party plugin kept crashing my entire vSphere web client. Why in the world can a single plugin, crash my soon to be only admin interface?  That’s a very bad design.  Protect the admin interface, if you have to kill something, kill the plugins, and honestly, I’d much rather see you simply kill the troublesome plugin.  Adding to that, actually have some meaningful troubleshooting abilities for plugins.  Like “hey, I needed more memory, and there wasn’t enough”.
  10. vCenter should serve as a proxy for all ESXi access. Meaning if I want to upload an ISO, or connect to a VM’s console, proxies those connections through vCenter.  This allows me to keep ESXi more secure, while still allowing developers and other folks to have basic access to our VMware environment.
  11. Despite its maturity, I think vMotion and DRS need some love too.
    1. Resource pools basically get ripped apart during maintenance mode evacuations or moving VM’s (if you’re not careful). VMware should develop a similar wizard to what’s done when you move storage.  That is, default to leaving a VM in a resource pool when we switch hosts, but ask if we’d like to switch it to a resource pool.
    2. I would love to see a setting or setting(s) where we can influence DRS decision a bit more in a heavily loaded cluster. For example, I’ve personally had vCenter move VM’s to hosts that didn’t have enough physical memory to back the allocated memory, and guess what happened?  Ballooning like a kid’s birthday party.  Allow us to have a tick box or something that prevents VM’s from moving to hosts that don’t have enough physical memory to back the allocated + overhead memory of the VM’s.
    3. Would love to see fault zones added to compute. For example, maybe I want my anti-affinity rules to not only be host aware, but fault zone aware as well.
      1. Have a concept of dynamic fault zones based on host values / parameters. For example, the rack that a host happens to run in.
    4. Show me WHY you moved my VM’s around in the vMotion history.
  12. How about a mobile app for basic administration and troubleshooting? I shouldn’t need a third party to make that happen.  And for the record I know you have one, I want it to be good though.  I shouldn’t need to add servers manually, just let me point at vCenter(s) and bring everything in.

SDRS, vVOLS, vSAN and storage in general:

If I had to pick a weak spot of VMware, it would be storage.  It’s not that its bad, it’s just that it seems slow to evolve.  I get it, it’s super critical to your environment, but in the same tone, it’s super critical to my environment, and that means I need them to keep up with demand.  Here is some example.

  1. Add support for tape drives, and I mean GOOD support / GOOD performance. This way my tape server can finally be virtualized too without the need to do things like remote iSCSI, or SR-IOV.  I know what some of you might be thinking, tape is dead.  Wish it were true, but it’s not.  What I really want to see VMware do, is have some sort of library certification process, and then enable the ability to present a physical library as a virtual one to my VM.  Either that, or related to that, let me do things like raw device mappings of tape drives.  Give me like a virtual SAS or fiber channel card, that can do a raw mapping of a table library.  Even cooler, would be enabling me to have those libraries be part of a switch, and enabling vMotion too.
  2. I still continue to sweat bullets about the amount of open storage I have on a given host, or at least when purchasing new hosts. It’s 2017, a period of time where data has been growing at incredible rates, and the default ESXi is still tuned for 32TB of open storage?  I know that sounds like a lot, but it really isn’t.  To make matters worse, the tuning parameters to enable more open storage (VMDK’s on VMFS) is buried in an advanced setting and not documented very well.  If the memory requirements are negligible, ESXi should be tuned for the max open storage it can support.  Beyond that, VMware should throw a warning if the amount of open storage exceeds the configured storage pointer cache.  Why burry something so critical and make an admin dig through log messages to know what’s going on (after the fact mind you)?
    1. Related to the above, why is ESX even limited to 128TB (pointer cache)? Don’t get me wrong, it’s a lot of storage, but it’s not like a wow factor.  A PB of open storage would be a more reasonable maximum IMO.   If it’s a matter of consuming more memory (and not performance) make that an admin choice.
  3. RDM’s via local RAID should be a generally supported ability. I know it CAN work in some cases, but it’s not a generally supported configuration.  There are times where an RDM makes sense, and local RAID could very much be one of those cases.  I should be able to carve up vDisks and present them to a VM directly.
  4. How about better USB disk support? It’s more of a small business need, but a need none the less.  In fact, I would say being even more generic, removable disks in general.
  5. Why in the world is removing a disk/LUN such an involved task still? There should literally be a right click, delete disk, and then the whole work flow kicks off in the background.  Needing to launch PowerCLI, do an unmount, detach process is just a PITA.  There shouldn’t even need to be an order of operations.  I mean, in windows I can just rip the disk out and no issues occur (presuming nothings on the disk of course).  I don’t mind VMware making some noise about a disk being removed, but then make it an easy process to say “yeah, that disk is dead, whack it from your memory”.
  6. Pretty much everything on my vSAN / what’s missing in HCI posts has gone unimplemented in vSAN. You can check that out here and here.  That said, they have added a few things like parity and compression / dedupe, but that’s nothing in the grand scheme of things.
    1. What I really wished vSAN was / is, is a non-hyperconverged storage solution. As in, I wish I could install vSAN as a standalone solution on storage, and use it as a generic SAN for anything, without needing to share it with compute.  Hedvig storage has the right idea.  Don’t know what I’m talking about, go check them out here.  Just imagine what vSAN could do with all that potential CPU power, if it didn’t have to hold its self-back for the sake of the VM’s.  And yes, THIS would be worth of a separate product SKU.
  7. SDRS:
    1. I wish VMware would let you create fault zones with SDRS. This way when I create VM anti-affinity rules and specific different fault zones, I’d sleep better at night knowing my two domain controllers weren’t running on the same SAN, BUT, that they could move wherever they needed to.
    2. It would be really great to see SDRS have the ability to balance VM’s across ANY storage type. And have expanded use to local storage as well.  For example, I would love to see vVOLs have SDRS in front of it.  So, my VM’s could still float from SAN to SAN, even if they’re a vVOL.  For the local storage bit, what if I have a few generic local non-san luns.  I could still see there being value in pooling that storage from an automation standpoint.
    3. I would love to see a DRS integration for non-shared storage DRS. I know it would be REALLY expensive to move VM’s around.  But in the case of things like web servers, where shared storage isn’t needed, and vSAN just adds complexity, I could see this being a huge win.  If nothing else, it would make putting a host into maintenance mode a lot easier.
    4. Let me have affinity rules in standard edition of VMware. This way I can at least be warned that I have two VM’s comingling on the same host that shouldn’t be.
  8. vFlash (or whatever it’s called)
    1. It would be nice to see VMware actually continue to innovate this. For example.
      1. Support for multiple flash drives per host and LARGE flash drives per host.
      2. Cache a data store instead of a single VM. This way the cache is used more efficiently.  Or make it part of a storage policy / profile.
      3. Do away with static capacity amounts per VMDK. In essence offer a dynamic cache ability based on the frequency of the data access patterns.
      4. I would also suggest write caching, but let’s get decent read caching first.

ESXi itself:

The largest stagnation in the platform has been ESXi its self.  You can’t count vSAN or NSX if you’re going to sell it as a separate product.  Here are some areas I would like to see improved.

  • I would love to see the installation wizard ask more questions early on, so that when they’re all answered, my host is closer to being provisioned. I understand that’s what the host deploy is for, but that’s likely overkill for a lot of customers.
    • ASK me for my network settings and verify they work.
    • ASK me if I want to join vCenter and if so, where I want the host located
    • ASK me if I want to provision this host straight to a distributed switch so I don’t need to go through the hassle of migrating to one later.
  • Let the free edition be joined to vCenter. This way we can at least move a vm (shutdown) from one host to another, and also be able to keep them updated.  I could see a great use case for this if developers want / need dedicated hosts, but we need to keep them patched.  I’m not asking for you do anything, other than let us patch them, move vm, and be able to monitor their basic health of the host.  Keep all the other limits in place.
  • Give us an option to NEVER overcommit memory. I’d rather see a VM fail to power on, not migrate or anything if it’s going to risk memory swapping / ballooning.
  • Make reservations an actual “reservation” If I say I want the whole VM’s memory reserved, pre-reserve the whole memory space for that VM, regardless of whether the VM is using it.
  • Support for virtualizing other types of HW, like SSL offload cards and presenting them to VMs. I suspect this would also involve support from the card vendors of course, but it would still be a useful thing to see.  For example, SSL offloading in our virtual F5’s.
  • I want to see EVERYTHING that can done in an ESX CLI and other troubleshooting / config tools also be available in PowerCLI.
  • Have a pre-canned command I can run to report on all hardware, its drivers, firmware and modules.
  • I think it would be kind of slick to run ESXi as a container. Perhaps I want to carve up a single physical ESXi host, into a couple of smaller ESXi hosts and use the same license.  Again, developers would be a potentially great use case for this.
  • I would like to see an ability to export and import and ESXi image to another physical server. Simple use case would be migrating a server from one host to another.  Maybe even have a wizard for remapping resources such as the NICS, and the log location.  I’m not talking about a host backup, more like a host migration wizard.
  • Actually, get ESXi joining to an Active Directory working reliably.
  • How about showing us active NFC connections, how much memory they’re consuming and the last time they were used. While we’re at it, how about supporting MORE NFC connections.
  • Create a new kernel for NFC and cold migration traffic with a related friendly name.
  • Help us detect performance issues easier with top. Meaning, if there are particular metrics that have crossed well known thresholds, maybe raise an event or something in the logs.  Related though, perhaps offing a GUI (or PowerCLI) related option for creating / scheduling an ESXTOP trace and storing the results in a CSV.

Evolving the company:

Documentation:

Look, almost everyone hates being stuck with documenting things, or at least I do.  However, it’s something that everyone relies on, and when done well, it’s very useful.   I get that VMware is large and complex, so I have to imagine documentation is a tough job.  Still, I think they need to do better at it.  Here is what I see that’s not working well.

  • KB articles aren’t kept up to date as new ESXi versions are released. Is that limitation still applicable?  I don’t know, the documentation doesn’t tell me.
  • There is a lack of examples on changing a particular setting. For example, they may show a native ESXCLI method, while completely leaving out PowerCLI and the GUI.
  • There is a profound lack of good documentation on designing and tuning ESXi for more extreme situation. Things like dealing with very large VM’s, designing for high IOPS or high throughput, large memory and vCPU VM’s.  I don’t know, maybe the thought is you should engage professional services (or buy a book), but that seems overkill to me.
  • Tuning and optimizing for specific application workloads. For example, Microsoft Clustering on top of VMware.  Yeah they have a doc, but no it’s not good.  Most of their testing is under best case scenarios, small VM’s, minimal load, empty ESXi servers, etc.  It’s time for VMware to start building documentation based on reality.  To use a lazy excuse like “everyone’s environment is different” doesn’t absolve even an attempt at more realistic simulations.  For example, I would love to see them test a 24 vCPU, 384GB of vRAM VM with other similarlay sized VM’s on the same host, under some decent load.  I think they’d find, vMotion causes a lot of headaches at that scale.
  • Related to above, I find their documentation a little untrustworthy when they say “x” is supported. Supported in what way?  Is vMotion not supposed to cause a failover, or do you simply mean, the vMotion operation will complete?  Even still, there are SO many conflicting sub-notes it’s just confusing to know what restrictions exist and what doesn’t.  It’s almost like the writer doesn’t understand the application they’re documenting.

Support:

If there is one thing that has taken a complete downward spiral, it’s support.  Like, the VMware execs basically decided customers don’t need good support and decided to outsource it to the cheapest entity out there.  Let me be perfectly clear, VMware support sucks, big time, and I’m talking about production support just to be clear.  Sure, I occasionally get in touch with someone that knows the product well, communicates clearly, and actually corresponds within a reasonable time, but that’s a rarity.  Here are just a few examples of areas that they drop the ball in.

  • Many times, they don’t contact you within your time zone. Meaning, if I work 9 – 5 and I’m EST, I might get a call at 5 or 6, or an email at 4am.
  • Instead of coordinating a time with you, they just randomly call and hope you’re there, otherwise its “hey, get back to me when you’re ready”, which is followed by another 24-hour delay (typically). Sometimes attempts to coordinate a time with them works, other times it doesn’t.
  • I have seen plenty of times where they say they’ll get back to you the next day, and a week or more goes by.
  • Re-Opening cases, has led to me needing to work with a completely different tech. A tech that didn’t bother reading the former case notes, or contacting the original owner to get the back story.  In essence, I might as well have opened a completely new case.
  • Communication is hit or miss. Sometimes, they communicate well, other times, there’s a huge breakdown.  It’s not so much understanding words, but an inability to understand tone, the severity of the situation, or other related factors.
  • Being trained in products that have been out for months. I remember when I called about some issues with a PSC appliance 6 MONTHS after vSphere 6 was released, and the tech didn’t have a clue on how the PSC’s worked.  I had to explain to him the basics, it was a miserable experience.
  • Having a desire to actually figure out an issue, or really solve a problem. It’s like they read from a book, and if the answer isn’t there, they don’t know how to think beyond that.

While we’re still on the support topic, this whole notion of business critical and mission critical support is a little messed up.  I guess VMware basically wants us to fund the salary of an entire TAM or something like that, which is bluntly stupid.  It doesn’t matter if I’m a company with one socket of Enterprise Plus, or a company with 100 sockets, we apparently all pay the same price.  I don’t entirely have a problem with pay a little extra to get access to better support, but then it should be something that’s an upgrade to my production support per socket, not a flat fee. Again, it should be based around fair consumption.

Sales:

You know when I hear from my sales team, when they want to sell me something.  They don’t call to check-in and see if I’m happy.  They’re not calling to go over the latest features included with products I own to make sure I’m maximizing value, none of that happens.  All that kind of stuff is reactive at best.  It’s ME reaching out to learn about something new, or ME reaching out to let them know support is really dropping the ball.  I spend a TON of money on VMware, I’d like to see some better customer service out of my reps.  I have vendors that reach out to me all the time, just to make sure things are going ok.  A little effort like that, goes a long way in keeping a relationship healthy.

Website:

I want to pull my hair out with your website.  Finding things is so tough, because your marketing team is so obsessed with big stupid graphics, and trying to shove everything and anything down my throat.  You’re a company that sells lean and mean software, and your website should follow the same tone.  Everything is all over the place with your site.  Also, it’s 2017, having a proper mobile optimized site would be nice too.

Finally, you guys run blogs, but one thing I’ve noticed is you stop allowing new comments after “x” time.  Why do you do this?  I might need further clarification on a topic that was written, even if it’s years ago.

Cloud and innovation:

This one is a tough area, I’m not sure what to say, other than I hope you’re not the next Novell.  You guys had a pretty spectacular fail at cloud, and I could probably go into a lot of reasons, and most of them wouldn’t be related to Microsoft or AWS being too big to beat.  I suspect part of it was you guys got fat, lazy and way too cocksure.  It’s ok, it happens to a lot of companies, and professionals alike.  While it’s hard for me to forsee someone wanting to consume a serverless platform from you guys, I wouldn’t find it hard to believe that someone might want to consume a better IaaS platform than what’s offered by Microsoft or AWS.  While they have great automation, their fundamental platform still leaves a lot to be desired.  That to me, is an area that you guys could still capture.  I could foresee a great use case for a virtual colocation + all the IaaS scalability and automation abilities.  I still have to shutdown an Azure VM for what feels like every operation, need I say more?

Closing:

Look I could probably keep going on, and one may wonder why stop, I’m already at 6,000 plus words.  I will say kudos to you, if you’ve actually read this far and didn’t simply skip down.  However, the point of this post wasn’t to tear down VMware, nor was it to go after writing my longest post ever.  I needed to vent a little bit, and wanted VMware to know that I’m frustrated with them and what they could do to fix that.  I suspect a lot of my view points aren’t shared by all, but in turn, I’m sure some are.  VMware was the first tech company that I was truly inspired by.  To me, they exemplified what a tech company should strive to be, and somewhere along the way, they lost it.  Here’s to hoping VMware will be with us for the long haul, and that what’s going on now, is simply a bump in the road.

 

Thinking out loud: Why do server vendors still struggle with driver and firmware management?

History:

Let me give you a little back story before digging into the meat of this post.  My team and I make a very concerted effort to keep our servers firmware and drivers updated.  We’ve gone so far as to purchase software from Dell, implement a process on how firmware / drivers are to be updated, and ensuring that its routinely done every quarter.  We do this because in general it’s a best practice, but also because we’ve run into too many occasions where troubleshooting with a vendor stops (if it ever starts) very quickly if the drivers / firmware isn’t recent.  In essence, we’re doing our best to be diligent and proactive with keeping our servers healthy, secure and updated.

Late last year we ran into two issues, both of which are related to drivers / firmware.

  1. A Broadcom NIC causing a purple screen of death (PSOD) in ESXi. This was a server that was freshly rebuilt and all drivers (or so we thought) and firmware updated.  Turns out the driver we were running was more than two years old and the PSOD we were having was a resolved issue in a newer driver.
  2. An Intel x710 quad port 10Gb NIC causing packets to black hole for certain VM’s on the same vLAN. Again, these were new hosts that were patched, firmware updated and in theory up to date.  This issue is what really triggered us to start evaluating other server vendors and their solutions.

 

Of those two issues above, only one was solved with a simple driver update, and the other, we just gave up on and switched to a different NIC (x520).

The Issue:

If you can’t see where this post is going already, let me lay it out.  Server vendors still can’t properly manage their own drivers, firmware and vendor specific software. I know what you’re thinking, you have tools that the vendor provided, you’re using them and you’re fine.  I hate to be the bearer of bad news, but I doubt it.  We just got done a rigorous evaluation of Dell, HP and Cisco.  None of them have a complete solution.  Don’t get me wrong, some of them are getting there, but no one has the problem solved.  If you’re wondering what the specific problems are, see the bullets below.

  • Server vendors would like that you keep your firmware, drivers and tools up to date.
  • OS vendors (and sometimes server vendors) require that the driver and firmware have a certified pairing. It is NOT good enough to simply have the latest driver and the latest firmware.  This of course may vary slightly depending on which server and OS vendor.  VMware though as an example, absolutely has driver and firmware paring that’s required.
    • This driver and firmware pairing is typically worked out between the server and OS vendor.
    • VMware has a strict HCL for this use case, and TMK, MS has an HCL, although they’re a little more forgiving when it comes to the pairing. I can’t speak about Linux, Solaris or other OS’s.

 

Think about this, when was the last time you did the following?

  • Retrieved an inventory of all your hardware firmware revisions and driver revisions.
    • Do you even know how to do this? Probably not as easy as you think.
  • Logged into your OS vendors HCL and one hardware item at a time, checked if you are running the latest driver and firmware, and that the pairing is also certified.
    • With VMware, you can use the device vendor ID, device ID and sub vendor ID to find the specific hardware in question on their HCL. Just remember its hex values.

I bet you’re either relying on the following.

  • VMware update manager, and vendor provided depots (if they exist).
  • Vendor supplied firmware management solutions.
    • Some may have driver management for select OS, but no one does it all.
  • Vendor custom ISOs / install discs.

I’m going to suspect if you go and check your VMware HCL, you’re out compliance in one way or another, or something is woefully out of date.

Solution:

Let’s share a pipe for a second and dream about what it should look like.

 

Server Vendor:

  • It should be a central console.
  • It should handle downloading all firmware, drivers and vendor specific tools.
  • It should use a concept of baselines. A baseline being defined as.
    • OS and server model specific.
    • Based on a release date. The baseline should define an approved pairing of drivers, firmware and vendor tools for a given month, quarter or however often the server vendor feels the need to establish a new baseline.
    • The baseline should support the concept of cumulative updates / hotfixes.
  • It should support grouping servers, and applying baselines to the server groups.
  • It should support compliance checking for the baselines. Not simply deploying the drivers and firmware and assuming everything is ok.  This would let you know if an admin went rouge and manually updated or downgraded firmware / drivers or tools.
  • It should support rolling back drivers, firmware or tools if it is determined to be too far ahead.
  • Provide very verbose information as to why an update process failed.
  • Bonus points
    • Support a multi-site architecture. Meaning be able to have a cached copy of the repo and a local server to perform the actual update and auditing process.
    • Auto discover servers

OS Vendor:

  • Should provide a comprehensive API
    • Look up hardware and driver pairings.
    • Enable the download of the driver or firmware directly would be nice.

Conclusion:

Coming back to reality a bit, what can you do?  Use the tools you have to the best of your ability, script what you can, and manually deal with the gaps.  That said, I’m working on the auditing part of the problem, at least with VMware. Hope to have blog post about it and a new GitHub commit in the coming month or so, stay tuned.

Oh, one final thing you can do, start bugging your server vendor sales team about the issue. If enough people raise the issue, it will get the attention it desperately needs.

Thinking out loud: What HP + Nimble means to me

Disclaimer:

These are opinions, not facts and these opinions are mine, not my employers.

Introduction:

Upon receiving the news that HP was to acquire Nimble, I can’t say I was exactly thrilled.  Nothing personal against HP, they make great servers, but I like Nimble the way it is right now.  None the less, I know the industry is moving in a direction where its either get big, or get out.  There is a huge storm “cloud” looming, and if you’re an on premises solution, its going to be a scary time in the coming years.

I was thinking about what would be some of the pros and cons of the HP acquisition and this is what I’ve come up with.

Pros:

  1. HP is a big company and an established one at that. We’ll focus on the pros being big/established in this section.
    1. HP will likely have an easier time pitching Nimble into companies that would not have given them a second look. HP is established, so there’s a perception that Nimble is established.  This leads to better market penetration.  Nimble getting better market penetration means Nimble makes HP more money and if Nimble makes HP more money, HP invests more into Nimble.  Hopefully the circle of money keeps snow balling and we all win.
      1. HP is a world wide company and while Nimble has done a great job so far, HP is going to take them into more countries faster than they could on their own. If you’ve had difficulty getting Nimble equipment purchased “in country”, I can see this getting easier long term.
    2. Obviously HP has more capital at their disposal than Nimble did. If invested correctly, I could see this accelerating Nimble’s innovation.
    3. HP has more purchasing power than Nimble does, this could lead to Nimble’s margins being better, which in turn may lead to us having a more affordable product (or more profit for HP).
  2. Look, we all know why tech companies pick Supermicro, and its not quality, its affordability. HP makes some pretty kick ass hardware, so if we were to see Nimble’s hardware platform change from Supermicro to HP equipment, not only would my datacenter look a little sexier, I wouldn’t cut my fingers trying to rack Nimble anymore.
  3. If you’re a current HP customer, I can see two nice integration points.
    1. Infosight for other HP solutions.
    2. Nimble integration into OneView.

Cons:

  1. As mentioned in the pros, HP is a large established company. While this in its self can have some pros, it also has the potential for a number of cons.
    1. Big companies tend to move slow, bureaucracy and over analysis being suspect causes. Nimble had far fewer hoops to jump through before making a decision.  Just remember, deadlines and accomplishments drift a day at a time.  Days become weeks and weeks become months, and you get the picture.
    2. Every company is profit conscious, but some larger companies will kill any sliver of waste, even at the cost of productivity or customer satisfaction. I’m not saying it will happen with HP, only that it could.
    3. While HP will open a lot of new doors for Nimble, it has the potential to close a lot of existing ones too. There are a lot of companies that have had bad experiences with HP and this may be enough for them to drop Nimble.  That said, being realistic, it seems one way or another, you’re going to be purchasing storage from some big vendor, and it may not be the same as the one you purchase servers from.
    4. If HP tries to assimilate Nimble into their ways, I can see this being bad for Nimble customers. Nimble for example has a great support experience.  If HP tries to force Nimble to adopt their triage and support structure, that would be a quick way to devalue Nimble.  There’s other things too, like getting stuck speaking with a generic HP sales rep and sales engineer, instead of having direct access to a Nimble SE and a Nimble sales person, or other typical large company sales and support processes.
  2. HP hasn’t made a great name for themselves here of late. We know they’ve split the company in half and sold off a lot of assets.  It’s hard to say if it’s too little too late, or if it was the right move and just in time.  Regardless, HP to me is a company that’s walking a fine line of a falling giant, or one that’s getting back on its feet.  If HP goes down, Nimble goes with it, and that’s not good for Nimble customers.
  3. HP isn’t exactly synonymous with innovation, at least not any more. I fear that HP has the potential of choking the life out of Nimble.  In my opinion, 3Par was a great storage solution.  Part of me wonders if HP couldn’t make that work, what makes them think Nimble will be any different?  Meaning, are they going to turn Nimble into the next Equallogic?

Other thoughts:

I think deep down everyone knew Nimble wanted to get bought.  Me personally, I was REALLY hoping Cisco was going to buy Nimble.  In my opinion, Cisco + Nimble would go together like peanut butter and fluff.  HP already has a storage company that’s flailing.  I don’t want Nimble to follow suite.  People like to remind me about about Whiptail and how bad that was.  I look at that as a rash move on Cisco part (the solution was doomed to fail), but Nimble would be a pick that no one could blame Cisco for.  Best of all, Cisco doesn’t have any competing products (other than Hyperflex, but that’s a different type of solution).  This would have led to a much stronger and untied focus on pushing Nimble.  From Nimble’s view, it would have solidified them as being established (opening the closed doors), and for Cisco, it would have given them a proven storage startup that’s on fire.  Honestly, if I was Cisco’s CEO, I would be doing everything I could to steal the deal from HP.  If it was a matter of HP vs. Dell vs. Cisco, and Cisco was the one with Nimble, IMO, Cisco would crush the other two like a ten-ton hammer.

Conclusion:

This is obviously all speculation at this point, just thinking out loud.  I hope all the pros of what I pointed out occur with the acquisition and none of the cons.  I wish both vendors the best of luck, and until proven otherwise, I’m still a diehard Nimble (HP) fan.

Thinking out loud: Hyper converged storages missing link

Introduction:

In general, I’m not a huge fan of hyper converged infrastructure.  To me, its more “hype” than substance at the moment.  It was born out of web scale infrastructure like Google, Facebook, etc. and IMO, that is still the area where it’s better suited.    The only enterprise layer where I see HCI being a good fit is VDI, other than that, almost every other enterprise workload would be better suited on new school shared storage.  I could probably go into a ton of reasons why I personally see shared storage still being the preferred architecture for enterprises, but instead I’ll focus on one area that if adopted might change my view (slightly).  You see, there is a balance between the best and good enough.  Shared storage IMO is the best, but HCI could be good enough.

What’s missing?

What is the missing link (pun intended)?  IMO, its external / independent DAS.  Can’t see where this is going?  Follow along on why I think external DAS will make hyper converged storage good enough for almost anyone’s environment.

Scaling Deep:  Right now the average server tops out at 24 2.5” drives and less for 3.5” drives.  In a lot of larger shops, that would mean running more hosts in order to meet your storage requirements, and that will come at the cost of paying for more CPU, memory and licensing then you should have to.  Just imagine a typical 1ru r630 + a 2u 60 drive JBOD!  That’s a lot more storage that you can fit under a single host, and it would only consume one more rack unit than a typical r730.  Add to this, theoretically speaking, the number of drives you could add to a single host would go beyond a single JBOD.  A quad port SAS HBA could have four 60 drive enclosures attached, and that’s a single HBA.

Storage independence:  Having the storage outside the server also makes that storage infinitely more flexible.  This is even true when you’re building vendor homogeneous solutions.  Take Dell for example.  Typically speaking their enclosures are movable between different server generations.  Currently with the storage stuck in the chassis, it gets really messy (support wise) and in many cases not doable, to move the storage from one chassis to another, especially if you’re talking about going from an older generation server to a newer generation.

Adding to this, depending on your confidence, white boxing also allows you to cut a server vendor out of the costliest part of the solution, which is the disks themselves.  Going with an enclosure from someone like RAID inc. or DataOn, Quanta QCT, Seagate, etc.  Add in a generic LSI (sorry Avago, oh sorry again, Broadcom) HBA, and now you have a solution that is likely good enough supportability wise.  JBODs tend to be pretty dumb and reliable, which just leaves the LSI card (well known established vendor) and your SSD / HDD.

Why do you want to move the storage anyway?  Simple, I’d bet a nice steak dinner that you want to upgrade or replace your compute long before you need to replace your storage. If you’re simply replacing your compute (not adding a new node but swapping it) then moving a SAS card + DAS is far more efficient than rebuying the storage, or moving the internal storage into a new host (remember warranty gets messy).  Simply vacate the host like you would with internal storage, shutdown, rip the hba out, swap server, put existing HBA back in, done.

If you’re adding a new host, depending on your storage, you may have the option of buying another enclosure and spreading the disks you have evenly across all hosts again.  So if for example, you had 50 disks in 4 hosts (total 200 disks) and you add a fifth host.  One option could be you simply remove 10 disks from each current node and place them in the new node. Your only additional cost was JBOD enclosure, and you now continue to keep your current investment in disks (with flash, that would be the expensive part).

Mix and match 3.5 / 2.5 drives:  Right now with internal storage, you are either running a 3.5” chassis, which doesn’t hold a lot of drives, but CAN support 2.5” drives with a sled.  Or you are running a 2.5” chassis which guarantees no 3.5” drives.  External DAS could mean one of two options:

  1. Use a denser 3.5” JBOD (say 60 disks) and use 2.5” sleds when you need to.
  2. Use one JBOD for 3.5” drives and a different one for 2.5” drives.

Again it comes down to flexibility.

Performance upgrades:  Now this is a big “it depends”.  Hypothetically if there were no SW imposed bottlenecks (which there are), one of your likely bottlenecks (with all flash at least) are going to be either how many drives you have per SAS lane, or how many drives you have per SAS card.  For example, if your SAS card is PCIe 3.0 internally, but the PCIe bus is 4.0, there’s a chance you could upgrade your server to a newer / better storage controller card.   More so, even if you were stuck on PCIe 3 (as an example).  There would be nothing stopping you from slicing your JBOD in half, and using two HBA to double your throughput.  Before you even go there, yes I do know the 730xd has an option for two RAID cards, glad you brought that up.  Guess what, with external DAS, you’re only limited by your budget, the number of PCIe slots you have and the constrains of your HCI vendor.  I for example could have 4 SAS cards, and 2 JBODs partially filled and each sliced in half.  You don’t have that flexibility with internal storage.

With the case of white boxing your storage, this also means to the extent of the HCL, you can run what you want.  So if you want to use all Intel dc3700’s, you can.  Heck, they’re even starting to make JBOF (just a bunch of flash) enclosures for NVMe, which again, would be REALLY fast.

Conclusion:

I say external DAS support is the missing link because it is what would allow HCI to offer similar scaling flexibility that exists in SAN/NAS.  I still think the HCI industry is at least 3 – 5 years out from matching the performance, scalability and features we’ve come to expect in enterprise storage, but external storage support would knock a big hole in a large facet of the scalability win with SAN/NAS.

Thinking out loud: The cloud (IaaS) delusion

Introduction:

Just so we’re all being honest here, I’m not going to sit here and lie about how I’m not biased and I’m looking at both sides 100% objectively.  I mean I’m going to try to, but I have a slant towards on prem, and a lot of that is based on my experience and research with IaaS solutions as they exist now.  My view of course is subject to change as technology advances (as anyones should), and I think with enough time, IaaS will get to a point where its a no brainer, but I don’t think that time is yet for the masses.  Additionally, I think its worth noting that in general, like any technology, I’m a fan of what makes my life easier, what’s better for my employer, and what’s financially sound.  In many cases cloud fits those requirement, and I currently run and have run cloud solutions (long before being trendy).  I’m not anti cloud, I’m anti throwing money away, which is what IaaS is mostly doing.

Where is this stemming from?  After working with Azure for the past month, and reading why I’m a cranky old SysAdmin for not wanting to move my datacenter to the cloud, I wanted to speak up on why in contrary, I think you’re a fool if you do.  Don’t get me wrong, I think there are perfectly valid reasons to use IaaS, there are things that don’t make sense to do in house, but running a primary (and at times a DR) datacenter in the cloud, is just waisting money and limiting your companies capabilities.  Let’s dig into why…

Basic IaaS History:

Let’s start with a little history as I know it on how IaaS was initially used, and IMO, this is still the best fit for IaaS.

I need more power… Ok, I’m done, you can have it back.

There are companies out there (not mine) that do all kinds of crazy calculations, data crunching and other compute intensive operations.  They needed huge amounts of compute capacity for relatively short periods of time (or at least that was the ideal setup).  Meaning, they were striving to get the work done as fast as possible, and for arguments sake, let’s just say their process scaled linearly as they added compute nodes.  There was only so much time, so much power, so much cooling, and so much budget to be able to house all these physical servers for solving what is in essence one big complex math equation.  What they were left with was a balancing act of buying as much compute as they could manage, without being excessively wasteful.  After all, if they purchased so much compute that they could solve the problem in a minimal amount of time, unless they were keeping those server busy, once the problem was solved, it was a waste of capital.  About 10 years ago (taking a rough guess here), AWS releases this awesome product capable of renting compute by the hour, and offering whats basically unlimited amounts of cpu / gpu power.  Now all of a sudden a company that would have had to operate a massive datacenter has a new option of renting mass amounts of compute by the hour.  This company could fire up as many compute nodes as they could afford, and not only could they solve their problem quicker, but they only had to pay for the time they used.

I want to scale my web platform on demand…. and then shrink it, and then scale it, and then shrink it.

It evolved further, if its affordable for mass scale up and scale down for folks that fold genomes, or trend the stock market, why not for running things like next generation web scale architectures.  Sort of a similar principle, except that you run everything in the cloud.  To make it affordable, and scalable, they designed their web infrastructure so that it could scale out, and scale on demand.  Again, we’re not talking about a few massive database servers, and a few massive web servers, we’re talking about tons of smaller web infrastructure components, all broken out into smaller independently scalable components.  Again the cloud model worked brilliantly here, because it was built on a premise that you designed small nodes, and scaled them out on demand as load increased, and destroyed nodes as demand dwindled.  You could never have this level of dynamic capacity affordably on prem.

I want a datacenter for my remote office, but I don’t need a full server, let alone multiples for redundancy.

At this stage IaaS is working great for the DNA crunchers and your favorite web scale company, and all the while, its getting more and more development time, more functionally, and finally gaining the attention of more folks for different use cases.  I’m talking about folks that are sick of waiting on their SysAdmins to deploy test servers, or folks that needed a handful of servers in a remote location, folks that only needed a handful of small servers in general, and didn’t need a big expensive SAN or server. Again, it worked mostly well for these folks.  They saved money by not needing to manage 20 small datacenters, or they were able to test that code on demand and on the platform they wanted, and things were good.

The delusion begins…

Fast forward to now, and everyone thinks that if the cloud worked for the genome folders, the web scale companies and finally for small datacenter replacements, then it must also be great for my relatively speaking static, large legacy enterprise environment.  At least that’s what every cloud peddling vendor and blogger would have you believe, and thus the cloud delusion was born.

Why do I call it the cloud delusion?  Simple, your enterprise architecture is likely NOT getting the same degrees of wins that these types of companies were/are getting out of IaaS.

Let’s break it down the wins that cloud offered and offers you.  In essence, if this is functionality that you need, then the cloud MAY make sense for you.

  1. Scale on demand:  Do you find your self frequently needing to scale servers by the hundreds every, day, week or even month?   Shucks, I’ll even give you same leeway and ask if you’re adding multiple hundred servers every year?   In turn are you finding that you are also destroying said servers in this quantity?  We’re trying to find out if you really need the dynamic scale on demand advantage that the cloud brings over your on prem solution.
  2. Programatic Infrastructure:  Now I want to be very clear with this from the start, while on prem may not be as advanced as IaaS, infrastructure is mostly programatic on prem, so weigh this pro carefully.  Do you find that you hate using a GUI to manage your infrastructure, or need something that you can that can be highly repeatable, and fully configurable via a few JSON files and a few scripts?  I mean really think about that.  How many of you right now are just drowning because you haven’t automated your infrastructure, and are currently head first in automating every single task you do?  If so, the cloud may be a good fit then because practically everything can be done via a script and some config files.  If however, you’re still running through a GUI, or using a handful of simple scripts, and really have no intention of doing everything through a JSON file / script, its likely that IaaS isn’t offering you a big win here.  Even if you are, you have to question if your on prem solution offers similar capabilities, and if so, whats the win that a cloud provider offers that your on prem does not.
  3. Supplement infrastructure personnel:    Do you find your infrastructure folks are holding you back?  If only they didn’t have to waste time on all that low level stuff like managing hypervisors, SANs, switches, firewalls, and other solutions, they’d have so much free time to do other things.  I’m talking about things like patching firmware, racking / unracking equipment, installing hypervisors, provisioning switch ports.  We’re talking about all of this consuming a considerable portion of your infrastructure teams time.  If they’re not spending that much time on this stuff (and chances are very high that they’re not), then  this is not going to be a big win for you.  Again, companies that would have teams busy with this stuff all the time, probably have problem number 1 that I identified.  I’d also like to add that even if this is an issue you have, there is still a limited amount of gain you’ll get out of this.  You’re still going to need to provision storage, networking and compute, but now instead of in the HW, it will simply be transferred to a CLI / GUI.  Mostly the same problem, just a different interface.  Again, unless you plan to solve this problem ALONG with problem 2, its not going to be a huge win.
  4. VM’s on demand for all:  Do you plan on giving all your folks (developers, DBA,  QA, etc.) access to your portal to deploy VM’s?  IaaS has an awesome on demand capability that’s easy to delegate to folks.  if you’re needing something like this, without having to worry about them killing your production workload, then IaaS might be great for you.  Don’t get me wrong, we can do this on prem too, but there’s a bit more work and planning involved.  Then again, letting anyone deploy as much as they want, can be an equally expensive proposition.  Also, let’s not forget problem number 2, chances are pretty high, your folks need some pre-setup tasks performed, and unless you’ve got that problem figured out, VM’s on demand probably isn’t going to work well anywhere, let alone the cloud.
  5. At least 95% of your infrastructure is going to the cloud:  While the number may seem arbitrary (and to some degree it is a guess), you need a critical mass of some sort for it to make financial sense to send you infrastructure to the cloud (if you’re not fixing a point problem).  What good is it to send 70% of your infrastructure to the cloud, if you have to keep 30% on prem.  You’re still dealing with all the on prem issues, but now your economies of scale are reduced.  If you can’t move the lions share of your infrastructure to the cloud, then what’s the point in moving random parts of it?  I’m not saying don’t move certain workloads to the cloud.  For example, if you have a mission critical web site, but everything else its ok to have an outage for, then move that component to the cloud.  However, if most of your infrastructure needs five 9’s, and you can only move 70% of it, then you’re still stuck supporting five 9’s on prem, so again, what’s the point?

Disclaimer:  Extreme amounts of snark are coming, be prepared.

Ok, ok maybe you don’t need any of these features, but you’ve got money to burn, you want these features just because you might use them at some point, everyone else is “going cloud” so why not you, or who knows whatever reason you might be coming up with for why the cloud is the best decision.  What’s the big deal, I mean you’re probably thinking you lose nothing, but gain all kinds of great things.  Well that my friend is where you’d be wrong.  Now my talking points are going to be coming from my short experience with Azure, so I can’t say these apply to all clouds.

  1. No matter what, you still need on prem infrastructure.  Maybe its not a hoard of servers, but you’ll need stuff.
    1. Networking isn’t going anywhere (should have been a network engineer).    Maybe you won’t have as many datacenter switches to contend with (and you shouldn’t have a lot if your infrastructure is modern and not greater than a few thousand VM’s), but you’ll still need access switches for you staff.  You’re going to need VPN’s and routers.  Oh, and NOW you’re going to need a MUCH bigger router and firewall (err… more expensive).  All that data you were accessing locally now has to go across the WAN, if you’re encrypting that data, that’s going to take more horsepower, and that means bigger badder WAN networking.
    2. You’re probably still going to have some form of servers on site.  In a windows shop that will be at least a few domain controllers, you’ll also have file server caching appliances, and possibly other WAN acceleration devices depending on what apps you’re running in the cloud.
    3. Well, you’ve got this super critical networking and file caching HW in place, you need to make sure it stays on.  That potentially is going to lead back to UPS’s at a minimum and maybe even a generator.  Then again, being fair, if the power is out, perhaps its out for your desktops too, so no one is working anyway.  That’s a call you need to make.
    4. Is your phone system moving to the cloud too?  No… guess you’re going to need to maintain servers and other proprietary equipment for that too.
    5. How about application “x”?  Can you move it to the cloud, will it even run in the cloud?  Its based on Windows 2003, and Azure doesn’t support Windows 2003.  What are application “X”‘s dependencies and how will they effect the application if they’re in the cloud?  That might mean more servers staying on prem.
  2. They told you it would be cheaper right, I mean the cloud saves you on so much infrastructure, so much personnel power, and it provides this unlimited flexibility and scalability that you don’t actually need.
    1. Every VM you build now actually has a hard cost.  Sorry, but there’s no such thing as “over provisioning” in the cloud.  Your cloud provider gets to milk that benefit out of you  and make a nice profit.  Yeah I can run a hundred small VM’s on a single host, those same VM’s I’d pay per in a cloud solution.  But hey, its cheaper in the cloud, or so the cloud providers have told me.
    2. Well at least the storage is cheaper, except that to get decent performance in the cloud, you need to run on premium storage and premium storage isn’t cheap (and not really all that premium either).  You don’t get to enjoy the nice low latency, high iop, high throughput, adaptive caching (or all flash) that your on prem SAN provided.  And if you want to try and match what you can get on prem, you’ll need to over-provision your storage, and do crazy in guest disk stripping techniques.
    3. What about your networking?  I mean what is one of the most expensive reoccurring  networking costs to a business?  The WAN links… well they just got A LOT more expensive.  So on top of now spending more capex on a router and firewall, you also need to pump more money into the WAN link so your users have a good experience.  Then again, they’ll never have the same sub-millisecond latency that they had when the app was local to them.
      1. No problem you say, I’ll just move my desktop to the cloud, and then you remember that the latency still exists, its just been moved from client and application, to the user interfacing with the client.  Not really sure which is worse.
        1. Even if you’re not deterred by this, now you’re incurring the costs of running your desktops in the cloud.  You know, the folks that you force 5 years or older desktops on.
    4. How many IP’s or how many NIC’s does your VM have?  I hope its one and one.  You see there are limitations (in Azure) of one IP per NIC, and in order to run multiple NIC’s per server, you need a larger VM.  Ouch…
    5. I hope you weren’t thinking you’d run exactly 8 vCPU’s and 8GB of vRAM because that’s all your server needs.  Sorry, that’s not the way the cloud works.  You can have any size VM you want, as long as its the sizes that your cloud provider offers.  So you may end up paying for a VM that has 8 vCPU and 64GB of RAM because that’s the closest fit.  But wait, there’s more…  what if you don’t need a ton of CPU or RAM, but you have a ton of data, say a file server.  Sorry, again, the cloud provider only enables a certain number of disks per vCPU, so you now need to bump up your VM size to support the disk size you need.
    6. At least with cloud, everything will be easy, I mean yeah it might cost more, but oh… the simplicity of it all.  Yep, because having a year 2005 limitation of 1TB disks just makes everything easy.  Hope you’re really good with dynamic disks, windows storage spaces, or LVM (Linux) because you’re going to need it  Also, I hope you have everything pre-thought out if you plan to stripe disks in guest.  MS has the most unforgiving disk stripping capabilities if you don’t.
    7. Snapshots, they at least have snapshots… right?  Well sort of, except its totally convoluted, not something you’d probably never want to implement for fear of wrecking your VM (which is what you were trying to avoid with the snap right?).
    8. Ok, ok, well how about dynamically resizing your VM’s?  They can at least do that right?  Yes, sort of, so long as your sizing up in a specific VM class.  Otherwise TMK, you have to rebuild once you outgrow a given VM class.  For example, the “D series” can be scaled until you reach the maximum for the “D”.  You can’t easily convert it to a “G” series in a few clicks to continue growing it.
    9. Changes are quick and non-disruptive right?  LOL, sure with any other hypervisor they might be, but this is the cloud (Azure) and from what I can see, its iffy if your VM’s don’t need to be shutdown, or even worse, if you do something that is supported hot, you may see longer than normal stuns.
    10. Ever need to troubleshoot something in the console?  Me too, a shame because Azure doesn’t let you access the console.
    11. Well at least they have a GUI for everything right?  Nope, I found I need to go drop into PS more often than not.  Want to resize that premium storage disk, that’s gonna take a powershell cmdlet.  That’s good though right, I mean you like wasting time finding the disk guid, digging into a CLI, just to resize one disk, which BTW is a powered off operation, WIN!
    12. You like being in control of maintenance windows right?  Of course you do, but with cloud you don’t get a say.

I could keep going on, but honestly I think you get the point.  There are caveats in spades when switching to the cloud as a primary (or even DR) datacenter.  Its not a simple case of paying more for features you don’t have, you lose flexibility / performance, and you pay more for it too.

Alright, but what about all those bad things they say about on prem, or things like TCO they’re trying to woo you to the cloud for.  Well lets dig into it a bit.

  1. Despite what “they” tell you, they’re likely out of touch.  Most of the cloud folks you’re dealing with, have been chewing their own dog food so long, they don’t have a clue about what exists in the on prem world, let alone dealing with your infrastructure and all its nuances.  They might convince you they’re infrastructure experts, but only THEIR infrastructure, not yours and certainly not on prem in general.  Believe me, most of them have been in their bubble for half a decade at least, and we all know how fast things change in technology, they’re new school in cloud, but a dinosaur in on prem.  Don’t misunderstand me, I’m not saying they’re not smart, I’m saying I doubt they have the on prem knowledge you do, and if you’re smart, you’ll educate yourself in cloud so you’re prepared to evaluate if IaaS really is a good fit for you and your employer.
  2. Going cloud is NOT like virtualization.  With virtualization you didn’t change the app, you didn’t’ lose control and more importantly it actually saved you money and DID provide more flexibility, scalability and simplicity.  Cloud does not guarantee any of those for a traditional infrastructure.  Or rather it may offer different benefits, that are not as equally needed.
  3. They’ll tell you the TCO for cloud is better and they MAY be right if you’re doing foolish things like.
    1. Leasing servers and swapping them every three years.  A total waste of money.  There’s very few good reason you aren’t financing a server (capex) and re-purposing that server through a proper lifecycle.  Five years is the minimum maximum life cycle for a modern server.  You have DR, and other things you can use older HW for.
    2. You’re not maxing out the cores in your server to maximize licensing costs, reduce network connectivity costs, and also reduce power, cooling and rack space.  An average dual socket 18 core server can run 150 average VM’s without breaking a sweat.
    3. Your threshold for a maxed out cluster is too low.  There’s nothing wrong with a 10:1 or even a 15:1 vCPU to pCPU ratio so long as your performance is ok.  Your milage may vary, but be honest with yourself before buying more servers based on arbitrary numbers like these.
    4. You take advice from a greedy VAR.  Do yourself a favor and just hire a smart person that knows infrastructure.  They’ll be cheaper than all the money you waste on a VAR, or cloud.   You should be pushing for someone that  is border line architect, if not an architect.
      1. FYI, I’m not saying all VARs are greedy, but more are than not.  I can’t tell you how many interviews I’ve had where I go “yeah, you got up sold”.
    5. Stop with this BS of only buying “EMC” or “Cisco” or “Juniper” or whatever your arbitrary preferred vendor is. Choose the solution based on price, reliability, performance, scalability and simplicity, not by its name.  I picked Nimble when NetApp would have been an easy, but expensive choice.  Again, see point 4 about getting the right person on staff.
    6. Total datacenter costs (Power, UPS, generator and cooling) are worth considering, but are often not as expensive as the providers would have you think.  If this is the only cost saving’s point they have you sold on, you should consider colocation first which takes care of some of that, but also incurs some of the same costs/caveats that come with cloud (but not nearly as many).  Again, I personally think this is FUD, and in a lot of cases, IT departments, let alone businesses don’t even see the bill for all of this.  Even things like DC space, if you’re using newer equipment, the rack density you get through virtualization is astounding.
    7. You’re not shopping your solution, ever.  I know folks that just love to go out to lunch (takes one to know one), and their VAR’s and vendors are happy to oblige.  If your management team isn’t pushing back on price, and lets you run around throwing PO’s like monopoly money, there’s a good chance you’re paying more for something than you need to.
    8. You suck at your job, or you’ve hired the wrong person.  Sounds a little harsh,but again, going back to point 4.  if you have the right people on staff,you’ll get the right solutions, they’ll be implemented better, and they’ll get implemented quicker.  Cloud by the way, only fixes certain aspects of this problem.
  4. They’ll tell you can’t do it better than them, they scale better, and it would cost you millions to get to their level.  They’re right, they can build a 100,000 VM host datacenter better than you or I, and they can run it better.  But you don’t need that scale, and more importantly, they’re not passing those economies of scale on to you.  That’s their profit margin.  Remember, they’re not doing this to save you money, they’re doing this to make money.   In your case, if your DC is small enough (but not too small) you can probably do it MUCH cheaper than what you’d pay for in a cloud, and it will likely run much better.
  5. They’ll tell you’ll be getting rid of a SysAdmin or two thanks to cloud.  Total BS… An average sysadmin (contrary to marketing slides) does not spend a ton of time with the mundane task or racking HW, patching hypervisors (unless its Microsoft :-)), etc.  They spend most of their time managing the OS layer, doing deployments, etc, which BTW all still need to be done in the cloud.

For now, that’s all I’ve got.  I wrote this because I was so tired of hearing folks spew pro cloud dogma from their mouthes without even having a simplistic understanding of what it takes to run infrastructure in the cloud or on prem.  Maybe I am the cranky main frame guy, and maybe I’m the one who is delusional and wrong.  I’m not saying the cloud doesn’t have its place, and I’m not even saying that IaaS won’t be the home of my DC in ten years.  What I am saying is right now, at this point in time, I see moving to the cloud as big expensive mistake if your goal is to simply replace your on prem DC.  If you’re truly being strategic with what you’re using IaaS for, and there are pain points that are difficult to solve on prem, then by all means go for it.  Just don’t tell me that IaaS is ready for general masses, because IMO, it has a long ways to go yet.