All posts by Eric Singer

Powershell Scripting: Microsoft Exchange, Configure client-specific message size limits

Introduction:

If you don’t know by now, I’m a huge PowerShell fan. It’s my go to scripting language for anything related to Microsoft (and non-Microsoft) automation and administration. So when it came time to automating post exchange cumulative update setting, I was a bit surprised to see some of the code examples from Microsoft, not containing any PowerShell example. Surprised is probably the wrong word, how about annoyed? I mean, after all, this is not only the company that shoved this awesome scripting language down our throat, but also the very team that was the first one to have a comprehensive set of admin abilities via PowerShell. So if that’s the case, why in the world, don’t they have a single PS example for configuring client-specific message size limits?

Not to be discouraged, I said screw appcmd, I’m PS’ing this stuff, because it’s 2017 and PS / DSC is what we should be using. Here’s how I did it

The settings:

If you’re looking for where the setting are that I’m speaking of / about, check out this link here. That’s how you do it in the “old school” way.

The new school way:

My example below is for EWS, you need to adjust this if you want to also include EAS.


     Write-Host "Attempting to set EWS settings"
    Write-Host "Starting with the backend ews custom bindings"
    $AllBackendEWSCustomBindingsWebConfigProperties = Get-WebConfigurationProperty -Filter "system.serviceModel/bindings/custombinding/*/httpsTransport" -PSPath "MACHINE/WEBROOT/APPHOST/Exchange Back End/ews" -Name maxReceivedMessageSize -ErrorAction Stop | Where-Object {$_.ItemXPath -like "*EWS*https*/httpstransport"} 
    Foreach ($BackendEWSCustomBinding in $AllBackendEWSCustomBindingsWebConfigProperties)
        {
        Set-WebConfigurationProperty -Filter $BackendEWSCustomBinding.ItemXPath -PSPath "MACHINE/WEBROOT/APPHOST/Exchange Back End/ews" -Name maxReceivedMessageSize -value 209715200 -ErrorAction Stop
        }
    Write-Host "Finished the backend ews custom bindings"
    
    Write-Host "Starting with the backend ews web http bindings"
    $AllBackendEWwebwebHttpBindingWebConfigProperties = Get-WebConfigurationProperty -Filter "system.serviceModel/bindings/webHttpBinding/*" -PSPath "MACHINE/WEBROOT/APPHOST/Exchange Back End/ews" -Name maxReceivedMessageSize -ErrorAction Stop | Where-Object {$_.ItemXPath -like "*EWS*"} 
    Foreach ($BackendEWSHTTPmBinding in $AllBackendEWwebwebHttpBindingWebConfigProperties)
        {
        Set-WebConfigurationProperty -Filter $BackendEWSHTTPmBinding.ItemXPath -PSPath "MACHINE/WEBROOT/APPHOST/Exchange Back End/ews" -Name maxReceivedMessageSize -value 209715200 -ErrorAction Stop
        }
    Write-Host "Finished the backend ews web http bindings"

    Write-Host "Starting with the back end ews request filtering"
    Set-WebConfigurationProperty -Filter "/system.webServer/security/requestFiltering/requestLimits" -PSPath "MACHINE/WEBROOT/APPHOST/Exchange Back End/ews" -Name maxAllowedContentLength -value 209715200 -ErrorAction Stop
    Write-Host "Finished the back end ews request filtering"

    Write-Host "Starting with the front end ews request filtering"
    Set-WebConfigurationProperty -Filter "/system.webServer/security/requestFiltering/requestLimits" -PSPath "MACHINE/WEBROOT/APPHOST/Default Web Site/EWS" -Name maxAllowedContentLength -value 209715200 -ErrorAction Stop
    Write-Host "Finished the front end ews request filtering" 

Is it technically better than appcmd?  Yes, of course, what did you think I was going to say?  It’s PS, of course it’s better than CMD.

As for how it works, I mean it’s pretty obvious, I don’t think there’s any good reason to go into a break down.  I took what MS did with AppCMD and just changed it to PS, with a foreach loop in the beginning to have even a little less code 🙂

You should be able to take this, and easily adapt it to other IIS based web.config settings.  My Get-WebConfigurationProperty in the very beginning, is a great way to explore any web.config via the IIS cmdlets.

Anyway, hope this helps someone.

***Update 07/29/2017:

So we did our exchange 2013 cu15 upgrade, and everything went well with the script, except for one snag.  My former script had an incorrect filter that added an “https” binding to an “http”  path.  EWS didn’t like that very much (as we found out the hard way).  Anyway, should be fixed now.  I updated the script.  Just so you know which line was affected you can see the before and after below.  Basically my original filter grabbed both the http and https transports.  I guess technically each web property has the potential for both.  My new filter goes after only https EWS configs + https transports.


#I changed this:

$AllBackendEWSCustomBindingsWebConfigProperties = Get-WebConfigurationProperty -Filter "system.serviceModel/bindings/custombinding/*/httpsTransport" -PSPath "MACHINE/WEBROOT/APPHOST/Exchange Back End/ews" -Name maxReceivedMessageSize -ErrorAction Stop | Where-Object {$_.ItemXPath -like "*EWS*"}

#To this

$AllBackendEWSCustomBindingsWebConfigProperties = Get-WebConfigurationProperty -Filter "system.serviceModel/bindings/custombinding/*/httpsTransport" -PSPath "MACHINE/WEBROOT/APPHOST/Exchange Back End/ews" -Name maxReceivedMessageSize -ErrorAction Stop | Where-Object {$_.ItemXPath -like "*EWS*https*/httpstransport"}

Powershell Scripting: Get-ECSESXHostVIBToPCIDevices

Introduction:

If you remember a little bit ago, I said I was trying to work around the lack of driver management with vendors.  This function is the start of a few tools you can use to potentially make your life a little easier.

VMware’s drivers are VIBS (but not all VIBS are drivers).  So the key to knowing if you have the correct drivers is to find which VIB matches which PCI device.  This function does that work for you.

How it works:

First, I hate to be the bearer of bad news, but if you’re running ESXi 5.5 or below, this function isn’t going to work for you.  It seems the names of the modules  and vibs don’t line up via ESXCLI in 5.5, but they do in 6.0.  So if you’re running 6.0 and above, you’re in luck,.

As for how it works, its actually pretty simple.

  1. Get a list of all PCI devices
  2. Get a list of all modules (which aren’t the same as VIBS).
  3. Get a list of all VIBs.
  4. Loop through each PCI device
    1. See if we find a matching module
      1. Take the module and see if we find a VIB that matches it.
  5. Take the results of each loop, add it to an array
  6. Spit out the array once the function is done running
  7. Your results should be present.

How to execute it:

Ok, to begin with, I’m not doing fancy pipelining or anything like that.  Simply enter the name of the ESXi host as it is in vCenter and things will work just fine.  There is support for verbose output if you want to see all the PCI devices, modules and vibs that are being looped through.

Get-ECSESXHostVIBToPCIDevices -VMHostName "ServerNameAsItIsInvCenter"

If you want to do something like loop through a bunch of hosts in a cluster, that’s awesome, you can write that code :).

How to use the output:

Ok great, so now you’ve got all this output, now what?  Well, this is where we’re back to the tedious part of managing drivers.

  1. Fire up the VMware HCL web site and go to the IO devices section
  2. Now, there are three main columns from my output that you need to find the potential list of drivers.  Yeah, even with an exact match, there maybe anywhere from 0 devices listed (take that as you’re running the latest) to having on or more hits.
    1. PCIDeviceSubVendorID
    2. PCIDeviceVendorID
    3. PCIDeviceDeviceID
  3. Those three columns are are all you need.  Now a few notes with this.
    1. if there are less than four characters, VMware will add leading zeros on their web drop down picker.  For example, if my output shows “e3f”, on VMwares drop down picker, you want to look for “0e3f”.
    2. if you get a lot of results, what I suggest doing next, is seeing if the vendor matches your server vendor.  If you find a server vendor match and there are still more than one result, see if its something like the difference between a dual port or single port card.  If you don’t see your server vendor listed, see if the card vendor is listed.  For example, in UCS servers, instead of seeing Cisco for a RAID controller, you would likely find a match for “Avago” or “Broadcom”.  Yeah, it totally gets confusing with HW vendors buying each other LOL.
  4. Once you find a match, the only thing left to do, is look at the output of column “ModuleVibVersion” in my script and see if you’re running the latest driver available, or if it at least is recent.  Just keep in mind, if you update the driver, make sure the FW you’re running is also certified for that driver.

Where’s the code?

Right here

What’s next / missing?

Well, a few things:

  1. I haven’t found a good way yet to loop through each PCI device and see its FW version.  That’s a pretty critical bit of info as I’ve said before.
  2. Even if i COULD find the firmware version for you, you’re still going to need to cross reference it against your server vendor.  Without an API, this is also going to be a tedious process.
  3. You need to manually check the HCL because in 2017, VMware still doesn’t have an API, let alone a restful one to do the query.  If we had that, the next logical step would be to take this output and query an API to find a possible match(es).  For now, you’ll need to do it manually.
    1. Ideally, the same API would let you download a driver if you wanted.
  4. VMware lacks an ability to add VIBS via PowerCLI or really manage baselines and what not.  So again, VMware really dropping the ball here.  This time it’s the “Update Manger” team.

Conclusion:

Hope this helps a bit, it’s far from perfect, but I’ve used it a few times, and found a few NIC drivers and RAID controllers that had older drivers.

Problem Solving: WSUS failing for Windows 10 with error 8024401c

Hi Folks,

After updating WSUS to support Windows 10 newer update format, we noticed that our Windows 10 client weren’t working. The error they were getting was 8024401c whenever we checked for updates (post WSUS upgrade).  Initially we thought it was related to the WSUS upgrade, but found out that most of our systems hadn’t been updating for a while.  So we moved on to troubleshooting the client further.  We found that the following GPO “Do not connect to any Windows Update Internet locations”  was not configured.  After doing some digging we determined that this was put in place to prevent our clients from downloading updates from MS directly, which was originally happening.  The weird thing to me was why were our clients going to MS anyway?  We have WSUS, that is the point of WSUS?  Disabling the setting resulted in us getting updates, but now they were coming from MS directly and not WSUS.  Enabling or setting it to “not configured” resulted in the lovely error.

Example snippet of log file below.

2017/06/13 10:08:31.2836183 676 10488 WebServices WS error: There was an error communicating with the endpoint at ‘http://%ServerName%/ClientWebService/client.asmx’.
2017/06/13 10:08:31.2836186 676 10488 WebServices WS error: There was an error receiving the HTTP reply.
2017/06/13 10:08:31.2836189 676 10488 WebServices WS error: The operation did not complete within the time allotted.
2017/06/13 10:08:31.2836280 676 10488 WebServices WS error: The operation timed out
2017/06/13 10:08:31.2836379 676 10488 WebServices Web service call failed with hr = 8024401c.

After a ton of Google Fu, I stumbled on to this article https://blogs.technet.microsoft.com/windowsserver/2017/01/09/why-wsus-and-sccm-managed-clients-are-reaching-out-to-microsoft-online/  Before you start reading, make sure you’re relaxed and read through it carefully, because the answer is there, but you have make sure you’re not just skimming.

Here is the main section and highlighted points that you need to glean from that article.


Ensure that the registry HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate doesn’t reflect any of these values.

  • DeferFeatureUpdate
  • DeferFeatureUpdatePeriodInDays
  • DeferQualityUpdate
  • DeferQualityUpdatePeriodInDays
  • PauseFeatureUpdate
  • PauseQualityUpdate
  • DeferUpgrade
  • ExcludeWUDriversInQualityUpdate

What just happened here? Aren’t these update or upgrade deferral policies?

Not in a managed environment. These policies are meant for Windows Update for Business (WUfB). Learn more about Windows Update for Business.

Windows Update for Business aka WUfB enables information technology administrators to keep the Windows 10 devices in their organization always up to date with the latest security defenses and Windows features by directly connecting these systems to Windows Update service.

We also recommend that you do not use these new settings with WSUS/SCCM.

If you are already using an on-prem solution to manage Windows updates/upgrades, using the new WUfB settings will enable your clients to also reach out to Microsoft Update online to fetch update bypassing your WSUS/SCCM end-point.

To manage updates, you have two solutions:

  • Use WSUS (or SCCM) and manage how and when you want to deploy updates and upgrades to Windows 10 computers in your environment (in your intranet).
  • Use the new WUfB settings to manage how and when you want to deploy updates and upgrades to Windows 10 computers in your environment directly connecting to Windows Update.

So, the moment any one of these policies are configured, even if these are set to be “disabled”, a new behavior known as Dual Scan is invoked in the Windows Update agent.

When Dual Scan is engaged, the following change in client behavior occur:

  • Whenever Automatic Updates scans for updates against the WSUS or SCCM server, it also scans against Windows Update, or against Microsoft Update if the machine is configured to use Microsoft Update instead of Windows Update. It processes any updates it finds, subject to the deferral/pausing policies mentioned above.

Some Windows Update GPOs that can be configured to better manage the Windows Update agent. I recommend you test them in your environment


After reading that, I went back in our GPO and did some more digging, since all our WSUS client settings are defined in GPO, turns out we have the “Do not include drivers with…” setting enabled.  So ultimately it was this setting that led to the whole “Dual Scan” mode being enabled, which led to us downloading MS updates (needed to happen anyway), which led to us disabling that, which led to WSUS not being used at all. So after setting both settings to not configured and doing a lot of GPUpdates / restarting of the windows update services, eventually I went from getting that error, to everything being back to normal.  That is, my client connecting to WSUS and downloading updates the right way.

Lessons learned besides not just randomly enabling WSUS settings, is that Microsoft in my not so humble opinion, needs to do a better job with the entire WSUS client control.  This is just stupid behaviour to be blunt.    What I would suggest that MS do is as follows.

  1. For Pete’s sake, have a damn setting that controls whether we want updates via WSUS, WUFB, or neither.  I mean it seems like such an obvious thing.  Clearly implied settings conflict.  If you have to write a damn article explaining all the gotcha’s you failed at building an user friendly solution.
  2. Group settings that are Windows update for business specific in their own damn GPO folder and their own damn reg key.  This way there’s no question these are for WUfB only.  Similar to WSUS.
  3. If WSUS is enabled, ignore WUfB settings and vice versa.

Anyway, hope that helps any other poor souls out there.

Review: 5 years with CommVault

Introduction:

Backup and recovery is a rather dry topic, but it’s an important one.  After all, what’s more critical to your company than their data?  You can have the best products in the world, but if disaster strikes and you don’t have a good solution in place, it can make your recovery painful or even impossible.  Still, many companies shirk investment in this segment.  The good solutions (like the one I’m about to discuss) cost a pretty penny, and that’s capital that needs to be balanced with technology that makes or saves your company money.  Still, insurance (and that’s what backup is) is something that’s typically on the back of companies minds.

Finding the right product in this segment can be a challenge, not only because every vendor tries to convince you that they’ve cracked the nut, but because it seems like all the good solutions are expensive.  Like many, our budget was initially constrained.  We had an old investment in CV (CommVault), but had not reinvested in it over the years, and needed a new solution.  We initially chose a more affordable Veeam + Windows Storage Spaces to handle our backup duties.  It was a terrible mistake, but you know, sometimes you have to fail to learn, and so we did.

After putting up with Veeam for a year, we threw in the towel and and went back to CV with open arms.  Our timing was also great too, as Veeam had put a serious hurt on their business and some of their licensing changed, to accommodate that.  We ultimately ended up with much better pricing than when we last looked at CV, and on top of that, we actually found their virtualization backup to be more affordable and in many ways more feature rich.  CV isn’t perfect as I’ll outline below, but they’re pretty much as close as you can get to perfection for a product that is the swiss army knife of backup.

CommVault Terms:

For those of you not super familiar with CV, you’ll find the following terms useful for understanding what I’m talking about.  There are a lot more components in CV, but these are the fundamental ones.

  • MA (Media Agent): Simply put, it’s a data mover.  It copies data to disk, tape, cloud, etc.
  • Agent: A client that is installed to backup an application or OS.
  • VSA (Virtual Server Agent): A client specially designed to for virtualization backup.
  • CC (CommCell): The central server that manages all the jobs, history, reporting, configuration, etc.  This is the brains of the whole operation.

Our Environment:

  • We have five MA’s.
    • Two virtual MA’s that backup to a Quantum QXS SAN (DotHill). This was done because we were reusing an old pair of VMhost and have a few other non-CV backup components running on these hosts.
      • The SAN has something like two pools of 80 disks. Not as fast as we’d like, but more than fast enough.  The QXS (DotHill) was our replacement for Storage Spaces.  Overall, better than Storage Spaces, but a lot of room for improvement.  The details of that are for another review.
    • Two physical MA’s with DAS, each MA has 80 disks in a RAID 60, yeah it rips from a disk performance perspective J. Multiple GBps
    • One physical MA that’s attached to our tape library.
  • We have five VSA’s, I’ll go more into this, but we’re not using five because I want to.
  • We have one CC, although we’ll be rolling out a second for resiliency and failover soon.
  • We have a number of agents
    • Several MS Exchange
    • Several MS Active Directory
    • Several Linux
    • The rest are file server / OS image agents.
  • In total, we have about a PB of total backup capacity between our SAN and DAS, but not all of that is consumed by CV (most is though).
  • We only use compression right now, no dedupe.
  • We only use active fulls (real fulls) not synthetics

Pros:

  • Backup:
    • CV can backup practically anything, and also has a number of application specific agents as well. You can backup your entire enterprise with their solution.  I would contend with CV, there are very few cases that you’d need point tools anymore.  Desktops, servers, virtualization, various applications and NAS devices are all systems that can be backed up by CV.  Honestly, it’s hard to find a solution that is as comprehensive as them.  That being said, I can imagine you’re wondering if they do it all, can they do it well?  I would say mostly.  I have some deltas to go over a little farther down, but they do a lot and a lot well.  It’s one of the reasons the solution was (and still is) expensive.
    • I went from having to babysit backup’s with Veeam, to having a solution that I almost never had to think about anymore (other than swapping tapes). There were some initial pains at first as we learned CV’s way of doing virtualization backup, but we quickly got to a stable state.
  • Deployment / Scalability:
    • CommCell has a great deployment model that works well in single office locations all the way to globally distributed implementations. They’re able to accomplish all of this with a single pane of glass, which a number of vendors can’t claim to do.
    • Besides the size of the deployment, you’re not forced into using Windows only for most components of CV. A lot of the roles outlined above run on Linux or Windows.
    • CV is software based, and best of all, its an application that runs on an OS which you’re already comfortable with (Linux / Windows). Because of this, the HW that you deploy the solution on is really only limited by minimum specs, budget and your imagination.  You can build a powerful and affordable solution on simple DAS, or you can go crazy and run on NVMe / all flash SANs.  It also works in the cloud because again, it’s just SW inside a generic OS.  I can’t tell you how many backup solutions I looked at that had zero cloud deployment capabilities.
    • There are so many knobs to turn in this solution, it’s pretty tough to run into a situation that you can’t tune for (there are a few though). Most of the out of box defaults are fine, but you’ll get the best performance when you dig in an optimize.  Some find this overwhelming and I’ll chat more about that in the cons, but with CV’s great support and reading their documentation, it’s not as bad as it sounds.  Ultimately the tuneablity is an incredible strength of this solution.  I’ve been able to increase backup throughput from a few hundred MBps to a few GBps simply by changing the IO size that CV uses.
  • Support:
    • Overall, they have fantastic support. Like any vendors support, it can vary and CV is no different.  Still, I can count on my hand the number of times support was painful, and even of those times, ultimately we got the issue resolved.
    • For the most part, support knows the application they’re backing up pretty well. I had a VMware backup issue that we ran into with Veeam and continued with CV.  CV while not being able to directly solve the problem, provided significantly more data for me to hand off to VMware, which ultimately led to us finding a known issue.   CV analyzed the VMware logs best they could and found the relevant entries that they suspected were the issue.  Veeam, was useless.
    • Getting CV issues fixed is something else that’s great about CV. No vendor is perfect, that’s what hotfixes and service packs are for.  CV, has an amazing escalation process.  I went from a bug, to a hotfix that resolved the issue in under two weeks.
    • My experience with their supports response time is fantastic. I rarely find a time where I don’t hear from them for a few hours.  They’re also not afraid to simply call you and work on the problem real time. I don’t mind email responses for simple questions, but when you’re running into a problem, sometimes you just want someone to call you and hash it out in real time.  I also like that most of the time you get the tech’s direct number if you need to call them.
  • Feature requests: A little hit or miss, but feature requests tend to get taken seriously with CV, especially if it’s something pretty simple.
  • Value: This one is a mixed bag.  Thanks to Veeam eating their lunch, virtualization backup with CV has never been a better value.  I could be wrong, but I actually think virtualization backup in CV rings in at a significantly lower price than Veeam.  I would say at least 50% of our backup’s are virtualization.  It’s our default backup method unless there is a compelling reason to use agents.   This is ultimately what made CV an affordable backup solution for us.  We were able to leverage their virtualization backup for most of our stuff, and utilize agents for the few things that really needed to be backed up at a file level or application level.  The virtualization backup entitles you to all their premium features, which is why I think it’s a huge value add.  That being said, I have some stuff to touch on in the cons with regards to the value.
  • Retention Management: Their retention management is a little tricky to get your head around, but it’s ultimately the right way to do retention.  Their retention is based on a policy, not based on the number of recovery point.    You configure things like how many days of fulls you want and how many cycles you need.  I can take a bazillion one off backup’s and not have to worry about my recovery history being prematurely purged.
  • Copy management: They manage copies of data like a champ.  Mix it with the above point, and you have all kinds of copies with different retentions, different locations, and it all works rock solid.  You have control over what data get’s copied.  So your source data might have all your VM’s and you only want a second copy of select VM’s, not problem for them.  Maybe you want dedupe on some, compression on other, some on tape, some on disk, some on cloud, again, no issue at all.
  • Ahead of the curve: CV seems to be the most forward thinking when it comes to backup / recovery destinations and sources.  They had our Nimble SAN’s certified for backup LONG before our previous vendor.  They support all kinds of cloud destinations, the ability to recover VM’s from physical to virtual, virtual to cloud, etc.  This goes back to the holistic approach that I brought up.  They do a very good job of wrapping everything up, and creating a flexible ecosystem to work with.  You typically don’t need point solutions with them.
  • Storage Management: I love their disk pools, and the way they store their backup data.  First and foremost, it’s tunable, so if you want 512MB files to whatever size files, it’s an option.  They shard the data across disks, etc.  Frankly the way they store data is a no brainer.  They also move jobs / data pretty easily from one disk to another which is great.  This type of flexability is not only helpful for things like making it easier to fit your data on disparate storage, but also in ensuring your backup’s can easily be copied to unreliable destinations.  Having to recopy a 512MB file is a lot better than having to recopy an 8TB file.  CV can take that 8TB file if you want, and break it up into various sized (default is 2GB).
  • Policies: Most of the way things are defined, are defined using policies.  Schedules, retention, copies, etc.  Not everything, but most things.  This makes it easy to establish standards for how things should act, and it also makes it easier to change thing.
  • CLI: They have a ton of capability with their CLI / API.  Almost anything can be executed or configured.  I actually developed a number of external work flows which call their CLI and it works well.
  • Tape Management:
    • They handle tapes like a librarian, minus the dewy decimal system. Seriously though, I haven’t worked with a solution that makes handling tapes as easy as they do.
    • If you happen to use Iron Mountain, they have integration for that too.
    • They’re pretty darn efficient with tape usage as well, which is mostly thanks to their “global copy” concept. We still have some white space issues, but it makes sense why
    • They are very good at controlling tape drive and parallel job management. This allows you to balance how many tape drives are used for what jobs.
  • Documentation: They document everything, and for good reason, there is a lot their product does. This includes things like advanced features and most of the special tuning knobs as well.  It’s not always perfect, but it’s typically very good.
  • Recovery:
    • File level recovery from tape for VM backups, without having to recover the whole file, need I say more. That means if I need one file off an 8TB backup VMDK, I don’t have to restore 8TB first.
    • Most application level backup’s offer some level of item level recovery. It’s not always straight forward, or quick, but its usually possible.
    • They’re smart with how they restore data. You can pick where you want the data recovered from (location and copy), and if it does need tapes, it tells you exactly what tapes you need.  No more throwing every single tape in and hoping that’s all you need.

Cons:

  • Backup:
    • Virtualization:
      • Their VMware backup in many ways isn’t as tunable as it should be. There are places where they don’t have stream limits where they really need them.  For example, they lack a stream limit on a the vCenter host, the ESXi host or even the VSA doing the backup.  It’s honestly a little strange as CV seems to offer a never-ending number of stream controls for other areas of their product.  I bring this up as probably my number one issue with their VMware backup.  This led us to have the most initial problems with their solution.  I would still say this is a glaring hole in their virtualization backup.  I just looked up their CV11 SP7 and nothing has changed with regards to this, which is disappointing to say the least.  This is one area that I think Veeam handles much better than them.
      • The performance of NBD (management network only) based backup is bluntly terrible. The only way we could get really good performance out of their product was to switch to hot add.  Typically speaking I hate hot add for Vmware backup.  It takes forever to mount disks, and it makes the setup of VM backup more complicated than it needs to be.  Not to mention if you do have an issue during the backup process (like vCenter dying) the cleanup of the backup is horrible.
      • They don’t pre-tune VSA for hot add. Things like disabling initialize disk in windows and what not.
      • Their inline compression throughput was also atrocious at first. We had to switch the algorithm used which fixed the issue, but it required a non-gui tweak to achieve and me asking if there was anything else they could do.  It was actually timely that the new algorithm had been released as experimental in the release we just upgraded to.
      • Their default VM dispatch to me is less than ideal. Instead of balancing VM’s in a least load method across the VSA’s, they pick the VSA closest to the VM or datastore.  I needed to go in and disable all of this.
    • Deployment / Scalability:
      • While I applaud their flexibility, the one area that I think still needs work is their dedupe. To me, they really need to focus on building a DataDomain level of solution that can scale to petabytes of logical data in a single media agent, and right now they can’t scale that big.  It seems like you need to have a bunch of mid sized buckets which is better than nothing, but still not as ideal as it should be.
      • Deployment for CV newbies is not straight forward. You’ll definitely need professional services to get most of the initial setup done, at least until you have time to familiarize yourself with it.  You’ll also need training so that you actually know how to care for and grow the solution.  I think CV could do a better job with perhaps implementing a more express setup just to get things up, and maybe even have a couple of into / how to videos to jump start the setup.  It’s complicated, because of it’s power, but I don’t think it needs to be.  The knobs and tuning should be there to customize the solution to a person’s environment, but there should be an easy button that suites most folks out of the box.
    • Support: In general I love their support, but there are times where I’m pretty confident the folks doing the support, don’t have at scale experience with the product.  There are times when I’ve tried explaining the scaling issue we were having, and they couldn’t wrap their heads around the issue.  They also tend to get wrapped up in the “this is the way it works” and not in the “this is the way it SHOULD work”.  Which again I think comes back to the experience with product at scale.  This would tend to happen more when I was trying to explain why I setup something in a particular way, and a way that didn’t match their norm.  For example, VM backups, they like to pile everything into subclients.  For more than a number of reasons I’m not going to go into in this blog post, that doesn’t work for us, and frankly it shouldn’t work for most folks.  I was able to punch holes in why their design philosophy was off, but they were stuck on “this is the way it is”.  The good news is you can typically escalate over short sited techs like this and get to someone who can think outside the box.
    • Value: This is a tough one.  On one hand, I want good support and a feature rich product, but on the other hand, the cost of agent based backup is frankly stupid expensive.  When the cost of my backup product costs more per TB than my SAN, that’s an issue.  It’s one of the primary reasons we push towards VM based backup’s as its honestly the only way we could afford their product.  Even with huge discounts, the cost per TB is insane with their solution.  In some cases, I would almost rather have a per agent cost rather than a per TB cost.  I could see how that could get out of control, but I think there are cases where each licensing model works better for each company.  If I had thousands of servers, I could see where the per TB model might make more sense.  This is one of the reasons we don’t backup SQL direct with CV, it just costs too much per TB.  It’s cheaper for us to use a (still too expensive) file based agent to pick up SQL dump files.
    • Storage Management: Once data is stored on its medium, moving it off isn’t easy.  If you have a mountpoint that needs to be vacated, you need to either aux copy data to a new storage copy, manually move the data to another mountpoint, or wait till it ages out.  They really should have an option in their storage pool to simply right click the mountpoint and say “vacate”.  This operation would then move all data/jobs to whatever mountpoints are left in the whole pool.  Similar to VMwares SDRS.  I would actually like to see this ability at a MA level as well too.
    • CLI: I’ll knock any vendor that doesn’t have a Powershell module and CV is one of those vendors.  Again, glad that you have API’s, but in an enterprise where Windows rules the house, Powershell should be standard CLI option.
    • Tape Management: As much as I think they do it better than anyone else, they could still improve the white space issue.  I almost think they need a tapering off setting.  Perhaps maybe even a preemptive analysis of the optimal number of tapes and tape drives before the start of each new aux copy, and re-analyze that each time you detect more data that needs to be copied to tape.  This way it could balance copy performance with tape utilization.  Maybe even define a range of streams that can be used.
    • Documentation: As great as their documentation is, it needs someone to really organize it better.  Taking into account the differences in CV versions.  I realize it’s probably a monumental task, but it can be really hard to find the right document to the right version of what you’re looking for.  I’ve also found times where certain features are documented in older CV version docs, but not in newer ones (but they do exist).  I guess you could argue at least they have so much documentation that it’s just hard to find the right one, vs. not having any doc at all.  When in doubt though, I contact support and they can generally point me in the right direction, or they’ll just answer the question.
    • Recovery:
      • Item level recovery that’s application based really needs a lot of work. One thing I’ll give Veeam is they seems to have a far more feature rich and intuitive application item level recovery solution than CV.
        • Restoring exchange at an item level is slow and involved (lots of components to install). I honestly still haven’t gotten it working.
        • AD item level recovery is incredibly basic and honestly needs a ton of work.
        • Linux requires a separate appliance, which IMO it shouldn’t. If Linux admins can write tools to read NTFS, why can’t a backup vendor write a Windows tool that can natively mount and ready EXT3/4, ZFS, XFS, UFS, etc.
      • P2P, V2V / P2V leaves a lot to be desired. If you plan to use this method, make sure you have an ISO that already works.  Otherwise you’ll be scrambling to recover bare metal when you need to.

Conclusion:

Despite CommVaults cons, I still think it’s the best solution out there.  It’s not perfect in every category, and that’s a typical problem with most do it all solution, but it’s pretty damn good at most.  It’s an expensive solution, and its complicated, but if you can afford it, and invest the time in learning it, I think you’ll fall in love with it, at least as much as one can with a backup tool.

Thinking out loud: Why do server vendors still struggle with driver and firmware management?

History:

Let me give you a little back story before digging into the meat of this post.  My team and I make a very concerted effort to keep our servers firmware and drivers updated.  We’ve gone so far as to purchase software from Dell, implement a process on how firmware / drivers are to be updated, and ensuring that its routinely done every quarter.  We do this because in general it’s a best practice, but also because we’ve run into too many occasions where troubleshooting with a vendor stops (if it ever starts) very quickly if the drivers / firmware isn’t recent.  In essence, we’re doing our best to be diligent and proactive with keeping our servers healthy, secure and updated.

Late last year we ran into two issues, both of which are related to drivers / firmware.

  1. A Broadcom NIC causing a purple screen of death (PSOD) in ESXi. This was a server that was freshly rebuilt and all drivers (or so we thought) and firmware updated.  Turns out the driver we were running was more than two years old and the PSOD we were having was a resolved issue in a newer driver.
  2. An Intel x710 quad port 10Gb NIC causing packets to black hole for certain VM’s on the same vLAN. Again, these were new hosts that were patched, firmware updated and in theory up to date.  This issue is what really triggered us to start evaluating other server vendors and their solutions.

 

Of those two issues above, only one was solved with a simple driver update, and the other, we just gave up on and switched to a different NIC (x520).

The Issue:

If you can’t see where this post is going already, let me lay it out.  Server vendors still can’t properly manage their own drivers, firmware and vendor specific software. I know what you’re thinking, you have tools that the vendor provided, you’re using them and you’re fine.  I hate to be the bearer of bad news, but I doubt it.  We just got done a rigorous evaluation of Dell, HP and Cisco.  None of them have a complete solution.  Don’t get me wrong, some of them are getting there, but no one has the problem solved.  If you’re wondering what the specific problems are, see the bullets below.

  • Server vendors would like that you keep your firmware, drivers and tools up to date.
  • OS vendors (and sometimes server vendors) require that the driver and firmware have a certified pairing. It is NOT good enough to simply have the latest driver and the latest firmware.  This of course may vary slightly depending on which server and OS vendor.  VMware though as an example, absolutely has driver and firmware paring that’s required.
    • This driver and firmware pairing is typically worked out between the server and OS vendor.
    • VMware has a strict HCL for this use case, and TMK, MS has an HCL, although they’re a little more forgiving when it comes to the pairing. I can’t speak about Linux, Solaris or other OS’s.

 

Think about this, when was the last time you did the following?

  • Retrieved an inventory of all your hardware firmware revisions and driver revisions.
    • Do you even know how to do this? Probably not as easy as you think.
  • Logged into your OS vendors HCL and one hardware item at a time, checked if you are running the latest driver and firmware, and that the pairing is also certified.
    • With VMware, you can use the device vendor ID, device ID and sub vendor ID to find the specific hardware in question on their HCL. Just remember its hex values.

I bet you’re either relying on the following.

  • VMware update manager, and vendor provided depots (if they exist).
  • Vendor supplied firmware management solutions.
    • Some may have driver management for select OS, but no one does it all.
  • Vendor custom ISOs / install discs.

I’m going to suspect if you go and check your VMware HCL, you’re out compliance in one way or another, or something is woefully out of date.

Solution:

Let’s share a pipe for a second and dream about what it should look like.

 

Server Vendor:

  • It should be a central console.
  • It should handle downloading all firmware, drivers and vendor specific tools.
  • It should use a concept of baselines. A baseline being defined as.
    • OS and server model specific.
    • Based on a release date. The baseline should define an approved pairing of drivers, firmware and vendor tools for a given month, quarter or however often the server vendor feels the need to establish a new baseline.
    • The baseline should support the concept of cumulative updates / hotfixes.
  • It should support grouping servers, and applying baselines to the server groups.
  • It should support compliance checking for the baselines. Not simply deploying the drivers and firmware and assuming everything is ok.  This would let you know if an admin went rouge and manually updated or downgraded firmware / drivers or tools.
  • It should support rolling back drivers, firmware or tools if it is determined to be too far ahead.
  • Provide very verbose information as to why an update process failed.
  • Bonus points
    • Support a multi-site architecture. Meaning be able to have a cached copy of the repo and a local server to perform the actual update and auditing process.
    • Auto discover servers

OS Vendor:

  • Should provide a comprehensive API
    • Look up hardware and driver pairings.
    • Enable the download of the driver or firmware directly would be nice.

Conclusion:

Coming back to reality a bit, what can you do?  Use the tools you have to the best of your ability, script what you can, and manually deal with the gaps.  That said, I’m working on the auditing part of the problem, at least with VMware. Hope to have blog post about it and a new GitHub commit in the coming month or so, stay tuned.

Oh, one final thing you can do, start bugging your server vendor sales team about the issue. If enough people raise the issue, it will get the attention it desperately needs.

Thinking out loud: What HP + Nimble means to me

Disclaimer:

These are opinions, not facts and these opinions are mine, not my employers.

Introduction:

Upon receiving the news that HP was to acquire Nimble, I can’t say I was exactly thrilled.  Nothing personal against HP, they make great servers, but I like Nimble the way it is right now.  None the less, I know the industry is moving in a direction where its either get big, or get out.  There is a huge storm “cloud” looming, and if you’re an on premises solution, its going to be a scary time in the coming years.

I was thinking about what would be some of the pros and cons of the HP acquisition and this is what I’ve come up with.

Pros:

  1. HP is a big company and an established one at that. We’ll focus on the pros being big/established in this section.
    1. HP will likely have an easier time pitching Nimble into companies that would not have given them a second look. HP is established, so there’s a perception that Nimble is established.  This leads to better market penetration.  Nimble getting better market penetration means Nimble makes HP more money and if Nimble makes HP more money, HP invests more into Nimble.  Hopefully the circle of money keeps snow balling and we all win.
      1. HP is a world wide company and while Nimble has done a great job so far, HP is going to take them into more countries faster than they could on their own. If you’ve had difficulty getting Nimble equipment purchased “in country”, I can see this getting easier long term.
    2. Obviously HP has more capital at their disposal than Nimble did. If invested correctly, I could see this accelerating Nimble’s innovation.
    3. HP has more purchasing power than Nimble does, this could lead to Nimble’s margins being better, which in turn may lead to us having a more affordable product (or more profit for HP).
  2. Look, we all know why tech companies pick Supermicro, and its not quality, its affordability. HP makes some pretty kick ass hardware, so if we were to see Nimble’s hardware platform change from Supermicro to HP equipment, not only would my datacenter look a little sexier, I wouldn’t cut my fingers trying to rack Nimble anymore.
  3. If you’re a current HP customer, I can see two nice integration points.
    1. Infosight for other HP solutions.
    2. Nimble integration into OneView.

Cons:

  1. As mentioned in the pros, HP is a large established company. While this in its self can have some pros, it also has the potential for a number of cons.
    1. Big companies tend to move slow, bureaucracy and over analysis being suspect causes. Nimble had far fewer hoops to jump through before making a decision.  Just remember, deadlines and accomplishments drift a day at a time.  Days become weeks and weeks become months, and you get the picture.
    2. Every company is profit conscious, but some larger companies will kill any sliver of waste, even at the cost of productivity or customer satisfaction. I’m not saying it will happen with HP, only that it could.
    3. While HP will open a lot of new doors for Nimble, it has the potential to close a lot of existing ones too. There are a lot of companies that have had bad experiences with HP and this may be enough for them to drop Nimble.  That said, being realistic, it seems one way or another, you’re going to be purchasing storage from some big vendor, and it may not be the same as the one you purchase servers from.
    4. If HP tries to assimilate Nimble into their ways, I can see this being bad for Nimble customers. Nimble for example has a great support experience.  If HP tries to force Nimble to adopt their triage and support structure, that would be a quick way to devalue Nimble.  There’s other things too, like getting stuck speaking with a generic HP sales rep and sales engineer, instead of having direct access to a Nimble SE and a Nimble sales person, or other typical large company sales and support processes.
  2. HP hasn’t made a great name for themselves here of late. We know they’ve split the company in half and sold off a lot of assets.  It’s hard to say if it’s too little too late, or if it was the right move and just in time.  Regardless, HP to me is a company that’s walking a fine line of a falling giant, or one that’s getting back on its feet.  If HP goes down, Nimble goes with it, and that’s not good for Nimble customers.
  3. HP isn’t exactly synonymous with innovation, at least not any more. I fear that HP has the potential of choking the life out of Nimble.  In my opinion, 3Par was a great storage solution.  Part of me wonders if HP couldn’t make that work, what makes them think Nimble will be any different?  Meaning, are they going to turn Nimble into the next Equallogic?

Other thoughts:

I think deep down everyone knew Nimble wanted to get bought.  Me personally, I was REALLY hoping Cisco was going to buy Nimble.  In my opinion, Cisco + Nimble would go together like peanut butter and fluff.  HP already has a storage company that’s flailing.  I don’t want Nimble to follow suite.  People like to remind me about about Whiptail and how bad that was.  I look at that as a rash move on Cisco part (the solution was doomed to fail), but Nimble would be a pick that no one could blame Cisco for.  Best of all, Cisco doesn’t have any competing products (other than Hyperflex, but that’s a different type of solution).  This would have led to a much stronger and untied focus on pushing Nimble.  From Nimble’s view, it would have solidified them as being established (opening the closed doors), and for Cisco, it would have given them a proven storage startup that’s on fire.  Honestly, if I was Cisco’s CEO, I would be doing everything I could to steal the deal from HP.  If it was a matter of HP vs. Dell vs. Cisco, and Cisco was the one with Nimble, IMO, Cisco would crush the other two like a ten-ton hammer.

Conclusion:

This is obviously all speculation at this point, just thinking out loud.  I hope all the pros of what I pointed out occur with the acquisition and none of the cons.  I wish both vendors the best of luck, and until proven otherwise, I’m still a diehard Nimble (HP) fan.

Naming Conventions: Server Names

Introduction:

One of the things I’m struggling with, is how to balance the number of posts per naming convention.  It would be easy in some ways to use a single post per server type, but it would also be overly redundant in many ways.  I originally wrote a dedicated post to SQL server naming conventions, and realized that the logic behind its name is ultimately a similar logic for other server names.  With that said, I’ve decided to create a consolidated post for server names.  I’ll rehash the overall structure used for the SQL naming convention, and show you how its reusable for other servers.

Limitations:

To begin with, 15 characters is a length limitation I would always suggest maintaining.  The only exception is in the following circumstances.  If you’re building a server that isn’t Windows, and will NEVER need to join a Windows domain, then and only then can you make names longer than 15.  Microsoft in their perpetual need to maintain excessive backwards compatibility, still hasn’t dropped NetBIOS out of its architecture.  Even if you build a Microsoft domain, 100% running on DNS resolution, they still truncate names for backwards compatibility.  I wish they would provide a naming resolution compatibility mode that would in essence switch the domain from NetBIOS supported to DNS only, but that’s a whole different blog post.

My naming conventions are designed to scale for smaller to mid-sized companies.  If you’re dealing with 10s of thousands of servers, this naming convention won’t scale to your needs.  My names are designed to give you a hint of what the server does.  When you’re at the 10s of thousands size, you need a whole new way of dealing with server names.

Other stuff:

I used to really get hung up on server numbering.  What I mean by that, is if I was running a smaller shop with say two domain controllers.  Let’s call them DC1 and DC2 for simplicity.  I would want to keep those names every time I did a major upgrade.  Ultimately it led to a lot of shuffling, and a lot of time spent for something that was ultimately cosmetic and not important.  Point being, if you are or were like me, learn to let it go.  Sometimes you will have a DC3 and a DC4 when there are no DC1 and no DC2.

When you’re dealing with systems that need to connect to other systems by name, learn that CNAME records can be your best friend. I strongly suggest to avoid pointing things directly at a server name, if whatever you’re pointing is mostly a generic service.  For example, its very common to have many things pointing at your DC’s for LDAP lookups.  Rather than doing something like pointing at DC1 and DC2, create a CNAME record for something like “ldap1.domain.com” and “ldap2.domain.com”.  Then when you change your DC’s, you only have two records to change.

It’s expensive, but load balancers can also help with renaming / moving things.  They help because of their ability to create a “virtual IP” and redirect that traffic to any real IP as needed.  In the case of DNS, it would enable you to move your DNS functionality to a new server without having to change the IP you have configured across all your systems.

My naming convention basics:

There are a few basics planned into my naming conventions.  These basics make it easier to organize servers, and ultimately find / figure out what a server’s purpose is.  Obviously with a 15-character limit, there are going to be a lot of abbreviations.  In my opinion, so long as you’re consistent, even vague abbreviations will eventually be memorable or make sense.

Hyphens:

I use hyphens to separate may of the naming conventions purposes.  I know its potentially wasting very precious characters, but at our scale, we can mostly afford to do it, in exchange for making it easier to programmatically find servers.  You don’t need to use hyphens, I do it because its easier for me to script with.  Plus, they make for a consistent separator.  Ultimately though, consistency as I stated above, is what’s important.

Location variable:

I prefer to start all names with either a location variable.  At two completely different companies, I inherited naming conventions that used their company’s acronym for their primary site, and DR for their disaster recovery site servers.  There are two problems with using something this descriptive.

  1. I’ve worked for a company that flipped the location of their DR and primary site. It meant for a very confusing period of time where some servers might have said “DR” but were actually now in the primary headquarters, and vice versa.
  2. If you’re company has more than one office or more than one DR site, the naming convention kind of falls apart.

The main goal for the server location should be generic, but consistent.  Using something like AA1, is just as likely to suffer from the issue above in point 1, but it does solve the issue of problem 2.  If you have multiple locations, you just keep incrementing the number.  So AA2, AA3 ….AA9, and then increment the letter to AB1.  It leaves a TON of room for different / unique locations.  Heck, maybe you don’t even need the double letter.  Math isn’t my strength, but if my calculations are correct even something like one letter + one number = 234 locations, and that assumes you never use the number “0”, in which case it would be 260 locations.

Application:

I like to use something short (as in 3 letters or less) to tell me something about the application or purpose of the server.  For example, I might use “SQL” for a SQL Database Server, or RMQ for a Rabbit MQ server, or EXM for an Exchange Mailbox server.  It can get a little tricky of course, after all you have MySQL and MSSQL, but maybe that doesn’t matter.  After all, a SQL DB is a SQL DB.  The reason I keep it three letters or shorter (on average) is to leave room for a clustering naming conventions (coming up).  Of course if you’re not limited by the 15 characters, you can get a lot more verbose, but at least for those of us in windows shops, that’s tough.

Environment:

This one is short and easy, I use a few letters to denote the environment of the server.

  • P = Production
  • S = Stage
  • U = UAT
  • D = Dev
  • T = Test
  • X = Sandbox

Clustered or Standalone:

I like to denote if this is a clustered system or a standalone system.  The standalone part of is pretty easy, I just use an “S”.  Sometimes I’ll trail it with a number, like S1, or S2 to denote that the application isn’t clustered but that they’re related (you’ll see an example later).

For the cluster part, it gets a little more involved and varies a little bit based on the application.  We start out with a simple “C” to denote clustered, but then I like to use another trailing letter / number as well.  Let’s look at a few cluster examples.

  • CN1 = Clustered Node 1
  • CN2 = Clustered Node 2
  • CDI1 = Clustered Database instance
  • CDG1 = Clustered Database Group 1

Application group number:

This is the final number that really ties everything together.  I use a simple “01”, “02” or whatever number really to tie all clustered nodes or even standalone (related) systems together.

Putting it all together:

Here are a few practical examples to give you an idea of how it all goes together.

Example 1:  A SQL environment to support the widgets application.  This SQL environment will have a full development lifecycle environment and UAT will mirror Production exactly.  We’ll be using SQL AAG’s.

  1. Dev = a1-sqlds-01
  2. Stage = a1-sqlss-01
  3. UAT =
    1. Nodes
      1. A1-sqlucn1-01
      2. A1-sqlucn2-01
    2. Clustered Named Object (management)
      1. A1-sqluc-01
      2. SQL AAG listener names
        1. A1-sqlucdg1-01
        2. A1-sqlucdg2-01
      3. Prod =
        1. Nodes
          1. A1-sqlpcn1-01
          2. A1-sqlpcn2-01
        2. Clustered Named Object (management)
          1. A1-sqlpc-01
  • SQL AAG listener names
    1. A1-sqlpcdg1-01
    2. A1-sqlpcdg2-01

Notice how the last number glues everything together.  Also notice how everything is built off a consistent naming standard.  If we deployed a new application

Example 2: How about something simple like a domain controller environment for 3 sites?  Let’s just say there will be a production environment and a test environment.

  1. Test
    1. Site a1
      1. A1-dctcn1-01
      2. A1-dctcn2-01
    2. Site a2
      1. A2-dctcn1-01
      2. A2-dctcn2-01
    3. Site a3
      1. A3-dctcn1-01
      2. A3-dctcn2-01
    4. Production:
      1. Site a1
        1. A1-dcpcn1-01
        2. A1-dcpcn2-01
      2. Site a2
        1. A2-dcpcn1-01
        2. A2-dcpcn2-01
      3. Site a3
        1. A3-dcpcn1-01
        2. A3-dcpcn2-01

Again, notice how I use a single final number to glue an entire “purpose” together.  If I built a second discrete domain, I would likely change the last number to “02” which would quickly tell me that the domain controller is part of a separate domain.  Also notice how you can easily tell which node a DC is, which site a DC is in, and what its environment is.

Example 3:  A non-clustered exchange server environment that’s serving the same company.

  1. Mailbox servers:
    1. A1-exmps1-01
    2. A1-exmps2-01
  2. CAS Nodes (load balanced)
    1. A1-excpcn1-01
    2. A1-excpcn2-01
  3. CAS VIP
    1. A1-excpc-01_vip

Here you can see how the standalone mailbox servers are working for the same purpose, but ultimately they’re not clustered together.  With the CAS servers, you can see that they are in fact clustered and that we even created a VIP DNS name that helps you understand how everything is related.

Other examples:

At this stage I’m just going to bullet a list of various options, I’ll use a single destination and a single application number since we’ve already gone over that.

File Servers:

  1. Clusters
    1.  Nodes
      1. a1-fspcn1-01
      2. a1-fspcn2-01
    2. Cluster Resource
      1. a1-fspc-01
    3. Clustered SMB share
      1. a1-fspcsmb1-01
      2. a1-fspcsmb2-01
    4. Clustered NFS share
      1. a1-fspcnfs1-01
      2. a1-fspcnfs1-01
  2. Standalone
    1. a1-fsps-01

CommVault:

  1. Comcell
    1. a1-cvccps1-01
  2. MediaAgent
    1. a1-cvmaps1-01
    2. a1-cvmaps2-01
  3. Virtual Server Agent (for dedicated VM’s)
    1. a1-cvvsaps1-01
    2. a1-cvvsaps2-01

DHCP:

  1. Clusters ***Note: because DHCP consumes 4 characters, I leave the “c” off the name. 
    1. a1-dhcppn1-01
    2. a2-dhcppn2-01
  2. Standalone
    1. a1-dhcpps1-01

Review: 1.5 years with MVP Systems Job Automation Scheduler (JAMS)

Introduction:

I wrote a really quick review here about MVP systems JAMS product about 1 year or so ago (maybe a little less).  At the time, I was in search of a solution that could help me glue together several disjointed systems in a workflow.  Specifically, we were trying to integrate Veeam and CommVault backup’s together.  Veeam was of course doing the VM backup’s, and CommVault was copying the Veeam files to tape.  We’ve since moved on from Veeam, but JAMS has continued to be a vital part of our infrastructure.

What is JAMS?

The simple answer is it’s a centralized task scheduler, the long answer is its not only that, but a whole lot more.  This is a solution that replaces cron, windows task scheduler, SQL agent jobs, or pretty much anything else that you would normally use to schedule and execute something.

What makes up a JAMS solution?

There are four main components.

  • The JAMS server: This is clusterable component that schedules, queues and executes any jobs or workflows.
  • The JAMS client: This is the administration GUI.  Kind of self-explanatory, but this is where you would configure all of the settings for the various jobs, and server settings.
    • For windows, this also includes a Powershell module for CLI administration. Pretty sure they have a generic API, but I never bothered to look since PS was available.
  • The JAMS agent: This is a component that is installed on a system where you want to execute jobs.  All kinds of OS’s supported.
  • Microsoft SQL server: Check with MVP systems if other DB’s are supported, but we’re a MS shop, and SQL is on their list.  This is used to store the job history, job status, and pretty much the entire server configuration.  If this goes down, you have big issues to deal with J.  And yes, a clustered SQL server IS supported.

All in all, the infrastructure is pretty simple to understand and for smaller use cases, these roles can all be installed on the same system.

History:

I didn’t start out with JAMS, in fact, they were nowhere in sight when the initial problem came to fruition.  I figured this would be a relatively trivial Powershell solution, and started down the path of trying to write a quick workflow.  Building the logic for the workflow was actually pretty easy, but what I kept running into was the good ‘ol Kerberos double hop condition.  Never heard about it?   Read about it here.  In order to centralize the solution, I basically tried to build my own poor mans centralized task scheduler.  In order to keep it central, I was utilizing “invoke-command” to execute scripts on our Veeam server and our CommVault server.  With Veeam, our database was stored on a different server, so when my “invoke-command” executed against Veeam, my credentials were never passed along to the SQL server.  I was able to work around it by using CredSSP, but it wasn’t reliable.  Sometimes it would work, and sometimes I guess it would timeout or something similar (don’t really remember to be honest).  Then there was the issue with CommVault.  See they used old fashion EXE’s to start jobs (we were on v8) from the command line.  The commands I needed to run had to be executed in sequential order.  Anyone who has worked with Powershell’s “start-process” via invoke-command, knows that the “-wait” parameter is ignored.  I don’t recall the reason, but it was lame on MS part.  Ultimately, it was this that was the deal breaker, and so started the search for some sort of centralized task scheduler.
We ultimately landed on a cheap’o but well known solution called “VisualCron”.  I’ve got nothing against the solution, but after working with it for a few days, not only was it very hacked together, but it wasn’t the most user friendly solution.  So the search continued and we ultimately stumbled across JAMS.  It took a lot of creative searching to find them, but I’m glad we did.  After installing the trial, we knew it was the solution we were looking for, and the rest as they say, is history.

The pros:

  • Easy solution: Pretty easy to install the solution and understand the components. Unlike some other solutions we’ve installed, JAMS takes care of installing any pre-requisites and also has an easy to understand architecture.
  • You get tech support: Normally not something to write home about, but we leveraged their support quite a bit at first, and they were normally helpful.  As simple as JAMS is, it can do a lot of stuff, and that’s where support can (and is) a huge help.  I remember one part of a solution where were trying to pass a variable from one job to another.  Called up support, and sure enough, JAMS could do it and they showed us how.  How about bulk creating a bunch of jobs via PS?  Yep, support had an example of that too.
  • The GUI: This is one where I have pros and cons.   We’re in the pros section, so that’s what I’ll focus on here.
    • I’ve never worked with a GUI that was capable of bulk edits, but JAMS is and it rocks. Just imagine wanting to change the start time on 60 jobs.  You could write a script to do it, or you could highlight the 60 jobs in a folder, right click and basically change the value of one field (time) to another value.  Then BAM! It goes and changes the time for all highlighted jobs.  Pretty much any column you can add to the GUI has this functionality and it rocks.
    • Easy to see all jobs scheduled to run, running or failed in one view.
    • It keeps a detailed log of each job execution. If you write output to the host (think write-host in PowerShell or REM in batch), that output gets logged to a file and stored for historical purposes.  So as long as your script has verbose output, you’ll know exactly what happened in your job.
    • Sort of related to the above, it keeps a history of all executed jobs and their final status. It also tracks things like when it ran, how long it ran, how many resources it consumed, etc.
    • They have some pretty neat dashboards (once you figure them out). There are a few cool built in ones (like projected schedule) too.
    • Last but not least, it’s a pretty easy GUI to use. I won’t say it doesn’t have any learning curve, but I think the learning curve is really more related to the solution than the client its self.
  • Scripting engines: The agent can execute all kinds of scripts.
    • Powershell
    • Batch
    • Bash
    • SSH
    • T-SQL
  • Agent OS Support: The agent can be installed on different OS’s, so this isn’t a 100% windows only solution.
  • Workflows (setups): You can build “setups” (workflows) that tie jobs together. The jobs themselves can run on completely different systems.  In our case, we had a setup which had a “job” that ran on a veeam server and a different job that ran on the CV server.  The Setup was configured to wait until the first job completed with a success before moving on.
  • Job Queueing: It supports queueing jobs. Probably not an issue for many folks, but we used to limit the number of tape backup’s running in parallel.  What’s great is each “job” in jams can share a queue or have different queues.  This allows a Setup to execute the first job (as an example) and if needed, the second job will queue.  We typically had 50 setups running in parallel, but only 4 tape jobs that were allowed to run in parallel.  JAMS would execute all 50 setups in parallel, but when it came time to run the tape portion of a setup, the tape jobs would go into a queue and trickle out as others completed (or failed).  This didn’t stop the first jobs (the backup its self) from completing, so it ultimately kept things moving at a great pace.
  • PowerShell: Being able to admin JAMS through PS is a huge win in my book.  You can create, modify and delete, jobs, setups, queues, etc.  Everything in the GUI can be done in PS.  It’s sad in 2017 that I even have to list this as a pro of a solution. None the less, it’s not as common as it should be, and it’s a win for JAMS.
  • Different Licensing: With a lot of solutions, there’s only one licensing strategy. I found that JAMS had several, and they do so to accommodate differing needs and purposes.
  • Sales team: The sales team I worked with was friendly, knowledgeable and not pushy in any kind of way.  Additionally, what I think is worth noting, is while it felt like we were shopping for a Ferrari, they understood we were on a Corvette budget, and worked with us to find a licensing model (and some pricing breaks) to let us drive home in a solution we really wanted.

The cons:

  • Price: I’m not saying it’s overpriced, all I’m saying is its not cheap.  I would love to use this solution for my whole environment, but it’s not cheap to do that.  I’m not saying they won’t work with you (they will), but to scale the solution, you will be digging deeper in your wallet.
  • The GUI: I think the GUI has some great design characteristics, but I also think it has some flaws too.
    • They recently updated the GUI look. I’m personally not a fan.  It’s a matter of opinion of course, but I find it harder to see what I need to see now.
    • I don’t like the way they separate jobs from setups. I wish they just used a different icon, or a value in a field to separate them.  There are plenty of times I click on a folder and forgot that I’m in the “setup view” when I’m looking for a job.
    • They don’t support right click for certain job management features. I intuitively want to right click in the jobs window and select “new job” or something related, but that’s not the way the GUI is designed.
    • When you bulk submit jobs, they ask you if you want to, for each job. That means if you selected 25 jobs, you’re clicking “submit” 25 times afterwards.
  • Their security design: I found that their security model didn’t work quite like one might think.  I remember working with a tech to do something simple like let our DBA’s manage jobs (execute and read) and something as simple as that required what seemed like a million hoops to jump through.  Ultimately IIRC (it’s been a while) we ended up needing to grant them more rights than I would have wanted in order to accomplish what seemed like a trivial task.  I gave up on it because I didn’t want to create a solution that going to be too complex to manage.
  • Overlapping job detection: I remember when I first started with their solution, we had run into a few cases where jobs (or setups) were overlapping on themselves.  Meaning Job A from Monday night was still running and Tuesday nights job started up and started running.  When I asked support about this, they handed me a script that would nuke the Tuesday job, but ultimately didn’t solve my issue of needing Tuesday’s job to just wait.  I ultimately ended up writing a pre-check job that would detect if any of the same jobs were running and if so, to go into a loop where it checks every minute, waiting for the previous job to complete.  What sucks about this and the script they gave me, is every job I launch that has a pre-check job, this ends up burning my job count.  To me, this just seems like something that should be built into the solution.
  • Maintenance Mode: They don’t seem to have a maintenance mode option.  What I mean by that, is being able to put JAMS into a paused state.  I think you can stop a service on the windows hosts, but honestly that’s a hack.  They should just have a maintenance mode option built right into the GUI.   I could see something like having a few options like, queue any new jobs that start, or let existing jobs finish, but queue anything new, or don’t let any jobs start at all.  Bonus points if this could be done at a folder level.

Conclusion:

Ultimately after living with JAMS for almost 1.5 years, I think they really rock as a solution.  I can’t say I have any experience with any other enterprise job scheduling solutions, but my overall experience with JAMS has been a pleasure.  No solution is perfect, and theirs is no exception, but the great news is they have a solution that is ultimately awesome, with a few negatives, which is a far cry from other vendor’s solutions I’ve used.  My suggestion is if you’re looking for something to replace SQL jobs, task scheduler, cron or any other isolated solution, to give them a look, I think you’ll be pleased.

Naming Conventions: SQL Server Names

Introduction:

If you’re working in a Windows environment like me, you have to deal with 15 character limitations (at least if you care about NETBIOS resolution).  Honestly, really cramps my style, but these conventions were written with that limitation in mind.  If you’re using this for a Linux based server running MySQL or PostgreSQL, you might be able to get a little more detailed.

Multi Environment / Cluster or standalone:

At my employer ASI, we have to contend with multiple applications, multiple environments for each application and in a lot of cases clusters.  Trying to come up with a name that works for them all can be tough, but I think I’ve got a few good ones depending on your use cases.

Here is an MSSQL server naming convention:

Standalones:

  • cmp-sqlps1-01
  • cmp-sqlus1-01
  • cmp-sqlss1-01
  • cmp-sqlds1-01
  • cmp-sqlps1-02
  • cmp-sqlps2-02
  • cmp-sqlus1-02

Let’s break it down:

  • cmp = company.  Use whatever prefix you want, doesn’t need to be all letters and doesn’t need to be three, but I wouldn’t go beyond three.
  • (_) underscores = separators
  • SQL = SQL server
    • Next letter = environment
      • P = Prod
      • U = UAT
      • S = Stage
      • D = Dev
    • Next letter = Clustered or standalone
      • S = Standalone
      • C = Cluster ***More on this later
    • Following number = the number of SQL servers for this environment and application.  If you have multiple stand alone servers for “app1” this allows you to accommodate them.
  • The final number is a number to represent the application.  Doesn’t matter what the number is, but once its assigned, your other environments should fall into the same number.  I can look at the last number and quickly go “oh that’s app 20”.  Ok, maybe not quickly, but its easy enough with a matrix.

You can quickly determine if the servers function, environment, is it a cluster and the apps.

Clusters:

Cluster management resource name

  • cmp-sqlpc-01
  • cmp-sqluc-01

Cluster Node Name:

  • cmp-sqlpcn1-01
  • cmp-sqlpcn2-01
  • cmp-sqlucn1-01
  • cmp-sqlucn2-01

Cluster application resource name

SQL AAG Name / SQL Listener Name
  • cmp-sqlpcdg1-01
  • cmp-sqlpcdg2-01
  • cmp-sqlucdg1-01
  • cmp-sqlucdg2-01
SQL Traditional Failover Cluster Name
  • cmp-sqlpcdi1-01
  • cmp-sqlucdi1-01

Let’s break it down:

SQL clusters require a lot of computer accounts / names, and having a naming convention helps to keep them grouped and organized.

  • CMP-SQLPC = we should already know this based on the previous explanation.  This part tells me its a SQL production cluster.
    • If it goes straight to the application number, that means this is the cluster management resource.
    • If it goes to N1 or N2 = that’s the node of that cluster.  N1 = node 1, N2 = Node 2.
    • If it goes dg1 = Database Group 1, or SQL AAG1 and it belongs to this cluster.
    • If it goes DI1 = Database Instance 1, or SQL failover instance 1.
  • The following number of course is the application we’re assigning.

Closing:

That’s pretty much it, short and sweet.  It’s not designed to scale to massive levels, but I think it will work for most environments.

Index:

To see other naming conventions or posts about naming conventions, head over here.