Quantcast
Channel: Hurry Up and Wait!
Viewing all 109 articles
Browse latest View live

Boxstarter 2.5 Realeased

$
0
0

This week I released Boxstarter 2.5.10. Although there are no new features introduced in this release or really the last several releases, it seemed appropriate to bump the minor version after so many bug fixes, stabilizations and small experience improvement tweaks in v2.4. The focus of this release has been to provide improved isolation between the version of Chocolatey used by boxstarter from the version previously (or subsequently) installed by the user.

In this post I'd like to provide further details on why boxstarter uses its own version of Chocolatey and also share some thoughts on where I see boxstarter heading in the future.

Why is boxstarter running its own version of Chocolatey?

When Chocolatey v0.9.9 released, it introduced a complete rewrite of chocolatey, transforming itself from a set of powershell scripts and functions to a C# based executable. This fundamentally changes the way that boxstarter is able to interact with chocolatey and hook into the package installation sequence. Boxstarter essentially "Monkey Patches" chocolatey in order to intercept certain package installation commands. It needs to do this for a few reasons - the most important is to be able to detect pending reboots prior to any install operation and reboot if necessary.

Boxstarter now installs the last powershell incarnation of chocolatey (v0.9.8.33) in its own install location ($env:appdata\Boxstarter\Chocolatey). However, it keeps the chocolatey lib and bin directories in the standard chocolatey locations inside $env:ProgramData\Chocolatey so that all applications installed with either version of chocolatey are kept in the same repository. Boxstarter continues to intercept choco commands and reroute them to its own chocolatey.ps1. Its important to note this only happens when installing chocolatey packages via boxstarter.

Boxstarter cannot achieve the same behavior with the latest version of chocolatey without some significant changes to boxstarter and likely some additional changes contributed to Chocolatey. I did some spiking on this while on vacation a few months ago but have not found the time to dig in further since. However it is the highest priority item lined up in boxstarter feature work standing just behind critical bug fixes.

Windows protected your PC?

If you use boxstarter using the "click-once" based web installer. You may see a banner in windows 8/2012 and up when installing boxstarter. This is because I have had to obtain a new code signing certificate (an annual event) and it has not yet gained enough time/downloads to establish complete trust from windows. I do not know what the actual algorithm for gaining this trust is but last year these warnings went away after just a few days. In the meantime just click the "more info" link and then choose "run anyways."

Why fewer features?

Over the last year I have taken a bit of a break from boxstarter development in comparison to 2013 and early 2014. This is partly because my development has been much less windows centric and I've been exploring the rich set of automation tool chains that have deeper roots in linux but have been gaining a foothold in windows as well. These include tools such as Vagrant, Packer, Chef, Docker and Ansible. This exploration has been fascinating. (Side note: another reason for less boxstarter work is my commitment to this blog. These posts consume alot of time but I find it worth it).

I strongly believe that in order to provide sound automation tools for windows and truly understand the landscape, one must explore how these tools have been built and consumed in the linux world.  Many of these tools have some overlap with boxstarter and I have no desire to reinvent the wheel. I also think I now have a more mature vision of how to focus future boxstarter contributions.

Whats ahead for boxstarter?

First and foremost - compatibility with the latest chocolatey. This will mean some possibly gut wrenching (the good kind) changes to boxstarter architecture.

Integrating with other ecosystems

I'd like to focus on taking the things that make boxstarter uniquely valuable and making them accessible to other tools. This also means cutting some boxstarter features where other tools do a better job delivering the same functionality.

For instance, I'd like to make boxstarter provisioners for Vagrant and Packer. This means people can more easily setup windows boxes and images with boxstarter while leveraging Vagrant and Packer to interact with the Hypervisor and image file manipulation. With that, I could deprecate the Azure and Hyper-V boxstarter modules because Vagrant is already doing that work.

I'd also like to see better interoperability between boxstarter and other configuration management tools like chef, puppet and DSC.

Decoupling Features

Boxstarter has some features which I find very useful like its streaming scheduled task output, hang detection, and ability to wrap a script with reboot resiliency. There have been many times that I would have liked to have been able to reuse these features in different contexts, but they are currently hard wired and not intended to be easily consumable on their own. I'd like to break this functionality out which would not only make these features more accessible outside of boxstarter but also improve the boxstarter architecture.

Hoping for a more active year to come

I'm unfortunately still a bit time starved but I am really hoping to knock out the chocolatey compatibility work and get started on some new boxstarter features and integrations over the next year.


Creating a Hyper-V Vagrant box from a VirtualBox vmdk or vdi image

$
0
0

I personally use Hyper-V as my hypervisor on windows, but I use VirtualBox on my Ubuntu work laptop so it is convenient for me to create boxes in both formats . However it is a major headache to do so especially when preparing Windows images. I either have to run through creating the image twice (once for VirtualBox and again for Hyper-V) or I have to copy multi gigabyte files across my computers and then convert them.

This post will demonstrate how to automate the conversion of a VirtualBox disk image to a Hyper-V compatible VHD and create a Vagrant .box file that can be used to fire up a functional Hyper-V VM. This can be entirely done on a machine without Hyper-V installed or enabled. I will be demonstrating on a Windows box using Powershell scripts but the same can be done on a linux box using bash.

The environment

You cannot have VirtualBox and Hyper-V comfortably coexist on the same machine. I have Hyper-V enabled on my personal windows laptop and I needed some "bare metal" to install VirtualBox. Well as luck had it, I was able to salvage my daughter's busted old laptop. Its video is hosed which is just fine for my purposes. I'll be running it as a headless VM host. Here is how I set it up:

  • Repaved with Windows 8.1 with Update 1 Professional, fully patched
  • Enabled RemoteDesktop
  • Installed VirtualBox, Vagrant, and Packer with Chocolatey

Of course I used Boxstarter to set it all up. After all I was not born in a barn.

Creating the VirtualBox image

I am currently working on another post (hopefully due to publish this week) that will go into detail on creating Windows images with Packer and Boxstarter and cover many gory details around, unattend.xml files, sysprep, and tips to get the image as small as possible. This post will not cover the creation of the image. However, if you are interested in this topic, I'm assuming you are able to create a VirtualBox VM. That is all you need to do to get started here. Guest OS does not matter. You just need to get a .VDI or .VMDK file generated which is what automatically happens when you create a VirtualBox VM.

Converting the virtual hard disk to VHD

This is a simple one liner using VBoxManage.exe which installs with VirtualBox. You just need the path to the VirtualBox VM's hard disk.

$vboxDisk = Resolve-Path "$baseDir\output-virtualbox-iso\*.vmdk"
$hyperVDir = "$baseDir\hyper-v-output\Virtual Hard Disks"
$hyperVDisk = Join-Path $hyperVDir 'disk.vhd'
$vbox = "$env:programfiles\oracle\VirtualBox\VBoxManage.exe"
.$vbox clonehd $vboxDisk $hyperVDisk --format vhd

This uses the clonehd command and takes the location of the VirtualBox disk and the path of the vhd to be created. The vhd --format is also supplied. The conversion takes a few minutes to complete on my system.

Note that this may likely produce a vhd file that is much larger than the vmdk or vdi image from which the conversion took place. Thats OK. My 3.8GB vmdk produced a 9GB vhd. This is simply because the VHD is uncompressed and we'll take care of that in the last step. It is highly advisable that you "0 out" all unused disk space as the last step of the image creation to assist the compression. On windows images, the sysinterals tool sdelete does a good job at that.

Laying out the Hyper-V and Vagrant metadata

So you now have a vhd file that can be sucked into any Hyper-V VM. However, just the vhd alone is not enough for Vagrant to produce a Hyper-V VM. There are 2 key bits that we need to create: A Hyper-V xml file that defines the VM metadata and vagrant metadata json as well as an optional Vagrantfile. These all have to be archived in a specific folder structure.

The Hyper-V XML

Prior to Windows 10, Hyper-V stored virtual machine metadata in an XML file. Vagrant expects this XML file and inspects it for several bits of metadata that it uses to create a new VM when you "vagrant up". Its still completely compatible with Windows 10 since Vagrant will simply use the Hyper-V Powershell cmdlets to create a new VM. However, it does mean that exporting a windows 10 Hyper-V vm will no longer produce this xml file. Instead it produces two binary files in vmcx and vmrx formats that are no longer readable or editable. However, if you so happen to have access to a xml vm file exported from v8.1/2012R2 or earlier, you can still use that.

Here I am using just that. A xml file exported from a 2012R2 VM. Its fairly large so I will not display it in its entirety here but you can view it on github here. You can edit things like vm name, available ram, switches, etc. This one should be compatible out of the box on any Hyper-V host. The most important thing is to make sure that the file name of the hard disk vhd referenced in this metadata matches the filename of the vhd you produced above. Here is the disk metadata:

<controller0><drive0><iops_limit type="integer">0</iops_limit><iops_reservation type="integer">0</iops_reservation><pathname type="string">C:\dev\vagrant\Win2k12R2\.vagrant\machines\default\hyperv\disk.vhd</pathname><persistent_reservations_supported type="bool">False</persistent_reservations_supported><type type="string">VHD</type><weight type="integer">100</weight></drive0><drive1><pathname type="string"></pathname><type type="string">NONE</type></drive1></controller0>

Another thing to take note of here is the "subtype" of the VM. This is known to most as the "generation." Later Hyper-V versions support both generation and one and two VMs. If you are from the future, there may be more. Typically vagrant boxes will do fine on generation 1 and will also be more portable and accessible to older hosts so I make sure this is specified.

<properties><creation_time type="bytes">j3iz2977zwE=</creation_time><global_id type="string">0AA394FA-7C4A-4070-BA32-773D43B28A68</global_id><highly_available type="bool">False</highly_available><last_powered_off_time type="integer">130599959406707888</last_powered_off_time><last_powered_on_time type="integer">130599954966442599</last_powered_on_time><last_state_change_time type="integer">130599962187797355</last_state_change_time><name type="string">2012R2</name><notes type="string"></notes><subtype type="integer">0</subtype><type_id type="string">Virtual Machines</type_id><version type="integer">1280</version></properties>

Again, its the subtype that is of relevance here. 0 is generation 1. Also, here is where you would change the name of the vm if desired.

If you are curious about exactly how Vagrant parses this file to produce a VM. Here is the powershell that does that. I dont think there is any reason why this same xml file would not work for other OSes but you would probably want to change the vm name. In a bit, I'll show you where this file goes.

Vagrant metadata.json

This is super simple but 100% required. The packaged vagrant box file needs a metadata.json file in its root. Here are the exact contents of the file:

{
  "provider": "hyperv"
}

Optional Vagrantfile

This is not required but especially for windows boxes it may be helpful to include an embedded Vagrantfile. If you are familiar with vagrant, you know that the Vagrantfile contains all of the config data for your vagrant box. Well if you package a Vagrantfile with the base box as I will show here, the config in this file will be inherited by any Vagrantfile that consumes this box. Here is what I typically add to windows images:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure(2) do |config|
  config.vm.guest = :windows
  config.vm.communicator = "winrm"

  config.vm.provider "virtualbox" do |vb|
    vb.gui = true
    vb.memory = "1024"
  end

  config.vm.provider 'hyperv' do |hv|
    hv.ip_address_timeout = 240
  end
end

First, this is pure ruby. That likely does not matter at all and those of you unfamiliar with ruby hopefully groc what this file is specifying, but if you want to go crazy with ifs, whiles, and dos, go right ahead.

I find this file provides a better windows experience accross the board by specifying the following:

  • As long as winrm is enabled on the base image vagrant can talk to it on the right port.
  • VirtualBox will run the vm in a GUI console
  • Hyper-V typically takes more than 2 minutes to become accesible and this prevents timeouts.

Laying out the files

Here is how all of these files should be structured in the end:

Creating the .box file

In the end vagrant consumes a tar.gz file with a .box extension. We will use 7zip to create this file.

."$env:chocolateyInstall\tools\7za.exe" a -ttar package-hyper-v.tar hyper-v-output\*
."$env:chocolateyInstall\tools\7za.exe" a -tgzip package-hyper-v.box package-hyper-v.tar

If you have chocolatey installed, Rejoice! you already have 7zip. If you do not have chocolatey, you should feel bad.

This simply creates the tar archive and then compacts it. It takes about 20 minutes on my machine.

Testing the box

Now copy the final box file to a machine that runs Hyper-V and has vagrant installed.

C:\dev\vagrant\Win2k12R2\exp\Win-2012R2> vagrant box add test-box .\package-hyper-v.box
==> box: Box file was not detected as metadata. Adding it directly...
==> box: Adding box 'test-box' (v0) for provider:
    box: Unpacking necessary files from: file://C:/dev/vagrant/Win2k12R2/exp/Win-2012R2/package-hyper-v.box
    box: Progress: 100% (Rate: 98.8M/s, Estimated time remaining: --:--:--)
==> box: Successfully added box 'test-box' (v0) for 'hyperv'!
C:\dev\vagrant\Win2k12R2\exp\Win-2012R2> vagrant init test-box
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.
C:\dev\vagrant\Win2k12R2\exp\Win-2012R2> vagrant up
Bringing machine 'default' up with 'hyperv' provider...
==> default: Verifying Hyper-V is enabled...
==> default: Importing a Hyper-V instance
    default: Cloning virtual hard drive...
    default: Creating and registering the VM...
    default: Successfully imported a VM with name: 2012R2Min
==> default: Starting the machine...
==> default: Waiting for the machine to report its IP address...
    default: Timeout: 240 seconds
    default: IP: 192.168.1.11
==> default: Waiting for machine to boot. This may take a few minutes...
    default: WinRM address: 192.168.1.11:5985
    default: WinRM username: vagrant
    default: WinRM transport: plaintext
==> default: Machine booted and ready!
==> default: Preparing SMB shared folders...
    default: You will be asked for the username and password to use for the SMB
    default: folders shortly. Please use the proper username/password of your
    default: Windows account.
    default:
    default: Username: matt
    default: Password (will be hidden):
==> default: Mounting SMB shared folders...
    default: C:/dev/vagrant/Win2k12R2/exp/Win-2012R2 => /vagrant

Enjoy your boxes!

Creating windows base images using Packer and Boxstarter

$
0
0

I've written a coupleposts on how to create vagrant boxes for windows and how to try and get the base image as small as possible. These posts explain the multiple steps of preparing the base box and then the process of converting that box to the appropriate format of your preferred hypervisor. This can take hours. I cringe whenever I have to refresh my vagrant boxes. There is considerable time involved in each step: downloading the initial ISO from microsoft, installing the windows os, installing updates, cleaning up the image, converting the image to any alternate hypervisor formats (I always make both VirtualBox and Hyper-V images), compacting the image to the vagrant .box file, testing and uploading the images.

Even if all goes smoothly, the entire process can take around 8 hours. There is a lot of babysitting done along the way. Often there are small hiccups that require starting over or backtracking.

This post shows a way that ends much of this agony and adds considerable predictability to a successful outcome. The process can still be lengthy, but there is no baby sitting. Type a command, go to sleep and you wake up with new windows images ready for consumption. This post is a culmination of my previous posts on this topic but on steroids...wait...I mean a raw foods diet - its 100% automated and repeatable.

High level tool chain overview

Here is a synopsis of what this post will walk through:

  • Downloading free evaluation ISOs (180 day) of Windows
  • Using a Packer template To load the ISO in a VirtualBox VM and customize using a Windows Autounattend.xml file.
  • Optimize the image with Boxstarter, installing all windows updates, and shrinking as much as possible.
  • Output a vagrant .box file for creating new VirtualBox VMs with this image.
  • Sharing the box with others using Atlas.Hashicorp.com

tl;dr

If you want to quickly jump into things and forgo the rest of this post or just read it later, just read the bit about instaling packer or go right to their download page and then clone my packer-templates github repo. This has the packer template, and all the other artifacts needed to build a windows 2012 R2 VirtualBox based Vagrant box for windows. You can study this template and its supporting files to discover what all is involved.

Want Hyper-V?

For the 7 other folks out there that use Hyper-V to consume their devops tool artifacts, I recently blogged how to create a Hyper-V vagrant .box file from a Virtualbox hard disk (.vdi or vmdk). This post will focus of creating the VirtualBox box file, but you can read my earlier post to easily turn it into a Hyper-V box.

Say hi to Packer

Hi Packer.

In short, Packer is a tool that assists in the automation of machine images. Many are familiar with Vagrant, made by the same outfit - Hashicorp - which has become an extremely popular platform for creating VMs. Making the spinning up of VMs easy to do and easy to share has been huge. But there has remained a somewhat uncharted frontier - automating the creation of the images that make up the VM.

So many of today's automation tooling focuses on bringing a machine from bare OS to a known desired state. However there is still the challenge of obtaining a bare OS and perhaps one that is not so "bare". Rather than spending so many cycles building the base OS image up to desired state on each VM creation, why not just do this once and bake it into the base image?...oh wait...we used to do that and it sucked.

We created our "golden" images. We bowed down before them and worshiped at the alter of the instant environment. And then we forgot how we built the environment and we wept.

Just like Vagrant makes building a VM a source controlled artifact, Packer does the same for VM templates and like Vagrant, its built on a plugin architecture allowing for the creation of just about all the common formats including containers.

Installing Packer

Installing packer is super simple. Its just a collection of .exe files on windows and regardless of platform, its just an archive that needs to be extracted and then you simply add the extraction target to your path. If you are on Windows, do yourself a favor and install using Chocolatey.

choco install packer -y

Thats it. There should be nothing else needed. Especially since release 0.8.1 (the current release at the time of this post), Packer has everything needed to build windows images and no need for SSH.

Packer Templates

At the core of creating images using Packer is the packer template file. This is a single json file that is usually rather small. It orchestrates the entire image creation process which has three primary components:

  • Builders - a set of pluggable components that create the initial base image. Often this is just the bare installation media booted to a machine.
  • Provisioners - these plugins can attach to the built, minimal image above and bring it forward to a desired state.
  • Post-Processors - Components  that take the provisioned image and usually brings it to its final usable artifact. We will be using the vagrant post-processor here which converts the image to a vagrant box file.

Here is the template I am using to build my windows vagrant box:

{
    "builders": [{
    "type": "virtualbox-iso",
    "vboxmanage": [
      [ "modifyvm", "{{.Name}}", "--natpf1", "winrm,tcp,,55985,,5985" ],
      [ "modifyvm", "{{.Name}}", "--memory", "2048" ],
      [ "modifyvm", "{{.Name}}", "--cpus", "2" ]
    ],
    "guest_os_type": "Windows2012_64",
    "iso_url": "iso/9600.17050.WINBLUE_REFRESH.140317-1640_X64FRE_SERVER_EVAL_EN-US-IR3_SSS_X64FREE_EN-US_DV9.ISO",
    "iso_checksum": "5b5e08c490ad16b59b1d9fab0def883a",
    "iso_checksum_type": "md5",
    "communicator": "winrm",
    "winrm_username": "vagrant",
    "winrm_password": "vagrant",
    "winrm_port": "55985",
    "winrm_timeout": "5h",
    "guest_additions_mode": "disable",
    "shutdown_command": "C:/windows/system32/sysprep/sysprep.exe /generalize /oobe /unattend:C:/Windows/Panther/Unattend/unattend.xml /quiet /shutdown",
    "shutdown_timeout": "15m",
    "floppy_files": [
      "answer_files/2012_r2/Autounattend.xml",
      "scripts/postunattend.xml",
      "scripts/boxstarter.ps1",
      "scripts/package.ps1"
    ]
  }],
    "post-processors": [
    {
      "type": "vagrant",
      "keep_input_artifact": true,
      "output": "windows2012r2min-{{.Provider}}.box",
      "vagrantfile_template": "vagrantfile-windows.template"
    }
  ]
}

As stated in the tldr above, you can view the entire repository here.

Lets walk through this.

First you will notice the template includes a builder and a post-processor as mentioned above. It does not use a provisioner. Those are optional and we'll be doing most of our "provisioning" in the builder. I explain more about why later. Note that one can include multiple builders, provisioners, and post-processors. This one is pretty simple but you can find lots of more complex examples online.

  • type - This uses the virtualbox-iso builder which takes an ISO install file and produces VirtualBox .ovf and .vmdk files. (see packer documentation for information on the other built in builders).
  • vboxmanage - config sent directly to Virtualbox:
    • natpf1 - very helpful if building on a windows host where the winrm ports are already active. Allows you to define port forwarding to winrm port.
    • memory and cpus - these settings will speed things up a bit.
  • iso_url: The url of the iso file to load. This is the windows install media. I downloaded an eval version of server 2012 R2 (discussed below). This can be either an http url that points to the iso online or an absolute file path or file path relative to the current directory. I keep the iso file in an iso directory but because it is so large I have added that to my .gitignore.
  • iso_checksum and iso_checksum_type: These serve to validate the iso file contents. More on how to get these values later.
  • winrm values: as of version 0.8.0, packer comes with a winrm communicator. This means no need to install an SSH server. I use the vagrant username and password because this will be a vagrant box and those are the default credentials used by vagrant. Note that above in the vboxmanage settings I forward port 5985 to 55985 on the guest. 5985 is the default http winrm port so I need to specify 55985 as the winrm port I am using. The reason I am using a non default port is because the host I am using has winrm enabled and listens on 5985. If you are not on a windows host, you can probably just use the default port but that would conflict on my host. I specify a 5 hour timeout for winrm. This is the amount of time that packer will wait at most for winrm to be available. This is very important and I will discuss why later.
  • guest_additions_mode - By default, the virtualbox-iso builder will upload the latest virtualbox guest additions to the box. For my purposes I do not need this and it just adds extra time, takes more space and I have also had intermittent errors while the file is uploaded which hoses the entire build.
  • shutdown_command: This is the command to use to shutdown the machine. Different operating systems may require different commands. I am invoking sysprep which will shutdown the machine when it ends. Syprep when called with /generalize like I am doing here will strip the machine of security identifiers, machine name and other elements that that make it unique. This is particularly useful if you plan to use the image in an environment where many machines may be provisioned from this template and they need to interact with one another. Without doing this, all machines would have the same name and unique user SIDs which could cause problems especially in domain scenarios.
  • floppy_files: an array of files to be added to a floppy drive and made accessible to the machine. Here these include an answer file, and other files to be used throughout the image preparation.

Obtaining evaluation copies of windows

Many do not know that you can obtain free and fully functioning versions of all the latest versions of windows from Microsoft. These are "evaluation" copies and only last 180 days. However, if you only plan to use the images for testing purposes like I do, that should be just fine.

You can get these from  https://msdn.microsoft.com/en-us/evalcenter.aspx You will just need to regenerate the image at least every 180 days. You are not bound to purchase after 180 days but can easily just download a new ISO.

Finding the iso checksums

As shown above, you will need to provide the checksum and checksum type for the iso file you are using. You can do this using a utility called fciv.exe. You can install it from chocolatey:

choco install fciv -y

Now you can call fciv and pass it the path to any file and it will produce the checksum of that file in md5 format by default but you can pass a different format if desired.

Significance of winrm availability and the winrm_timeout

The basic flow of the virtualbox packer build is as follows:

  1. Boot the machine with the ISO
  2. Wait for winrm to be accessible
  3. As soon as it is accessible via winrm, shutdown the box
  4. start the provisioners
  5. when the provisioners are complete, run the post processors

In a perfect world I would have the windows install process enable winrm early on in my setup. This would result in a machine restart and then invoke provisioners that would perform all of my image bootstrap scripts. However, there are problems with that sequence on windows and the primary culprit is windows updates. I want to have all critical updates installed ASAP. However, windows updates cannot be easily run via a remote session which the provisioners do and once they complete, you will want to reboot the box which can cause the provisioners to fail.

Instead, I install all updates during the initial boot process. This allows me to freely reboot the machine as many times as I need since packer is just waiting for winrm and will not touch the box until that is available so I make sure not to enable winrm until the very end of my bootstrap scripts.

Also these scripts run locally so windows update installs can be performed without issue.

Bootstrapping the image with an answer file

Since we are feeding virtualbox an install iso, if we did nothing else, it would prompt the user for all of the typical windows setup options like locale, admin password, disk partition, etc. Obviously this is all meant to be scripted and unattended and that would just hang the install. This is what answer files are for. Windows uses answer files to automate the the setup process. There are all sorts of options one can provide to customize this process.

My answer file is located in my repo here. Note that it s named AutoUnattend.xml and added to the floppy drive of the booted machine. Windows will load any file named AutoUnattend.xml in the floppy drive and use that file as the answer file. I am not going to go through every line here but do know there are additional options beyond what I have that one can specify. I will cover some of the more important parts.

<UserAccounts><AdministratorPassword><Value>vagrant</Value><PlainText>true</PlainText></AdministratorPassword><LocalAccounts><LocalAccount wcm:action="add"><Password><Value>vagrant</Value><PlainText>true</PlainText></Password><Group>administrators</Group><DisplayName>Vagrant</DisplayName><Name>vagrant</Name><Description>Vagrant User</Description></LocalAccount></LocalAccounts></UserAccounts><AutoLogon><Password><Value>vagrant</Value><PlainText>true</PlainText></Password><Enabled>true</Enabled><Username>vagrant</Username></AutoLogon>

This creates the administrator user vagrant with the password vagrant. This is the default vgrant username and password so setting up this admin user will make vagrant box setups easier. This also allows the initial boot to auto logon as this user instead of prompting.

<FirstLogonCommands><SynchronousCommand wcm:action="add"><CommandLine>cmd.exe /c C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -File a:\boxstarter.ps1</CommandLine><Order>1</Order></SynchronousCommand></FirstLogonCommands>

You can specify multiple SynchronousCommand elements containing commands that should be run when the user very first logs in. I find it easier to read this already difficult to read file by just specifying one powershell file to run and then I'll have that file orchestrate the entire bootstrapping.

This file boxstarter.ps1 is another file in my scripts directory of the repo that I add to the virtualbox floppy. We will look closely at that file in just a bit.

<settings pass="specialize"><component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-ServerManager-SvrMgrNc" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS"><DoNotOpenServerManagerAtLogon>true</DoNotOpenServerManagerAtLogon></component><component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-IE-ESC" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS"><IEHardenAdmin>false</IEHardenAdmin><IEHardenUser>false</IEHardenUser></component><component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-OutOfBoxExperience" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS"><doNotOpenInitialConfigurationTasksAtLogon>true</DoNotOpenInitialConfigurationTasksAtLogon></component></settings>

In short, this customizes the image in such a way that will make it far less likely to cause you to kill yourself or others while actually using the image. So you are totally gonna want to include this.

This prevents the "server manager" from opening on startup and allows IE to actually open web pages without the need to ceremoniously click thousands of times.

A boxstarter bootstrapper

So given the above answer file, windows will install and reboot and then the vagrant user will auto logon. Then a powershell session will invoke boxstarter.ps1. Here it is:

$WinlogonPath = "HKLM:\Software\Microsoft\Windows NT\CurrentVersion\Winlogon"
Remove-ItemProperty -Path $WinlogonPath -Name AutoAdminLogon
Remove-ItemProperty -Path $WinlogonPath -Name DefaultUserName

iex ((new-object net.webclient).DownloadString('https://raw.githubusercontent.com/mwrock/boxstarter/master/BuildScripts/bootstrapper.ps1'))
Get-Boxstarter -Force

$secpasswd = ConvertTo-SecureString "vagrant" -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential ("vagrant", $secpasswd)

Import-Module $env:appdata\boxstarter\boxstarter.chocolatey\boxstarter.chocolatey.psd1
Install-BoxstarterPackage -PackageName a:\package.ps1 -Credential $cred

This downloads the latest version of boxstarter and installs the boxstarter powershell modules and finally installs package.ps1 via a boxstarter install run. You can visit boxstarter.org for more information regarding all the details of what boxstarter does. The key features is that it can run a powershell script, and handle machine reboots by making sure the user is automatically logged back in and that the script (package.ps1 here) is restarted.

Boxstarter also exposes many commands that can tweak the windows UI, enable/disable certain windows options and also install windows updates.

Note the winlogon registry edit at the beginning of the boxstarter bootstraper. Without this, boxstarter will not turn off the autologin when it completes. This is only necessary if running boxstarter from a autologined session like this one. Boxstarter takes note of the current autologin settings before it begins and restores those once it finishes. So in this unique case it would restore to a autologin state.

Here is package.ps1, the meat of our bootstrapping:

Enable-RemoteDesktop
Set-NetFirewallRule -Name RemoteDesktop-UserMode-In-TCP -Enabled True

Write-BoxstarterMessage "Removing page file"
$pageFileMemoryKey = "HKLM:\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management"
Set-ItemProperty -Path $pageFileMemoryKey -Name PagingFiles -Value ""

Update-ExecutionPolicy -Policy Unrestricted

Write-BoxstarterMessage "Removing unused features..."
Remove-WindowsFeature -Name 'Powershell-ISE'
Get-WindowsFeature | 
? { $_.InstallState -eq 'Available' } | 
Uninstall-WindowsFeature -Remove

Install-WindowsUpdate -AcceptEula
if(Test-PendingReboot){ Invoke-Reboot }

Write-BoxstarterMessage "Cleaning SxS..."
Dism.exe /online /Cleanup-Image /StartComponentCleanup /ResetBase

@(
    "$env:localappdata\Nuget",
    "$env:localappdata\temp\*",
    "$env:windir\logs",
    "$env:windir\panther",
    "$env:windir\temp\*",
    "$env:windir\winsxs\manifestcache"
) | % {
        if(Test-Path $_) {
            Write-BoxstarterMessage "Removing $_"
            Takeown /d Y /R /f $_
            Icacls $_ /GRANT:r administrators:F /T /c /q  2>&1 | Out-Null
            Remove-Item $_ -Recurse -Force -ErrorAction SilentlyContinue | Out-Null
        }
    }

Write-BoxstarterMessage "defragging..."
Optimize-Volume -DriveLetter C

Write-BoxstarterMessage "0ing out empty space..."
wget http://download.sysinternals.com/files/SDelete.zip -OutFile sdelete.zip
[System.Reflection.Assembly]::LoadWithPartialName("System.IO.Compression.FileSystem")
[System.IO.Compression.ZipFile]::ExtractToDirectory("sdelete.zip", ".") 
./sdelete.exe /accepteula -z c:

mkdir C:\Windows\Panther\Unattend
copy-item a:\postunattend.xml C:\Windows\Panther\Unattend\unattend.xml

Write-BoxstarterMessage "Recreate [agefile after sysprep"
$System = GWMI Win32_ComputerSystem -EnableAllPrivileges
$System.AutomaticManagedPagefile = $true
$System.Put()

Write-BoxstarterMessage "Setting up winrm"
Set-NetFirewallRule -Name WINRM-HTTP-In-TCP-PUBLIC -RemoteAddress Any
Enable-WSManCredSSP -Force -Role Server

Enable-PSRemoting -Force -SkipNetworkProfileCheck
winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'

The goals of this script is to fully patch the image and then to get it as small as possible. Here is the breakdown:

  • Enable Remote Desktop. We do this with a bit of shame but so be it.
  • Remove the page file. This frees up about a GB of space. At the tail end of the script we turn it back on which means the page file will restore itself the first time this image is run after this entire build.
  • Update the powershell execution policy to unrestricted because who likes restrictions? You can set this to what you are comfortable with in your environment but if you do nothing, powershell can be painful.
  • Remove all windows features that are not enabled. This is a new feature in 2012 R2 called feature on demand and can save considerable space.
  • Install all critical windows updates. There are about 118 at the time of this writing.
  • Restart the machine if reboots are pending and the first time this runs, they will be.
  • Run the DISM cleanup that cleans the WinSxS folder of rollback files for all the installed updates. Again this is new in 2012 R2 and can save quite a bit of space. Warning: it also takes a long time but not nearly as long as the updates themselves.
  • Remove some of the random cruft. This is not a lot but why not get rid of what we can?
  • Defragment the hard drive and 0 out empty space. This will allow the final act of compression to do a much better job compressing the disk.
  • Lastly, and it is important this is last, enable winrm. Remember that once winrm is accessible, packer will run the shutdown command and in our case here that is the sysprep command:
C:/windows/system32/sysprep/sysprep.exe /generalize /oobe /unattend:C:/Windows/Panther/Unattend/unattend.xml /quiet /shutdown

This will cause a second answer file to fire after the machine boots next. That will not happen until after this image is built, likely just after a "vagrant up" of the image. That file can be found here and its much smaller than the initial answer file that drove the windows install. This second unattend file mainly ensures that the user will not have to reset the admin password at initial startup.

Packaging the vagrant file

So now we have an image and the builders job is done. On my machine this all takes just under 5 hours. Your mileage may vary but make sure that your winrm_timeout is set appropriately. Otherwise if the timeout is less that the length of the build, the entire build will fail and be forever lost.

The vagrant post-processor is what generates the vagrant box. It takes the artifacts of the builder and packges them up. There are other post-processors available, you can string several together and you can create your own custom post-processors. You can make your own provisioners and builders too.

One post processor worth looking at is the atlas post-processor which you can add after the vagrant post processor and it will upload your vagrant box to atlas and now you can share it with others.

Next steps

I have just gotten to the point where this is all working. So this is rough around the edges but it does work. There are several improvements to be made here:

  • Make a boxstarter provisioner for packer. This could run after the initial os install in a provisioner and AFTER winrm is enabled. I would have to leverage boxstarter's remoting functionality so that it does not fail when interacting with windows updates over a winrm session. One key advantage is that the boxstarter output would bubble up to the console running packer giving much more visibility to what is happening with the build as it is running. As it stands now, all output can only be seen in the virtualbox GUI which will not appear in headless environments like in an Atlas build.
  • As stated at the beginning of this post, I create both virtualbox and Hyper-V images. The Hyper-V conversion could use its own post-processor. Now I simply run this in a separate powershell command.
  • Use variables to make the scripts reusable. I'm just generating a single box now but I will definitely be generating more: client SKUs, server core, nano, etc. Packer allows you to specify variables and better templatize the build.

Thanks Matt Fellows and Dylan Meissner

I started looking into all of this a couple months ago and then got sidetracked until last week. In my initial investigation in may, packer 0.8.1 was not released and there was no winrm support out of the box. Matt and Dylan both have contributed to those efforts and also provided really helpful pointers as I was having issues getting things working right.

Detecting a hung windows process

$
0
0

A couple years ago, I was adding remoting capabilities to boxstarter so that one could setup any machine in their network and not just their local environment. As part of this effort, I ran all MSI installs via a scheduled task because some installers (mainly from Microsoft) would pull down bits via windows update and that always fails when run via a remote session. Sure the vast majority of MSI installers will not need a scheduled task but it was just easier to run all through a scheduled task instead of maintaining a whitelist.

I have run across a couple installers that were not friendly to being run silently in the background. They would throw up some dialog prompting the user for input and then that installer would hang forever until I forcibly killed it. I wanted a way to automatically tell when "nothing" was happening as opposed to a installer actually chugging away doing stuff.

Its all in the memory usage

This has to be more than a simple timeout based solution  because there are legitimate installs that can take hours. One that jumps to mind rhymes with shequel shurver. We have to be able to tell if the process is actually doing anything without the luxury of human eyes staring at a progress bar. The solution I came up with and am going to demonstrate here has performed very reliable for me and uses a couple memory counters to track when a process is truly idle for long periods of time.

Before I go into further detail, lets look at the code that polls the current memory usage of a process:

function Get-ChildProcessMemoryUsage {
    param(
        $ID=$PID,
        [int]$res=0
    )
    Get-WmiObject -Class Win32_Process -Filter "ParentProcessID=$ID" | % { 
        if($_.ProcessID -ne $null) {
            $proc = Get-Process -ID $_.ProcessID -ErrorAction SilentlyContinue
            $res += $proc.PrivateMemorySize + $proc.WorkingSet
            $res += (Get-ChildProcessMemoryUsage $_.ProcessID $res)
        }
    }
    $res
}

Private Memory and Working Set

You will note that the memory usage "snapshot" I take is the sum of PrivateMemorySize and WorkingSet. Such a sum on its own really makes no sense so lets see what that means. Private memory is the total number of bytes that the process is occupying in windows and WorkingSet is the number of bytes that are not paged to disk. So WorkingSet will always be a subset of PrivateMemory, why add them?

I don't really care at all about determining how much memory the process is consuming in a single point in time. What I do care about is how much memory is ACTIVE relative to a point in time just before or after each of these snapshots. The idea is that if these numbers are changing, things are happening. I tried JUST looking at private memory and JUST looking at working set and I got lots of false positive hangs - the counts would remain the same over fairly long periods of time but the install was progressing.

So after experimenting with several different memory counters (there are a few others), I found that if I looked at BOTH of these counts together, I could more reliably use them to detect when a process is "stuck."

Watching the entire tree

You will notice in the above code that the Get-ChildProcessMemoryUsage function recursively tallies up the memory usage counts for the process in question and all child processes and their children, etc. This is because the initial installer process tracked by my program often launches one or more subprocesses that do various bits of work. If I only watch the initial root process, I will get false hang detections again because that process may do nothing for long periods of time as it waits on its child processes.

Measuring the gaps between changes

So we have seen how to get the individual snapshots of memory being used by a tree of processes. As stated before, these are only useful in relationships between other snapshots. If the snapshots fluctuate frequently, then we believe things are happening and we should wait. However if we get a long string where nothing happens, we have reason to believe we are stuck. The longer this string, the more likely our stuckness is a true hang.

Here are memory counts where memory counts of processes do not change:

VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 0
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 1
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 2

These are 3 snapshots of the child processes of the 2012 Sql Installer on an Azure instance being installed by Boxstarter via Powershell remoting. The memory usage is the same for all processes so if this persists, we are likely in a hung state.

So what is long enough?

Good question!

Having played and monitored this for quite some time, I have come up with 120 seconds as my threshold - thats 2 minutes for those not paying attention. I think quite often that number can be smaller but I am willing to error conservatively here. Here is the code that looks for a run of inactivity:

function Test-TaskTimeout($waitProc, $idleTimeout) {
    if($memUsageStack -eq $null){
        $script:memUsageStack=New-Object -TypeName System.Collections.Stack
    }
    if($idleTimeout -gt 0){
        $lastMemUsageCount=Get-ChildProcessMemoryUsage $waitProc.ID
        $memUsageStack.Push($lastMemUsageCount)
        if($lastMemUsageCount -eq 0 -or (($memUsageStack.ToArray() | ? { $_ -ne $lastMemUsageCount }) -ne $null)){
            $memUsageStack.Clear()
        }
        if($memUsageStack.Count -gt $idleTimeout){
            KillTree $waitProc.ID
            throw "TASK:`r`n$command`r`n`r`nIs likely in a hung state."
        }
    }
    Start-Sleep -Second 1
}

This creates a stack of these memory snapshots. If the last snapshot is identical to the one just captured, we add the one just captured to the stack. We keep doing this until one of two thing happens:

  1. A snapshot is captured that varies from the last one recorded in the stack. At this point we clear the stack and continue.
  2. The number of snapshots in the stack exceeds our limit. Here we throw an error - we believe we are hung.

I hope you have found this information to be helpful and I hope that none of your processes become hung!

A dirty check for Chef cookbooks: testing everything that has changed and only when it is changed

$
0
0

I was recently pinged by a #cheffriend of mine, Alex Vinyar. A few months back, I sent him a bit of code that I wrote at CenturyLink Cloud that runs on our build agents to determine which cookbooks to test after a cookbook commit is pushed to source control. He just mentioned to me that he has gotten some positive feedback on that code so I thought I'd blog on the details of that for others who may be interested. This can also be considered a follow up to my post from almost a year ago on cookbook dependency management where I shared some theory I was pondering at the time on this subject. This post covers how I followed up on that theory with working code.

I'm going to cover two key strategies:

  1. Testing all cookbooks that depend on a changed cookbook
  2. Ensuring that once a cookbook is tested and succeeds, it is never tested again as long as it and its dependencies do not change

I'll share the same code I shared with Alex (and more) in a bit.

Not enough to test just the cookbook changed

Lets consider this scenario: You manage infrastructure for a complex distributed application. You maintain about a baker's dozen cookbooks - see what I did there?...baker...chef...look at me go girl! You have just committed and pushed a change to one of your lower level library cookbooks that exposes functions used by several of your other cookbooks. We'd expect your pipeline to run tests on that library cookbook but is that enough? What if your upstream cookbooks that depend on the library cookbook "blow up" when they converge with the new changes. This is absolutely possible and not just for library cookbooks.

I wanted a way for a cookbook commit to trigger not only its own chefspecs and test-kitchen tests but also the tests of other cookbooks that depend on it.

Kitchen tests provide slow but valuable feedback, don't run more than needed

Here is another scenario: You are now testing the entire dependency tree touched by a change...yay! But failure happens and your build fails. However the failure may not have occurred at the root of your tree (the cookbook that actually changed). Here are some possibilities:

  1. Maybe there was some transient network glitch that caused a upstream cookbook to fail
  2. Gem dependency changes outside of your own managed cookbook collection (you know - the dark of semver - minor build bumps that are really "major")
  3. Your build has batched more than one root cookbook change. One breaks but the others pass

So an hour's worth (or more) of testing 10 cookbooks results in a couple broken kitchen tests. You make some fixes, push again and have to wait for the entire set to build/test all over again.

I have done this before. Just look at my domain name!

However many of these tests may be unnecessary now. Some of the tests that has succeeded in the filed batch may be unaffected by the fix.

Finding dependencies

So here is the class we use at CenturyLink Cloud to determine when one cookbook as been affected by a recent change. This is what I sent to Alex. I'll walk you through some of it later.

require 'berkshelf'
require 'chef'
require 'digest/sha1'
require 'json'

module Promote
  class Cookbook
    def initialize(cookbook_name, config)
      @name = cookbook_name
      @config = config
      @metadata_path = File.join(path, 'metadata.rb')
      @berks_path = File.join(path, "Berksfile")
    end

    # List of all dependent cookbooks including transitive dependencies
    #
    # @return [Hash] a hash keyed on cookbook name containing versions
    def dependencies
      @dependencies ||= get_synced_dependencies_from_lockfile
    end

    # List of cookbooks dependencies explicitly specified in metadata.rb
    #
    # @return [Hash] a hash keyed on cookbook name containing versions
    def metadata_dependencies
      metadata.dependencies
    end

    def version(version=nil)
      metadata.version
    end

    def version=(version)
      version_line = raw_metadata[/^\s*version\s.*$/]
      current_version = version_line[/('|").*("|')/].gsub(/('|")/,"")
      if current_version != version.to_s
        new_version_line = version_line.gsub(current_version, version.to_s)
        new_content = raw_metadata.gsub(version_line, new_version_line)
        save_metadata(new_content)
      end
      version
    end

    # Stamp the metadata with a SHA1 of the last git commit hash in a comment
    #
    # @param [String] SHA1 hash of commit to stamp
    def stamp_commit(commit)
      commit_line = "#sha1 '#{commit}'"
      new_content = raw_metadata.gsub(/#sha1.*$/, commit_line)
      if new_content[/#sha1.*$/].nil?
        new_content += "\n#{commit_line}"
      end
      save_metadata(new_content)
    end

    def raw_metadata
      @raw_metadata ||= File.read(@metadata_path)
    end

    def metadata
      @metadata ||=get_metadata_from_file
    end

    def path
      File.join(@config.cookbook_directory, @name)
    end

    # Performs a berks install if never run, otherwise a berks update
    def sync_berksfile(update=false)
      return if berksfile.nil?

      update = false unless File.exist?("#{@berks_path}.lock")
      update ? berksfile_update : berksfile.install
    end

    # Determine if our dependency tree has changed after a berks update
    #
    # @return [TrueClass,FalseClass]
    def dependencies_changed_after_update?
      old_deps = dependencies
      sync_berksfile(true)
      new_deps = get_dependencies_from_lockfile

      # no need for further inspection is 
      # number of dependencies have changed
      return true unless old_deps.count == new_deps.count

      old_deps.each do |k,v|
        if k != @name
          return true unless new_deps[k] == v
        end
      end

      return false
    end

    # Hash of all dependencies. Guaranteed to change if 
    # dependencies of their versions change
    #
    # @param [environment_cookbook_name] Name of environmnt cookbook
    # @return [String] The generated hash
    def dependency_hash(environment_cookbook_name)
      # sync up with the latest on the server
      sync_latest_app_cookbooks(environment_cookbook_name)
      
      hash_src = ''
      dependencies.each do | k,v |
        hash_src << "#{k}::#{v}::"
      end
      Digest::SHA1.hexdigest(hash_src)
    end

    # sync our environmnt cookbook dependencies 
    # with the latest on the server
    #
    # @return [Hash] the list of cookbooks updated
    def sync_latest_app_cookbooks(environment_cookbook_name)
      result = {}

      #Get a list of all the latest cookbook versions on the server
      latest_server_cookbooks = chef_server_cookbooks(1)
      
      env_cookbook = Cookbook.new(environment_cookbook_name, @config)
      
      # iterate each cookbook in our environment cookbook
      env_cookbook.metadata_dependencies.each do |key|
        cb_key = key[0] # cookbook name

        # skip if its this cookbook or if its not on the server
        next if cb_key == @name || !latest_server_cookbooks.has_key?(cb_key)
        latest_version = latest_server_cookbooks[cb_key]['versions'][0]['version']
        
        # if the server as a later version than this cookbook
        # we do a berks update to true up with the server
        if dependencies.keys.include?(cb_key) &&
          latest_version > dependencies[cb_key].to_s
          berksfile_update(cb_key)
          result[cb_key] = latest_version
        end
      end

      # return the updated cookbooks
      result
    end

    private

    def chef_server_cookbooks(versions = 'all')
      Chef::Config[:ssl_verify_mode] = :verify_none
      rest = Chef::REST.new(
        @config.chef_server_url, 
        @config.node_name, 
        @config.client_key
      )
      rest.get_rest("/cookbooks?num_versions=#{versions}")
    end

    def berksfile
      @berksfile ||= get_berksfile
    end

    def get_berksfile
      return unless File.exist?(@berks_path)
      Berkshelf.set_format :null
      Berkshelf::Berksfile.from_file(@berks_path)
    end

    def save_metadata(content)
      File.open(@metadata_path, 'w') do |out|
        out << content
      end

      # force subsequent usages of metadata to refresh from file
      @metadata=nil
      @raw_metadata = nil
    end

    def get_metadata_from_file
      metadata = Chef::Cookbook::Metadata.new
      metadata.from_file(@metadata_path)
      metadata
    end

    def get_synced_dependencies_from_lockfile
      # install to make sure we have a lock file
      berksfile.install

      get_dependencies_from_lockfile
    end

    # get the version of every dependent cookbook from berkshelf
    def get_dependencies_from_lockfile
      berks_deps = berksfile.list
      deps = {}

      berks_deps.each {|dep| deps[dep.name] = dep.locked_version}
      deps
    end

    def berksfile_update(cookbooks = nil)
      cookbooks.nil? ? berksfile.update : berksfile.update(cookbooks)
      @dependencies = nil
    end
  end
end

This class is part of a larger "promote" gem that we use to promote cookbooks along our delivery pipeline. The class represents an individual cookbook and can be instantiated given an individual cookbook name and a config.

The config class (not shown here) is an extremely basic data class that encapsulates the properties of the chef environment.

Some assumptions

This code will only work given a couple conditions are true. One is fairly common and the other not so much. Even if these conditions do not exist in your environment you should be able to groc the logic.

Using Berkshelf

You are using berkshelf to manage your dependencies. This class uses the berkshelf API to find dependencies of other cookbooks.

Uning the Environment Cookbook Pattern

We use the environment cookbook pattern to govern which cookbooks are considered part of a deployable application. You can read more about this pattern here but an environment cookbook is very simple. It is just a metadata.rb and a Bekshelf.lock. The metadata file includes a depends for each of my cookbooks responsible for infrastructure used in the app. The Berkshelf.lock file uses that to build the entire tree of not only "my own" cookbooks but all third party cookbooks as well.

The dependency hash

For the purposes of this post, the dependency_hash method of the above class is the most crucial. This hash represents all cookbook dependencies and their version at the current point in time. If I were to persist this hash (more on that in a bit) and then later regenerate it with a hash that is at all different from the persisted one, I know something has changed. I don't really care where. Any change at all anywhere in a cookbook's dependency tree means it is at risk of breaking and should be tested.

The "Last Known Good" environment

Every time we test a cookbook and it passes, we store its current dependency hash in a special chef environment called LKG (last known good). This is really a synthetic environment with the same structure as any chef environment but the version constraints may look a bit strange:

{
  name: "LKG_platform",
  chef_type: "environment",
  json_class: "Chef::Environment",
  cookbook_versions: {
    p_rabbitmq: "1.1.15.2d88c011bfb691e9de0df593e3f25f1e5bf7cbba",
    p_elasticsearch: "1.1.79.2dd2472fbd47bf32db4ad83be45acb810b788a90",
    p_haproxy: "1.1.47.75dfc55b772fa0eb08b6fbdee02f40849f37ca77",
    p_win: "1.1.121.fa2a465e8b964a1c0752a8b59b4a5c1db9074ee5",
    p_couchbase: "1.1.92.325a02ab22743eccc17756539d3fac57ce5ef952",
    elasticsearch_sidec: "0.1.7.baab30d04181368cd7241483d9d86d2dffe730db",
    p_orchestration: "0.1.48.3de2368d660a4c7dfd6dba2bc2c61beeea337f70"
  }
}

We update this synthetic environment any time a cookbook test runs and passes. Here is a snippet of our Rakefile that does this:

cookbook = Promote::Cookbook.new(cookbook_name, config)
environment = Promote::EnvironmentFile.new("LKG", config)
lkg_versions = environment.cookbook_versions
cookbook_version = "#{cookbook.version.to_s}.#{hash}"

lkg_versions[cookbook_name] = cookbook_version

environment.write_cookbook_versions(lkg_versions)

This is just using the cookbook and environment classes of our Promote gem to write out the new versions and following this, the environment like all of our environments is committed to source control.

The dirty check

We have now seen how we generate a dependency hash and how we persist that on successful tests. Here is where we actually compare our current dependency hash and the last known hash. We need to determine before running integration tests on a given cookbook if it is any different than the last time we succesfully tested it. If we previously tested it and neither it nor any of its dependencies (no matter how deep) have changed, we really do not need or want to test it again.

def is_dirty(
  source_environment,
  lkg_environment,
  cookbook_name,
  dependency_hash = nil
)
  # Get the environment of the last good tests
  lkg_env_file = EnvironmentFile.new(lkg_environment, config)

  # if this cookbook is not in the last good tests, we are dirty
  unless lkg_env_file.cookbook_versions.has_key?(cookbook_name)
    return true
  end

  # get the current environment under test
  source_env_file = EnvironmentFile.new(source_environment, config)
  
  # get the version of this cookbook from the last good tests
  # it will look like 1.1.15.2d88c011bfb691e9de0df593e3f25f1e5bf7cbba
  # thi is the normal version followed by the dependency hash tested
  lkg_parts = lkg_env_file.cookbook_versions[cookbook_name].split('.')

  # Get the "plain" cookbook version
  lkg_v = lkg_parts[0..2].join('.')
  
  # the dependency hash last tested
  lkg_hash = target_parts[3]
  
  # we are dirty if our plain version or dependency hash has changed
  return source_env_file.cookbook_versions[cookbook_name] != lkg_v || 
    dependency_hash != lkg_hash
end

This method takes the environment currently under test, our LKG environment that has all the versions and their dependency hashes last tested, the cookbook to be tested and its current dependency hash. It returns whether or not our cookbook is dirty and needs testing.

Prior to calling this code, our build pipeline edits our "test" environment (the "source_environment" above). This environment holds the cookbook versions of all cookbooks being tested including the last batch of commits. We use a custom kitchen provisioner plugin to ensure Test-Kitchen uses these versions and not just the versions in each cookbooks berkshelf lock file. As I discussed in my dependency post from last year, its important that all changes are tested together.

So there you have it. Now Hurry up and Wait for your dirty cookbooks to run their tests!

I'm so excited. I'm about to join Chef and I think I like it!

$
0
0

A little over a year ago, a sparkly devops princess named Jennifer Davis came to my workplace at CenturyLink Cloud and taught me a thing or two about this thing she called Chef.  I had been heads down for the previous 3 years trying to automate windows. While wandering in the wilderness of windows automation I had previously bumped into Chocolatey where I met my good friend Rob Reynolds who had told me stories about these tools called Puppet, Chef, and Vagrant. Those were the toys that the good little girls and boys on the other side of the mountain (ie linux land) used for their automation. Those tools sounded so awesome but occupied with my day job at Microsoft that was not centered around  automation and my nights and weekends developing Boxstarter, I had yet to dabble in these tools myself.

So when the opportunity came to join a cloud computing provider CenturyLink Cloud and focus on data center automation for a mixed windows and linux environment, I jumped at the opportunity. Thats where Jennifer entered the scene and opened a portal door leading from my desert to a land of automation awe and wonder and oh the things I saw. Entire communities built, thriving and fighting all about transforming server configuration from lines of code. To these people, "next" and "finish" were commands and not buttons. These were my people and with them I wanted to hang.

There were windows folk I spied through the portal. Jeffrey Snover had been there for quite some time in full robe and sandals pulling windows folk kicking and screaming through to the other side of the portal. Steven Murauwski was there too translating Jeffrey's teachings and pestering Jeffrey's disciples to write tests because that's how its done. Rob had been sucked through the portal door at least a year prior and was working for Puppet. Hey look he's carrying a Mac just like all the others. Is that some sort of talisman? Is it the shared public key needed to unveil their secrets? I don't know...but its so shiny.

So what the heck, I thought. Tonight we'll put all other things aside and I followed Jennifer through the portal and never looked back. There was much ruby to be learned and written. Its not that different from powershell and had many similarities to C# but it was indeed its own man and would occasionally hit back when I cursed it. One thing I found was that even more fun that using Chef to automate CenturyLink data centers, was hacking on and around the automation tooling like Chef, Vagrant, Ruby's WinRM implementation, Test-Kitchen, etc.

There was so much work to be done here to build a better bridge between windows and the automation these tools had to offer. This was so exciting and Chef seemed to be jumping all in to this effort as if proclaiming I know, I know, I know, I know, I want you, I want you. Before long Steven took up residence with Chef and Snover seemed to be coming over for regular visits where he was working with Chef to feed multitudes on just a couple loaves of DSC. I helped a little here and there along with a bunch of other smart and friendly people.

Well meanwhile on the CenturyLink farm, we made alot of progress and added quite a bit of automation to building out our data centers. Recently things have been winding down for me on that effort and it seemed a good time to asses my progress and examine new possibilities. Then it occured to me, if Jennifer can be a sparkly devops princess at chef, could I be one too?

We may not yet know the answer to that question yet but I intend to explore it and enter into the center sphere of its celestial force.

So I'll see my #cheffriends Monday. I'm so excited and I just cant hide it!

Troubleshooting Ruby hangs on windows with windbg

$
0
0

Ahhh windbg...those who are familiar with it know that you can't live with it and you can't live without it. Unless you are a hardened windows c++ dev (I am not), if you have used windbg, it was in a moment of desperation when all of the familiar tools just wouldn't cut it. So now whenever you hear the word "windbg", it conjours up the archetype of hard to use, arcane, engineering tooling. Step aside git and vi, windbg kicks your ass when it comes to commands that were intended to be forgotten. Yet here's the thing, windbg has saved your but. It allowed you to see things nothing else could show you. Did git or vi provide this value?...ok...maybe they did...but still!!

Many of us have personal windbg stories. These usually involve some horrendous production outage and are mixed with the drama of figuring out a tool that was not meant to just be picked up and mastered. I have several of these stories in my high volume web development days. Here are a couple:

  1. A frequently called function uses a new regex causing CPU to spike accross all web nodes.
  2. Web site becomes unresponsive because the asp.net thread pool is saturated with calls to a third party web service that is experiencing latency issues.

I survived these events and others thanks to windbg. Thanks windbg! But I still hate you and I think I always will.

windbg and ruby?

So why am I talking about windbg and ruby in the same post? Are they not from two different worlds that should remain apart? Well I have now had two separate incidents in the past few months where I have needed to use windbg with a problem in ruby code. Both of these incidents are similar in that they both differ from my previous run ins with windbg where I was debugging managed code and they both were related to a similar Chef related problem.

When it comes to ruby, its all unmanaged code and you may not know the difference between managed and unmanaged code in the windows world and that is totally ok, but if you do, I can tell you there are extensions you can use with windbg (like sos.dll) to make your life easier. Sorry, those do not apply here. Its all native from here on out!

I've blogged long ago about debugging native .net code with windbg, bit one, thats a super old post and two, I had to discover new "go to" commands for dealing with unmanaged memory dumps.

Lets talk like we are 2

As in two years old. First, two year olds are adorable. But the real reason is that's pretty much the level where I operate and whatever windbg maturity I gained over the past few days will be lost the next time I need it. I'm writing this post to my future self who has fewer fast twitching brain cells that I have now and is in a bind and needs a clear explanation of how to navigate in a sea of memory addresses.

So to those who are more seasoned, this may not be for you and you may know a bunch more tricks to make this all simpler and for you god created commenting engines.

The problem

So here is the scenario. You notice that your windows nodes under Chef management are not converging. Maybe, they are supposed to converge every few minutes but they have not converged for hours.

So you figure something has happened to cause my nodes to fail and expect to find chef client logs showing chef run after run with the same failures. Nothing new, you will debug the error and fix it. So you open the log and see the latest chef run has just started a new run. Then you look at the time stamp and notice it started hours ago but never moved past finding its runlist. Its just sitting there hung.

No error message to troubleshoot. There is something wrong but the data stops there. What do you do?

Two words...windbg my friend...windbg. Ok its not two words but kind of has a two word ring to it.

What is this windbg?

Windbg is many things really. At its core, its simply a debugger and many use it to attach to live processes and step through execution. That's not usually a good idea when troubleshooting a multithreaded application like a web site but may not be bad for a chef run. However, I have never used it in this manner.

Another very popular usage is to take a snapshot of a process, also called a memory dump, and use it to deeply examine exactly what was going on in the system at that point in time.

The great thing is that this snapshot is very complete. It has access to all memory in the process, all threads and all stack traces. However the rub is that it is very raw data. Its just raw memory, a bunch of addresses pointers and hex values that may cause more confusion than help.

There are several commands to help sort out the sea of data but its likely far less familiar, intuitive or efficient than your day to day dev tools.

This is one reason why I write this post and why I wrote my last windbg post, I can never remember this stuff and the act of committing my immediate memory to writing and having a permanent record of this learning will help me when I inevitably have another similar problem in the future.

Taking a dump

What?!

Oh...god...no.

Thats really what we call this. Seriously. I sit with straight faced, well paid, professional adults and ask them to take dumps and give them to me.

Oh stop it.

Seriously though, this is the first stumbling point in the debugging endeavor. There are several kinds of memory dumps (crash dumps, hang dumps, minidumps, user mode dumps, kernel dumps, etc), each have their merits and there are different ways to obtain them and some means are more obscure than others.

For debugging ruby hangs, we generally just need a user mode dump of the ruby.exe process. This is not going to be a thorough discussion on all the different types of dumps and the means to produce them but I will cover a couple options.

Task Scheduler

In recent versions of windows, they come equipped with a crash dump generation tool that anyone can easily access. Simply right click on the process you want to examine and then select "create dump file", this generates a user mode dump file of the selected process. There are a couple downsides to collecting dumps in this manner:

1. These dumps do not include the handle table and therefore any use of the !handle command in windbg will fail.

2. On a 64 bit machine, unless you explicitly invoke the 32 bit task scheduler, you will get a 64 bit dump even of 32 bit processes. This is not a big deal really and we'll talk about how to switch to 32 bit mode in a 64 bit dump.

ProcDump

There is a sysinternals tool, ProcDump, that can be used to generate a dump. This tool allows you to set all kinds of thresholds in order to capture a dump file at just the right time based on CPU or memory pressure as well as other factors. You can also simply capture a dump immediately. I typically run:

procdump -ma <PID> <path of dump file>

So in the event of analyzing a ruby process dump, I simply give it the process id of ruby.exe. This will include the handle table and is smart enough to produce a 32 bit dump for 32 bit processes.

There are plenty of other ways to get dumps. I used to use AdPlus.vbs years ago and I *think* that will still work and ships with the debugging tools for windows that also includes windbg. However you get your dump file, it will end in a .dmp extension. If supplying a output path to procdump or other generator, make sure you use that extension when naming the file.

When to take the dump

In this post we are discussing hang scenarios. So when analyzing hangs, take the dump when your process is hanging. If we were analyaing a CPU pinning scenario, we would take the dump when the process was pinning the cpu. This can be easier said than done especially in production environments where hangs and pins may not surface on demand and may take hours or days to occur after a process starts. If you find yourself in such a situation where its tricky to generate the dump file at just the right time, have a look at the various procdump switches that can wait for just the right moment based on several different triggers.

Simulating high memory pressure

In the scenario described here, one of the hang events was preceded by high memory pressure on the chef node as seen from various flavors of OutOfMemoryExceptions on preceding chef runs. To reproduce the scenario, I needed to create an environment where there was little memory available on the box. For the benefit of my future self and others, here is how I did that:

  1. Created a VM with relatively low memory. 512MB was adequate in my case.
  2. Shrunk the paging file down to about 128MB
  3. Used the sysinternals tool, TestLimit to artificially leak memory on the box running: TestLimit64.exe -d 1 -c

At this point based on analyzing previous dumps, I knew exactly where in the chef run the problem was occurring so I just created a little ruby program to run the problem call in a loop until the ruby process hanged.

Getting windbg

Windbg is included in Microsoft's debugging toolset which is included in the Windows SDK. From the microsoft download site, you can download the entire SDK or choose only the debugging tools.

Another option is chocolatey. I maintain the windbg chocolatey package and this package will download and install the windows debugging tools and does a few extra helpful things:

  1. Adds a microsoft symbol server path environment variable that windbg will pick up. Assigning the path symsrv*symsrv.dll*f:\localsymbols*http://msdl.microsoft.com/download/symbols to the _NT_SYMBOL_PATH variable. This is important if you want to be able to make much sense of the dump files. Without symbols, you are left to raw memory addresses with no specific information on the structure and calling patterns of key classes and structures in the call stack.
  2. Copies the sos.dll from your local .net installation to the same folder where windbg is installed. This comes in handy for managed code debugging not covered here.
  3. Adds separate shortcut icons for the 64 bit and 32 bit versions of windbg. It can be important depending on the type of process dumped, to use the corresponding version of windbg. It really does not mater so much for this post.

Simply run:

choco install windbg -y

to install the chocolatey windbg package. I do this on all of my dev machines.

Analyzing dump files

So now you have the tools, windbg, you need and a dump file of a problem process to analyze. The remainder of this post will cover how to make sense of the dump with windbg. I will by no means be covering all the commands possible. In fact I'm just gonna cover a handful which should be all you need but I'd encourage anyone to at least browse over this helpful command summary since there very well may be a command that is better suited for your scenario. I'd also point out that I am NOT a windbg guru and if you see a command or strategy I am missing here that you think would be helpful, please provide a comment.

Loading the dump file

Once windbg is open, navigate to File -> Open Crash Dump and browse to the .DMP file you captured. I know...your process may not have actually "crashed", but this is the correct menu option.

So now your dump is loaded into windbg and the entire memory heap of the process is at your fingertips. Here is the typical path I walk at this point under unmanaged scenarios like a ruby hang.

Which thread do I care about?

Chances are nearly certain that there is more than a single thread captured in the dump and in fact depending on the nature of the app, there can be quite a few. However it is likely you only care about one. At least that was the case in my scenario. I want to examine the thread that is hung.

I'll start by running !runaway. Not because thats the thing I want to do most at this time. Rather this will list all threads and how much user time they have consumed:

0:000> !runaway
 User Mode Time
  Thread       Time
   0:734       0 days 0:20:11.843
   1:94        0 days 0:00:00.078

Here there are only two threads. I am more interested in the one that was running for 20 minutes: thread 0. Now the times displayed here may be lower, perhaps dramatically lower than you expect. My process had been running for a couple hours, what's up with 20 minutes!? Its even possible that your process has been hung for days, and the time displayed here will be a mere milliseconds.

Its important to note that this is "User Time" and is the amount of time that the thread was actually working. To get actual "elapsed time" you can add a parameter to !runaway. I'll run !runaway 7 to get user, kernel and elapsed time:

0:000> !runaway 7
 User Mode Time
  Thread       Time
   0:734       0 days 0:20:11.843
   1:94        0 days 0:00:00.078
 Kernel Mode Time
  Thread       Time
   0:734       0 days 0:13:19.437
   1:94        0 days 0:00:00.031
 Elapsed Time
  Thread       Time
   0:734       0 days 2:58:53.443
   1:94        0 days 2:58:53.412

This is from the dump where I artificially created the high memory pressure environment and the same ruby thread was looping for hours. Of the 3 hours it was looping, the thread itself was only doing work for 20 minutes. Some of the production dumps I looked at had a much higher elapsed time (days) and lower user time (just milliseconds) because Chef had started a ruby process that immediately invoked a blocking call to an external process and then waited for days, during which time it did no work. Lazy bum.

Call stacks

Now lets look at some call stacks. Even though I'm pretty sure at this point that thread 0 is the one I'm interested in, I'll fist browse the stack of all threads using: ~*k

0:000> ~*k

.  0  Id: 90c.734 Suspend: 0 Teb: 7ffdd000 Unfrozen
ChildEBP RetAddr  
0028ea2c 7746112f ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c combase!CCliModalLoop::BlockFn+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 combase!ThreadSendReceive+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d combase!CRpcChannelBuffer::SendReceive2+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d combase!ClassicSTAThreadSendReceive+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 combase!CCtxComChnl::SendReceive+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 combase!NdrExtpProxySendReceive+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f combase!ObjectStublessClient+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 combase!ObjectStubless+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e wbemdisp!CDispatchHelp::Invoke+0xaf
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for win32ole.so - 
0028f7b8 6ab901e2 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde win32ole+0x101e2
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for msvcrt-ruby200.dll - 
0028f988 668e3539 win32ole+0x10bde
0028fa18 668ef3f8 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d msvcrt_ruby200!rb_check_copyable+0x37d1
*** ERROR: Module load completed but symbols could not be loaded for ruby.exe
0028fe88 0040287f msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa ruby+0x287f
0028ff80 77087c04 ruby+0x13fa
0028ff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

   1  Id: 90c.94 Suspend: 0 Teb: 7ffda000 Unfrozen
ChildEBP RetAddr  
022cfec8 77452cc7 ntdll!NtWaitForSingleObject+0xc
022cff3c 77452c02 KERNELBASE!WaitForSingleObjectEx+0x99
022cff50 668fe676 KERNELBASE!WaitForSingleObject+0x12
WARNING: Stack unwind information not available. Following frames may be wrong.
022cff80 77087c04 msvcrt_ruby200!rb_thread_list+0x10d6
022cff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
022cffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
022cffec 00000000 ntdll!_RtlUserThreadStart+0x1b

OK, so the ~*k semantics were probably obvious but remember we are 2. Lets walk through each character:

~ - Typing this alone would have simply listed 2 lines: each thread and their status.

* - This returned all threads. We could have supplied a number instead to get a single thread like ~0k to get just thread 0.

k - this produced the call stack.

There are more single character commands we can use after the 'k' each equally intuitive and discoverable. We will look at a couple in just a bit.

So here we have call stacks for each thread with a line for each frame in the stack. One can kind of groc the gist of the execution here: a bunch of ruby C calls and then an RPC call is sent and it waits for that call to return. The two addresses at the front of each frame are likely not much help for us. They refer to the base pointer of the frame and the return address where execution is to resume.

So now you are dissapointed. You were probably thinking you'd be seeing an actual ruby stack trace. Yeah. That would be cool. So one challenge we face is finding what actual ruby code, perhaps that we wrote, is at play here. However, if you find yourself needing to debug at this level, there are things happening outside of ruby that are of real interest.

Another thing to note is that the ruby frames seem brief compared to the COM frames above. Later we'll see how examining the COM is much easier than ruby. This is because we lack symbols for the compiled ruby runtime. I have not tried it myself, but we could compile ruby ourselves and prooduce symbols then capture a dump running on that ruby. This would likely be helpful but not practical in a production environment.

Switching to the correct bitness

Maybe you are seeing nothing like this at all. If you are seeing something like this:

  0  Id: 90c.734 Suspend: 0 Teb: 00000000`7ffdb000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`0008ee08 00000000`77581e66 wow64cpu!CpupSyscallStub+0x2
00000000`0008ee10 00000000`7754219a wow64cpu!WaitForMultipleObjects32+0x1d
00000000`0008eec0 00000000`775420d2 wow64!RunCpuSimulation+0xa
00000000`0008ef10 00007ffc`7f383a15 wow64!Wow64LdrpInitialize+0x172
00000000`0008f450 00007ffc`7f362f1e ntdll!LdrpInitializeProcess+0x1591
00000000`0008f770 00007ffc`7f2d8ebe ntdll!_LdrpInitialize+0x8a00e
00000000`0008f7e0 00000000`00000000 ntdll!LdrInitializeThunk+0xe

This is what a 32 bit thread looks like in a 64 bit dump. Thats ok. You can switch inside of this same dump to 32 bit mode by running:

!wow64exts.sw

Now run ~0k and you will see a stack like the larger one before.

Going deeper into the stack

So I'll be honest, the ruby frames here make no sense to me. This may be due to the lack of symbols but the last ruby frame would indicate that an argument error is being thrown. However it is called by rb_f_send which does not call rb_error_arity according to the ruby source and rb_error_arity does not call into win32ole. So I don't know whats going on here. Now I do know that the ruby win32ole class manages WMI queries and I also know that wbem is the prefix of the wmi scripting library in windows. Further, WMI queries are handled out of process which would explain the RPC call. In this case, the hang is occuring in a Chef client run but before the cookbook recipes actually run. This is when chef "ohai" data is collected which is the data that describes the current state of a machine and windows nodes leverage WMI for much of that data.

So is there any more info that might lend us some clues here? Well one step we can take is to query the thread 0 stack again and this time try to get the actual parameters being passed into the function calls. Thats done by adding...wait...I wont tell...you guess...

...thats right! Of course...'b'. We add b to the ~0k command so we run ~0kb:

0:000> ~0kb
ChildEBP RetAddr  Args to Child              
0028ea2c 7746112f 00000002 0028ebe0 00000001 ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 00000002 0028ebe0 00000000 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 00000001 0028ec80 ffffffff user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c 0028ec80 00000001 0028ec7c combase!CCliModalLoop::BlockFn+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- -------- -------- -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- -------- -------- -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 0028ef84 005c9c54 00505170 combase!ThreadSendReceive+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- -------- -------- -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d 005c9c54 0028ef84 0028ef4c combase!CRpcChannelBuffer::SendReceive2+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- -------- -------- -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- -------- -------- -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d 005c9c54 0028ef84 0028ef4c combase!ClassicSTAThreadSendReceive+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- -------- -------- -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 005c9c54 0028ef84 0028ef4c combase!CCtxComChnl::SendReceive+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 02b0f2c4 0028efb0 76dc5740 combase!NdrExtpProxySendReceive+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b 29b71661 76f113e0 0028f3d0 rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e 74ba39b0 74ba4ce0 0028f3d0 rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f 0028f3d0 00000003 0028f440 combase!ObjectStublessClient+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 02b0f2c4 005bc634 ffffffff combase!ObjectStubless+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f 005bc690 ffffffff 00000001 fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 02b7b668 0028f4f8 0051197c wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 02b7b668 00000024 00000004 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f 004f54ec 02b7b668 00000001 oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e 02b7b678 00000001 6ab991f4 wbemdisp!CDispatchHelp::Invoke+0xaf
0028f7b8 6ab901e2 02b7b668 00000001 6ab991f4 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde 00000003 00000000 01dc6590 win32ole+0x101e2
0028f988 668e3539 00000001 00433098 02386500 win32ole+0x10bde
0028fa18 668ef3f8 0262556e 02487ce8 00000015 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 0043308c 00433088 00000004 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd 00375750 004b2f30 02419480 msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 01d5af70 00000004 02487ed5 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 00000001 00000000 01dda1b8 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 02522f58 00000000 ffffffff msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d 0028fe7c 00000000 0028fe88 msvcrt_ruby200!rb_check_copyable+0x37d1
0028fe88 0040287f 02522f58 00372cb8 0028ff80 msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa 00000002 00372cb8 00371550 ruby+0x287f
0028ff80 77087c04 7ffde000 77087be0 28433cec ruby+0x13fa
0028ff94 7765ad1f 7ffde000 28241b9c 00000000 kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ffffffff 7764024a 00000000 ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 004014e0 7ffde000 00000000 ntdll!_RtlUserThreadStart+0x1b

This adds three more memory addresses to each frame and they point to the first three arguments passed to the function on the stack.

So these may be hex representations of actual values of pointers to other data structures. If we have symbols for the code being called, things get alot easier but lets look at a ruby call first. I'd like to check out the args passed to the last ruby call in the stack.

Examining raw address values

So the command to display memory in windbg is 'd'. I know, thats really kind of a given but remember we are 2. Also, its not just d. d is always accompanied by another character to let windbg know how to display that value. Here are some I find myself using alot:

  • dd - DWORD (4bytes) by default it spits out 32 of them but you can limit the number with the L parameter (ex. dd 7746112f L2)
  • dc - Same output as dd but adds ascii representation to the right
  • da - displays the address memory as a ascii string
  • du - displays the memory as a unicode string

There is another one, dt, that is super helpful and we will look at that in just a bit.

So one thing I am curious about here are the arguments passed to the last ruby frame:

0028fa18 668ef3f8 0262556e 02487ce8 00000015 msvcrt_ruby200!rb_error_arity+0x129

Remember, its the 2nd - 4th addresses that contain the first three arguments. You always see 3 addresses regardless of the number of arguments the method takes so these may point to nowhere useful. Looking at the first argument dc 0262556e produces:

0:000> dc 0262556e 
0262556e  336e6977 6e705f32 67697370 6464656e  win32_pnpsignedd
0262557e  65766972 26340072 5f564552 335c3130  river.4&REV_01\3
0262558e  37363226 36313641 26332641 00003833  &267A616A&3&38..
0262559e  3d740241 1b8612c8 43508000 45565c49  A.t=......PCI\VE
026255ae  30385f4e 44263638 375f5645 26323931  N_8086&DEV_7192&
026255be  53425553 305f5359 30303030 26303030  SUBSYS_00000000&
026255ce  5f564552 335c3330 37363226 36313641  REV_03\3&267A616
026255de  26332641 00003030 3d7d0000 1c011331  A&3&00....}=1...

OK, this is interesting and curious. win32_pnpsigneddriver is a valid WMI class. However, this would not be a valid argument for rb_error_arity which takes 3 ints. So now I'm just dismissing the entire ruby stack to lack of debug symbols, but the I feel confident a WMI call is a factor. Here is another clue: the last argument on the second win32ole frame:

0:000> dc 02386500 
02386500  0000000c 02361db0 00000000 6ab834f0  ......6......4.j
02386510  0241c9d8 00000000 09402005 01d5a820  ..A...... @. ...
02386520  00000023 02571888 00000023 00000000  #.....W.#.......
02386530  0000a007 01d528a0 02386560 00000000  .....(..`e8.....
02386540  00000000 00000000 09424005 01d5a820  .........@B. ...
02386550  63657845 72657551 00000079 00000000  ExecQuery.......
02386560  00502005 01d5a820 00000023 02625560  . P. ...#...`Ub.
02386570  0000003e 00000000 08202005 01d5a820  >........  . ...

ExecQuery is the WMI method name that performs queries on WMI classes.

So searching the chef ohai gem I come accross:

drivers = wmi.instances_of('Win32_PnPSignedDriver')

This instances_of method uses the wmi-lite gem calling ExecQuery on the locator returned from:

locator = WIN32OLE.new("WbemScripting.SWbemLocator")

So now I have a pretty good idea what application code is triggering the hang. As I stated earlier in this post, WMI queries create an RPC call to a process named WmiPrvSE.exe. Its not uncommon to see more than one of these processes on a machine. So now I'm wondering if it might be possible to track which of these processes this thread is waiting on. This gets us to a couple other useful windbg commands.

Navigating types

So we do have symbols for the COM code handling the RPC message. That being the case, there is another stack trace command we can use to get us the parameters in a more friendly way - ~0kP

0:000> ~0kP
ChildEBP RetAddr  
0028ea2c 7746112f ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c combase!CCliModalLoop::BlockFn(
            void ** ahEvent = 0x0028ec80, 
            unsigned long cEvents = 1, 
            unsigned long * lpdwSignaled = 0x0028ec7c)+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 combase!ThreadSendReceive(
            class CMessageCall * pCall = 0x00505170, 
            struct _GUID * rmoid = 0x00509fb4 {70622196-bea4-9302-ac70-cabc069d1e23})+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d combase!CRpcChannelBuffer::SendReceive2(
            struct tagRPCOLEMESSAGE * pMessage = 0x0028ef84, 
            unsigned long * pstatus = 0x0028ef4c)+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d combase!ClassicSTAThreadSendReceive(
            class CAptRpcChnl * pChannel = 0x005c9c54, 
            struct tagRPCOLEMESSAGE * pMsg = 0x0028ef84, 
            unsigned long * pulStatus = 0x0028ef4c)+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 combase!CCtxComChnl::SendReceive(
            struct tagRPCOLEMESSAGE * pMessage = 0x0028ef84, 
            unsigned long * pulStatus = 0x0028ef4c)+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 combase!NdrExtpProxySendReceive(
            void * pThis = 0x02b0f2c4, 
            struct _MIDL_STUB_MESSAGE * pStubMsg = 0x0028efb0)+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f combase!ObjectStublessClient(
            void * ParamAddress = 0x0028f3d0, 
            long Method = 0n3)+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 combase!ObjectStubless(void)+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e wbemdisp!CDispatchHelp::Invoke+0xaf
0028f7b8 6ab901e2 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde win32ole+0x101e2
0028f988 668e3539 win32ole+0x10bde
0028fa18 668ef3f8 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d msvcrt_ruby200!rb_check_copyable+0x37d1
0028fe88 0040287f msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa ruby+0x287f
0028ff80 77087c04 ruby+0x13fa
0028ff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

Maybe still not the most beautiful output, but look at those combase calls. We have types and parameter names! Amazing! So to get the process ID of the service handling the call, we want to look at the CMessageCall passed to ThreadSendReceive. We can dig into CMessageCall using the dt command. It takes the symbol to inspect and its address:

0:000> dt CMessageCall 0x00505170
Ambiguous matches found for CMessageCall (dumping largest sized):
    combase!CMessageCall {0x004 bytes}
    combase!CMessageCall {0x100 bytes}
combase!CMessageCall
   +0x000 __VFN_table : 0x76efb4c0 
   =7703e09c s_uCurrentCallTraceId : 0x23ac7b
   +0x004 __VFN_table : 0x76efb53c 
   +0x008 __VFN_table : 0x76efb52c 
   +0x00c _pNextMessage    : (null) 
   =76ef0000 Kind             : 0x905a4d (No matching name)
   +0x010 _callcat         : 1 ( CALLCAT_SYNCHRONOUS )
   +0x014 _iFlags          : 0x202
   +0x018 _hResult         : 0n0
   +0x01c _hEvent          : 0x00000260 Void
   +0x020 _hWndCaller      : (null) 
   +0x024 _ipid            : _GUID {0009042d-030c-0000-5ed7-a7ed59ca66b0}
   +0x034 _hSxsActCtx      : 0xffffffff Void
   +0x038 _server_fault    : 0
   +0x03c _destObj         : CDestObject
   +0x048 _pHeader         : 0x005bfdb8 Void
   +0x04c _pLocalThis      : 0x005bfdd8 WireLocalThis
   +0x050 _pLocalThat      : (null) 
   +0x054 _pHandle         : 0x005941a0 CChannelHandle
   +0x058 _hRpc            : 0x02b68f18 Void
   +0x05c _pContext        : (null) 
   +0x060 _dwCOMHeaderSize : 0x30
   +0x064 message          : tagRPCOLEMESSAGE
   +0x090 _iidWire         : _GUID {423ec01e-2e35-11d2-b604-00104b703efd}
   +0x0a0 _guidCausality   : _GUID {1a786d47-0267-499f-b9e9-6678d3e29858}
   +0x0b0 _dwServerPid     : 0x30c
   +0x0b4 _dwServerTid     : 0
   +0x0b8 _iMethodWire     : 3
   +0x0bc _dwErrorBufSize  : 0
   +0x0c0 _pASTAIncomingCallData : (null) 
   +0x0c4 m_ulCancelTimeout : 0xffffffff
   +0x0c8 m_dwStartCount   : 0xfb0049
   +0x0d0 m_pClientCtxCall : (null) 
   +0x0d4 m_pServerCtxCall : (null) 
   +0x0d8 _bCheckedClientIsAppContainer : 0y0
   +0x0d8 _bClientIsAppContainer : 0y0
   +0x0d8 _bServerIsAppContainer : 0y0
   +0x0d8 _bCheckedClientIsRpcss : 0y0
   +0x0d8 _bClientIsRpcss  : 0y0
   +0x0d8 _bCheckedClientIsStrongNamedProcess : 0y0
   +0x0d8 _bClientIsStrongNamedProcess : 0y0
   +0x0d8 _bClientCalledGetBuffer : 0y1
   +0x0d9 _bClientCalledSendOrSendReceive : 0y1
   +0x0d9 _bClientCalledReceiveOrSendReceive : 0y1
   +0x0d9 _bClientReceiveOrSendReceiveCompletedSuccessfully : 0y0
   +0x0d9 _bSuppressClientStopTracing : 0y0
   +0x0dc _uCallTraceId    : 0x23ac7b
   +0x0e0 _hrClient        : 0
   +0x0e4 _sourceOfHResult : 0 ( Unknown )
   +0x0e8 _pInBiasContainer : (null) 
   +0x0ec _pOutBiasContainer : (null) 
   +0x0f0 _pushTlsMitigation : PushTlsPreventRundownMitigation
   +0x0f8 _bEnabledInprocOutParameterMitigation : 0y0
   +0x0f8 _bCheckedConditionsForOutofprocOutParameterMitigation : 0y0
   +0x0f8 _bShouldEnableOutofprocOutParameterMitigation : 0y0
   +0x0f8 _bScheduledAcknowlegmentOfOutParameterMarshalingSet : 0y0
   +0x0fc _pServerOXIDEntry : 0x004fbe68 OXIDEntry

So the key property here is _dwServerPid with a value of 0x30c, which translates to a decimal value of 780. This was a svchost.exe process which can host several windows services. This process was hosting the Windows Management Instrumentation service.

To be honest at this point my energy to dive into that process to see if I could find which spawned WmiPrvSE.exe was at fault had fizzled. There had been alot of other troubleshooting beyond what we have looked at here leading me to believe that there was likely nothing helpful on the WmiPrvSE side that would be useful. The actionable bit of data I could walk away from here is to include a timeout in the wmi-lite query so that if WMI queries hang in the future, the process will at least fail allowing subsequent chef runs to proceed.

Before ending this post, I would like to cover a couple other points that may be helpful in other investigations.

Waiting handles

One dead giveaway from this stack that things are stuck is the WaitForMultipleObjects call at the very top.

0028ea2c 7746112f 00000002 0028ebe0 00000001 ntdll!NtWaitForMultipleObjects+0xc

Just google this name and you will find reams of posts, some well worth the read, talking about hangs. We can even examine the waiting handle. Its the second parameter sent to that call. First we need the handle address that argument points to. We will use dd 0028ebe0 L1 to get the very first dword:

0:000> dd 0028ebe0 L1
0028ebe0  00000260

This gives us the handle address which we retrive using the !handle command:

0:000> !handle 00000260  f
Handle 00000260
  Type          Event
  Attributes    0
  GrantedAccess 0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
  HandleCount   2
  PointerCount  35085
  Name          <none>
  Object specific information
    Event Type Auto Reset
    Event is Waiting

You definitely want to use the f parameter other wise you just get the handle type. So we see here the thread is waiting.

Locks, lock owners and orphans

In another hang I had some dumps of the WmiPrvSE processes on the machine at the time that ruby was hanging. In this dump was the following thread:

   3  Id: 1dd0.293c Suspend: 0 Teb: 7f28f000 Unfrozen
ChildEBP RetAddr  Args to Child              
0385dfa4 7724df87 00000380 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0385e018 7724de0b 00000001 0366e230 0366e260 ntdll!RtlpWaitOnCriticalSection+0xd0
0385e040 7724de35 0385e064 73f33450 0366e260 ntdll!RtlpEnterCriticalSectionContended+0xa0
0385e048 73f33450 0366e260 02e1bb48 0366e230 ntdll!RtlEnterCriticalSection+0x42
0385e064 73f33721 3a369589 00000000 00000000 signdrv!CWhqlObj::WalkTree+0x18
0385e930 73f3364d 02df2da8 02e1bb48 00000000 signdrv!CWhqlObj::CreateList+0x4b
0385e97c 00106d8a 0366e238 02e1b394 00000000 signdrv!CWhqlObj::CreateInstanceEnumAsync+0xdd
0385e9d8 00106a26 00000000 02dc50b4 00000000 WmiPrvSE!CInterceptor_IWbemSyncProvider::Helper_CreateInstanceEnumAsync+0x2c1
0385ea34 75c8676f 02dd8b70 02dc50b4 00000000 WmiPrvSE!CInterceptor_IWbemSyncProvider::CreateInstanceEnumAsync+0xf6
0385ea5c 75d071fc 00106930 0385eca8 00000005 rpcrt4!Invoke+0x34
0385f0e0 74fa868d 02dbd098 02dc4154 02db95b4 rpcrt4!NdrStubCall2+0x32c
0385f12c 75c6147c 02dbd098 02db95b4 02dc4154 combase!CStdStubBuffer_Invoke+0x99 [d:\9147\com\combase\ndr\ndrole\stub.cxx @ 1590]
0385f148 74072f08 02dbd098 02db95b4 02dc4154 rpcrt4!CStdStubBuffer_Invoke+0x2c
0385f164 74fa853f 02dbda6c 02db95b4 02dc4154 fastprox!CBaseStublet::Invoke+0x38
0385f1f0 74e8c086 02dbb228 000f3888 02dc4154 combase!SyncStubInvoke+0x14c [d:\9147\com\combase\dcomrem\channelb.cxx @ 1664]
(Inline) -------- -------- -------- -------- combase!StubInvoke+0x9e [d:\9147\com\combase\dcomrem\channelb.cxx @ 1957]
0385f31c 74fa9600 02dc4154 02db95b4 02dbda6c combase!CCtxComChnl::ContextInvoke+0x236 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 1377]
(Inline) -------- -------- -------- -------- combase!DefaultInvokeInApartment+0xffffffe8 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 2716]
0385f3c4 74fa8de0 02dbda6c 02dd8b70 02de2de0 combase!AppInvoke+0x415 [d:\9147\com\combase\dcomrem\channelb.cxx @ 1481]
0385f520 74fa964f 74fa9710 00000000 02dc0ed0 combase!ComInvokeWithLockAndIPID+0x605 [d:\9147\com\combase\dcomrem\channelb.cxx @ 2311]
0385f57c 75c867cd 02dc0ed0 3c14e0c5 00000000 combase!ThreadInvoke+0x694 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5488]
0385f5c0 75c8722e 74fa9710 02dc0ed0 0385f670 rpcrt4!DispatchToStubInCNoAvrf+0x4d
0385f630 75c86ff7 02dc0ed0 00000000 00000000 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x13e
0385f6c4 75c86d3c 0385f6f4 00000000 02dc0e18 rpcrt4!LRPC_SCALL::DispatchRequest+0x226
0385f720 75c86a14 02de2d48 02de3e50 00000000 rpcrt4!LRPC_SCALL::HandleRequest+0x31c
0385f75c 75c8538f 02de2d48 02de3e50 00000000 rpcrt4!LRPC_SASSOCIATION::HandleRequest+0x1fc
0385f824 75c85101 00000000 75c85070 02d9a0c0 rpcrt4!LRPC_ADDRESS::ProcessIO+0x481
0385f860 77250b10 0385f8e4 02db39c4 02d9a0c0 rpcrt4!LrpcIoComplete+0x8d
0385f898 77250236 0385f8e4 02d9a0c0 00000000 ntdll!TppAlpcpExecuteCallback+0x180
0385fa38 755d7c04 02d981d8 755d7be0 3cb70784 ntdll!TppWorkerThread+0x33c
0385fa4c 7728ad1f 02d981d8 3ee82643 00000000 kernel32!BaseThreadInitThunk+0x24
0385fa94 7728acea ffffffff 7727022d 00000000 ntdll!__RtlUserThreadStart+0x2f
0385faa4 00000000 77244a00 02d981d8 00000000 ntdll!_RtlUserThreadStart+0x1b

This thread jumped out at me because it was waiting to obtain a lock to run a critical section. We can get more info on the critical section by using the !critsec command with the address of the first parameter passed to RtlEnterCriticalSection:

0:000> !critsec 0366e260

CritSec +366e260 at 0366e260
WaiterWoken        No
LockCount          1
RecursionCount     1
OwningThread       2f84
EntryCount         0
ContentionCount    1
*** Locked

We can also use the !locks command:

0:000> !locks

CritSec +366e260 at 0366e260
WaiterWoken        No
LockCount          1
RecursionCount     1
OwningThread       2f84
EntryCount         0
ContentionCount    1
*** Locked

Scanned 13 critical sections

So the kind of obvious next question one might ask themselves now is: what is thread 2f84 doing? Probably nothing productive. However, a listing of threads:

0:000> ~
.  0  Id: 1dd0.9b4 Suspend: 0 Teb: 7f3be000 Unfrozen
   1  Id: 1dd0.2c64 Suspend: 0 Teb: 7f3bb000 Unfrozen
   2  Id: 1dd0.3148 Suspend: 0 Teb: 7f289000 Unfrozen
   3  Id: 1dd0.293c Suspend: 0 Teb: 7f28f000 Unfrozen

shows that thread has gone missing. This probably means one of two things:

  1. We are in the middle of a context switch and owner 2f84 has finished.
  2. 2f84 encountered an access violation and died leaving our thread here orphaned. Tragic but very possible.

Coincidentally, more inspection of this thread shows that it was doing something with our WMI class friend Win32_PNPSignedDriver. This very well could have been why the ruby thread in this case was hanging. It was hard to say because other hangs did not reveal this same pattern.

So I know I've looked at just about enough memory dumps to last me a good while and I hope that this has been equally satisfying to other readers.

Oh...and...hi future self! I'm thinking about you and hoping you a quick investigation.

A Packer template for Windows Nano server weighing 300MB

$
0
0

Since the dawn of time, human kind has struggled to produce Windows  images under a gigabyte and failed. We have all read the stories from the early Upanishads, we have studied the Zoroastrian calculations, recited the talmudic laws governing SxS yet continue to grow ever older as we wait for our windows pets to find an IP for us to RDP to. Well hopefully these days are nearing an end. I think its pretty encouraging that I can now package a windows VM in a 300MB vagrant package.

This post is going to walk through the details and pitfalls of creating a Packer template for Windows NanoVagrant boxes. I have already posted on the basics of Packer templates and Vagrant box packaging. This post will assume some knowledge of Packer and Vagrant basic concepts.

Windows Nano, a smaller windows

Windows nano finally brings us vm images of similar relative size to its linux cousins. The one I built for VirtualBox is about 307MB. This is 10x smaller than the smallest 2012R2 box I have packaged at around 3GB.

Why so much smaller?

Here are a few highlights:

  • No GUI. Really this time. No notepad and no cmd.exe window. Its windows without windows.
  • No SysWow64. Nano completely abandons 32 bit compatibility, but I'm bummed there will be no SysWowWow128.
  • Minimal packages and features in the base image. The windows team have stripped down this OS to a minimal set of APIs and features. You will likely find some of your go to utilities missing here, but thats ok because it likely has another and probably better API that accomplishes the same functionality.

Basically Microsoft is letting Backwards compatibility slide on this one and producing an OS that does not try to support legacy systems, but is far more effective at managing server cattle.

Installation challenges

Windows Nano does not come packaged in a separate ISO nor does it bundle as a separate image inside the ISO like most of the other server SKUs such as Standard Server or Data Center. Instead you need to build the image from bits in the installation media and extract that.

If you want to host Nano on Hyper-V, running the scripts to build and extract this image are shockingly easy. Even if you want to build a VirtualBox VM, things are not so bad. However there are more moving parts and some elusive gotchas when preparing a Packer template.

Just show me the template

Before I go into detail, mainly as a cathartic act of self governed therapy to recover from the past week of yak shaving, lets just show how to start producing and consuming packer templates for Nano images today. The template can be found here in my packer-templates repository. I'm going to walk through the template and the included scripts but that is optional reading.

I'm running Packer 0.8.2 and Virtualbox 5.0.4 on Windows 8.1 to build the template.

Known Issues

There were several snags but here are a couple items that just didn't work and may trip you up when you first try to build the template or Vagrant up:

  1. I had upgraded to the latest Packer version, 0.8.6 at the time of this post, and had issues with WinRM connectivity so reverted back to 0.8.2. I do plan to investigate that and alter the template to comply with the latest version or file issue(s) and/or PRs if necessary.
  2. Vagrant up will fail but may succeed to the extent that you need it to. It will fail to establish a WinRM connection with the box but it will create a connectable box and can also destroy it. This does mean that you will not have luck using any vagrant provisioners or packer provisioners. For me, that's fine for now.

The reason for the latter issue is that the WinRM service in nano expects requests to use codepage 65001 (UTF-8) and will refuse requests that do not. The WinRM ruby gem used by Vagrant uses codepage 437 and you will see exceptions when it tries to connect. Previous windows versions have accepted both codepages and I have heard that this will be the case with nano by the time it officially ships.

Connecting and interacting with the Nano Server

I have been connecting via powershell remoting. That of coarse assumes you are connecting from Windows. Despite what I said above about the limitations of the ruby WinRM gem, it does have a way to override the 437 codepage. However, doing so is not particularly friendly and means you cannot use alot of the helper methods in the gem.

To connect via powershell, run:

# Enable powershell remoting if it is not already enabled
Enable-PSRemoting -Force

# You may change "*" to the name or IP of the machine you want to connect to
Set-Item "wsman:\localhost\client\trustedhosts" -Value "*" -Force

# the password is vagrant
$creds = Get-Credential vagrant

# this assumes you are using NAT'd network which is the Virtualbox default
# Use the computername or IP of the machine mand skip the port arg
# if you are using Hyper-V or another non NAT network
Enter-PSSession -Computername localhost -Port 55985 -Credential $creds

If you do not have a windows environment from which to run a remote powershell session, you can just create a second VM.

Deploying Nano manually

Before going through the packer template, it would be helpful to understand how one would build a nano server without packer or by hand. Its a bit more involved that giving packer an answer file. There are a few different ways to do this and some paths work better for different scenarios. I'll just layout the procedure for building Nano on virtualbox.

From Windows hosts

Ben Armstrong has a great post on creating nano VMs for Hyper-V. If you are on Windows and want to create Virtualbox VMs, the instructions for creating the nano image are nearly identical. The key change is to specify -OEMDrivers instead of -GuestDrivers in the New-NanoServerImage command. GuestDrivers have the minimal set of drivers needed for Hyper-V. While it can also Create a VirtualBox image that loads and shows the initial nano login screen, I was unable to actually login. Using -OEMDrivers adds a larger set of drivers and allows the box to function in VirtualBox. Its interesting to note that a Hyper-V Vagrant Box build using GuestDrivers is 60MB smaller than one using OEMDrivers.

Here is a script that will pop out a VHD after you mount the Windows Server 2016 Technical Preview 3 ISO:

cd d:\NanoServer
. .\new-nanoserverimage.ps1
mkdir c:\dev\nano
$adminPassword = ConvertTo-SecureString "Pass@word1" -AsPlainText -Force

New-NanoServerImage `
  -MediaPath D:\ `
  -BasePath c:\dev\nano\Base `
  -TargetPath c:\dev\nano\Nano-image `
  -ComputerName Nano `
  -OEMDrivers `
  -ReverseForwarders `
  -AdministratorPassword $adminPassword

Now create a new Virtualbox VM and attach to the VHD created above.

From Mac or Linux hosts

You have no powershell here so the instructions are different. Basically you need to either create or use an existing windows VM. Make sure you have a shared folder setup so that you can easily copy the nano VHD from the windows VM to your host and then create the Virtualbox vm using that VHD as its storage.

That all seems easy, why Packer?

So you may very well be wondering at this point, "Its just a handful of steps to create a nano VM. Your packer template has multiple scripts and probably 100 lines of powershell. What is the advantage of using Packer?"

First there might not be one. If you want to create one instance and play around on the same host and don't care about supporting other instances on other hosts or have scenarios where you need to ensure that multiple nodes come from an identically built image, then packer may not be the right tool for you.

Here are some scenarios where packer shines:

  • Guaranteed  identical images - If all images come from the same template, you know that they are all the same and you have "executable documentation" on how they were produced.
  • Immutable Infrastructure - If I have production clusters that I routinely tear down and rebuild/replace or a continuous delivery pipeline that involves running tests on ephemeral VMs that are freshly built for each test suite, I can't be futzing around on each node, copying WIMs and VHDs.
  • Multi-Platform - If I need to create both linux and windows environments, I'd prefer to use a single tool to pump out the base images.
  • Single click, low friction box sharing - For the thousands and thousands of vagrant users out there, many of whom do not spend much time on windows, giving them a vagrant box is the best way to ensure they have a positive experience provisioning the right image and Packer is the best tool for creating vagrant boxes.

Walking through the template

So now we will step through the key parts of the template and scripts highlighting areas that stray from the practices you would normally see in windows template work and dwelling on nano behavior that may catch you off guard.

High level flow

First a quick summary of what the template does:

  1. Installs Windows Server 2016 Core on a new Virtualbox VM
  2. Powershell script is launched from the answer file that creates the Nano image, mounts it, copies it to an empty partition and then updates the default boot record to boot from that partition.
  3. Machine reboots into nano
  4. Some winrm tweaks are made, the Windows Server 2016 partition is removed and the nano partition extended over it.
  5. "Zap" unused space on disk.
  6. Packer archives the VM to vmdk and packages to a .box file.

Three initial disk partitions

We assume that there is no windows anywhere (because this reflects many build environments) so we will be installing two operating systems: The larger Windows Server 2016 and Nano. We build nano from the former. Our third partition is a system partition. Its easier to have a separate partition for the master boot record that we don't have to touch or move around in the process.

It is important that the Windows Server 2016 Partition be physically located at the end of the disk. Otherwise we will be stuck with a gap in the disk after we remove it.

One may find it odd that our Autounattend.xml file installs Server 2016 from an image named "Windows Server 2012 R2 SERVERDATACENTERCORE." It is odd but correct. That's cool. This is all beta still and I'm sure this is just one detail yet to be ironed out. There is probably some horrendously friction laden process involved to change the image name. One thing that tripped me up a bit is that there are 4 images in the ISO:

C:\dev\test> Dism /Get-ImageInfo /ImageFile:d:\sources\install.wim

Deployment Image Servicing and Management tool
Version: 10.0.10240.16384

Details for image : d:\sources\install.wim

Index : 1
Name : Windows Server 2012 R2 SERVERSTANDARDCORE
Description : Windows Server 2012 R2 SERVERSTANDARDCORE
Size : 9,621,044,487 bytes

Index : 2
Name : Windows Server 2012 R2 SERVERSTANDARD
Description : Windows Server 2012 R2 SERVERSTANDARD
Size : 13,850,658,303 bytes

Index : 3
Name : Windows Server 2012 R2 SERVERDATACENTERCORE
Description : Windows Server 2012 R2 SERVERDATACENTERCORE
Size : 9,586,595,551 bytes

Index : 4
Name : Windows Server 2012 R2 SERVERDATACENTER
Description : Windows Server 2012 R2 SERVERDATACENTER
Size : 13,847,190,006 bytes

The operation completed successfully.

Images 3 and 4, the DataCenter ones are the only ones installable from an answer file.

Building Nano

I think .\scripts\nano_create.ps1 is pretty straight forward. We build the nano image as discussed earlier in this post and copy it to a permanent partition.

What might seem odd is the last few lines that setup winrm. Why do we do this when we are about to blow away this OS and never use winrm? We do this because of the way that the VirtualBox builder works in packer. It is currently waiting for winrm to become available before moving forward in the build process. So this is done simply as a signal to packer. A signal to what? 

The Virtualbox builder will now invoke any "provisioners" in the template and then issue the template's shutdown command. We dont use any provisioners which brings us to the our first road bump.

Nano forces a codepage incompatible with packer and vagrant

On the one hand it is good to see Nano using a Utf-8 code page (65001). However, previous versions of Windows have traditionally used the old MS-DOS code page (437) and both the ruby WinRM gem used by Vagrant and the GO WinRM package used by packer are hard coded to use 437. At this time, Nano will not accept 437 so any attempt to establish WinRM communication by Vagrant and Packer will fail with htis error:

An error occurred executing a remote WinRM command.

Shell: powershell
Command: hostname
if ($?) { exit 0 } else { if($LASTEXITCODE) { exit $LASTEXITCODE } else { exit 1 } }
Message: [WSMAN ERROR CODE: 2150859072]: <f:WSManFault Code='2150859072' Machine='192.168.1.130' xmlns:f='http://schemas.microsoft.com/wbem/wsman/1/wsmanfault'><f:Message><f:ProviderFault path='%systemroot%\system32\winrscmd.dll' provider='Shell cmd plugin'>The WinRS client cannot process the request. The server cannot set Code Page. You may want to use the CHCP command to change the client Code Page to 437 and receive the results in English. </f:ProviderFault></f:Message></f:WSManFault>

 This means packer provisioners will not work and we need to take a different route to provisioning.

One may think this a show stopper for provisioning Windows images and it is for some scenarios but for my initial packer use case, that's OK and I hear that Nano will accept 437 before it "ships." Note that this only seems to be the case with Nano and not Windows Server 2016.

Cut off from Winrm Configuration APIs

Both Vagrant and Packer expect to communicate over unencrypted WinRM using Basic Authentication. I know I just said that Vagrant and Packer cant talk WinRM at all but I reached a challenge with WinRM before discovering the codepage issue. When trying to allow unencrypted WinRM and basic auth, I found that the two most popular methods for tweaking winrm were not usable on nano.

These methods include:

  1. Using the winrm command line utility
  2. Using the WSMan Powershell provider

The first simply does not exist. Now the winrm command is c:\windows\system32\winrm.cmd which is a tiny thin wrapper around cscript.exe, the scripting engine used to run vbscripts. Well there is no cscript or wscript so no visual basic runtime at all. Interestingly, winrm.vbs does exist. Feels like a sick joke.

So we could use the COM API to do the configuration. If you like COM constants and HRESULTS, this is totally for you. The easier approach at least for my purposes is to simply flip the registry keys to get the settings I want:

REG ADD HKLM\Software\Microsoft\Windows\CurrentVersion\WSMAN\Service /v allow_unencrypted /t REG_DWORD /d 1 /f

REG ADD HKLM\Software\Microsoft\Windows\CurrentVersion\WSMAN\Service /v auth_basic /t REG_DWORD /d 1 /f

REG ADD HKLM\Software\Microsoft\Windows\CurrentVersion\WSMAN\Client /v auth_basic /t REG_DWORD /d 1 /f

No modules loaded in Powershell scripts run from SetupComplete.cmd

SetupComplete.cmd is a special file that can sit in windows\setup\scripts and if it does, will be run on first boot and then never again. We use this because as mentioned before, we cant use Packer provisioners since winrm is not an option. I have never used this file before so its possible this is not specific to nano but that would be weird. I was wondering why the powershell script I called from this file was not being called at all. Everything seemed to go fine, no errors but my code was definitely not being called. Kinda like debugging scheduled tasks.

First, Start-Transcript is not present on Nano. So that was to blame for the lack of errors. I switched to old school redirection:

cmd.exe /c C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -command . c:\windows\setup\scripts\nano_cleanup.ps1 > c:\windows\setup\scripts\cleanup.txt

Next I started seeing errors about other missing cmdlets like Out-File. I thought that seemed strange and had the script run Get-Module. The result was an empty list of modules so I added loading of the basic PS modules and the storage module, which would normally be auto loaded into my session:

Import-Module C:\windows\system32\windowspowershell\v1.0\Modules\Microsoft.PowerShell.Utility\Microsoft.PowerShell.Utility.psd1
Import-Module C:\windows\system32\windowspowershell\v1.0\Modules\Microsoft.PowerShell.Management\Microsoft.PowerShell.Management.psd1
Import-Module C:\windows\system32\windowspowershell\v1.0\Modules\Storage\Storage.psd1

Not everything you expect is on Nano but likely everything you need

As I mentioned above, Start-Transcipt and cscrpt.exe were missing, but thats not the only things. Here are some other commands I noticed were gone:

  • diskpart
  • bcdboot
  • Get-WMIObject
  • Restart-Computer

I'm sure there are plenty others but these all have alternatives that I could use.

Different arguments to powershell.exe

A powershell /? will reveal a command syntax slightly different from what one is used to:

C:\dev\test> Enter-PSSession -ComputerName 192.168.1.134 -Credential $c
[192.168.1.134]: PS C:\Users\vagrant\Documents> powershell /?
USAGE: powershell [-Verbose] [-Debug] [-Command] <CommandLine>



  CoreCLR is searched for in the directory that powershell.exe is in,

  then in %windir%\system32\CoreClrPowerShellExt\v1.0\.

No -ExecutionProfile,  no -File and others are missing too. I imagine this could break some existing scripts.

No 32 bit

I knew this going in but was still caught off guard when sdelete.exe failed to work. I use sdelete, a sysinternals utility, for freeing empty space on disk which leads to a dramatically smaller image size when we are done. Well I'm guessing it was compiled for 32 bit because I got complaints about the executable image being incomatible with nano.

In the end this turned out to be for the best, I found a pure powershell alternative to sdelete which I adapted for my limited needs:

$FilePath="c:\zero.tmp"
$Volume= Get-Volume -DriveLetter C
$ArraySize= 64kb
$SpaceToLeave= $Volume.Size * 0.05
$FileSize= $Volume.SizeRemaining - $SpacetoLeave
$ZeroArray= new-object byte[]($ArraySize)
$Stream= [io.File]::OpenWrite($FilePath)
try {
   $CurFileSize = 0
    while($CurFileSize -lt $FileSize) {
        $Stream.Write($ZeroArray,0, $ZeroArray.Length)
        $CurFileSize +=$ZeroArray.Length
    }
} 
finally {
    if($Stream) {
        $Stream.Close()
    }
}
Del $FilePath

Blue Screens of Death

So I finally got the box built and was generally delighted with its size (310MB). However when I launched the vagrant box, the machine blue screened reporting that a critical process had died. All of the above issues had made this a longer haul than I expected but it turned out that troubleshooting the bluescreens was the biggest time suck and sent me on hours of wild goose chases and red herrings. I almost wrote a separate post dedicated to this issue, but I'm gonna try to keep it relatively brief here (not a natural skill).

What was frustrating here is I knew this could work. I had several successful tests but with slightly different execution flows which I was tweaking along the way, but it certainly did not like my final template and scripts. I would get the CRITICAL_PROCESS_DIED blue screen twice and then it would stop at a display of error code 0xc0000225 and the message "a required device isn't connected or can't be accessed."

Based on some searching I thought that there was something wrong somewhere in the boot record. After all I was messing with deleting and resizing partitions and changing the boot record compounded by the fact that I am not an expert in that area. However lots of futzing with diskpoart, cdbedit, cdbboot, and bootrec got me nowhere. I also downloaded the Technical preview 3 debug symbols to analyze the memory dump but there was nothing interesting there. Just a report that the process that died was wininit.exe.

Trying to manually reproduce this I found that the final machine produced by packer was just fine. Packer exports the VM to a new .vmdk virtual disk. Trying to create a machine from that would produce blue screens. Further, manually cloning a .vdi had the same effect - more blue screens. Finally, I tried attaching a new VM to the same disk that worked and made sure the vm settings were identical to the working machine. This failed too which seemed very odd. I then discovered that removing the working machine and manually editing the broken machine 's .box xml to have the same UUID as the working one, fixed things. After more researching, I found out that Virtualbox has a modifiable setting called a Hardware UUID. If none is supplied, it uses the box's UUID. So I cloned another box from the working machine, validated that it blue screened and then ran:

vboxmanage modifyvm --hardwareid "{same uuid as the working box}"

Voila! The box came to life. So I could fix this by telling the packer template to stamp an artificial guid at startup:

    "vboxmanage": [
      [ "modifyvm", "{{.Name}}", "--natpf1", "guest_winrm,tcp,,55985,,5985" ],
      [ "modifyvm", "{{.Name}}", "--memory", "2048" ],
      [ "modifyvm", "{{.Name}}", "--vram", "36" ],
      [ "modifyvm", "{{.Name}}", "--cpus", "2" ],
      [ "modifyvm", "{{.Name}}", "--hardwareuuid", "02f110e7-369a-4bbc-bbe6-6f0b6864ccb6" ]
    ],

then add the exact same guid to the Vagrantfile template:

config.vm.provider "virtualbox" do |vb|
  vb.customize ["modifyvm", :id, "--hardwareuuid", "02f110e7-369a-4bbc-bbe6-6f0b6864ccb6"]
  vb.gui = true
  vb.memory = "1024"
 end

This ensures that vagrant "up"s the box with the same hardware UUID that it was created with. The actual id does not matter and I don't think there is any harm, at least for test purposes, in having duplicate hardware uuids.

I hoped that a similar strategy would work for Hyper-V by changing its BIOSGUID using a powershell script like this:

#Virtual System Management Service
$VSMS = Get-CimInstance -Namespace root/virtualization/v2 -Class Msvm_VirtualSystemManagementService
#Virtual Machine
$VM = Get-CimInstance -Namespace root/virtualization/v2  -Class Msvm_ComputerSystem -Filter "ElementName='Demo-VM'"
#Setting Data
$SD = $vm | Get-CimAssociatedInstance -ResultClassName Msvm_VirtualSystemSettingData -Association Msvm_SettingsDefineState

#Update bios uuid
$SD.BIOSGUID = "some guid"
 
#Create embedded instance
$cimSerializer = [Microsoft.Management.Infrastructure.Serialization.CimSerializer]::Create()
$serializedInstance = $cimSerializer.Serialize($SD, [Microsoft.Management.Infrastructure.Serialization.InstanceSerializationOptions]::None)
$embeddedInstanceString = [System.Text.Encoding]::Unicode.GetString($serializedInstance)
 
#Modify the system settings
Invoke-CimMethod -CimInstance $VSMS -MethodName ModifySystemSettings @{SystemSettings = $embeddedInstanceString}

Thanks to this post for an example.  It did not work but the Hyper-V blue screens seem to be "self healing". I post more details in the vagrant box readme on atlas.

No sysprep

I suspect that the above is somehow connected to OS activation. I saw lots of complaints on google about needing to do something similar with the hardware uuid in order to preserve the windows activation of a cloned machine. I also noted that if I cloned the box manually before packer rebooted into nano thereby letting the clone run the initial setup, things worked.

Ideally, the fix here would be to leave the hardware uuid alone and just sysprep the machine as the final packer step. This is what I do for 2012R2. However from what I can tell, there is no sysprep available for nano. I really hope that there will be.

Finally a version of windows for managing cattle servers

You may have heard the cattle vs. pets analogy that compares "special snowflake" servers to cloud based clusters. The idea is that one perceives cloud instances like cattle. There is no emotional attachment or special treatment given to one server over another. At this scale, one cant afford to. We don't have the time or resources to be accessing our instances via a remote desktop and clicking buttons or dragging windows. If one becomes sick, we put it out of its misery quickly and replace it.

Nano is light weight and made to be accessed remotely. Really interested to see how this progresses.


Understanding and troubleshooting WinRM connection and authentication: a thrill seeker's guide to adventure

$
0
0

Connecting to a remote windows machine is often far more difficult than one would have expected. This was my experience years ago when I made my first attempt to use powershell remoting to connect to an Azure VM. At the time, powershell 2 was the hotness and many were talking up its remoting capabilities. I had been using powershell for about a year at the time and thought I'd give it a go. It wasn't simple at all and took a few hours to finally succeed.

Now armed with 2012R2 and more knowledge its simpler but lets say you are trying to connect from a linux box using one of the open source WinRM ports, there are several gotchas.

I started working for Chef about six weeks ago and it is not at all uncommon to find customers and fellow employees struggling with failure to talk to a remote windows node. I'd like to lay out in this post some of the fundamental moving parts as well as the troubleshooting decision tree I often use to figure out where things are wrong and how to get connected.

I'll address cross platform scenarios using plain WinRM, powershell remoting from windows and some Chef specific tooling using the knife-windows gem.

Connecting and Authenticating

In my experience these are the primary hurdles to WinRM sweet success. First is connecting. Can I successfully establish a connection on a WinRM port to the remote machine? There are several things to get in the way here. Then a yak shave or two later you get past connectivity but are not granted access. What's that you say? You are signing in with admin credentials to the box?...I'm sorry say that again?...huh?...I just can't hear you.

TL;DR - A WinRM WTF checklist:

I am going to go into detail in this post on the different gotchas and their accompanying settings needed to successfully connect and execute commands on a remote windows machine using WinRM. However, if you are stuck right now and don't want to sift through all of this, here is a cheat sheet list of things to set to get you out of trouble:

On the remote windows machine:

  • Run Enable-PSRemoting
  • Open the firewall with: netsh advfirewall firewall add rule name="WinRM-HTTP" dir=in localport=5985 protocol=TCP action=allow
  • Accessing via cross platform tools like chef, vagrant, packer, ruby or go? Run these commands:
winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'

Note: DO NOT use the above winrm settings on production nodes. This should be used for tets instances only for troubleshooting WinRM connectivity.

This checklist is likely to address most trouble scenarios when accessing winrm over HTTP. If you are still stuck or want to understand this domain more, please read on.

Barriers to entry

Lets talk about connectivity first. Here are the key issues that can prevent connection attempts to a WinRM endpoint:

  • The Winrm service is not running on the remote machine
  • The firewall on the remote machine is refusing connections
  • A proxy server stands in the way
  • Improper SSL configuration for HTTPS connections

We'll address each of these scenarios but first...

How can I know if I can connect?

It can often be unclear whether we are fighting a connection or authentication problem. So I'll point out how you can determine if you can eliminate connectivity as a potential issue.

On Mac/Linux:

$ nc -z -w1 <IP or host name> 5985;echo $?

This uses netcat available on the mac and most linux distros. Assuming you are using the default HTTP based WinRM port 5985 (more on determining the correct port in just a bit), if the above returns 0, you know you are getting through to a listening WinRM endpoint on the other side.

On Windows:

Test-WSMan -ComputerName <IP or host name>

Again this assumes you are trying to connect over the default HTTP WinRM port (5985), if not add -UseSSL. This should return some non-error response that looks something like:

wsmid         : http://schemas.dmtf.org/wbem/wsman/identity/1/wsmanidentity.xsd
ProtocolVersion : http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
ProductVendor   : Microsoft Corporation
ProductVersion  : OS: 0.0.0 SP: 0.0 Stack: 3.0

WinRM Ports

The above commands used the default WinRM HTTP port to attempt to connect to the remote WinRM endpoint - 5985. WinRM is a SOAP based HTTP protocol.

Side Note: In 2002, I used to car pool to my job in Sherman Oaks California with my friend Jimmy Bizzaro and would kill time by reading "Programming Web Services with SOAP" an O'Reilly publication. This was cutting edge, cool stuff. Java talking to .net, Java talking to Java but from different machines. This was the future. REST was done in a bed or on a toilet. So always remember, today's GO and Rust could be tomorrow's soap.

Anyhoo...WinRM can talk HTTP and HTTPS. The default ports are 5985 and 5986 respectfully. However the default ports can be changed. Now usually the change is driven by network address translation. Sure these ports can be changed locally too, but in my experience if you need to access WinRM on ports other than 5985 or 5986 its usually to accommodate NAT. So check your Virtualbox NAT config or your Azure or EC2 port mappings to see if there is a port forwarding to 5985/6 on the VM. Those would be the ports you need to use. The Test-WSMan cmdlet also takes a -port parameter where you can provide a non standard WinRM port.

So now you know the port to test but you are getting a non 0 netcat response or an error thrown from Test-WSMan. Now What?

Is WinRM turned on?

This is the first question I ask. If winrm is not listening for requests, then there is nothing to connect to. There are a couple ways to do this. What you usually do NOT want to do is simply start the winrm service. Not that that is a bad thing, its just not likely going to be enough. The two best ways to "turn on" WinRM are:

winrm quickconfig

or the powershell approach:

Enable-PSRemoting

For default windows 2012R2 installs (not altered by group policy), this should be on by default. However windows 2008R2 and client SKUs will be turned off until enabled.

Wall of fire

In some circles called a firewall. This can often be a blocker. For instance, while winrm is on by default on 2012R2, its firewall rules will block public traffic from outside its own subnet. So if you are trying to connect to a server in EC2 or Azure for example, opening this firewall restriction is important and can be done with:

HTTP:

netsh advfirewall firewall add rule name="WinRM-HTTP" dir=in localport=5985 protocol=TCP action=allow

HTTPS:

netsh advfirewall firewall add rule name="WinRM-HTTPS" dir=in localport=5986 protocol=TCP action=allow

This also affects client SKUs which by default do not open the firewall to any public traffic. If you are on a client version of windows 8 or higher, you can also use the -SkipNetworkProfileCheck switch when enabling winrm via Enable-PSRemoting which will at least open public traffic to the local subnet and may be enough if connecting to a machine on a local hypervisor.

Proxy Servers

As already stated, WinRM runs over http. Therefore if you have a proxy server sitting between you and the remote machine you are trying to connect to, you need to make sure that the request goes through that proxy server. This is usually not an issue if you are on a windows machine and using a native windows API like powershell remoting or winrs to connect. They will simply use the proxy settings in your internet settings.

Ruby tooling like Chef, Vagrant, or others uses a different mechanism. If the tool is using the WinRM ruby gem, like chef and vagrant do, they rely on the HTTP_PROXY environment variable instead of the local system's internet settings. As of knife-windows 1.1.0, the http_proxy settings in your knife.rb config file will make its way to the HTTP_PROXY environment variable. You can manually set this as follows:

Mac/Linux:

$ export HTTP_PROXY="http://<proxy server>:<proxy port>/"

Windows Powershell:

$env:HTTP_PROXY="http://<proxy server>:<proxy port>/"

Windows Cmd:

exit

Friends don't let friends use cmd.exe and you are my friend.

SSL

I'm saving SSL for the last connection issue because it is more involved (why folks often opt for HTTP over the more secure HTTPS). There is extra configuration required both on both the remote and local side and that can vary by local platform. Lets first discuss what must be done on the remote WinRM endpoint.

Create a self signed certificate

Assuming you have not purchased a SSL certificate from a valid certificate authority, you will need to generate a self signed certificate. If your are on a 2012R2 windows os version or later, this is trivial:

$c = New-SelfSignedCertificate -DnsName "<IP or host name>" -CertStoreLocation cert:\LocalMachine\My

Read ahead for issues with New-SelfSignedCertificate and certificate verification with openssl libraries.

Creating a HTTPS WinRM listener

Now WinRM needs to be configured to respond to https requests. This is done by adding an https listener and associating it with the thumbprint of the self signed cert you just created.

winrm create winrm/config/Listener?Address=*+Transport=HTTPS "@{Hostname=`"<IP or host name>`";CertificateThumbprint=`"$($c.ThumbPrint)`"}"

Adding firewall rule

Finally enable winrm https requests through the firewall:

netsh advfirewall firewall add rule name="WinRM-HTTPS" dir=in localport=5986 protocol=TCP action=allow

SSL client configuration

At this point you should be able to reach a listening WinRM endpoint on the remote server. On a mac or linux box, a netcat check on the https winrm port should be successful:

$ nc -z -w1 <IP or host name> 5986;echo $?

On Windows, runing Test-NetConnection (a welcome alternative to telnet on windows 8/2012 or higher) should show an open TCP port:

C:\> Test-netConnection <IP> -Port 5986

ComputerName           : <IP>
RemoteAddress          : <IP>
RemotePort             : 5986
InterfaceAlias         : vEthernet (External Virtual Switch)
SourceAddress          : <local IP>
PingSucceeded          : True
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : True

However, trying to establish a WinRM connection will likely fail with a certificate validation error unless you install that same self signed cert you created on the remote endpoint.

If you try to test the connection on windows using Test-WSMan as we saw before, you would receive this error:

Test-WSMan -ComputerName 192.168.1.153 -UseSSL
Test-WSMan : <f:WSManFault
xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault" Code="12175"
Machine="ultrawrock"><f:Message>The server certificate on the destination
computer (192.168.1.153:5986) has the following errors:
The SSL certificate is signed by an unknown certificate authority.
The SSL certificate contains a common name (CN) that does not match the
hostname.     </f:Message></f:WSManFault>
At line:1 char:1
+ Test-WSMan -ComputerName 192.168.1.153 -UseSSL
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (192.168.1.153:String) [Test-W
   SMan], InvalidOperationException
    + FullyQualifiedErrorId : WsManError,Microsoft.WSMan.Management.TestWSManC
   ommand

Now you have a few options depending on your platform and needs:

  • Do not install the certificate and disable certificate verification (not recommended)
  • Install to the windows certificate store if you are on windows and need to use native windows APIs like powershell remoting
  • Export the certificate to a .pem file for use with ruby based tools like chef

Ignoring certificate validation errors

This is equivalent to when you are browsing the internet in a standard browser and try to view a https based site with an invalid cert and the browser gives you a scary warning that you are about to go somewhere potentially dangerous but gives you the option to go there anyway even though that's probably a really bad idea.

If you are testing, especially using a local hypervisor, the risk of a man in the middle attack is pretty small, but you didn't hear that from me. If you do not want to go through the trouble of installing the certificate (we'll go through those steps shortly), here is what you need to do:

Powershell Remoting:

$options=New-PSSessionOption -SkipCACheck -SkipCNCheck
Enter-PSSession -ComputerName <IP or host name> -Credential <user name> -UseSSL -SessionOption $options

WinRM Ruby Gem:

irb
require 'winrm'
w=WinRM::WinRMWebService.new('https://<ip or host>:5986/wsman', :ssl, :user => '<user>', :pass => '<password>', :no_ssl_peer_verification => true)

Chef's knife-windows

knife winrm -m <ip> ipconfig -x <user> -P <password> -t ssl --winrm-ssl-verify-mode verify_none

Installing to the Windows Certificate store

This is the more secure route and will allow you to interact with the machine via powershell remoting without being nagged that your certificate is not valid.

The first thing to do is download the certificate installed on the remote machine:

$webRequest = [Net.WebRequest]::Create("https://<ip or host>:5986/wsman")
try { $webRequest.GetResponse() } catch {}
$cert = $webRequest.ServicePoint.Certificate

Now we have an X509Certificate instance of the certificate used by the remote winrm HTTPS listener. So we install it in our local machine certificate store along with the other root certificates:

$store = New-Object System.Security.Cryptography.X509Certificates.X509Store `
  -ArgumentList  "Root", "LocalMachine"
$store.Open('ReadWrite')
$store.Add($cert)
$store.Close()

Having done this, we can now validate the SSL connection with Test-WSMan:

C:\> Test-WSMan -ComputerName 192.168.43.166 -UseSSL
wsmid        : http://schemas.dmtf.org/wbem/wsman/identity/1/wsmanidentity.xsd
ProtocolVersion : http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
ProductVendor   : Microsoft Corporation
ProductVersion  : OS: 0.0.0 SP: 0.0 Stack: 3.0

Now we can use tools like powershell remoting or winrs to talk to the remote machine.

Exporting the certificate to a .pem/.cer file

The above certificate store solution works great on windows for windows tools, but it won't help for many cross platform scenarios like connecting from non-windows or using chef tools like knife-windows. The WinRM gem used by tools like Chef and Vagrant take a certificate file which is expected to be a base 64 encoded public key only certificate file. It commonly has a .pem, .cer, or .crt extension.

On windows you can export the X509Certificate we downloaded above to such a file by using the following lines of powershell:

"-----BEGIN CERTIFICATE-----" | Out-File cert.pem -Encoding ascii
[Convert]::ToBase64String($cert.Export('cert'), 'InsertLineBreaks') |
  Out-File .\cert.pem -Append -Encoding ascii
"-----END CERTIFICATE-----" | Out-File cert.pem -Encoding ascii -Append

With this file you could use Chef's knife winrm command from the knife-windows gem to run commands on a windows node:

knife winrm -m 192.168.1.153 ipconfig -x administrator -P Pass@word1 -t ssl -f cert.pem

Problems with New-SelfSignedCertificate and openssl

If the certificate on the server was generated using New-SelfSignedCertificate, cross platform tools that use openssl libraries may fail to verify the certificate unless New-SelfSignedCertificate was used with the -CloneCert argument and passed a certificate that includes a BasicConstraint property identifying it as a CA. Viewing the certificate's properties in the certificate manager GUI, the certificate should contain this:

certCA.PNG

There are are several alternatives to the convenient New-SelfSignedCertificate cmdlet if you need a cert that must be verified with openssl libraries:

  1. Disable peer verification (not recommended) as shown earlier
  2. Create a private/public key certificate using openssl's req command and then use openssl pkcs12 to combine those 2 files to a pfx file that can be imported to the winrm listener's certificate store. Note: Make sure to include the "Server Authentication" Extended Key Usage (EKU) not added by default
  3. Use the handy New-SelfSignedCertificateEx available from the Technet Script Center and provides finer grained control of the certificate properties and make sure to use the -IsCA argument:
. .\New-SelfSignedCertificateEx.ps1
New-SelfsignedCertificateEx -Subject "CN=$env:computername" `
 -EKU "Server Authentication" -StoreLocation LocalMachine `
 -StoreName My -IsCA $true

Exporting the self signed certificate on non-windows

If you are not on a windows machine, all this powershell is going to produce far different output than what is desirable. However, its actually even simpler to do this with the openssl s_client command:

openssl s_client -connect <ip or host name>:5986 -showcerts </dev/null 2>/dev/null|openssl x509 -outform PEM >mycertfile.pem

The output mycertfile.pem can now be passed to the -f argument of knife-windows commands to execute commands via winrm:

mwrock@ubuwrock:~$ openssl s_client -connect 192.168.1.155:5986 -showcerts </dev/null 2>/dev/null|openssl x509 -outform PEM >mycertfile.pem
mwrock@ubuwrock:~$ knife winrm -m 192.168.1.155 ipconfig -x administrator -P Pass@word1 -t ssl -f ~/mycertfile.pem
WARNING: No knife configuration file found
192.168.1.155
192.168.1.155 Windows IP Configuration
192.168.1.155
192.168.1.155
192.168.1.155 Ethernet adapter Ethernet:
192.168.1.155
192.168.1.155    Connection-specific DNS Suffix  . :
192.168.1.155    Link-local IPv6 Address . . . . . : fe80::6c3f:586a:bdc0:5b4c%12
192.168.1.155    IPv4 Address. . . . . . . . . . . : 192.168.1.155
192.168.1.155    Subnet Mask . . . . . . . . . . . : 255.255.255.0

Authentication

As you can probably tell so far, alot can go wrong and there are several moving parts to establishing a successful connection with a remote windows machine over WinRM. However, we are not there yet. Most of the gotchas here are when you are using HTTP instead of HTTPS and you are not domain joined. This tends to describe 95% of the dev/test scenarios I come in contact with.

As we saw above, there is quite a bit of ceremony involved in getting SSL just right and running WinRM over HTTPS. Lets be clear: its the right thing to do especially in production. However, you can avoid the ceremony but that just means there is other ceremonial sacrifices to be made. At this point, if you are connecting over HTTPS, authentication is pretty straight forward. If not, there are often additional steps to take. However these additional steps tend to be less friction laden, but more security heinous, than the SSL setup.

HTTP, Basic Authentication and cross-platform

Both the Ruby WinRM gem and the Go winrm package do not interact with the native windows APIs needed to make Negotiate authentication possible and therefore must use Basic Authentication when using the HTTP transport. So unless you are either using native windows WinRM via winrs or powershell remoting or using knife-windows on a windows client (more on this in a bit), you must tweak some of the WinRM settings on the remote windows server to allow plain text basic authentication over HTTP.

Here are the commands to run:

winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'

One bit of easy guidance here is that if you can't use Negotiate authentication, you really really should be using HTTPS with verifiable certificates. However if you are just trying to get off the ground with local Vagrant boxes and you find yourself in a situation getting WinRM Authentication errors but know you are passing the correct credentials, please try running these on the remote machine before inflicting personal bodily harm.

I always include these commands in windows packer test images because that's what packer and vagrant need to talk to a windows box since they always use HTTP and are cross platform without access to the Negotiate APIs.

This is quite the security hole indeed but usually tempered by the fact that it is on a test box in a NATed network on the local host. Perhaps we are due for a vagrant PR allowing one to pass SSL options in the Vagrantfile. That would be simple to add.

Chef's winrm-s gem using windows negotiate on windows

Chef uses a separate gem that mostly monkey patches the WinRM gem if it sees that winrm is authenticating from windows to windows. In this case it leverages win32 APIs to use Negotiate authentication instead of Basic Authentication and therefore the above winrm settings can be avoided. However, if accessing from a linux client, it will drop to Basic Authentication and the settings shown above must then be present.

Local user accounts

Windows remote communication tends to be easier when you are using domain accounts. This is because domains create implicit trust boundaries so windows adds restrictions when using local accounts. Unfortunately the error messages you can sometimes get do not at all make it clear what you need to do to get past these restrictions. There are two issues with local accounts that I will mention:

Qualifying user names with the "local domain"

One thing that has previously tripped me up and I have seen others struggle with is related to authenticating local users. You may have a local user (not a domain user) and it is getting access denied errors trying to login. However if you prefix the user name with './', then the error is resolved. The './' prefix is equivelent to '<local host or ip>\<user>'. Note that the './' prefix may not work in a windows login dialog box. In that case use the host name or IP address of the remote machine instead of '.'.

Setting the LocalAccountTokenFilterPolicy registry setting

This does not apply to the built in administrator account. So if you only logon as administrator, you will not run into this. However lets say I create a local mwrock account and even add this account to the local Administrators security group. If I try to connect remotely with this account using the default remoting settings on the server, I will get an Access Denied error if using powershell remoting or a WinRMAuthentication error if using the winrm gem. This is typically only visible on 2012R2. By default, the winrm service is running on a newly installed 2012R2 machine with an HTTP listener but without the LocalAccountTokenFilterPolicy enabled, while 2008R2 and client SKUs have no winrm service running at all. Running winrm quickconfig or Enable-PSRemoting on any OS will enable the LocalAccountTokenFilterPolicy, which will allow local accounts to logon. This simply sets the LocalAccountTokenFilterPolicy subkey of HKLM\software\Microsoft\Windows\CurrentVersion\Policies\system to 1.

Trusted Hosts with HTTP, non domain joined powershell remoting

There is an additional security restriction imposed by powershell remoting when connected over HTTP on a non domain joined  (work group) environment. You need to add the host name of the machine you are connecting to the list of trusted hosts. This is a white list of hosts you consider ok to talk to. If there are many, you can comma delimit the list. You can also include wildcards for domains and subdomains:

Set-Item "wsman:\localhost\client\trustedhosts" -Value 'mymachine,*.mydomain.com' -Force

Setting your trusted hosts list a single wildcard would allow all hosts:

Set-Item "wsman:\localhost\client\trustedhosts" -Value '*' -Force

You would only do this if you simply interact with local test instances and even that is suspect.

Double-Hop Authentication

Lets say you want to access a UNC share on the box you have connected to or in any other way use your current credentials to access another machine. This will typically fail with the ever informative Access Denied error. You can enable whats called credential delegation by using a different type of authentication mechanism called CredSSP. This is only available using Powershell remoting and requires extra configuration on both the client and remote machines.

On the remote machine, run:

Enable-WSManCredSSP -Role Server

On the client there are a few things to set up. First, similar to the server, you need to enable it but also specify a white list of endpoints.  This is formatted similar to the trusted hosts discussed above:

Enable-WSManCredSSP -Role Client -DelegateComputer 'my_trusted_host.me.org'

Next you need to edit the local security policy on the machine to allow delegation to specific endpoints. In the gpedit GUI, navigate to Computer Configuration > Administrative Templates > System > Credential Delegation and enable "Allow Delegating Fresh Credentials". Further, you need to add the endpoints you authorize delegation to. You can add WSMAN\*.my_domain.com to allow all endpoints in the my_domain.com domain. You can add as many entries as you need.

Windows Nano TP 3

As of the date of this post, Microsoft has released technical preview 3 of its new Windows Nano flavored server OS. I have previously blogged about this super light weight os but here is a winrm related bit of info that is unique to nano as of this version at least: there are no tools to tweak the winrm settings. Neither the winrm command or the winrm powershell provider are present.

In order to make changes, you must edit the registry directly. These settings are located at:

 HKLM\Software\Microsoft\Windows\CurrentVersion\WSMAN

Other Caveats

I've written an entire post on this topic and will not go into the same detail here. Basically I have found that once winrm is correctly configured, there is still a small subset of operations that will fail in a remote context. Any interaction with wsus is an example but please read my previous post for more. When you hit one of these road blocks, you typically have two options:

  1. Use a Scheduled Task to execute the command in a local context
  2. Install an SSH server and use that

The second option appears to be imminent and in the end will make all of this easier and perhaps render this post irrelevant.

Fixing - WinRM Firewall exception rule not working when Internet Connection Type is set to Public

$
0
0

You may have seen the following error when either running Enable-PSRemoting or Set-WSManQuickConfig:

Set-WSManQuickConfig : <f:WSManFault xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault" Code="2150859113"
Machine="localhost"><f:Message><f:ProviderFault provider="Config provider"
path="%systemroot%\system32\WsmSvc.dll"><f:WSManFault xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault"
Code="2150859113" Machine="win81"><f:Message>WinRM firewall exception will not work since one of the network
connection types on this machine is set to Public. Change the network connection type to either Domain or Private and
try again. </f:Message></f:WSManFault></f:ProviderFault></f:Message></f:WSManFault>
At line:1 char:1
+ Set-WSManQuickConfig -Force
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Set-WSManQuickConfig], InvalidOperationException
    + FullyQualifiedErrorId : WsManError,Microsoft.WSMan.Management.SetWSManQuickConfigCommand

This post will explain how to get around this error. There are different ways to do this depending on your operating system version. Windows 8/2012 workarounds are fairly sane while windows 7/2008R2 may seem a bit obtuse.

This post is inspired by an experience I had this week trying to get a customer's Chef node able to connect via WinRM over SSL. Her test node was running Windows 7 and she was getting the above error when trying to enable WinRM. Windows 7 has no way to change the connection Type via a native powershell cmdlet. I had done this via the commandline before on Windows 7 but did not have the commands handy. Further, it had been so long since changing the connection type on windows 7 via the GUI, I had to fire up my own win 7 VM and run through it just so I could relay propper instructions to this very patient customer.

So I write this to run through the different nuances of connection types on different operating systems but primarily to have a known place on the internet where I can stash the commands.

TL;DR for Windows 7 or 2008R2 Clients:

If you dont care about anything else other than getting past this error on windows 7 or 2008R2 without ceremonial pointing and clicking, simply run these commands:

$networkListManager = [Activator]::CreateInstance([Type]::GetTypeFromCLSID([Guid]"{DCB00C01-570F-4A9B-8D69-199FDBA5723B}")) 
$connections = $networkListManager.GetNetworkConnections() 

# Set network location to Private for all networks 
$connections | % {$_.GetNetwork().SetCategory(1)}

This works on windows 8/2012 and up as well but there are much friendlier commands you can run instead.  Unless you are one partial to GUIDs.

Connection Types - what does it mean?

Windows provide different connection type profiles (or Network Locations) each with different levels of restrictions on what connections can be granted to the local computer on the network.

I have personally always found these types to be confusing yet well meaning. You are perhaps familiar with the message presented the first time you logon to a machine asking you if you would like the computer to be discoverable on the internet. If you choose no, the network interface is given a public internet connection profile. If you choose "yes" then it is private. For me the confusion is that I equate "public" with "publicly accessible" but here the opposite applies.

Public network locations have Network Discovery turned off  and restrict your firewall for some applications. You cannot create or join Homegroups with this setting. WinRM firewall exception rules also cannot be enabled on a public network. Your network location must be private in order for other machines to make a WinRM connection to the computer.

Domain Networks

If your computer is on a domain, that is an entirely different network location type. On a domain network, the accessibility of the machine is governed by your domain policies. This network location type is automatically chosen if your machine is part of an Active Directory domain.

Working around Public network restrictions on Windows 8 and up

When enabling WinRM, client SKUs of windows (8, 8.1, 10) expose an additional setting that allow the machine to be discoverable over WinRM publicly but only on the same subnet. By using the -SkipNetworkProfileCheck switch of Enable-PSRemoting or Set-WSManQuickConfig you can still allow connections to your computer but those connections must come from other machines on the same subnet.

Enable-PSRemoting -SkipNetworkProfileCheck

So this can work for local VMs but will still be restrictive for cloud based VMs.

Changing the Network Location

Once you answer yes or no the initial question of whether you want to be discovered, you are never prompted again, but you can change this setting later.

This is a family friendly blog so I am not going to cover how to change the Network Location via the GUI. It can be done but you are a dirty person for doing so (full disclosure - I have been guilty of doing this).

Windows 8/2012 and up

Powershell version 3 and later expose cmdlets that allow you to see and change your Network Location. Get-NetConnectionProfile shows you the network location of all network interfaces:

PS C:\Windows\system32> Get-NetConnectionProfile


Name             : Network  2
InterfaceAlias   : Ethernet
InterfaceIndex   : 3
NetworkCategory  : Private
IPv4Connectivity : Internet
IPv6Connectivity : LocalNetwork

Note the NetworkCategory above. The Network Location is private.

Use the Set-NetConnectionProfile to change the location type:

Set-NetConnectionProfile -InterfaceAlias Ethernet -NetworkCategory Public

or

Get-NetConnectionProfile | Set-NetConnectionProfile -NetworkCategory Private

The later is convenient if you want to ensure that all network interfaces are set to a particular location.

Windows 7 and 2008R2

You will not have the above cmdlets available on Windows 7 or 2008R2. You can still change the location on the command line but the commands are far less intuitive. As shown in the tl;dr, here is the command:

$networkListManager = [Activator]::CreateInstance([Type]::GetTypeFromCLSID([Guid]"{DCB00C01-570F-4A9B-8D69-199FDBA5723B}")) 
$connections = $networkListManager.GetNetworkConnections() 

# Set network location to Private for all networks 
$connections | % {$_.GetNetwork().SetCategory(1)}

First we get a reference to a COM instance of an INetworkListManager which naturally has a Class ID of DCB00C01-570F-4A9B-8D69-199FDBA5723B. We then grab all the network connections and finally set them all to the desired location:

  • 0 - Public
  • 1 - Private
  • 2 - Domain

Released Boxstarter 2.6 now with an embedded Chocolatey 0.9.10 Beta

$
0
0

Today I released Boxstarter 2.6. This release brings Chocolatey support up to the latest beta release of the Chocolatey core library. In March of this year, Chocolatey released a fully rewritten version 0.9.9. Prior to this release, Chocolatey was released as a Powershell module. Boxstarter intercepted every Chocolatey call and could easily maintain state as both chocolatey and boxstarter coexisted inside the same powershell process. With the 0.9.9 rewrite, Chocolatey ships as a .Net executable and creates a separate powershell process to run each package. So there has been lot of work to create a different execution flow requiring changes to almost every aspect of Boxstarter. While this may not introduce new boxstarter features, it does mean one can now take advantage of all new features in Chocolatey today.

A non breaking release

This release should not introduce any breaking functionality from previous releases. I have tested many different usage scenarios. I also increased the overall unit and functional test coverage of boxstarter in this release to be able to more easily validate the impact of the Chocolatey overhaul. That all said, there has been alot of changes to how boxstarter and chocolatey interact and its always possible that some bugs have hidden themselves away. So please report issues on github as soon as you encounter problems and I will do my best to resolve problems quickly. Pull requests are welcome too!

Where is Chocolatey?

One thing that may come as a surprise to some is that Boxstarter no longer installs Chocolatey. One may wonder, how can this be? Well, Chocolatey now exposes its core functionality via an API (chocolatey.dll). So if you are setting up a new machine with boxstarter, you will still find the Chocolatey repository in c:\ProgramData\Chocolatey, but no choco.exe. Further the typical chocolatey commands: choco, cinst, cup, etc will not be recognized commands on the command line after the Boxstarter run unless you explicitly install Chocolatey.

You may do just that: install chocolatey inside a boxstarter package if you would like the machine setup to include a working chocolatey command line.

iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))

You'd have to be nuts NOT to want that.

Running 0.9.10-beta-20151210

When I say Boxstarter is running the latest Chocolatey, I really mean the latest prerelease. Why? That has a working version of the WindowsFeatures chocolatey feed. When the new version of Chocolatey was released, The WindowsFeature source feed did not make it in. However it has been recently added and because it is common to want to toggle windows features when setting up a machine and many Boxstarter packages make use of it, I consider it an important feature to include.


Hosting a .Net 4 runtime inside Powershell v2

$
0
0

My wish for the readers of this post is that you find this completely irrelevant and wonder why folks would wish to inflict powershell v2 on themselves now that we are on a much improved v5. However the reality is that many many machines are still running windows 7 and server 2008R2 without an upgraded powershell.

As I was working on Boxstarter 2.6 to support Chocolatey 0.9.9 which now ships as a .net 4 assembly, I had to be able to load it inside of Powershell 2 since I still want to support virgin win7/2008R2 environments. Without "help", this will fail because Powershell 2 hosts .Net 3.5. I really don't want to ask users to install an updated WMF prior to using Boxstarter because that violates the core mission of Boxstarter which is to setup a machine from scratch.

Adjusting CLR version system wide

So after some investigation I found several posts telling me what I already knew which included the following solutions:

  1. Upgrade to a WMF 3 or higher
  2. Create or edit a Powershell.exe.config file in C:\WINDOWS\System32\WindowsPowerShell\v1.0 setting the supportedRuntime to .net 4
  3. Edit the  hklm\software\microsoft\.netframework registry key to only use the latest CLR

I have already mentioned why option 1 was not an option. Options 2 and 3 are equally unpalatable if you do not "own" the system since both change system wide behavior. I just want to change the behavior when my application is running.

An application scoped solution

So after more digging I found an obscure, and seemingly undocumented environment variable that can impact the version of the .net runtime loaded: $env:COMPLUS_version. If you set this variable to "v4.0.30319" and then spawn a new process, that process will use the specified version of the .net runtime.

PS C:\Users\Administrator> $PSVersionTable

Name                           Value
----                           -----
CLRVersion                     2.0.50727.5420
BuildVersion                   6.1.7601.17514
PSVersion                      2.0
WSManStackVersion              2.0
PSCompatibleVersions           {1.0, 2.0}
SerializationVersion           1.1.0.1
PSRemotingProtocolVersion      2.1


PS C:\Users\Administrator> $env:COMPLUS_version="v4.0.30319"
PS C:\Users\Administrator> & powershell { $psVersionTable }

Name                           Value
----                           -----
PSVersion                      2.0
PSCompatibleVersions           {1.0, 2.0}
BuildVersion                   6.1.7601.17514
CLRVersion                     4.0.30319.17929
WSManStackVersion              2.0
PSRemotingProtocolVersion      2.1
SerializationVersion           1.1.0.1

A script that runs commands in .net 4

So given that this works, I created a Enter-DotNet4 command that allows one to run ad hoc scripts inside .net 4. Here it is:

function Enter-Dotnet4 {<#
.SYNOPSIS
Runs a script from a process hosting the .net 4 runtile

.DESCRIPTION
This function will ensure that the .net 4 runtime is installed on the
machine. If it is not, it will be downloaded and installed. If running
remotely, the .net 4 installation will run from a scheduled task.

If the CLRVersion of the hosting powershell process is less than 4,
such as is the case in powershell 2, the given script will be run
from a new a new powershell process tht will be configured to host the
CLRVersion 4.0.30319.

.Parameter ScriptBlock
The script to be executed in the .net 4 CLR

.Parameter ArgumentList
Arguments to be passed to the ScriptBlock

.LINK
http://boxstarter.org

#>
    param(
        [ScriptBlock]$ScriptBlock,
        [object[]]$ArgumentList
    )
    Enable-Net40
    if($PSVersionTable.CLRVersion.Major -lt 4) {
        Write-BoxstarterMessage "Relaunching powershell under .net fx v4" -verbose
        $env:COMPLUS_version="v4.0.30319"& powershell -OutputFormat Text -ExecutionPolicy bypass -command $ScriptBlock -args $ArgumentList
    }
    else {
        Write-BoxstarterMessage "Using current powershell..." -verbose
        Invoke-Command -ScriptBlock $ScriptBlock -argumentlist $ArgumentList
    }
}

function Enable-Net40 {
    if(!(test-path "hklm:\SOFTWARE\Microsoft\.NETFramework\v4.0.30319")) {
        if((Test-PendingReboot) -and $Boxstarter.RebootOk) {return Invoke-Reboot}
        Write-BoxstarterMessage "Downloading .net 4.5..."
        Get-HttpResource "http://download.microsoft.com/download/b/a/4/ba4a7e71-2906-4b2d-a0e1-80cf16844f5f/dotnetfx45_full_x86_x64.exe" "$env:temp\net45.exe"
        Write-BoxstarterMessage "Installing .net 4.5..."
        if(Get-IsRemote) {
            Invoke-FromTask @"
Start-Process "$env:temp\net45.exe" -verb runas -wait -argumentList "/quiet /norestart /log $env:temp\net45.log"
"@
        }
        else {
            $proc = Start-Process "$env:temp\net45.exe" -verb runas -argumentList "/quiet /norestart /log $env:temp\net45.log" -PassThru 
            while(!$proc.HasExited){ sleep -Seconds 1 }
        }
    }
}

This will install .net 4.5 if not already installed and then spawn a new powershell process to run the given commands with the .net 4 runtime hosted.

Does not work in a remote shell

One scenario where this does not work is if you are remoted on a Powershell v2 machine. The .net4 CLR will almost immediately crash. My guess is that this is related to the fact that remote shells have an inherently different hosting model and run under wsmprovhost.exe or winrshost.exe.

The workaround for this in Boxstarter is to call the chocolatey.dll in a Scheduled Task instead of using Enter-DotNet4 when running remote.

Vagrant Powershell - The closest thing to vagrant ssh yet for windows

$
0
0
Using "vagrant powershell" to remote to my windows nano server

Using "vagrant powershell" to remote to my windows nano server

A couple years ago when I started playing with vagrant to setup both windows and linux VMs, I wished there was a powershell equivalent of Vagrant's ssh command that would drop me into a powershell session on the windows vagrant box. Sure its simple enough to find the IP and winrm port of the box and run Enter-PSSession with the credentials given to vagrant, but a simple "vagrant powershell" just makes it so much more seamless.

After gaining some familiarity with ruby, I submitted  a PR to the vagrant github repo implementing this command. The command essentially shells out to powershell.exe and runs Enter-PSSession pointing to the vagrant guest using the same credentials that the vagrant's winrm communicator uses. Additionally, it temporarily adds the vagrant winrm endpoint to the host's Trusted Host entries, restoring the original entries once the command exits.

Its been a good while since that submission, but I'm excited to see it released in this week's first minor version update since that time.

Small bug/annoyance

Note that the up and down arrow keys do not cycle through command history. While I thought that was working a ways back, its possible that other changes around vagrant's handling of subprocesses caused this to pop up. At any rate, I just submitted a fix for that today and hope it is merged by the next bug fix release.

Only works on windows

The vagrant powershell command is only available on Windows hosts. Currently there is no cross platform implementation of the Powershell Remoting Protocol. There is a ruby based winrm repl but it is extremely limited compared to a true powershell session. However, the vagrant powershell command does provide a --command argument for running ad hoc non-interactive commands on the vagrant box. It would be easy to at least support that on non windows boxes.

Whats new in the ruby WinRM gem 1.5

$
0
0

Its been well over a year since Salim Afiune started the effort of making Test-Kitchen (a popular infrastructure testing tool) compatible with Windows. I got involved in late 2014 to improve the performance of copying files to Windows Test instances from Windows or Linux hosts. There was a lot of dust flying in the air as this work was nearing completion in early 2015. Shawn Neal nicely refactored some of the work I had submitted to the winrm gem into a spin off gem called winrm-fs. At the same time, Fletcher Nichol was feverishly working to polish off the Test-Kitchen work by Chef Conf 2015 and was making some optimizations on Shawn's refactorings and those found their way into a new gem winrm-transport.

Today we are releasing a new version of the winrm gem that pulls in alot of this work into the core winrm classes and will make it possible to pull the rest into winrm-fs, centralizing ruby winrm client implementations and soon deprecating the winrm-transport gem. If you use the winrm gem, there are now some changes available that improve the performance of cross platform remote execution calls to windows along with a couple other new features. This post summarizes the changes and explains how to take advantage of them.

Running multiple commands in one shell

The most common pattern for invoking a command in a windows shell has been to use the run_cmd or run_powershell_script methods of the WinRMWebService class.

endpoint = 'https://other_machine:5986/wsman'
winrm = WinRM::WinRMWebService.new(endpoint, :ssl, user: 'user', pass: 'pass')
winrm.run_cmd('ipconfig /all') do |stdout, stderr|
  STDOUT.print stdout
  STDERR.print stderr
end

winrm.run_powershell_script('Get-Process') do |stdout, stderr|
  STDOUT.print stdout
  STDERR.print stderr
end

Under the hood both run_cmd and run_powershell_script each make about 5 round trips to execute the command and return its output:

  1. Open a "shell" (equivalent of launching a cmd.exe instance)
  2. Create a command
  3. Request command output (potentially several calls for long running streamed output or long blocking calls)
  4. Terminate the command
  5. Close the shell

The first call, opening the shell, can be very expensive. It has to spawn a new process and involves authenticating the credentials with windows. If you need to make several calls using these methods, a new shell is spawned for each one. And things are even worse with powershell since that spawns yet another process (powershell.exe) from the command shell incurring the cost of creating a new runspace on each call. Stay tuned for a future winrm release (likely 1.6) that implements the powershell remoting protocol (psrp) to avoid that extra overhead.

While there is currently a way to make several calls in the same shell, its not very friendly or safe (you could easily end up with orphaned processes running on the remote windows machine). Typically this is not such a big deal because most will batch up several calls into one larger script. But here's the kicker: you are limited to a 8000 character command in a windows command shell (again stay tuned for the psrp implementation to avoid that). So imagine you are copying a 100MB file over the command line. Well, you will have to break that up into 8k chunks. While this may provide an excellent opportunity to review the six previous Star Wars episodes as you wait to transfer your music library, its far from ideal.

Using the CommandExecutor to stream commands

This 1.5 release, exposes a new class, CommandExecutor that you can use to make several commands from the same shell. The CommandExecutor provides run_command and run_powershell_script methods but these simply run a command, collect the output and terminate the command. You get a CommandExecutor by calling create_executor from a WinRMWebService instance. There are two usage patterns:

endpoint = 'https://other_machine:5986/wsman'
winrm = WinRM::WinRMWebService.new(endpoint, :ssl, user: 'user', pass: 'pass')
winrm.create_executor do |executor|
  executor.run_cmd('ipconfig /all') do |stdout, stderr|
    STDOUT.print stdout
    STDERR.print stderr
  end
  executor.run_powershell_script('Get-Process') do |stdout, stderr|
    STDOUT.print stdout
    STDERR.print stderr
  end
end

This yields an executor to a block that uses the executor to make calls. When the block completes, the shell opened by the executor will be closed.

The other pattern:

endpoint = 'https://other_machine:5986/wsman'
winrm = WinRM::WinRMWebService.new(endpoint, :ssl, user: 'user', pass: 'pass')
executor = winrm.create_executor

executor.run_cmd('ipconfig /all') do |stdout, stderr|
  STDOUT.print stdout
  STDERR.print stderr
end
executor.run_powershell_script('Get-Process') do |stdout, stderr|
  STDOUT.print stdout
  STDERR.print stderr
end

executor.close

Here we are responsible for the executor and this closing the shell it owns. Its important to close the shell because if it remains open, that process will continue to live on the remote machine. Will it dream?...we may never know.

Self signed SSL certificates and ssl_peer_fingerprint

If you are using a self signed certificate, you can currently use the :no_ssl_peer_verification option to disable verification of the certificate on the client:

WinRM::WinRMWebService.new(endpoint, :ssl, :user => myuser, :pass => mypass, :basic_auth_only => true, :no_ssl_peer_verification => true)

This is not ideal since it still risks "Man in the Middle" attacks. Still not completely ideal but better than completely ignoring validation is using a known fingerprint and passing that to the :ssl_peer_fingerprint option:

WinRM::WinRMWebService.new(endpoint, :ssl, :user => myuser, :pass => mypass, :ssl_peer_fingerprint => '6C04B1A997BA19454B0CD31C65D7020A6FC2669D')

This ensures that all messages are encrypted with a certificate bearing the given fingerprint. Thanks here is due to Chris McClimans (HippieHacker) for submitting this feature.

Retry logic added to opening a shell

Especially if provisioning a new machine, it's possible the winrm service is not yet running when first attempting to connect. The WinRMWebService now accepts new options :retry_limit and :retry_delay to specify the maximum number of attempts to make and how long to wait in between. These default to 3 attempts and a 10 second delay.

WinRM::WinRMWebService.new(endpoint, :ssl, :user => myuser, :pass => mypass, :retry_limit => 30, :retry_delay => 10)

Logging

The WinRMWebService now exposes a logger attribute and uses the logging gem to manage logging behavior. By default this appends to STDOUT and has a level of :warn, but one can adjust the level or add additional appenders.

winrm = WinRM::WinRMWebService.new(endpoint, :ssl, :user => myuser, :pass => mypass)

# suppress warnings
winrm.logger.warn = :error

# Log to a file
winrm.logger.add_appenders(Logging.appenders.file('error.log'))

If a consuming application uses its own logger that complies to the logging API, you can simply swap it in:

winrm.logger = my_logger

Up next: PSRP

Thats it for WinRM 1.5 but I'm really looking forward to productionizing a spike I recently completed getting the powershell remoting protocol working, which promises to improve performance and opens up other exciting cross platform scenarios.

Sane authentication/encryption arrives to ruby based cross platform WinRM remote execution

$
0
0
Dan Wanek's notes when first developing the ruby implementation of NTLM.

Dan Wanek's notes when first developing the ruby implementation of NTLM.

This week the WinRM ruby gem version 1.6.0 was released and there is really just one new feature it delivers but it is significant: NTLM/Negotiate authentication. Simply stated this provides a safe, low friction authentication and encryption mechanism long available in native Windows remote management tooling but absent in cross platform tools that implement the WinRM protocol.

In this post I'll highlight why I think this is a big deal, share a little history behind the feature release and discuss why authentication/encryption over WinRM has historically been a problem in the cross platform ecosystem.

Tell me again? Why is this awesome?

This is awesome because it means that assuming you are connecting to a Windows Server OS of Windows 2012 R2 or later, you can now securely connect from either a Windows or Linux application leveraging the ruby WinRM gem without any preconfiguration of the target machine. Client OS SKUs (like Windows 7, 8 and 10) and server versions 2008 R2 and prior still need to have WinRM explicitly enabled.

No SSL setup and no WinRM configuration that compromises its security. Just to be clear SSL is still a good idea and encouraged for production nodes, but if you are just trying to get your feet wet with Windows remote execution using tools like Chef or Vagrant, it is now much easier to accomplish and keep the security bar at a sane level.

How did this come to be?

I am a developer at Chef and we use the WinRM gem inside of Knife-Windows in order to execute remote commands on a Windows node from either Windows or Linux. We use a monkey patch gem called winrm-s to provide Negotiate authentication and encryption BUT it only works from windows workstations because it leverages the native Win32 APIs available only on Windows. Also its just monkey patches on top of the winrm and httpclient gems and therefore quite fragile.

My initial objective was to simply port the patches in winrm-s to winnrm and possibly the httpclient gem so that we could provide the same functionality but the implementation would live downstream where it belongs.

Well I remembered I had seen a PR in the WinRM repo that mentioned providing Negotiate auth implementation. It was submited by Dan Wanek, the original author of the WinRM gem. The PR was from late 2014 and I never really looked at it but filed it away in my mind as something worth looking at when the time came to drop winrm-s. So the time came and I quickly noticed that it did not seem to use any native Win32 APIs but leveraged another gem, rubyntlm (also authored by Dan Wanek), which appeared to be a pure ruby implementation of NTLM/Negotiate. That means it would work on Linux in addition to Windows which would be really really great.

It was pretty straight forward to get working and needed just a small amount of tweaking to be production ready - impressive since it had been dormant for over a year. Once it was up and running I verified that it indeed worked both from Windows and Linux. Nice work Dan!

The cross platform Windows remote execution landscape

I've written several posts covering different aspects of WinRM. Like this one and this one and also this one. Its a protocol that many native windows users likely take for granted and perhaps even forget they are using it when they are using its more full featured cousin: Powershell Remoting. However those who dont have access to direct Powershell because they either run on other operating systems that need to talk to Windows machines or are on Windows but use tools that are portable to non windows platforms are likely using a library that implements the WinRM protocol.

There are many such libraries:

This is just a list of the most popular libraries but there are many more. These are used by well known "DevOps" automation tools such as Chef, Ansible, Packer, Vagrant and others.

WinRM is a simple SOAP based client/server protocol. So the above libraries merely implement a web service client that issues requests to a WinRM Windows service and interprets the responses. Basically these exchanges result in:

  • Creating a Shell
  • Creating a Command
  • Requesting Command Output and Exit Code

The output can include both Standard Output and Standard Error streams.

Kind of like SSH but not.

What about Powershell Remoting

I'll save the details for another post but be aware there is a more modern protocol called the Powershell Remoting Protocol (PSRP). There is no cross platform implementation that I am aware of. It uses the same SOAP based wire protocol - MS-WSMV (Web Services Management Protocol Extensions for Windows Vista). I just love the "shout out" to Vista here.

PSRP is more feature rich but more difficult to implement. I've been playing with a ruby based partial implementation and will blog more details soon. You can run powershell commands with WinRM but you are shelling out to Powershell.exe from the traditional "fisher-price" Windows command shell.

Securing the transport

If you are using native WinRM on Windows (likely via Powershell Remoting) the most popular methods of authentication and encryption are (but are not limited to):

Negotiate

To quote the Windows Remote Management Glossary, Negotiate Authentication is defined:

A negotiated, single sign on type of authentication that is the Windows implementation of Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO). SPNEGO negotiation determines whether authentication is handled by Kerberos or NTLM. Kerberos is the preferred mechanism. Negotiate authentication on Windows-based systems is also called Windows Integrated Authentication.

Mmmmmmm. Lets all take a minute or two and think about just what this means and how it might guide our relationships and shape our perspectives on global social injustice.

Yeah its complex right? Anyways its also secure. Its much better than emailing your password to the contacts in your address book.

Kerberos

You know this Windows Remote Management Glossary is pretty darn smart. Lets just see what it has to say about Kerberos:

A method of mutual authentication between the client and server that uses encrypted keys. For computers running on a Windows-based operating system, the client account must be a domain account in the same domain as the server. When a client uses default credentials, Kerberos is the authentication method if the connection string is not one of the following: localhost, 127.0.0.1, or [::1].

Takeaways here are its "mutual", "encrypted" and importantly "for computers." Again, better than emailing your password to the contacts in your address book.

Basic Authentication

Secrets are hard. Not only technically but emotionally as well. There is no small effort involved hiding the truth. Well thank goodness Basic Authentication makes it easy to share our secrets easily. Ah sweet freedom.  Here credentials are transmitted in plain text. I like to think of it as a "giving" protocol. Sure we all think our secrets are worth being kept but are they? Really? Just use Basic Authentication and you might find out.

SSL

This really isn't so much an authentication mechanism but it is a familiar means of securely transporting data from one point to another where the contents are only to be accessible to the sender and receiver. WinRM communication can use either HTTP or HTTPS (SSL). HTTP is the default but HTTPS provides an added layer of encryption.

One of the critical keys to securely using SSL is having a valid certificate issued by a reputable certification authority that serves to ensure that those on either side of the communication are who they say they are. Without this, for example using a non validated self signed certificate, you run the risk of a "Man in the Middle Attack". Not a nice man who offers to fix your tire, provide financial assistance or offer grievance counseling after the loss of a loved family member or pet. Rather a mean man who wants to take what you have and not give it back. He can do that because the authenticity of your certificate cannot be validated and therefore he can stand in the middle between you and the remote windows machine pretending to be that machine.

The dilemma of sane encryption using cross platform libraries

Some of the dilemmas I'll mention are shared in both native windows and cross platform libraries. For example, the friction of getting SSL up and running is pretty much the same on both sides. The key difference is that cross platform libraries may not (usually don't in fact) have access to all the authentication mechanisms listed above or getting them installed and configured is far less than clear.

Good news: SSL is Everywhere, Bad News: SSL is a pain to setup everywhere

Of course this statement is somewhat relative. There are those that are familiar with the basic rules of computer security and work with the knobs and levers of these mechanisms frequently enough that it is straight forward for them to setup. Even among the technically savvy, this is NOT the majority and its especially true in the world of WinRM vs. SSH. Not because SSH users are smarter, but because in Linux land, SSH is just so ubiquitous and pervasive its hard not to deal with regularly enough for it to be unfamiliar.

Here are some points to highlight the friction and pitfalls of SSL over WinRM:

  • Its not on by default and there are several steps to set it up
  • Without a "valid" certificate its still not secure (but better than Basic Authentication over HTTP) and valid certificates take effort to obtain.
  • Bootstrapping problem: How do I get the valid certificate onto the remote machine before establishing a secure connection? There are several ways to do this depending on your cloud provider or image prep system but that assumes you have a cloud provider or an image prep system.

Kerberos: I cant tell you in a paragraph how to set it up and get it working

First just getting the right library can be the worst part - one that's compatible with your OS, architecture and the language runtime of your WinRM library. Most cross platform libraries support it but its less than trivial to get working.

Negotiate is likely not implemented

Well on ruby it is now, however, its the only non native Windows implementation I am aware of (which does not mean that there are not others). If it is implemented, and again - it is on Ruby, its secure and it "Just Works."

Basic auth is available everywhere, horrible everywhere and slightly painful to setup but easier than SSL

If you dig into almost all of the Readme files of the cross platform WinRM libraries, they will all tell you how to run horrible commands on your computer that put out the Welcome Mat for the bad guys. I've listed these in a few posts and for once I will not do so here.

Not only are running these commands a bad idea in general (certainly on production nodes) but again they represent a bootstrapping problem. Before you can successfully talk to the machine they have to be run. There are ways to accomplish this but again rely on cloud APIs or pre-baking images.

Next steps, caveats and how can we make this even better

Its totally awesome that NTLM/Negotiate authentication is now available as a cross platform option but it only lays down the foundation.

Its not the default and consuming applications need to "turn it on"

When using the WinRM gem, consumers must specify which authentication transport they want to use. There is no default. So today applications that specify ":plaintext, basic_auth: true" will continue to use basic authentication.

Chef and Test-Kitchen support coming very soon, and Vagrant support on the way

I am a developer at Chef and my PR for porting this into Knife-Windows is imminent. I will also be porting to Test-Kitchen's winrm transport and I plan to submit a PR to do the same for Vagrant.

Update: PRs to knife-windows, test-kitchen and vagrant are all submitted.

Still no support outside of Ruby

For instance Ansible (Python) and Packer (GO) - both very popular tools that help manage a data center have yet to have working Negotiate Auth implementations available.

The fact is its hard and tedious to implement something like an encryption specification solely from following a specification PDF. Further, there are other ways to get secure so its not a "show stopper." However as more become aware of the rubyntlm library, referencing such a library makes it much easier to implement.


Need an SSH client on Windows? Don't use Putty or CygWin...use Git

$
0
0
SSH in a native powershell window. I prefer to use console2 and enjoy judging others who don't (conemu is good too).

SSH in a native powershell window. I prefer to use console2 and enjoy judging others who don't (conemu is good too).

Every once in a while I hear of windows users trying to find a good SSH client for Windows to connect to their Linux boxes. For the longest time, a couple of the more popular choices have been Cygwin and Putty.

These still work today but I personally find the experience of both to be sub-optimal. There are lots of annoyances I find in each but the main thing they both lack is an integrated SSH experience in the shell console I use for everything else (mainly powershell) day in/day out. Cygwin and Putty run in separate console experiences. I just want to type 'ssh mwrock@blahblah' in my console of choice and have it work.

You already have the software

Ok, maybe not...but its very likely that if you are reading this and find yourself needing to SSH here and there, you also use GIT. Well many are unaware that git for windows bundles several Linux familiar tools. Many might use these in the git bash shell.

Friends don't let friends use the git bash shell on windows

Don't get me wrong here - I'm not anti bash when I am on Linux. Its great. But I find tools like bash and cygwin offer a "worst of both worlds" experience on Windows. You don't need to run in the bash window to access SSH. You just need to make a small modification to your path. Assuming git was installed to C:/Program Files/Git (the default location), just add C:/Program Files/Git/usr/bin to your path:

$new_path = "$env:PATH;C:/Program Files/Git/usr/bin"
$env:PATH=$new_path
[Environment]::SetEnvironmentVariable("path", $new_path, "Machine")

Bam! Now type 'ssh':

C:\dev\WinRM [psrp +1 ~1 -0 !]> ssh
usage: ssh [-1246AaCfgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
       [-D [bind_address:]port] [-E log_file] [-e escape_char]
       [-F configfile] [-I pkcs11] [-i identity_file]
       [-L [bind_address:]port:host:hostport] [-l login_name] [-m mac_spec]
       [-O ctl_cmd] [-o option] [-p port]
       [-Q cipher | cipher-auth | mac | kex | key]
       [-R [bind_address:]port:host:hostport] [-S ctl_path] [-W host:port]
       [-w local_tun[:remote_tun]] [user@]hostname [command]
C:\dev\WinRM [psrp +1 ~1 -0 !]>

Don't have git?

Well that's a problem easily solved. Just grab chocolatey if you don't have it already and install git:

iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))
choco install git -params "/GitAndUnixToolsOnPath"

Note that the "GitAndUnixToolsOnPath " param sets the environment variable for you. You will need to open a new shell for git and ssh to be available in your console and then you are ready to SSH to your heart's content!

Run Kitchen tests in Travis and Appveyor using the kitchen-machine driver

$
0
0

There are a few cookbook repositories I help to maintain and services like Travis and Appveyor provide a great enhancement to the pull request workflow both for contributors and maintainers. As a contributor I get fast feedback without bothering any human on whether my PR meets a minimum bar of quality and complies with the basic code style standards of the maintainers. As a maintainer, I can more easily triage incoming contributions and simply not waste time reviewing failed commits.

Its common to use travis and appveyor for unit tests and even longer running functional tests as long as they take a reasonable amount of time to run. However, in the Chef cookbook world, you usually do not see them include Test-Kitchen tests which are the cookbook equivalent of end to end functional or integration tests. I have found myself wishing for an easy way to kick off cloud based Test-Kitchen runs triggered by pull request pushes. Often there just is no way to know for sure if a commit breaks a cookbook or if it "works" unless I manually pull down the PR and invoke the kitchen test (usually via vagrant). This takes time and I wish I could just see a green check mark or red 'X' automatically without doing anything.

This week while reviewing PRs for the Chocolatey cookbook, I set out to run the kitchen tests in appveyor and did not see any readily available way to do so. Thus the Kitchen-Machine driver was born.

In this post I'll talk about some patterns folks have used to run kitchen tests in travis and why I went looking for an alternative, what was involved in creating it (about a two hour endeavor) and how to go about using it in your cookbook repository.

Using Docker

One approach is to install and start docker inside the travis VM and then use kitchen-docker or kitchen-dokken to run them in the travis run. This is a very viable approach. You can fire up a docker container very quickly and most cookbooks will run fine containerized. There are a few downsides:

  • Installing and configuring docker while fairly straight forward may require several lines of setup script.
  • Some cookbooks may not run well in containers.
  • Not a viable approach for windows based cookbooks. Looking forward to when they are.

Leveraging AWS or other cloud alternatives

Another popular approach is to run the kitchen tests from travis but reach out to another cloud provider to host the actual test instances. This also has its drawbacks:

  • Unless using the aws free tier, its not free and if you are, its not fast.
  • You have to stash keys in your travis.yml

I have an ephemeral machine, why cant I use it?

Both of the above solutions don't sit well for me because it just seems sub optimal to have to bring up another "instance" whether it be container or cloud based when travis or appveyor just did that for me. The beauty of these services is that they bring up an isolated and full fledged test VM so why do I need to bring up another one?

All kitchen drivers are built to go "somewhere else"

This usually makes perfect sense. I don't want to run kitchen tests and have the test instance BE my machine because the state of that machine will be changed and I want to remain isolated from those changes and easily undo them. However in travis or appveyor, I AM somewhere else.

Yet having looked, I found no kitchen driver that would converge and test locally. The closest thing was kitchen-ssh which simply communicates with a test instance over SSH and does not try to spin up a vm. Surely one can easily just SSH to localhost. However, its just SSH and does not also leverage WinRM when talking to windows.

kitchen-machine, not necessarily virtual or cloud...just an available machine

I wanted a driver that could run commands using whatever kitchen transport was configured (SSH or WinRM) but would not try to interact with a hypervisor or cloud to start a separate instance. I would just point the transport to localhost and run locally.

This proved to be trivial to write. Honestly I just copied the kitchen-vagrant driver and then proceeded to delete almost all of it. The next best thing would be to create a new transport. A "local" transport that would run native shell commands locally and use the native filesystem for file operations. However, pointing SSH and WinRM to localhost seems to work just fine and requires a lot less work.

Using kitchen-machine

The kitchen-machine repo includes both travis and appveyor examples. Also, the chocolatey-cookbook repository includes a "real world" example that uses appveyor. I'll briefly walk through both appveyor and travis samples here.

The .kitchen.yml or .kitchen.{tsavis or appveyor}.yml file

First we'll look at the .kitchen.yml file that may be the same for either travis or appveyor. In the kitchen-machine repo, this is .kitchen.yml but most actual cookbook repositories will likely use .kitchen.travis.yml or .kitchen.appveyor.yml since they will want to use .kitchen.yml for locally run tests.

---
driver:
  name: machine
  username: <%= ENV["machine_user"] %>
  password: <%= ENV["machine_pass"] %>

provisioner:
  name: chef_zero

platforms:
  - name: ubuntu-14.04
  - name: windows-2012R2

verifier:
  name: inspec

suites:
  - name: default
    run_list:
      - recipe[machine_test]

The driver section uses the machine driver and configures a username and password for the transport. The transport being used is determined by the same kitchen logic, either one can explicitly configure the transport to use or kitchen will use SSH unless the platform starts with "win." This example configures the credential from environment variables. That will be more clear when we look at the travis and appveyor yml files.

The machine driver configuration can also take a port or hostname. Hostname defaults to "localhost" and port defaults to 22 or 5985 depending on the transport used. One draw back for now is that the SSH configuration only works with passwords and not keys. This was because I could not get the travis user to cooperate using keys. I'd be delighted to accept a PR that gets that right if possible. It just was not a big deal for me.

.travis.yml

language: ruby

env:
  global:
    - machine_user=travis
    - machine_pass=travis

rvm:
  - 2.1.7

sudo: required
dist: trusty

before_install:
  - sudo usermod -p "`openssl passwd -1 'travis'`" travis

script:
  - bundle exec rake
  - bundle exec kitchen verify ubuntu

branches:
  only:
  - master

Here we set the user name and password environment variables. We also give the travis user (what travis runs under) a password. Our test "script" runs kitchen verify against the ubuntu platform.

appveyor.yml

version: "master-{build}"

os: Windows Server 2012 R2
platform:
  - x64

environment:
  machine_user: test_user
  machine_pass: Pass@word1
  SSL_CERT_FILE: c:\projects\kitchen-machine\certs.pem

  matrix:
    - ruby_version: "21"

clone_folder: c:\projects\kitchen-machine
clone_depth: 1
branches:
  only:
    - master

install:
  - ps: net user /add $env:machine_user $env:machine_pass
  - ps: net localgroup administrators $env:machine_user /add
  - ps: $env:PATH="C:\Ruby$env:ruby_version\bin;$env:PATH"
  - ps: gem install bundler --quiet --no-ri --no-rdoc
  - ps: Invoke-WebRequest -Uri http://curl.haxx.se/ca/cacert.pem -OutFile c:\projects\kitchen-machine\certs.pem

build_script:
  - bundle install || bundle install || bundle install

test_script:
  - bundle exec kitchen verify windows

This one is a bit more involved but still not too bad. Like we did in the travis.yml, we assign our username and password environment variables. Here we actually go ahead and create a new user and make it an administrator. This is because its tough to get at the appveyor user's password and there may be issues changing its password in the middle of an active session.

Another thing to note for windows is we need to do some special SSL cert setup. Ruby and openssl do not use the native windows certificate store to manage certificate authorities; so we go ahead and download them and point openssl at them with the SSL_CERT_FILE variable. Those who use the Chef-dk can be thankful that hides this detail from you, but there is no chef-dk in this environment.

Finally our test_script invokes all tests in the windows platform.

The Results

Now I can see a complete Test-Kitchen log of converged resources and verifier test results right in my travis and appveyor build output.

Here is the end of the travis log:

And here is appveyor:


Installing and running a Chef client on Windows Nano Server

$
0
0

I've been really excited about Nano Server ever since it was first talked about last year. It's a major shift in how we "do windows" and has the promise of addressing many pain points that exist today for those who run Windows Server farms at scale. As a Chef enthusiast and employee I am thrilled to be apart of the Journey to making Chef on Nano a delight. Now note the word "journey" here. We are on the road to awesome  but we've just laid down the foundation and scaffolding. Heck, even Nano itself is still evolving as I write this and is on it's own journey to RTM. So while today I would not characterize the experience of Chef on Nano as "delightful," I can say that it is "possible." For those who enjoy living on the bleeding edge, "possible" is a significant milestone on the way to "delight."

In this post I will share what Nano is, how to get it and run it on your laptop, install a chef client and run a basic recipe. I'll also point out some unfinished walls and let you know some of the rooms that are entirely missing. Also bear in mind that I have not toured the entire house so you will certainly find other gaps between here and delight and hopefully you will share. Keep in mind that I am a software developer and not a product owner at Chef. So I won't be talking about road maps or showing any gnat charts here. I'm really speaking more as a enthusiast here and as someone who can't wait to use it more and make the experience of running Chef on Nano awesome like a possum because possums are awesome. (They really are. We have one living under our front porch and its just so gosh darn cute!)

What is Nano and why should I care?

Some of you may be thinking, "finally a command line text editor for windows!" Yeah that would be great but no I'm not talking about the GNU Nano text editor familiar to many linux users, I am talking about Windows Nano Server. You have likely heard of Windows Server Core. Well this takes core to the next level (or two). While Server Core does not support GUI apps (except the ones it does) but still supports all applications and APIs available on GUI based Windows Server, Nano throws backwards compatibility to the wind and slashes vast swaths of what sits inside a Windows server today widdling it down to a few hundred megabytes.

There is no 32 bit subsystem only 64 bit is supported. While traditional Windows server has multiple APIs for doing the same thing, Nano pairs much of that down throwing out many APIs where you can accomplish the same tasks using another API. Nope, no VBScript engine here. All those windows roles and features that lie dormant on a traditional Windows server simply don't exist on Nano. You start in a bare core and then add what you need.

Naturally this is not all roses and sunshine. The road to Nano will have moments of pain and frustration. We have grown accustomed to our bloated Windows and will need to adjust our workflows and in some case our application architecture to exist in this new environment. Here is one gotcha many may hit pretty quickly. Maybe you pull down the Couchbase MSI and try to run MSIEXEC on that thing. Well guess what...on Nano "The term 'msiexec' is not recognized as the name of a cmdlet, function, script file, or operable program." Nano will introduce new ways to install applications.

But before we get all cranky about the loss of MSIs (yeah the "packaging" system that wasn't really) lets think about what this smaller surface area gets us. First an initial image file that is often less than 300MB compared to what would be 4GB on 2012R2. With less junk in the trunk, there is less to update, so no more initial multi hour install of hundreds of updates. Fewer updates will mean fewer reboots. All of this is good stuff. Nano also boots up super duper quick. The first time you provision a Nano server, you may think you have done something wrong - I did - because it all happens so much more quickly than what you are used to.

Info graphic provided by Microsoft

Info graphic provided by Microsoft

Sounds great! How do I go to there?

There are several ways to get a nano server up and running and I'll outline several options here:

  • Download the latest Windows Server 2016 Technical Preview and follow these instructions to extract and install a Nano image.
  • Download just a VHD and import that into Hyper-V (user: administrator/pass: Pass@word1). While it successfully installs on VirtualBox, you will not be able to successfully to remote to the box. 
  • Are you a Vagrant user? Feel free to "vagrant up" my mwrock/WindowsNano Nano eval box (user: vagrant/pass: vagrant). It has a Hyper-V and VirtualBox provider. The Hyper-V provider may "blue screen" on first boot but should operate fine after rebooting a couple times.
  • Don't want to use some random dude's box but want to make your own with Packer? Check out my post where I walk through how to do that.

Depending on wheather you are a Vagrant user or VirtualBox user or not, options 2 and 3 are by far the easiest and should get you up and running in minutes. The longest part is just downloading the image but even that is MUCH faster than a standard windows ISO. About 6 minutes compared to 45 minutes on my FIOS home connection.

Preparing your environment

Basically we just need to make sure we can remote to the nano server and also copy files to it.

Setup a Host-Only network on VirtualBox

You will need to be able to access the box with a non localhost IP in order to transfer files over SMB. This only applies if you are using a NAT VirtualBox network (the default). Hyper-V Switches should require no additional configuration.

First create a Host-Only network if your VirtualBox host does not already have one:

VBoxManage hostonlyif create

Now add a NIC on your Nano guest that connects over the Host-Only network:

VBoxManage modifyvm "VM name" --nic2 hostonly

Now log on to the box to discover its IP address. Immediately after login, you should see its two IPs. One starts with 10. and the other starts with 172. You will want to use the 172 address.

Make sure you can remote to the Nano box

We'll establish a remote powershell connection to the box. If you can't do this now, we won't get far with Chef. So run the following:

$ip = "<ip address of Nano Server>"
Set-Item WSMan:\localhost\Client\TrustedHosts $ip
Enter-PSSession -ComputerName $ip -Credential $ip\Administrator

This should result in dropping you into an interactive remote powershell session on the nano box:

[172.28.128.3]: PS C:\Users\vagrant\Documents>

Open File and Print sharing Firewall Rule

If you are using my vagrant box or packer templates, this should already be setup. Otherwise, you will need to open the File and Print sharing rule so we can mount a drive to the Nano box. Don't you worry, we won't be doing any printing.

PS C:\dev\nano> Enter-PSSession -ComputerName 172.28.128.3 -Credential vagrant
[172.28.128.3]: PS C:\Users\vagrant\Documents> netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=yes

Updated 16 rule(s).
Ok.

Mount a network drive from the host to the Nano server

We'll just create a drive that connects to the root of the C: drive on the Nano box:

PS C:\dev\nano> net use z: \\172.28.128.3\c$
Enter the user name for '172.28.128.3': vagrant
Enter the password for 172.28.128.3:
The command completed successfully.

PS C:\dev\nano> dir z:\


    Directory: z:\


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----       11/21/2015   2:47 AM                Program Files
d-----       11/21/2015   2:47 AM                Program Files (x86)
d-----       11/21/2015   2:47 AM                ProgramData
d-----       11/21/2015   2:47 AM                Users
d-----       11/21/2015  10:50 AM                Windows

Ok. If you have gotten this far, you are ready to install Chef.

Installing Chef

Remember earlier in this post when I mentioned that there is no msiexec on Nano? We can't just download the MSI on Nano and install it like we normally would. There are some pre steps we will need to follow and we also need to make sure to use just the right Chef version.

Downloading the right version of Chef

If you remember, not only does Nano lack the ability to install MSIs, it also lacks a 32 bit subsystem. Lucky for us, Chef now ships a 64 bit Windows Chef client. On downloads.chef.io, the last few versions of the chef client provide a 64bit Windows install. However, don't get the latest one. The last couple versions have a specially compiled version of ruby that is not compatible with Nano server. This will be remedied in time, but for now, download the 64 bit install of Chef 12.7.2.

PS C:\dev\nano> Invoke-WebRequest https://packages.chef.io/stable/windows/2008r2/chef-client-12.7.2-1-x64.msi -OutFile Chef-Client.msi

Extract the Chef client files from the MSI

There are a couple ways to do this. The most straight forward is to simply install the MSI locally on your host. That will install the 64 bit Chef client, and in the process, extract its files locally.

Another option is to use a tool like LessMSI which can extract the contents without actually running the installer. This can be beneficial because it won't run installation code that adjusts your registry, changes environment variables or changes any other state on your local host that you would rather leave in tact.

You can grab lessmsi from chocolatey:

choco install lessmsi

Now extract the Chef MSI contents:

lessmsi x .\chef-client.msi c:\chef-client\

This extracts the contents of the MSI to c:\chef-client. What we are interested in now is the extracted C:\chef-client\SourceDir\opscode\chef.zip. That has the actual files we will copy to the Nano server. Lets expand that zip:

Expand-Archive -Path C:\chef-client\SourceDir\opscode\chef.zip -DestinationPath c:\chef-client\chef

Note I am on Windows 10. The Expand-Archive command is only available in powershell version 5. However if you are on a previous version of powershell, I am confident you can expand the zip using other methods.

Once this zip file is extracted by whatever means, you should now have a full Chef install layout at c:\chef-client\chef.

Editing win32-process gem before copying to Nano

Here's another area on "undelight." The latest version of the win32-process gem introduced coverage of some native kernel API functions that were not ported to Nano. So attempts to attach to these C functions results in an error when the gem is loaded. Since win32-process is a dependency of the chef client, loading the chef client will blow up.

Not to worry, we can perform some evasive maneuvers to work around this. Its gonna get dirty here, but I promise no one is looking. There are basically four lines in the win32-process source that we simply need to remove:

PS C:\dev\nano> (Get-content -Path C:\chef-client\chef\embedded\lib\ruby\gems\2.0.0\gems\win32-process-0.8.3\lib\win32\proces
s\functions.rb) | % { if(!$_.Contains(":Heap32")) { $_ }} | Set-Content C:\chef-client\chef\embedded\lib\ruby\gems\2.0.0\gems
\win32-process-0.8.3\lib\win32\process\functions.rb

Copy the Chef files to the Nano server

Now we are ready to copy over all the chef files to the Nano Server:

 Copy-Item C:\chef-client\chef\ z:\chef -Recurse

Validate the chef install

Lets set the path to include the chef ruby and bin files and then call chef-client -v:

PS C:\dev\nano> Enter-PSSession -ComputerName 172.28.128.3 -Credential vagrant
[172.28.128.3]: PS C:\Users\vagrant\Documents> $env:path += ";c:\chef\bin;c:\chef\embedded\bin"
[172.28.128.3]: PS C:\Users\vagrant\Documents> chef-client -v
Chef: 12.7.2

Hopefully you see similar output including the version of the chef-client.

Converging on Nano

Now we are actually ready to "do stuff" with Chef on Nano. Lets start simple. We will run chef-apply on a trivial recipe:

file "c:/blah.txt" do
  content 'blah'
end

Lets copy the recipe and converge:

[172.28.128.3]: PS C:\Users\vagrant\Documents> chef-apply c:/chef/blah.rb
[2016-03-20T07:49:38+00:00] WARN: unable to detect ipaddress
[2016-03-20T07:49:45+00:00] INFO: Run List is []
[2016-03-20T07:49:45+00:00] INFO: Run List expands to []
[2016-03-20T07:49:45+00:00] INFO: Processing file[c:/blah.txt] action create ((chef-apply cookbook)::(chef-apply recipe) line
 1)
[2016-03-20T07:49:45+00:00] INFO: file[c:/blah.txt] created file c:/blah.txt
[2016-03-20T07:49:45+00:00] INFO: file[c:/blah.txt] updated file contents c:/blah.txt
[172.28.128.3]: PS C:\Users\vagrant\Documents> cat C:\blah.txt
blah

That worked! Our file is created and we can see it has the intended content.

Bootstrapping a Nano node on a Chef server

Maybe you want to join a fleet of nano servers to a chef server. Well at the moment knife bootstrap windows throws encoding errors. This could be related to the fact that the codepage used by a Nano shell is UTF-8 unlike the "MS-DOS"(437) codepage used on all previous versions of windows.

Lets just manually bootstrap for now:

knife node create -d nano
# We use ascii encoding to avoid a UTF-8 BOM
knife client create -d nano | Out-File -FilePath z:\chef\nano.pem -Encoding ascii
knife acl add client nano nodes nano update

Note that the knife acl command requires the knife-acl gem that is not shipped with chef-dk but you can simply chef gem install it.

Now lets create a basic client.rb file in z:\chef\client.rb:

log_level        :info
log_location     STDOUT
chef_server_url  'https://api.opscode.com/organizations/ment'
client_key 'c:/chef/nano.pem'
node_name  'nano'

Now lets converge against our server:

[172.28.128.3]: PS C:\Users\vagrant\Documents> chef-client
[2016-03-20T18:28:07+00:00] INFO: *** Chef 12.7.2 ***
[2016-03-20T18:28:07+00:00] INFO: Chef-client pid: 488
[2016-03-20T18:28:08+00:00] WARN: unable to detect ipaddress
[2016-03-20T18:28:15+00:00] INFO: Run List is []
[2016-03-20T18:28:15+00:00] INFO: Run List expands to []
[2016-03-20T18:28:15+00:00] INFO: Starting Chef Run for nano
[2016-03-20T18:28:15+00:00] INFO: Running start handlers
[2016-03-20T18:28:15+00:00] INFO: Start handlers complete.
[2016-03-20T18:28:16+00:00] INFO: Loading cookbooks []
[2016-03-20T18:28:16+00:00] WARN: Node nano has an empty run list.
[2016-03-20T18:28:16+00:00] INFO: Chef Run complete in 1.671495 seconds
[2016-03-20T18:28:16+00:00] INFO: Running report handlers
[2016-03-20T18:28:16+00:00] INFO: Report handlers complete
[2016-03-20T18:28:16+00:00] INFO: Sending resource update report (run-id: 1fb8ae2a-8455-4b94-b956-99c5a34863ea)
[172.28.128.3]: PS C:\Users\vagrant\Documents>

We can use chef-client -r recipe[windows] if you have the windows cookbook on your server to converge its default recipe.

What else is missing?

The other gaping hole here is Test-Kitchen cannot converge a Nano test instance. Two main reasons prevent this:

  1. There is no provisioner that can get around the MSI install issue
  2. Powershell calls from the winrm gem will fail because Nano's Powershell.exe does not accept the -EncodedCommand argument.

One might think that the first issue could probably be worked around by creating a custom provisioner to extract the chef files on the host and copy them to the instance. However, because winrm-fs, used by Test-Kitchen to copy files to a windows test instance, makes ample use of the -EncodedCommand argument. Shawn Neal and I are currently working on a v2 of the winrm gem that will use a more modern protocol for powershell calls to get around the -EncodedCommand limitation and provide many other benefits as well.

Try it out!

Regardless of the missing pieces, as you can see, one can at least play with Chef and Nano today and get a glimpse of what life might be like in the future. Its no flying car but it beats our multi gigabyte Windows images of today.

Certificate (password-less) based authentication in WinRM

$
0
0

This week the WinRM ruby gem version 1.8.0 released adding support for certificate authentication. Many thanks to the contributions of @jfhutchi and @elpetak that make this possible. As I set out to test this feature, I explored how certificate authentication works in winrm using native windows tools like powershell remoting. My primary takeaway was that it was not at all straightforward to setup. If you have worked with similar authentication setups on linux using SSH commands, be prepared for more friction. Most of this is simply due to the lack of documentation and google results (well now there is one more). Regardless, I still think that once setup, authentication via certificates is a very good thing and many are not aware that this is available in WinRM.

This post will walk through how to configure certificate authentication, enumerate some of the "gotchas" and pitfalls one may encounter along the way and then explain how to use certificate authentication using Powershell Remoting as well as via the WinRM ruby gem which opens up the possibility of authenticating from a linux client to a Windows WinRM endpoint.

Why should I care about certificate based authentication?

First lets examine why certificate authentication has value. What's wrong with usernames and passwords? In short, certificates are more secure. I'm not going to go too deep here but here are a few points to consider:

  • Passwords can be obtained via brute force. You can protect against this by having longer, more complex passwords and changing them frequently. Very few actually do that, and even if you do, a complex password is way easier to break than a certificate. Its nearly impossible to brute force a private key of sufficient strength.
  • Sensitive data is not being transferred over the wire. You are sending a public key and if that falls into the wrong hands, no harm done.
  • There is a stronger trail of trust establishing that the person who is seeking authentication is in fact who they say they are given the multi layered process of having a generated certificate signed by a trusted certificate authority.

Its still important to remember that nothing may be able to protect us from sophisticated aliens or time traveling humans from the future. No means of security is impenetrable.

Not as convenient as SSH keys

So one reason some like to use certificates over passwords in SSH scenarios is ease of use. There is a one time setup "cost" of sending your public key to the remote server, but:

  1. SSH provides command line tools that make this fairly straight forward to setup from your local environment.
  2. Once setup, you just need to initiate an ssh session to the remote server and you don't have to hassle with entering a password.

Now the underlying cryptographic and authentication technology is no different using winrm, but both the initial setup and the "day to day" use of using the certificate to login is more burdensome. The details of why will become apparent throughout this post.

One important thing to consider though is that while winrm certificate authentication may be more burdensome, I don't think the primary use case is for user interactive login sessions (although that's too bad). In the case of automated services that need to interact with remote machines, these "burdens" simply need to be part of the connection automation and its just a non sentient cpu that does the suffering. Lets just hope they won't hold a grudge once sentience is obtained.

High level certificate authentication configuration overview

Here is a run down of what is involved to get everything setup for certificate authentication:

  1. Configure SSL connectivity to winrm on the endpoint
  2. Generate a user certificate used for authentication
  3. Enable Certificate authentication on the endpoint. Its disabled by default for server auth and enabled on the client side.
  4. Add the user certificate and its issuing CA certificate to the certificate store of the endpoint
  5. Create a user mapping in winrm with the thumbprint of the issuing certificate on the endpoint.

I'll walk through each of these steps here. When the above five steps are complete, you should be able to connect via certificate authentication using powershell remoting or using the ruby or python open source winrm libraries.

Setting up the SSL winrm listener

If you are using certificate authentication, you must use a https winrm endpoint. Attempts to authenicate with a certificate using http endpoints will fail. You can setup SSL on the endpoint with:

$ip="192.168.137.169" # your ip might be different
$c = New-SelfSignedCertificate -DnsName $ip `
                               -CertStoreLocation cert:\LocalMachine\My
winrm create winrm/config/Listener?Address=*+Transport=HTTPS "@{Hostname=`"$ip`";CertificateThumbprint=`"$($c.ThumbPrint)`"}"
netsh advfirewall firewall add rule name="WinRM-HTTPS" dir=in localport=5986 protocol=TCP action=allow

Generating a client certificate

Client certificates have two key requirements:

  1. An Extended Key Usage of Client Authentication
  2. A Subject Alternative Name with the UPN of the user.

Only ADCS certificates work from Windows 10/2012 R2 clients via powershell remoting

This was the step that I ended up spending the most time on. I continued to receive errors saying my certificate was malformed:

new-PSSession : The WinRM client cannot process the request. If you are using a machine certificate, it must contain a DNS name in the Subject Alternative Name extension or in the Subject Name field, and no UPN name. If you are using a user certificate, the Subject Alternative Name extension must contain a UPN name and must not contain a DNS name. Change the certificate structure and try the request again.

I was trying to authenticate from a windows 10 client using powershell remoting. I don't typically work or test in a domain environment and don't run an Active Directory Certificate Services authority. So I wanted to generate a certificate using either New-SelfSignedCertificate or OpenSSL.

In short here is the bump I hit: powershell remoting from a windows 10 or windows 2012 R2 client failed to authenticate with certificates generated from OpenSSL or New-SelfSignedCertificate. However these same certificates succeed to authenticate from windows 7 or windows 2008 R2. They only worked on Windows 10 and 2012 R2 if I used the ruby WinRM gem instead of powershell remoting. Note that while I tested on windows 10 and 2012 R2, I'm sure that windows 8.1 suffers the same issue. The only certificates I got to work on windows 10 and 2012 R2 via powershell remoting were created via an Active Directory Certificate Services Enterprise CA .

So unless I can find out otherwise, it seems that you must have access to an Enterprise root CA in Active Directory Certificate Services and have client certificates issued in order to use certificate authentication from powershell remoting on these later windows versions. If you are using ADCS, the stock client template should work.

Generating client certificates via OpenSSl

As stated above, certificates generated using OpenSSL or New-SelfSignedCertificate did not work using powershell remoting from windows 10 or 2012 R2. However, if you are using a previous version of windows or if you are using another client (like the ruby or python libraries), then these other non-ADCS methods will work fine and do not require the creation of a domain controller and certificate authority servers.

If you do not already have OpenSSL tools installed, you can get them via chocolatey:

cinst openssl.light -y

Then you can run the following powershell to generate a correctly formatted user certificate which was adapted from this bash script:

function New-ClientCertificate {
  param([String]$username, [String]$basePath = ((Resolve-Parh .).Path))

  $OPENSSL_CONF=[System.IO.Path]::GetTempFileName()

  Set-Content -Path $OPENSSL_CONF -Value @"
  distinguished_name = req_distinguished_name
  [req_distinguished_name]
  [v3_req_client]
  extendedKeyUsage = clientAuth
  subjectAltName = otherName:1.3.6.1.4.1.311.20.2.3;UTF8:$username@localhost
"@

  $user_path = Join-Path $basePath user.pem
  $key_path = Join-Path $basePath key.pem
  $pfx_path = Join-Path $basePath user.pfx

  openssl req -x509 -nodes -days 3650 -newkey rsa:2048 -out $user_path -outform PEM -keyout $key_path -subj "/CN=$username" -extensions v3_req_client 2>&1

  openssl pkcs12 -export -in $user_path -inkey $key_path -out $pfx_path -passout pass: 2>&1

  del $OPENSSL_CONF
}

This will output a certificate and private key file both in base64 .pem format and additionally a .pfx formatted file.

Generating client certificates via New-SelfSignedCertificate

If you are on windows 10 or server 2016, then you should have a more advanced version of the New-SelfSignedCertificate cmdlet - more advaced than what shipped with windows 2012 R2 and 8.1. Here is the command to generate the script:

New-SelfSignedCertificate -Type Custom `
                          -Container test* -Subject "CN=vagrant" `
                          -TextExtension @("2.5.29.37={text}1.3.6.1.5.5.7.3.2","2.5.29.17={text}upn=vagrant@localhost") `
                          -KeyUsage DigitalSignature,KeyEncipherment `
                          -KeyAlgorithm RSA `
                          -KeyLength 2048

This will add a  certificate for a vagrant user to the personal LocalComputer folder in the certificate store.

Enable certificate authentication

This is perhaps the simplest step. By default, certificate authentication is enabled for clients and disabled for server. So you will need to enable it on the endpoint:

Set-Item -Path WSMan:\localhost\Service\Auth\Certificate -Value $true

Import the certificate to the appropriate certificate store locations

If you are using powershell remoting, the user certificate and its private key should be in the My directory of either the LocalMachine or the CurrentUser store on the client. If you are using a cross platform library like the ruby or python library, the cert does not need to be in the store at all on the client. However, regardless of client implementation, it must be added to the server certificate store.

Importing on the client

As stated above, this is necessary for powershell remoting clients. If you used ADCS or New-SelfSignedCertificate, then the generated certificate is added automatically. However if you used OpenSSL, you need to import the .pfx yourself:

Import-pfxCertificate -FilePath user.pfx `
                      -CertStoreLocation Cert:\LocalMachine\my

Importing on the server

There are two steps to timporting the certificate on the endpoint:

  1. The issuing certificate must be present in the Trusted Root Certification Authorities of the LocalMachine store
  2. The client certificate public key must be present in the Trusted People folder of the LocalMachine store

Depending on your setup, the issuing certificate may already be in the Trusted Root location. This is the certificate used to issue the client cert. If you are using your own enterprise certificate authority or a publicly valid CA cert, its likely you already have this in the trusted roots. If you used OpenSSL or New-SelfSignedCertificate then the user certificate was issued by itself and needs to be imported.

If you used OpenSSL, you already have the .pem public key. Otherwise you can export it:

Get-ChildItem cert:\LocalMachine\my\7C8DCBD5427AFEE6560F4AF524E325915F51172C |
  Export-Certificate -FilePath myexport.cer -Type Cert

This assumes that 7C8DCBD5427AFEE6560F4AF524E325915F51172C is the thumbprint of your issuing certificate. I guarantee that is an incorrect assumption.

Now import these on the endpoint:

Import-Certificate -FilePath .\myexport.cer `
                   -CertStoreLocation cert:\LocalMachine\root
Import-Certificate -FilePath .\myexport.cer `
                   -CertStoreLocation cert:\LocalMachine\TrustedPeople

Create the winrm user mapping

This will declare on the endpoint: given a issuing CA, which certificates to allow access. You can potentially add multiple entries for different users or use a wildcard. We'll just map our one user:

New-Item -Path WSMan:\localhost\ClientCertificate `
         -Subject 'vagrant@localhost' `
         -URI * `
         -Issuer 7C8DCBD5427AFEE6560F4AF524E325915F51172C `
         -Credential (Get-Credential) `
         -Force

Again this assumes the issuing certificate thumbprint of the certificate that issued our user certificate is 7C8DCBD5427AFEE6560F4AF524E325915F51172C and we are allowing access to a local account called vagrant. Note that if your user certificate is self-signed, you would use the thumbprint of the user certificate itself.

Using certificate authentication

This completes the setup. Now we should actually be able to login remotely to the endpoint. I'll demonstrate this first using powershell remoting and then ruby.

Powershell remoting

C:\dev\WinRM [master]>Enter-PSSession -ComputerName 192.168.137.79 `>> -CertificateThumbprint 7C8DCBD5427AFEE6560F4AF524E325915F51172C
[192.168.137.79]: PS C:\Users\vagrant\Documents>

Ruby WinRM gem

C:\dev\WinRM [master +3 ~0 -0 !]> gem install winrm
WARNING:  You don't have c:\users\matt\appdata\local\chefdk\gem\ruby\2.1.0\bin in your PATH,
          gem executables will not run.
Successfully installed winrm-1.8.0
Parsing documentation for winrm-1.8.0
Done installing documentation for winrm after 0 seconds
1 gem installed
C:\dev\WinRM [master +3 ~0 -0 !]> irb
irb(main):001:0> require 'winrm'
=> true
irb(main):002:0> endpoint = 'https://192.168.137.169:5986/wsman'
=> "https://192.168.137.169:5986/wsman"
irb(main):003:0> puts WinRM::WinRMWebService.new(
irb(main):004:1*   endpoint,
irb(main):005:1*   :ssl,
irb(main):006:1*   :client_cert => 'user.pem',
irb(main):007:1*   :client_key => 'key.pem',
irb(main):008:1*   :no_ssl_peer_verification => true
irb(main):009:1> ).create_executor.run_cmd('ipconfig').stdout

Windows IP Configuration


Ethernet adapter Ethernet:

   Connection-specific DNS Suffix  . : mshome.net
   Link-local IPv6 Address . . . . . : fe80::6c3f:586a:bdc0:5b4c%12
   IPv4 Address. . . . . . . . . . . : 192.168.137.169
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 192.168.137.1

Tunnel adapter Local Area Connection* 12:

   Connection-specific DNS Suffix  . :
   IPv6 Address. . . . . . . . . . . : 2001:0:5ef5:79fd:24bc:3d4c:3f57:7656
   Link-local IPv6 Address . . . . . : fe80::24bc:3d4c:3f57:7656%14
   Default Gateway . . . . . . . . . : ::

Tunnel adapter isatap.mshome.net:

   Media State . . . . . . . . . . . : Media disconnected
   Connection-specific DNS Suffix  . : mshome.net
=> nil
irb(main):010:0>

Interested?

This functionality just became available in the winrm gem 1.8.0 this week. This gem is used by Vagrant, Chef and Test-Kitchen to connect to remote machines. However, none of these applications provide configuration options to make use of certificate authentication via winrm. My personal observation has been that nearly no one uses certificate authentication with winrm but that may be a false observation or a result of the fact that few no about this possibility.

If you are interested in using this in Chef, Vagrant or Test-Kitchen, please file an issue against their respective github repositories and make sure to @ mention me (@mwrock) and I'll see what I can do to plug this in or you can submit a PR yourself if so inclined.

A look under the hood at Powershell Remoting through a cross plaform lens

$
0
0

Many Powershell enthusiasts don't realize that when they are using commands like New-PsSession and streaming pipelines to a powershell runspace on a remote machine, they are actually writing a binary message wrapped in a SOAP envelope that leverages a protocol with the namesake of Windows Vista. Not much over a year ago I certainly wasn't. This set of knowledge all began with needing to transfer files from a linux machine to a windows machine. In a pure linux world there is a well known tool for this called SCP. In Windows we map drives or stream bytes to a remote powershell session. How do we get a file (or command for that matter) from one of these platforms to the other?

I was about to take a plunge to go deeper than I really wanted into a pool where I did not really care to swim. And today I emerge with a cross platform "partial" implementation of Powershell Remoting in Ruby. No not just WinRM but a working PSRP client.

In this post I will cover how PSRP differs from its more familiar cross platform cousin WinRM, why its of value and how one can give it a try. Hopefully this will provide an interesting perspective into what Powershell Remoting looks like from an implementor's point of view.

In the beginning there was WinRM

While PSRP is a different protocol from WinRM (Windows Remote Management) with its own spec. It cannot exist or be explained without WinRM. WinRM is a SOAP based web service defined by a protocol called Web Services Management Protocol Extensions for Windows Vista (WSMV). I love that name. This protocol defines several different message types for performing different tasks and gathering different kinds of information on a remote instance. I'm going to focus here on the messages involved with invoking commands and collecting their output.

A typical WinRM based conversation for invoking commands goes something like this:

  1. Send a Create Shell message and get the shell id from the response
  2. Create a command in the shell sending the command and any arguments and grab the command id from the response
  3. Send a request for output on the command id which may return streams (stdout and/or stderr) containing base64 encoded text.
  4. Keep requesting output until the command state is done and examine the exit code.
  5. Send a command termination signal
  6. Send a delete shell message

The native windows tool (which nobody uses anymore) to speak pure winrm is winrs.exe.

C:\dev\winrm [winrm-v2]> winrs -r:http://192.168.137.10:5985 -u:vagrant -p:vagrant ipconfig

Windows IP Configuration

Ethernet adapter Ethernet:

   Connection-specific DNS Suffix  . : mshome.net
   Link-local IPv6 Address . . . . . : fe80::c11b:f734:5bd4:ab03%3
   IPv4 Address. . . . . . . . . . . : 192.168.137.10
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 192.168.137.1

You can turn on analytical event log messages or watch a wireshark transcript of the communication. One thing is for sure, you will see a lot of XML and alot of namespace definitions. Its not fun to debug but you'll learn to appreciate it after examining PSRP transcripts.

Here's an example create command message:

<s:Envelope
 xmlns:s="http://www.w3.org/2003/05/soap-envelope"
 xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
 xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:To>
 http://localhost:80/wsman</wsa:To><wsman:ResourceURI s:mustUnderstand="true">
 http://schemas.microsoft.com/wbem/wsman/1/windows/shell/cmd</wsman:ResourceURI><wsa:ReplyTo><wsa:Address s:mustUnderstand="true">
 http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">
 http://schemas.microsoft.com/wbem/wsman/1/windows/shell/Command</wsa:Action><wsman:MaxEnvelopeSize s:mustUnderstand="true">153600</wsman:MaxEnvelopeSize><wsa:MessageID>
 uuid:F8671978-E928-49DA-ADB8-5BF97EDD9535</wsa:MessageID><wsman:Locale xml:lang="en-US" s:mustUnderstand="false" /><wsman:SelectorSet><wsman:Selector Name="ShellId">
 uuid:0A442A7F-4627-43AE-8751-900B509F0A1F</wsman:Selector></wsman:SelectorSet><wsman:OptionSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><wsman:Option Name="WINRS_CONSOLEMODE_STDIN">TRUE</wsman:Option><wsman:Option Name="WINRS_SKIP_CMD_SHELL">FALSE</wsman:Option></wsman:OptionSet><wsman:OperationTimeout>PT60.000S</wsman:OperationTimeout></s:Header><s:Body><rsp:CommandLine
 xmlns:rsp="http://schemas.microsoft.com/wbem/wsman/1/windows/shell"><rsp:Command>del</rsp:Command><rsp:Arguments>/p</rsp:Arguments><rsp:Arguments>
 d:\temp\out.txt</rsp:Arguments></rsp:CommandLine></s:Body></s:Envelope>

Oh yeah. Thats good stuff. This runs del /p d:\temp\out.txt.

Powershell over WinRM

When you invoke a command over WinRM, you are running inside of a cmd.exe style shell. Just as you would inside a local cmd.exe, you can always run powershell.exe and pass it commands. Why would anyone ever do this? Usually its because they are using a cross platform WinRM library and its just the only way to do it.

There are popular libraries written for ruby, python, java, Go and others. Some of these abstract the extra powershell.exe call and make it feel like a true native powershell repl experience. The fact is that this works quite well and so why bothering implementing a separate protocol? As I'll cover in a bit, PSRP is much more complicated than vanila WSMV so if you can get away with the simpler protocol, great.

The limitations of WinRM

There are a few key limitations with WinRM. Many of these limitations are the same limitations involved with cmd.exe:

Multiple shells

You have to open two shells (processes). First the command shell and then startup a powershell instance. This can be a performance suck especially if you need to run several commands.

Maximum command length

The command line length is limited to 8k inside cmd.exe. Now you may ask, why in the world would you want to issue a command greater than 8192 characters? There are a couple common use cases here:

  1. You may have a long script (not just a single command) you want to run. However, this script is typically fed to the -command or -EncodedCommand argument of powershell.exe so this entire script needs to stay within the 8k threshold. Why not just run the script as a file? Ha!...Glad you asked.
  2. WinRM has no native means of copying files like SCP. So the common method of copying files via WinRM is to base64 encode a file's contents and create a command that appends 8k chunks to a file.

#2 is what sparked my interest in all of this. I just wanted to copy a damn file, So Shawn Neal, Fletcher Nichol and I wrote a ruby gem that leveraged WinRM to do just that. It basically does this alot:

"a whole bunch of base64 text" >> c:\some\file.txt

It turns out that 8k is not a whole lot of data and if you want to copy hundreds of megabytes or more, grab a book. We added some algorithms to make this as fast as possible like compressing multiple files before transferring and extracting them on the other end. However, you just cant get around the 8k transfer size and no performance trick is gonna make that fast.

More Powershell streams than command streams

Powershell supports much more than just stdout and stderr. Its got progress, verbose, etc, etc. The WSMV protocol has no rules for transmitting these other streams. So this means all streams other than the output stream is sent on stderr.

This can confuse some WinRM libraries and cause commands that are indeed successful to "appear" to fail. The trick is to "silence" these streams. For example the ruby WinRM gem prepends all powershell scripts with:

$ProgressPreference = "SilentlyContinue"

Talking with Windows Nano

The ruby WinRM gem uses the -EncodedCommand to send powershell command text to powershell.exe. This is a convenient way of avoiding quote hell and base64ing text that will be transferred inside XML markup. Well Nano's powershell.exe has no EncodedCommand argument and so the current ruby WinRM v1 gem cannot talk powershell with Windows Nano Server. Well that simply can't be. We have to be able to talk to Nano.

Introducing PSRP

So without further ado let me introduce PSRP. PSRP supports many message types for extracting all sorts of metadata about runspaces and commands. A full implementation of PSRP could create a rich REPL experience on non windows platforms. However in this post I'm gonna limit the discussion to messages involved in running commands and receiving their output.

As I mentioned before, PSRP cannot exist without WinRM. I did not just mean that in a philosophical sense, it literally sits on top of the WSMV protocol. Its sort of a protocol inside a protocol. Running commands and receiving their response includes the same exchange illustrated above and issuing the same WSMV messages. The key differences is that instead of issuing commands in these messages in plain text and recieving simple base64 encoded raw text output, the powershell commands are packaged as a binary PSRP message (or sequence of message fragments) and the response includes one or more binary fragments that are then "defragmented" into a single binary message.

PSRP Message Fragment

A complete WSMV SOAP envelope can only be so big. This size limitation is specified on the server via the MaxEnvelopeSizeKB setting. This defaults to 512 on 2012R2 and Nano server. So a very large pipeline script or a very large pipeline output must be split into fragments.

The PSRP spec illustrates a fragment as:

All fragments have an object id representing the message being fragmented and each fragment of that object will have incrementing fragment ids starting at 0. E and S are each a single bit flag that indicates if the fragment is an End fragment and if it is a Start fragment. So if an entire message fits into one fragment, both E and S will be 1.

The blob (the interesting stuff) is the actual PSRP message and of course the blob length of the blob in bytes. So the idea here is that you chain the blobs ob all fragments with the same object id in the order of fragment id and that aggregated blob is the PSRP message.

Here is an implementation of a message fragment written in ruby and here is how we unwrap several fragments into a message.

PSRP messages

There are 41 different types of PSRP messages. Here is the basic structure as illustrated in the PSRP spec:

Destination signifies who the message is for: client or server. Message type is a integer representing which of the 41 possible message types this is and RPID and PID both represent runspace_id and pipeline_id respectively. The data has the "meat" of the message and its structure is determined by the message type. The data is XML. Many powershellers are familiar with CLIXML. Thats the basic format of the message data. So in the case of a create_pipeline message, this will include the CLIXML representation of the powershell cmdlets and arguments to run. It can be quite verbose but always beautiful. The symmetric nature of XML really shines here.

Here is an implementation of PSRP message in ruby.

A "partial" implementation in Ruby

So as far as I am aware, the WinRM ruby gem has the first open source implementation of PSRP. Its not officially released yet, but the source is fully available and works (at least integration tests are passing). Why am I labeling it a "partial" implementation?

As I mentioned earlier, PSRP provides many message structures for listing runspaces, commands and gathering lots of metadata. The interests of the WinRM gem are simple and aims to adhere to the same basic interface it uses to issue WSMV messges (however we have rewritten the classes and methods for v2). Essentially we want to provide an SSH like experience where a user issues a command string and gets back string based standard output and error streams as well as an exit code. This really is a "dumbed down" rendering of what PSRP is capable of providing.

The possibilities are very exciting and perhaps we will add more true powershell REPL features in the future but today when one issues a powershell script, we are basically constructing CLIXML that emits the following command:

Invoke-Expression -Command "your command here" | Out-String -Stream

This means we do not have to write a CLIXML serializer/deserializer but we reap most of the benefits of running commands directly in powershell. No more multi shell lag, no more comand length limitations and hello Nano Server. In fact, our repo provides a Vagrantfile that provisions Windows Nano server for running integration tests.

Give it a try!

I have complete confidence that there are major flaws in the implementation as it is now. I've been testing all along the way but I'm just about to start really putting it through the ringer. I can guarantee that Write-Host "Hello World!" works flawlessly. The Hello juxtaposed starkly against the double pointed 'W' and ending in the minimalistic line on top of a dot (!) is pretty amazing. The readme in the winrm-v2 branch has been updated to document the code as it stands now and assuming you have git, ruby and bundler installed, here is a quick rundown of how to run some powershell using the new PSRP implementation:

git clone https://github.com/WinRb/WinRM
git fetch
git checkout winrm-v2
bundle install
bundle exec irb

require 'winrm'
opts = { 
  endpoint: "http://myhost:5985/wsman",
  user: 'administrator',
  password: 'Pass@word1'
}
conn = WinRM::Connection.new(opts)
conn.shell(:powershell) do |shell|
  shell.run('$PSVersionTable') do |stdout, stderr|
    STDOUT.print stdout
    STDERR.print stderr
  end
end

The interfaces are not entirely finalized so things may still change. The next steps are to refactor the winrm-fs and winrm-elevated gems to use this new winrm gem and also make sure that it works with vagrant and test-kitchen. I cant wait to start collecting benchmark data comparing file copy speeds using this new version and the one in use today!

Viewing all 109 articles
Browse latest View live