Chef Quick TIP: TEST Your Guards

I recently had an issue after adding the reboot Chef resource to one of my cookbooks.  I kept receiving the following errors during the Kitchen converge step and couldn’t figure out what was going on:

>>>>>> ——Exception——-
>>>>>> Class: Kitchen::ActionFailed
>>>>>> Message: 1 actions failed.
>>>>>>      Failed to complete #converge action: [No connection could be made because the target machine actively refused it. – No connection could be
made because the target machine actively refused it. – connect(2) for “10.40.20.127” port 5985 (10.40.20.127:5985)] on default-w2008r2
>>>>>> ———————-
>>>>>> Please see .kitchen/logs/kitchen.log for more details
>>>>>> Also try running `kitchen diagnose –all` for configuration

>>>>>> ——Exception——-
>>>>>> Class: Kitchen::ActionFailed
>>>>>> Message: 1 actions failed.
>>>>>>      Failed to complete #converge action: [[WSMAN ERROR CODE: 2150858843]: <f:WSManFault Code=’2150858843′ Machine=’10.40.20.127′ xmlns:f=’http
://schemas.microsoft.com/wbem/wsman/1/wsmanfault’><f:Message>The WS-Management service cannot process the request because the request contained inval
id selectors for the resource. </f:Message></f:WSManFault>] on default-w2008r2
>>>>>> ———————-
>>>>>> Please see .kitchen/logs/kitchen.log for more details
>>>>>> Also try running `kitchen diagnose –all` for configuration

The Recipe

My recipe was similar to the example implementation on the reboot documentation page:

reboot ‘now’ do
action :nothing
reason ‘Cannot continue Chef run without a reboot.’
delay_mins 2
end

execute ‘foo’ do
command ‘…’
not_if  ‘…’
notifies :reboot_now, ‘reboot[now]’, :immediately
end

Notice I also added a not_if guard to ensure I only rebooted if it was truly needed.  In my case, I only expected it to reboot once.

Troubleshooting

At first I thought the converge run was timing out waiting on the reboot to complete.  The reboot functionality add a few settings to control this in the provisioner config.

There are three new settings on the provisioner:

  • retry_on_exit_code – which takes an array of exit codes that can indicate that kitchen should retry the converge command. Defaults to an empty array.
  • max_retries – number of times to retry the converge before passing along the failed status. Defaults to 1.
  • wait_for_retry – number of seconds to wait between converge attempts. Defaults to 30.

https://discourse.chef.io/t/test-kitchen-1-10-0-released/8721

The Solution

After several hours I finally questioned whether my guard was working appropriately.  In the end I found the guard was always returning false after some further testing.  This produced a reboot loop.

Chef unit tests don’t execute scripts, you can only mock them.  So always make sure to test your guards are valid by testing on the target server.

If this quick tip was helpful to you, please share how in the comments below.

Leave a Reply