ESXi Config Backup & Restore

The state.tgz file

One nice thing only a very few people know is the configuration file “state.tgz” of ESXi. If you’re using a SD-Card or USB Stick installation of ESXi, the boot image is stored on that USB / SD Card.

The boot process

During bootup, ESXi loads the configuration from /bootbank/state.tgz and extracts its contents to /etc. During the day, if the file gets updated, changes are backed up into state.tgz and placed to /bootbank again.

Advantage

Imagine the day, ESXi tells you there’s something wrong with /bootbank. At a closer look you see the SD Card suddenly died today. First – yes realize that – you’re happy ESXi and all VM’s are still up and running. You’ll be able to vMotion everything off that host and put him into maintenance mode.

Now, this is the right and latest time to create a config backup – if you didn’t before 🙂

Create a config Backup

To backup your ESXi configuration from USB / SD Card, follow this steps:

  1. Logon to your ESXi Host using SSH
  2. Run the auto-backup.sh script to confirm you have an up-to-date host configuration saved in the /bootbank/state.tgz file
  3. scp the file /bootbank/state.tgz to some safe location

Reinstall ESXi

If you’re in the same situation like me, it’s now time to shutdown the Host and reboot. You’ll see, the Host will not come up – yes because of the defective SD. 😛 Now insert a new SD Card, use ESXi boot media / installer and start from scratch.

After ESXi is installed again, just give him an IP and root passwort to be able to connect using ssh and continue with the restore process.

Restore Configuration

To restore the configuration:

  • scp the backup state.tgz to /tmp
  • Login using ssh
  • cd /tmp
  • tar -xvzf /tmp/state.tgz
  • cd /
  • mv local.tgz local.tgz.old
  • cp /tmp/local.tgz .
  • tar -xzvf local.tgz
  • Reboot the ESXi Host

The Host now starts with the restored configuration from your state.tgz file.

Source:
http://kb.vmware.com/kb/2043048

Advertisements

Debugging Bluescreens using WinDebug

When Windows stops with a “Bluescreen of Death” (short: BSOD), there may be the chance that just a single driver causing that issue. Maybe if you just installed an update or something new.

If a BSOD occours, Windows writes either a Minidump file to c:\windows\minidump.dmp or creates a full memory dump to c:\windows\memory.dmp (replace c:\windows\ by your %systemroot%). This file can be read-in using Microsoft’s debugging tool, included in the Windows SDK here:

Debugging Tools
http://msdn.microsoft.com/en-us/windows/hardware/hh852365.aspx

This SDK contains a set of Tools, but you only need to select the Debugging Tools during Setup. After Setup, you’ll find “Debugging Tools x64” in your Startmenu, hidden under “Windows Kits”. If you start WinDbg, you may think you’ve started a 16-bit application, but it only does look like.

Configure Symbol Path

Before opening a Crash Dump, the symbol sources have to be set. Instead of downloading several gigabytes of Symbol Data, you can put in a http address to online symbol files.

  • File -> Symbol File Path
  • Enter the following:

SRV*http://msdl.microsoft.com/download/symbols

Open a Crash Dump

Now, open the Crash Dump file

  • File -> Open Crash Dump

A new windows opens. If you fly over the first 50 lines of text, you’ll see you have to enter a command to start an analysis. At the bottom of the new windows, there’s a “kd>” prompt, enter now:

!analyze -v

First output after the command will be the STOP Error, some pages lower you get an “IMAGE_NAME” and other details about driver name and so on.

ESXi Pink Screen of Death (PsOD)

Have you ever seen one? 🙂

This was my first:

psod

How it started

But let me start from the beginning. We’re using Veeam Backup Software installed as a VM, backing up all other VM’s on the same and another host in a shared vCenter environment. As soon as Veeam tried to backup VM’s from the same host, ESXi ran into this pink screen.

Analysis

The Screenshot tells us something about Exception 14 in world 34461 (line 2) and even more informational, there’s another hint “…E1000PollRxRing@vmkernel…”. You’ll find this at the first line beginning with 0x41239a…

Solution

VMware recommends using vmxnet3 Network Adapters every possible time. We didn’t and actually ran into a problem here. Since we’ve changed only the network card of the Veeam VM to vmxnet3, we never saw Pink screens again. Unfortunately. 😛

VMDirectPath owns the local SCSI Controller

Yesterday I was witness of an exciting feature of VMware’s VMDirectPath Feature on ESXi. For those who don’t know: DirectPath allows you to directly attach PCI Devices to a VM.

In our case we installed a secondary SCSI Adapter to attach a tape drive to a VM. In ESX Host that is configured by using VMDirectPath. Unfortunately, fast hands have choosen the wrong adapter. In this case, where you only have two adapters, it’s the SCSI adapter of the disks where ESX ist installed on 😛 it isn’t possible to write down any configuration changes to disk now, because esx has no longer access to the SCSI Controller.

ESX boots up normally. After OS is up, the SCSI Adapter is passed trough to the DirectPath module. From this time on, you see error messages in the logs (ALT+F12 on console).

So, how to get back?

  • Boot another Linux Live CD and change the configuration? Didn’t work in our case, why ever.
  • Reinstall ESX: no.
  • Change the configuration during ESX operation. But how if filesystem is read only?

Here’s my How-To

Run the following on ESX Console (dcui) to get a list of who owns which hba:

pic1

Now let’s assign vmhba1 back to vmkernel:

pic2

Now kernel owns the hba, but before changes can be written to disk, a rescan is required.

pic3

Now let’s modify the esx.conf to permanentely assign vmhba1 to vmkernel.

pic4

To search a String, use “/” in vi; just type “/vmhba1” and hit enter, vi will show up the right line where you can see vmhba1 is assigned to passtrough instead of vmkernel. Some other vi hints for editing:

  • delete text using the DEL key
  • before typing, enter insert mode by pressing key “i”
  • after typing exit insert mode using ESC key
  • to save and exit: “:wq” and ENTER

Now do a reboot to test the new setting. After all now we’re able to assign the next/right hba to the passtrough 😉

pic5

Windows Update error 800B0001

I’ve done some quick research with Google and found the following.

http://windows.microsoft.com/en-GB/windows7/Windows-Update-error-800B0001

If you receive Windows Update error 800b0001, it means that Windows Update or Microsoft Update cannot determine the cryptographic service provider, or a file Windows Update requires (named catalog store) is corrupted. The System Update Readiness Tool can correct some conditions that cause this error.

In Article KB947821 they explain a way in Server 2012 and Win8 to use dism to scan the image health. For “older” Operating Systems, there’s a Tool that can help repair Windows Update.

http://support.microsoft.com/?kbid=947821

So in Server 2012 and Win8, just run the following commands as elevated admin:

DISM.exe /Online /Cleanup-image /Scanhealth
DISM.exe /Online /Cleanup-image /Restorehealth

Run Windows Update again, Error hopefully solved.

 

Windows temporary profile

If you’re logging in to a computer and get a message telling you “you’ve been logged on with a temporary profile”, you can solve that problem by just rebooting the computer. But sometimes, there’s a bigger fault in background. In this case, a colleague just gave me this hint:

HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList

Under the ProfileList subkey, delete the subkey that is named <<SID>>.bak

TCP/UDP Checksum Offload on RealTek NIC

I just wanted to do my cousin a favor and take look at his new computer he bought at a local IT store. He told me it’s kinda slow. Unfortunately, tt wasn’t just taking a look…

Characteristics of the problem

Newly installed, and also installed again using a recovery DVD, the computer had hangs by surfing the internet. Slow speed, some Websites did not load, mostly HTTPS SSL sites. In his case it was the eBanking software that didn’t work.

Troubleshooting

My first tought was Anti-Virus software, Firewalls: no success. Anti-Virus is not scanning traffinc, Windows Firewall has rules that allow all out- and the right incoming traffic.

Second tought:Computer is slow because he’s downloading over 100 windows updates in background. I took the time and downloaded all updates, installed them. Maybe one of the updates solves the problem. No success.

Third tought: there must be any tool blocking the traffic. I’ve unstalled mostly everything I didn’t know until today, disabled every senseless service. No success.

Fourth tought: Network issues. BANG! Success. Here’s how I analyzed that.

Analyze the unsuccessful network connections

Because Teamviewer didn’t work too, I decided to use that tool to produce the example traffic that will be analyzed. But that will work with an HTTPS site as well, I’m sure.

Network Traffic logging:

  • download Wireshark, install directly on Computer
  • Start Wireshark with no filters, without promisc. mode
  • start Teamviewer and wait until connections is established
  • stop Wireshark logging
  • set and apply a filter “ip.addr == my.computers.ip.address”

Teamviewer normally quickly connects to his servers and gives you a green light on the left bottom pane to tell you it’s ready to get help. On the computer with the issue, Teamviewer started with a red light, went to orange and tried to connect. Some seconds later it went back to red, then orange and finally green.

The analyzed traffic in Wireshark had a lot of black lines from local IP to an Internet IP of Wireshark. If I selected such a packet and opened the TCP part in the middle pane, it looked like this:

Nice from Wireshark, it tells me directly what’s wrong here. But what’s checksum offload?! After a search on Wikipedia:

TCP offload engine or TOE is a technology used in network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant.

Source: http://en.wikipedia.org/wiki/TCP_offload_engine

Nice, but my NIC is a default 1GBit/s one connected to my DSL (5MBit/s). Don’t need that stuff here. How does that come, a manufacturer thinks it’s neccessary to implement such Server / Datacenter Features on a normal Workstation? Yes for IT Guys it’s nice to have, but shall that be enabled by default?

Disable TCP Checksum Offload

To disable Offloading, I opened the Network Card’s Advanced Settings

Step 1, open Network Properties:

and then press “configure” (“Konfigurieren” in the German Snapshot).

Step 2, in the next dialog go to advanced (“Erweitert”) and search for TCP offloading. There’s a lot about offloading, but what we need is TCP and UDP checksum offloading on IPv4.

Left side “Eigenschaft” means “Property” and right side “Wert” means “Value”. The value of “TCP Prüfsummenabladung” (means TCP checksum offloading) is set to “Rx & Tx aktiviert” (Rx & Tx activated).

After setting this to disabled for both TCP and UDP, everything went back to normal. Teamviewer works, eBanking works, everything. Wireshark also just logs valid successful connections from now on.

Weird experience.