Citrix Provisioning Services High Availability

On 24 June 2013 by Pete Petersen
Provisioning Services High Availability

Image sources:
http://s3images.coroflot.com/
http://www.squareicon.com/

Working with a recent customer and running through endless failover testing scenarios–and actually failing components in the implemented environment–several things were discovered. One of which remains outstanding, and there is no easy answer for it.

Citrix Provisioning Services (PVS) has several considerations to ensure High Availability (HA) for each component. Explanations to follow. For a PVS service to work correctly, below is a detailed chain of events of the boot process. I will point out the potential single points of failure (SPOF) and then a solution for each.

Boot Process Details: DHCP Options 66 & 67

This first process is how I’ve seen PVS in nearly every implementation (TFTP is the SPOF):

  • IP Acquisition (SPOF: DHCP Server)
    • Target device (eg: XA server) broadcasts DHCP Discover
    • DHCP Server sends DHCP Offer packet
    • Target device sends unicast message requesting IP address
      (Target device also broadcasts that DHCP service is being handled)
    • DHCP Server sends DHCPACK to target device
  • Bootstrap Download (SPOF: TFTP Server)
    • DHCP Server sends TFTP Server Name (option 66) and Bootfile Name (ardbp32.bin, option 67) with DHCPACK
    • Bootstrap file is downloaded from TFTP server
  • PVS Logon (SPOF: PVS Server)
    • Get Login Port . Target Device contacts PVS specified in bootstrap file using default UDP port 6910
    • Login Start . Target Device identifies itself by its MAC and type of login requested
    • Transferred to I/O . PVS Server moves Target Device from login thread to I/O thread
    • I/O Response . PVS Server replies to Target Device with all disk, client, and policy info needed
    • Get I/O Port . Target Device requests IP address and port used for single read mode
    • Get I/O Service . Target Device requests the PVS Server start the I/O thread and requests information which vDisk to use
    • Login complete . PVS server grants access for I/O operation to Target Device and sends config specifying boot device
    • Get vDisk Info . Target Device requests specific vDisk
    • vDisk Respose . PVS Server replies with vDisk info including write cache location (if Target is in standard mode)
    • [All PVS Servers are capable of acting as both a login server and an I/O server. This is important for redundancy for running Target devices with a live-failed PVS server.]
  • Single Read Mode (SPOF: vDisk Image, PVS Server)
    • Bootstrap file now intercepts any requests made to interrupt 13 (eg: hard disk requests)
    • Windows OS starts loading drivers
    • BNISTACK successfully loaded
  • BNISTACK / MIO (SPOF: vDisk Image, PVS Server)
    • Target Device handshakes with PVS Server with BNISTACK driver is up
    • BNISTACK loaded into memory and takes over for bootstrap managing MIO communication
    • Information is exchanged
      • vDisk name
      • Image Mode
      • Active Directory Password Management Option
      • Write Cache Types and Size
      • Client Name
      • Licensing
    • Target Device is operational with read/writes

SPOF Notes for Boot Process w/Options 66 & 67:

  • DHCP Server
    DHCP Services normally follow Active Directory from an HA perspective. But this is critical if true HA is required for an environment. No Target Device (XA/XD boxes) will boot without a DHCP service (serving options 66 & 67) on its subnet.
  • TFTP Server (in the most common setup, this piece is missing from HA efforts)
    This is one of the trickiest pieces to make redundant, and it’s the one missed in nearly every environment I’ve seen implemented to date. The reason is that most DHCP clients (including the PVS TFTP service request and the Windows OS) can only use one TFTP record. If the DHCP record contains two (separated by semicolon), it can only utilize the first one. I’ll talk about options for this at the bottom of the thread. As stated, in most scenarios, this is a true single point of failure.
  • PVS Server
    PVS Servers that connect to the same farm database effectively “know” about each other. However, that’s not enough. The Target Devices also need to “know” about the PVS servers as well. The trick here is to ensure that the bootstrap settings (under Configure Bootstrap) for each server are correct. Each should contain itself first and then the other PVS server(s) in order. In the case of a failed PVS server, the target devices have to go through the PVS Logon process (above) again. If one of the servers is missing the bootstrap settings, the target device will not be able to go through the login process and will not be able to find its boot image again (basically dropping the hard drive out from under the OS). This is the symptom we saw during live failover testing at FBT.
  • vDisk Image
    This is important as well. Even if all of the above components are fully redundant, and the secondary PVS server does not contain the correct version of the boot image, the connection will fail, or during a PVS outage, logging into a secondary PVS server will fail to find the right image file and drop it from under the OS. In our scenarios, acceptable ways to get the boot image around to the various PVS servers are:

    • Manual copy
    • Scripted manual copy
    • DFS-R (this is tricky)

The final step in this process is to import the .xml file into the receiving PVS servers so they know about the new version(s) of the boot image. A matrix will follow regarding the pros and cons of each of these that we can distribute to MBG.

 

TFTP SPOF Solution Options

  1. DHCP options 66 & 67 (most common implementation)
    The trouble with option is that most DHCP clients can only process one entry for option 66, even PVS TFTP and Windows OS. This means that the TFTP service is still a SPOF (see details below).
  2. DHCP options 66 & 67 w/DNS Round Robin
    This helps, but does not eliminate SPOF. DNS Round Robin is not intelligent. A target device (XenApp server) can still be sent to a dead server.
  3. DHCP options 66 & 67 w/multiple entries
    This helps as well, but a target device (XenApp server) can still be sent to a dead server.
  4. Proxy DHCP (PVS PXE)
    This is a good option in many scenarios, but the trick is that the PXE server (PVS server) has to be in the same subnet as the target device.
  5. NetScaler w/USIP Address Mode
    This seems easy enough: Set up a VIP to use in DHCP option 66. But also needed is USIP (Use Source IP) instead of Mapped IP or Subnet IP (MIP or SNIP), which means we need to change the default gateway of the TFTP server to an IP on the NetScaler (the MIP or SNIP). This means that all PVS traffic is now routing through the NetScaler, including the image streams. This is certainly less than ideal.
  6. NetScaler w/DSR
    This option is a bit complicated, as explained here: http://support.citrix.com/article/CTX110501 and here: http://blogs.citrix.com/2010/11/11/redundancy-and-scalability-for-tftp-using-netscaler-direct-server-return/. One drawback (other than the setup and continued support of this complicated solution) is that the NetScaler needs to be on the same subnet as the TFTP servers.
  7. NetScaler w/SolarWinds
    SolarWinds in freeware, and it allows segregation of TFTP from PVS services. However, all PVS streaming traffic still has to pass through this device. Still not ideal.
  8. Boot Device Manager (BDM)
    This option basically eliminates TFTP from the equation altogether. The XenApp server boots from an ISO, which already has the bootstrap information. No need to go to the network for it. In the case of FBT (and any foreseeable implementation from here on out), this is the recommended option. As will pointed out in the thread below, need to ensure that both (all) PVS servers are included in the bootstrap in the created ISO.

The BDM Option

BDM seems like a much simpler, less fallible, and better performing, with no single point of failure. However, there are some gotchas. For example, as pointed about by PLA‘s William Daniels:

  • Target Device Performance
    There is a penalty in performance, surprising noticeable, when browsing files on a virtual server that has a mounted CD ROM. …When…all the servers are referring to the same BDM ISO, and Windows does its annoying thing of “spinning up” the ISO every time Windows Explorer is opened in Windows, it gets noticeable.
    “Additionally I have also seen in real world scenarios on ESXi were VMs will intermittently fail to read the ISO properly, probably due to contention or delay, on the store the ISO is on.”
  • SPOF
    The ISO is only as good as the storage it’s coming from. If the datastore where the ISO lives becomes unavailable, then XenApp servers cannot boot.
  • Manageability
    A BDM ISO is specific to a PVS farm, and normally per site. If implementing multiple sites at scale, without some forethought, this scenario could become unwieldy.

Bottom Line

BDM is a great solution to alleviate the FTFP HA issue in some scenarios, such as small or medium implementations.

However, it should be looked at as a price/penalty perspective as well. For the trouble, what are you buying? How much risk is it to not have XenApp boot capability during the time it takes to restore or rebuild a PVS server? Or, if you chose the DHCP method, the time it takes to update the option with a new TFTP server IP?

Given that, Proxy DHCP (PVS PXE) is a good option as it provides for redundancy of the TFTP service (via whoever responds to the PXE request) without a lot of complexity–other than ensuring that your target devices are on the same subnet as your PVS servers.

Other than that, a lot of complicated options exist that are questionably worth the price of the solution. As stated above, for the trouble, how much risk is it to not have XenApp boot capability during the time it takes to restore or rebuild a PVS server, alter a DHCP scope, or fix a helper relay?

 

References