After spending about a week working on some client issues, I was finally able to resume working on configuring a Lab environment to use multiple Isolated Networks for CI Builds (more on that in another future post once I am satifsfied I have it working properly). When I restarted the Lab Environment (which was working last week), I received errors indicating that the TFS service account was now somehow invalid.
Much to my dismay, I spent several hours trying to figure out what the happened. After searching around in all the usual places, I had come up empty. I could tell it was going to be a long night...
Before I explain how I corrected this issue, a little additional background will probably be helpful. The Hyper-V hosts in this particular environment are in a separate domain, lets call it 'XPServerFarm', and the "normal" servers all run in a separate domain; lets call it 'business.local'. There is a two-way full trust between the two domains. I suspect that it is a pretty normal configuration, although it does provide some challenges sometimes when it comes to permissions.
Reading the error message details, it looked to me like the TFS Service account did not have permissions on the host, I signed onto the host that has the VMs on it, and checked the local administrators group. The account was in it. Lots of things went through my mind at this point; but I'll refrain from sharing some of those thoughts and instead focus on some of the things I tried.
Testing this involved shutting down the lab environment, and starting it multiple times; each time waiting for the environment to "settle" into a final state, often times this can take as long as 10 minutes depending on what is configured on the various lab VM machines.
I removed the account from the administrators group and re-added it; I don't know what I expected, but there was no change in the behavior. I looked at the server and it had been running for about 70 days; I decided that I would reboot it (this server is only for testing so why not?); again no change. Honestly, I really didn't expect the behavior to change... Then I opened up Active Directory FSMO for XPServerFarm, which also happened to be the server that these VMs were on, and verified that the Trust itself was still intact.
At this point I was starting to get concerned... perhaps the problems were because it was on the AD FSMO, but I continued down the path of the verifying account issue itself. Finally, I decided to log into the server as the TFS service account... there had to be something I was missing, configuration-wise, but why would it show up now? I Remote Desktop-ed directly to the Hyper-V host computer and...
I reconnected as a local administrator on the Hyper-V host. Sure enough, the time had drifted to be about 12 minutes difference between the two domains. Interestingly I had previously setup the time synchronization but apparently there was a problem. I checked the configuration, and I found that the Time Remaining was 0 when I used the command
w32tm /query /peers
I waited a bit, and executed the command again... no change. I updated the Time Service on the Hyper-V host as follows. I decided to include the details because I ran into an issue here as well (nothing is ever easy).
I unregistered the the time service using this command:
Then I rebooted. When the host came back up, I re-registered the time service using this command
Followed by attempting to start the time service
Net start w32time
I could not get the service to start. I was getting an error indicating that there was some sort of SID issue because the processes running in the svchost were using incompatible SIDs. System Error 1290. I hunted around the internet and found some nonsense posts about changing the tapiserver's registry entry... instead I looked at all of the registry entries and discovered that they were all running with the same type of account (LocalSystem, LocalService or NetworkService); so that obviously wasn't it. I rebooted because I remembered reading somewhere that that way svchost initializes itself includes some sort of registration for the processes running inside svchost.exe. After rebooting, the time service started, and the problems were gone.
Now, I configured the time service to use the same time servers as the Business.Local domain using these commands
W32tm /config /syncfromflags:manual /manualpeerlist:”0.pool.ntp.org, 1.pool.ntp.org, 2.pool.ntp.org”
W32tm /config /reliable:yes
Net stop w32time
Net start w32time
Once completed, I issues these commands on both servers to make sure everything was working correctly.
W32tm /config /update
Net stop w32time
Net start w32time
After waiting a few minutes, the time on the servers synced up and I was ready to continue. I went back to Lab Management Center and tried starting the environment. After a period of time, the environment settled with this status:
After clicking Repair Testing Capability on the pull-down menu associated with the Testing Capabilities, I waited and was eventually greeted with a happy environment once again.