So, the symptoms of today’s problem are a wierd mess of COM+, DCOM, certificate and domain-related errors including, but not limited to:
- Kerberos problems (EventID: 7)
- Logon problems (Event ID: 1054)
- COM/DCOM problems (EventID: 10016)
- DNS/Socket problems (EventID: 11004)
- Occasional LSASS/Socket problems (EventID: 10107)
- A collection of certificate autoenrollment errors (EventID: 15)
And, I’m sure, a bunch of others too tedious to list. Among other bizzare symptoms, are Kerberos errors on machines that have exactly the right time, errors saying “unable to contact the domain controller” even though it’s one hop away and (the reason we discovered it) you can’t see those machines from any of Microsoft’s management tools (SCMS, MOMS, etc). Doesn’t happen on all the computers, which are largely identical, just some, some of the time.
The googling you can do for these seemingly-unrelated, completely obscure errors is epic; there’s something about combination of opaque error messages and time pressure that brings out the tribal-shamanism impulse in even the most soi-disant technically-inclined, and with a vengeance. And I’m sure that closed-source software, wildly inadequate instrumentation and the fact that nobody else in the herd knows what the fuck is going on either doesn’t help. Restart the server! Rejoin the domain! Rebuild the domain completely, and change NICs! Edit registry keys! Burn incense! Turn the AC down five degrees! Swing cats about by their tails! But not black cats, use only white cats, that’s the trick!
Most of my time isn’t spent actually solving problems anymore, it’s spent filtering out the garbage between me and the solution. Warren Ellis once wrote that: “the Singularity is the last trench of the religious impulse in the technocratic community”, but the more problems I have to fix on these black-box machines the more I think that for most people any excuse to get superstitious will do it, technocrat or not. All this technology, and you don’t even have to wait ’til sundown anymore to find out who’s afraid of the dark.
Here’s the trick: read your logs, and read them in order. Build up some sort of mental model where cause leads to effect, make a theory and test the theory and if it’s wrong, roll back your changes, make another theory and test that. When one of them works, that is the one you write down. Don’t clutter up google with the rest!
Science works, bitches; get your pseudoshamanist cargo-cult bullshit off my internet.
The real problem turns out to be a race condition between networking and all the things that rely on it – it can take just a hair longer, in this modern age of peppy computers running crappy software, to heat up that card, autonegotiate your duplexing and be assigned an IP address than it can to do all the other fancy things that rely on it, that in many cases (like “where’s the domain controller?”) just assume that it’s there.
In situations where you need for users to receive software, implement folder redirection, or run new scripts in a single logon, apply a GPO with the setting Always wait for the network at computer startup and logon to the computer. This setting is located under Computer Configuration\Administrative Templates\System\Logon in the Group Policy Object Editor. For this setting to take effect, Group Policy must be refreshed or the computer restarted.
So make the change and reboot. And just like that, poof, it’s gone.