27 August 2012

How to screw up a release

As I write this, the QA group is beta testing Firestorm 4.2.2.29837. This has been one hell of a long day.

Yesterday, we were all hopeful and happy that we'd gotten 4.2.1 the hell out the door. We'd been beating on it to get the pathfinding tools in, and a bunch of crash fixes and translation updates and a few very highly requested features (control-shift-E for Edit Linked Parts being my primary original contribution, though I also ported the Flickr snapshot upload from Exodus) as well. We QAd it, we beat on it, everything looked good. We pushed all the needed changes to the various repositories, told the users to grab it and have fun, and went to bed.

I got up about 5:40 AM my time (US Central, GMT-6/SLT+2). When I sat down in front of the computer, I saw in the support chat that there was a problem. It turned out that 4.2.1 has a bug in it that makes most swinging doors not look like they act correctly (though they actually do). They appear to swing normally, then jump ahead, swinging farther all at once.

You can guess how many swinging doors there are in SL. You'll probably guess low. No, I don't have a number, but it's gotta be astronomical.

The thing that annoyed me especially: The front doors on my own house showed the bug! Worse, I'd noticed the issue a week or so ago, and blown it off!

Fortunately, our ace support person and walking JIRA encyclopedia, Whirly Fizzle, had already found the changesets that caused the issue. I whipped up a quick build and saw that yes, backing them out did fix the problem. A bit more jockeying, and I had a recommended course of action: back out three changesets directly in the release branch of the repository, bump the version number to 4.2.2, and ship it.

Oh, if it were that simple.

First, the build servers were behind a fiber cut as a result of an automobile accident in Boston. That delayed spinning the new release builds.

Then, while we were waiting on that, we discovered another problem, with another patch: some spinning objects stuttered and didn't show correctly if they updated while spinning. This is the problem that the patches we backed out were supposed to fix. We found the changeset that caused that and backed it out, and it seemed to fix the problem, with no side effects.

But we couldn't be sure. The LL JIRA that that changeset was reported to fix, PATH-542, was (and still is) secret. So how the hell do we decide? Have we reached the end of the string, or are there nasty side effects of not fixing that one? Without knowing what the problem is, we can't make an intelligent decision on what to do with it.

We spent a large chunk of the afternoon trying to figure out what to do next. This time was completely wasted because of the JIRA being kept secret. Finally, about dinnertime, we got enough of a hint as to what the problem was that we were able to exercise it - and decide that not only was the original behavior not a bug, at least at the level of the Firestorm codebase (LL 3.3.3), it was actually the way things should behave.

So we declared it fixed and built release binaries. That's what QA's poking at now.

Jessica Lyon is not at all happy that we had to back out a release. I'm not either. Worse, I feel some responsibility for not saying anything.

Where did we screw up? To examine this, we need to detour for a moment into the world of fail that's been the LL pathfinding release. The pathfinding code has been rather epically broken at just about every step of the way. The problems ranged from broken physics to sitting on the ground failing in rather entertaining ways to the world and minimaps being mis-scaled to the toolset in the viewer being very, very unstable. (This is the reason that LL 3.4.0 is taking so long. It's really, really not pretty.)

We fought a lot of this while putting the pathfinding tools into Firestorm. We saw the effects of a whole bunch of these problems, to the point we got to thinking "Oh, something else broke? Must be pathfinding fail." That is exactly what I thought when I saw my front doors broken a week or so ago...and it cost us.

I'm not the only one. More than a few of the support folks and beta testers report the same thinking.

The lesson is obvious: Even - no, especially - when dealing with known LL fail, we need to investigate every problem we see. No matter how much it seems that it's just another LL screwup. Every problem. Period.

There's another lesson, and that's that LL's entirely too secretive when it comes to many bugs. Yes, I can see keeping details of LL's infrastructure secret, and it goes without saying that SECurity JIRAs need to be secret. There's simply no good reason for the others, though, especially once they've been fixed. The only reason is to keep TPV developers in the dark and make us reinvent wheels.

I hate reinventing wheels. If you're lucky, you end up with a pentagon.

So here we are; before I go to bed, 4.2.2 will be released, full of goodness. But a lot of us wasted a lot of time because nobody said anything about a bug many of us saw. That's gotta stop. It will stop.