Have you ever played Battleship? It's a game where 2 players put their ships on a grid and start guessing coordinates. If you guess a coordinate where your opponent's ship is located, that counts as a hit. Each ship takes up 3, 4, or 5 coordinates and once you've guessed all the coordinates of a ship, it's sunk. The commercial when I was kid had the kid on the losing side exclaim, "You sunk my battleship!". Good times, good times.
This is part 2 in a series. If you have no idea why I'm talking about board games from the 1980's, start with Part 1.
Battleship taught me most of what I need for troubleshooting. I have a bounded set of possibilities, I only get a certain number of attempts, and there's an uncomfortable period during each attempt when something bad may happen to me. When playing Battleship, you can simply guess at random, or you can employ a directed strategy. The same is true in troubleshooting. If you're all over the place, sometimes you'll get lucky and sometimes you'll miss for what seems to be an eternity.
As you're implementing your search strategy, it possible to become too focused. In the game, for example, it makes little sense to test a coordinate immediately adjacent to your previous guess since the ships are at least 3 units long. Better to space out the testing and then "walk it in" by focusing in the surrounding area once you get a hit. The same is true in troubleshooting. If each test run is simply a close variation on the previous hypothesis, you're in for a long night. If, however, you take a wider approach and test based on variations that are reasonably separated, you have a much better chance of getting a hit, and then you can hyper-focus on the surrounding areas.
I seem to recall that Battleship was never as much fun as I thought it should be. When you think about it, there isn't all that much excitement and it doesn't have much to do with the Navy. Troubleshooting: same story. It's one of those things that you do that has its ups and downs, but overall you kind of wish you had done something else instead.
The point here is that when you're confronted by a troubleshooting problem, don't just fire at random and hope that you get a hit. Approach it with a plan, execute, and as you get some success then focus on that area for all it's worth.
I'm a big fan of analogies. Analogies are the Swiss Army knife of the brain (note: I'm also a fan of recursion). A good analogy gives you a clear mental model of something complicated. This is the first in a series of posts that was inspired by a walk through a toy store where I saw some of my favorite childhood games and for each one thought, "Wow, that game is just like . . . " about some facet of my job.
Jenga
Jenga is a game where you start with a bunch of rectangular blocks stacked in an orderly manner into a tower. The game progresses by each player pulling a block out of the structure and stacking it on top. In this way the tower becomes taller and less stable over time. Sound like any systems you know?
By day I specialize in brownfield development which is like walking into a Jenga game near the end. Actually, it's like walking into a game where most, if not all, of the original players have left because no one wanted to touch the unstable mess in front of them. In Jenga, the one who knocks the tower over loses. There are 2 ways to knock the tower over: by pulling out the wrong piece or by putting the piece in the wrong place. In other words you can lose by removing the wrong thing or adding the wrong thing.
So how do you play well?
To be a good Jenga player you need 2 things: An understanding of the tower's current state and a steady hand. If you ever have the chance to play against a mechanical engineer, civil engineer, or architect (the real kind) then realize that you will most likely be playing on the basis of your steady hand. These folks will look at the structure, see some crazy Beautiful Mind type equations floating around, and pull a piece that you'd never touch. In the same vein, the more you understand about how your current system is built, the better off you'll be when it comes time to remove sections of it.
The steady hand in Jenga is analogous to the work habits of a solid developer. If you're focused, consistent, and just a bit fearful, then you'll probably do better than someone who just adds stuff in willy-nilly.
To take the analogy beyond game play itself, if you do knock the tower over then what you do next is very important. If you quickly reset the game to play again, chances are you get to play again. If you look at the mess, say, "Sucks to be you!", and walk away then it's likely that no one will want to play with you ever again. Also bad: losing a piece.
In short: You can win by taking the time to understand things well, being careful, and cleaning up quickly when things fall apart.
This post will probably strike you as either common sense or absolute crazy talk. It is especially written for those in the latter group.
I write a lot about working safely. After lots of posts on branching, test environments, kitchen analogies, etc. I'm here to recommend some behaviors for those times when you totally screw up. After all, you may very likely find yourself in an environment without all of the safety nets you want because you were specifically brought in to build the safety nets. I'm going to assume that you messed up while doing the right thing in the wrong way rather than something criminally stupid like, say, encoding your DVD collection to Divx on the production database server because "it has those really fast drives and all that RAM".
First, and foremost, as soon as you realize that you've screwed up, let someone know. Do not be tempted to keep things quiet and fix it before anyone notices. I have yet to see a production issue that didn't get worse with time (and quickly). Keeping things quiet is outright selfish because you're putting your own comfort ahead of the good of the group.
Secondly, fixing your mistake needs to become your top priority. Fixing means not only getting things working again, but getting them back to the way they would have been. Does data need to be re-keyed? It's now your job to re-key it. Do numbers need to be verified? If you're not the one who can do it, be prepared to generate special reports or data dumps to make the job as easy as possible.
Next, take responsibility for your mistakes. Full responsibility. You don't get to say, "I deleted the production website, but the slow restore process is what caused the outage to be so long." Being up the creek without a paddle means that you own the upstream and downstream problems as well.
After things are back to normal, do your own private After Action Review (note: there's a good chance you'll be asked for either a public one or one with your manager). Take this opportunity to learn from what just happened while it's still fresh. For a big enough mistake, you'll probably also reflect on it for a day or two. Having said that, hear me now and believe me later: do not utter the words, "Well it's kind of lucky it happened because…". Even if there's some fantastically beneficial outcome, you don't get to celebrate the effect, you are still responsible for the action.
Lastly, get over it. If you've made the kind of mistake that I'm writing about, it will almost certainly affect you emotionally, mentally, and physically. That's to be understood and will actually help with internalizing the "Don't do that again" lesson. But don't let it affect you too much for too long or you'll kill your productivity. If making a huge mistake makes you skittish to the point that you are no longer a high performing contributor, then things aren't back to normal are they?
As a final thought, while things are at their worst you may start wondering, "Are they going to fire me for this?" I can't answer for certain, but I can tell you this: When I had headcount, I never fired someone for making a mistake . . . and they pulled some doozies. If you do get fired for a blunder that you feel comfortable defending (i.e. Doing the right thing the wrong way), then chances are it wasn't the place for you anyway. The only way you can do truly incredible work is by being willing to take some risks and if your employer squashes any chance of that happening by firing people for mistakes, then you're better off elsewhere. Just don't make a habit of it: it's easy to explain a one-off, but the second time you get fired for f'ing up big time it starts to look like a trend.
In the Spring of 2008, I decided to become a brownfield development specialist. Greenfield development is when you're starting a project from scratch and get to design everything with only minimal constraints. Brownfield development is the opposite of that: it's when a project is n months into development and most of the constraints have been cast in stone. My guess is that if you were to ask every developer you know whether they'd prefer to work on a greenfield project or a brownfield project that they'd all say greenfield. Heck, I'd bet that a quarter of them would laugh so hard that Red Bull shot out of their nose just at being asked the question. Therein lies one of the secrets of becoming a Big Swinging Developer:
You can make money being good at things that other people hate to do. A lot of money.
When people hate to do something, they don't do it very much. Since they don't do much of it, they never get particularly good at it. This makes it harder for them to do it and the cycle starts all over again. Some common values of "it" for development include: writing tests, documentation, and creating installers. You can safely add brownfield development to the list since a job description of "Support the big ball of mess that runs our business while adding features and fixing bugs" doesn't usually have folks banging down your door. Throw in the fact that you'll have no relevant documentation and few, if any, tests to guide your way and you can picture weeks of pestering your co-workers with questions just to get to the point where you can fix a simple defect.
But what if brownfield was easy for you? What if, rather than pestering your co-workers with basic questions, you could understand what the code was doing and then ask why it was doing things that way rather than asking what it actually does? That's what my secret weapon, Visustin, gives you. You paste in your source code and it'll generate a flowchart for you:
It supports 31 languages including the popular .NET, open source, and SQL variants. I spent last week throwing hundreds of lines of Python into the tool and tracing through an incredibly complex financial trading system to learn how portfolio valuation is calculated. I was able to correctly describe the process to my team lead, including a couple gotchas buried in the code even though I don't know Python. Since I know how the system works, though, I can find the path the code will take and identify where problems are likely to occur while coming up to speed on the language at night.
If you're a developer, I'd highly recommend Visustin. It's great for code reviews, documentation, and for diving into existing systems. Developer or not, look around your industry and find the important things that no one wants to do because there's a real opportunity there. You can become better (or more tolerant) than anyone else simply by identifying the key aspects to the unpleasantness and solving that problem first.