Oftentimes I run into designs at work (not just at Amazon, in general) that work — mostly. The problem is the “mostly” part.

We would all like to write code that is perfect and runs right every time. Even if it is perfect the moment you press “ship it,” the world will conspire to make it imperfect soon enough.

Permit me to take a step sideways first: hardware design. Almost every piece of modern hardware has a plethora of test points littering the board. The more complex the board is the more test points you typically see on it.

My thought is that software aught to be written the same way.

This gets to a pet peeve of mine. Most programmers think of testing as writing a bunch of unit test or even integration tests. This works as long as you’re in controlled conditions. When the software gets out in the wild you get all sorts of wild and wonderful failure modes.

What I like to design into the code I write is a way of testing individual submodules in the wild in as much isolation as I can practically get. The advantage is when the code breaks, which it inevitably does, I can diagnose it since I designed it to get debugged.

A counterexample is what I was looking at today: SWF — Simple WorkFlow service. I normally brag about how awesome most of the AWS services are. And they are. SWF on the other hand I would name Complex and Fragile WorkFlow Service. CAFWFS. Doesn’t roll off the tongue as nice though.

Basically, you have this thing that you can inject things into and have it spit things back out. It’s based on the notion that you can call a dependent service, have it save state, and replay the calling code over and over while replaying the responses when it gets them until the workflow is finished. If you think that sentence is confusing, you’ve entered the world of SWF. It’s awesome that it can be implemented… it really is. But it’s also hard to understand and debug when it fails.

Not just that but it’s built with one of the most academically pure (which I use in the most derogatory way possible) frameworks: AspectJ. AspectJ might work great to inject logging. But in something like this it just adds to the undebuggability of the whole.

Sure, you could get this working once. But if it breaks — in production when all eyes are on you — I challenge you to figure out how you’re going to fix things in short order.

You might call me a luddite for saying that simple is better. But I do think that. I do say that.

When the simple thing breaks I can fix it. You can fix it. You’re not just stuck there looking at it wondering how it even worked in the first place.