When I was at Insurance.com we had build a good system.

From the inside we had our moments of panic and things seemingly blowing up. But for the most part we had a major firefight on our hands once every year or two. Sure, systems failed, but for the most part no one noticed.

It is an interesting thing building a system that can sustain failures.

It is interesting for a number of reasons:

  1. Technology – Fault tolerant
  2. Hardware – Keeping down costs
  3. Networking – Combination of the two
  4. Shared resources – Things that, by themselves, are a single point of failure
  5. Scalability – Making sure you can grow

Basically, it’s a balancing act between throwing hardware at the problem and throwing money at programming. Between the two is the networking layer — the more hardware you throw at things the more complicated the networking becomes as well.

Many times the easy things to add a bunch of front-end servers with a shared database. This is, essentially, what we did at ICOM. This has a simple problem though that we didn’t really run into with ICOM is the scaling problem — we only needed to scale so far.

The database is the single point that needs to scale — the only way to scale most databases is to get a bigger server. There also is a lot of extra hardware for the SAN that needs to be shared between at least two servers. This all gets expensive.

I’m playing around with some stuff on the side just for fun. I’ve been working in the Amazon cloud services. It’s cool to think how to make stuff work in a shared-nothing scenario using the services that comes with AWS. Amazon does it, so it’s obvious that it’s possible.

Another thing to think about is the complexity — the more complex a system is the harder it is to stabilize it. Another balancing act is knowing when and where to split systems to scale independently, and when the added complexity will detract from the reliability.

More on all this as it progresses.