Wednesday, May 27, 2009

What is Web-Scale post #2 - Available and Reliable - Designing the future - Mike Teets VP OCLC

Click here for original post : Available and Reliable - Designing the future

By Mike Teets, VP Enterprise Architecture, OCLC

This is the second post in the series on "What is Web-Scale". It has been a while since my first post so I had better get on with it. I took an informal poll on twitter to select which area of the web-scale / cloud concepts to expand. Transparency was popular but the most popular was "just do them in order". That is what I will do. Transparency will be next.
Reiterating the bullet from the summary post: Available and Reliable: 99.9 or 99.99% Availability (24x7x365, not against an advertised availability) Always On: No down time, planned or otherwise. The site must always be available.
A common internet forum statement is "If there aren't pictures, it didn't happen"... so here is a picture. The first thing about availability at scale is that you cannot depend on opinions or feelings about whether it is good enough. You must measure it. It must be measured every second of every day. The data must be logged over long periods to determine individual service frailty. For massively scalable systems, this data must be reviewed daily with alarms going off anytime a system falls out of specification. The following is a high level dashboard of one of our system monitors at OCLC. Failures must be evaluated for corrective action. It's just not optional.

The numbers: What does it mean... system managers slang is "Two nines" or "Four nines". 99% available is "2 nines", 99.99% is four. Simple enough right? While it seems mathematically simple, this area tends to be often misunderstood. We all tend to relate statements on reliability to personal devices and machines that are very local and singular in nature. A single machine at 99.99% is down for 52 minutes a year.
99% - 3.65 days outage per year
99.9% - 8.76 hours outage per year
99.99% - 52.56 minutes outage per year
99.999% - 5.256 minutes outage per year
99.9999% - 31.536 seconds outage per year
99.99999% - 3.1536 seconds outage per year
Now the bad news: A service actually drops to 99.98 available when it is dependent on just two 99.99% lower level services (105 minutes per year). This can be called series availability. The more services you chain together the worse your reliability gets.
2 services in series: 365 * 24 * 60 * .9999 * .9999 = 105 minutes annually.
3 services in series: 365 * 24 * 60 * .9999 * .9999 * .9999 = 157 minutes annually.
As you might guess, our current Web 2.0 mashup world is generating reliability issues as services are very typically series based... metasearch -> webui -> SRU -> database -> data just as a common example.

Don't despair, there is good news! This good news actually supports a service architecture environment instead of detracts. There are ways to improve availability with a SOA model. The first, and most expensive way is to buy and manage very highly reliable individual systems. This is the path the big iron of the 80's took... and it got very, very expensive. In modern highly available environments, this issue is addressed by parallelizing the workload. Simply double up each service so that both must be down for the entire system to be down and you are back to 52 minutes with 99.99% on each machine.
Availability = 1 - (1-MachineAvail)**2
Given that, we have even better news, if you double 99% available machines you get 99.99% availability, triple and you get 99.9999%! This is why the massively scalable architectures now can use commodity hardware instead of paying for it in the individual machines.
In real life examples however it is never "simply double..." There are issues in software design, data integrity issues, transaction routing, load balancing, fail-over, etc. These all contribute a significant cost to obtaining highly reliable systems. In other words, we moved some expense from hardware to software. This is good news again since software copies scale less expensively than hardware.
Planned verses Unplanned: How many of our services have an outage notification page or warning page of pending outage or a current outage? Historically we have struggled over the words to use as I am sure everyone has. We carefully craft messages and explanations. But realize this... NOBODY READS THEM! We might feel a little better when we find the notice after we see a service has failed but the vast majority of users of our systems just see a failure and move on to an alternate.

Another false comfort is that somehow planned and unplanned outages are different. Outages for upgrades are really not tolerated by users. Major internet services figured this out from the beginning. The service must be on at all times. Software installs must be done on a rolling basis while user transactions are serviced. Hardware additions or replacements must be the same. Always on is now the default end-user expectation.
A positive byproduct of the scaling across commodity hardware for reliability is that there are now many options for rolling installs across an environment. It can be done in parallel data centers, farms within data centers, individual machines or even virtual machines on a single host. Again, it takes software design and configuration management design, but it is quite practical in today's environments.
OCLC: Focusing on just one service platform, is comprised of 150 servers. These servers are divided into farms by function... 65 database servers, 75 application servers, and 10 servers supporting harvesting and bots. We continually add hardware and rebalance the environment with demand. We have two data centers today and will likely have more in the future as we grow and balance load geographically.
Posted using ShareThis


Post a Comment

Subscribe to Post Comments [Atom]

<< Home