Scalability

From Epowiki

Jump to: navigation, search

Scalability is the ability to keep solving a problem as the size of the problem increases.

Scale is measured relative to your requirements. As long as you can scale enough to solve your problem then you have scale. If you can handle the number of objects and events required for your application then you can scale. It doesn't really matter what the numbers are.

Scaling often creates a difference in kind for potential solutions. The solution you need to handle a small problem is not the same as you need to handle a large problem. If you incrementally try to evolve one into the other you can be in for a rude surprise, because it won't work as you pass through different points of discontinuity.

Scale is not language or framework specific. It is a matter of approach and design.

Contents

The Two Classes of How to Handle Scalability

I've come to think there are two classes of scalability problems:

  • Scalability under fixed resources.
  • Scalability under expandable resources.

The two different classes lead to solutions using completely different

lity problem you have a fixed set of resources

yet you have to deal with ever increasing loads.

For example, if you are an embedded system like a router or switch, you are not likely ever to get more CPU, more RAM, more disk, or a faster network. Yet you will be asked to handle:

  • more and more functionality in new upgrade images
  • more and more load from clients

The techniques for dealing with loads in this scenario are far different than load when you can expand your resources.

Scalability Under Expandable Resources

In this class of scalability problems you have the ability to add more resources to handle more work, in general this is called horizontal scaling.

The new era of cheap yet powerful computers has made horizontal scaling possible for virtually anyone. Many companies can afford to keep grid of hundreds of machines to solve problems.

This is the approach google has taken to handle their search systems, for example, and it's a very different approach from a fixed resource approach. In a fixed resource approach we would be squeezing every cycle of performance of the resources, we would be spending a lot of time on developing new approaches and tuning existing code to fit the exact problem.

When resources are available, and your approach is right, you can just add more machines. You start to figure out ways to solve your problem assuming horizontal scaling.

In general this are is called data parallel algorithms.

For example, terrascale (http://terrascale.net/) has an amazing storage grid called Terragrid, that allows you to scale up by adding incrementally adding commodity machines. With the availability of 10Gb ethernet interfaces these approaches become quite powerful.

Of course, your approach has to be right. If you select an architecture with single points of serialization then you won't be able to scale by adding more machines.

Examples of Dealing with Scale

Here are a few examples of how different people have dealt with scale.

Real Life Web Site Architectures


Databases


Networks


File System

One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:

Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

  • WHAT is going to be done (database, file storage?)
  • HOW will it be accessed? (One large file, many smaller files)
  • WHEN will it be accessed? (During business hours, distributed over the day?)
  • AVERAGE TRANSFERS - will the whole schmear come over, selected parts?
  • SECURITY a concern? (Sensitive data, protected network)
  • BACKUP - a petabyte of tape storage is expensive, and takes quite a while to do.
  • POWER - do you have enough?
  • COOLING - ditto
  • SPACE - ditto

Storage

Petabox claims 40 watts per terrabyte. That is pretty low, if you are going to try to come up with your own solution with off the shelf parts it'll be hard to match that. If they can't pay for 40 watts per terrabyte for a petabyte maybe they should reconsider that they need the petabyte for now.

  • Lets say $0.07 per kW/hr,
  • Then the 50kW as you said would be:
  • 50*24*31*$0.07 = $2,604/month

Misc


Monitoring

  • Munin - Munin the monitoring tool surveys all your computers and remembers what it saw. It presents all the information in graphs through a web interface.

Grid


Platforms

  • Apache, Perl
  • TurboGears, Python
  • Java, *
  • Windows, .net
Personal tools