Auto Scaling

The idea is really cool and cost efficient. However, actual implementation is not easy as it should be. There are vendors trying to bridge the gap and I believe it will be much easier in the future.

Problem at 2AM

For many services, usage fluctuates during any day (and also week). For example, we see our own pattern, from 2AM to 8AM (PST), it bottoms out. Servers sit idle, which waste money and electricity. The solution is to scale down during this period. The objective is to maintain a core capacity and add/terminate servers on demand. That’s the marketing hype cloud computing is supposed to deliver, but I guess a few companies take full advantage of this because the level of automation is still very low.

Problem with existing data

Say a cluster with 10 servers, at 2AM, you only need 5 servers, what do you do with the rest? It’s easy to think to simply shut then down. Not so fast! What about the data in those servers? If your app simply serves static/dynamic pages and do central logging (scaling problem of its own) elsewhere then this is possible. But if your application generates data and need to process it in some way, you have to deal with this data before termination. These are a few possible solutions. Please feel free to add your comments/suggestions and I’m sure there are better ways.

Decouple data storage and application layer

This is a good practice to isolate different layers. However, this comes with performance trade-off. If your app writes a lot (logging) into a central storage/database, many app servers can overload the master DB with many writes per second and then DB needs to scale out, making the problem more complicated and relying on a central storage can be a single point of failure.

Process before destroy

It depends how fast the data processing can take place, if the server needs 4 hours to process then the off-peak hours already past.

Move data to another peer before destroy

Peer helps other peers. The dying instance send all its data to another instance and then dies (hey, just like people). The problem here is dealing with the merging of data (eg: auto-increment). I think this is the best way for our particular situation (many many small writes per seconds) as any single instance only has a small portion (vs. central database) and it still follows KISS (keep it simple stupid).

Any thought on improvements or other alternatives?