Using Amazon Auto Scaling with Stateful Applications

jcrites · on June 5, 2016

Glad to see writeups and lessons learned about AutoScaling. As a system builder, I think it's a capability that's under-appreciated. My rule of thumb is that essentially every machine fleet benefits from being managed as an AutoScaling Group (whether stateless or not - there's value even without dynamic scaling).

> If you’re working with a queue based model then scaling will be done based on the SQS queue size, otherwise we’ll use the custom metric “number of running jobs”.

There is another strategy to consider when auto-scaling queue-based applications: if you have a fixed number of threads per machine instance that process work from a queue, then you can scale based on the percentage of threads occupied.

You can treat this as a utilization percentage similar to CPU Utilization, the observation being that you can begin scaling up in advance of ever actually developing a (long) queue. For example, consider an application where the average queue size is zero, and the average thread utilization is zero. Consider another application where the average queue size is zero, because messages are processed almost immediately, but 19 out of 20 threads are occupied on average. You can conclude that the first application is nearly idle, while the second is nearly maxed out, even though both of them have empty queues. By considering the second application to be (19/20=) 95% utilized, you can establish a scaling policy that scales up before a backlog develops if you wish -- this is assuming that you wish to avoid developing a long queue, which is desirable in some cases. It depends on how quickly you'd like to process messages - SLA.

(The article touches on this as well, talking about number of running jobs.)

Queue size can be useful as well, but I think it can be more difficult to tune. Percent of thread capacity works well regardless of how large your fleet is, and how expensive messages are to process. By comparison, a large-scale system that processes thousands of messages per second could develop a large queue -- in terms of number of items -- from a brief blip, which it will burn through momentarily. A 20,000 item queue might be nothing to such a system, whereas for another system, 100 items could be significant, if each one is a 10GB video to download and transcode.

The ideal auto-scaling solution typically involves a mix of multiple measurement techniques, since it's rare that any single performance characteristic captures an application's load perfectly. I would definitely agree with the author's point that instrumentation is highly valuable.

mikikian · on June 5, 2016

I agree that queue size should not be the only measure but would like to propose an alternative solution, namely M/M/c queue (https://en.wikipedia.org/wiki/M/M/c_queue) that takes into account arrival rate and service time.

amagnus · on June 8, 2016

Thanks for your input. There is a few drawbacks when sticking to queue size metric only indeed.

honkhonkpants · on June 5, 2016

Queues should be measured in wallclock time, not in terms of items. That removes the problem you mention.

brightball · on June 5, 2016

The only other consideration is stress of jobs on peripheral resources like the database. If you have multiple jobs that may be intense enough to insist on sequential execution it could inflate a wall clock number even if scaling wouldn't matter.

You essentially need a metric for wallclock time exclusive to jobs that benefit from scaling.

jcrites · on June 5, 2016

That's a good point. One of the properties I typically instrument in queue processing applications is the time delay from when messages were inserted into a queue, until when they have completed being fully processed. The application emits message processing latency into the instrumentation system for monitoring and examination. I haven't ever tried using this for purposes of auto-scaling, though. I'll have to give it a try sometime.

I've had the most success auto-scaling systems by measuring properties of system processing capacity, like percentage of threads in use, or percentage of maximum connections, or CPU; things like this that are properties of the system, rather than the work it's processing. These all have the desirable property that (i) they're relatively work-type agnostic (ii) you can begin scaling up when a utilization threshold is crossed, which is before there's ever an observable impact (like messages have been delayed an undesirable amount). Scaling on percent of thread capacity means that you can discriminate when to scale up even while your queue size is zero and your message latency is approximately zero (which means there has not yet been an impact on the work you're handling, such as a delay).

That being said, there's also a flip side: once utilization is at 100%, you can't tell how far behind you are, or how much to scale up. To combat this one establishes scaling rules that are aimed to prevent utilization reaching 100%, which is feasible if your mean time to bring new capacity online is low enough that you can follow the demand curve. This is also where you could bring in wall clock time as a measurement of how far behind the system is, and how much it needs to scale up to return to meeting its SLA.

(I should clarify that most of my automatic scaling experience is in the context of soft real-time systems with reasonably tight SLAs, not large batch systems that can afford to run large backlogs and that prefer to be at 100% utilization for efficiency.)

stretchwithme · on June 5, 2016

You'll want to make sure you don't autoscale just because a shared resource like a database is not working correctly. There's no point in having more instances up if the new ones don't work either.

Elastic Beanstalk can be very useful, especially for deploying new versions of your app. It has a CLI that makes it easy to SSH to all of your instances as well. And application environments can be configured to save its logs to an S3 bucket. And it has support for time-based scaling events, so you can have more hosts up during the day if that's what you need.

amagnus · on June 8, 2016

Good point, and that's when good monitoring comes into play.

jjeaff · on June 5, 2016

I would be curious what variance of load would be required for this type of auto scaling to be worth the time vs just putting your own bare metal in a colocation facility or leasing dedicated boxes.

The price performance is so much higher on dedicated. Or for that matter, using reserved instances instead of scaling with spot prices.

amagnus · on June 8, 2016

It comes down to the business decision of going all in with cloud or bare metal. Day to day management of each is pretty different.

stephengillie · on June 5, 2016

Stateful applications should use horizonal autoscale to hot-add CPUs. While this has a lower limit than vertical autoscale, the two can be combined with sticky persistence to maintain state at scale.