Developer Series: Serverless Scaling in AWS


Welcome to the first post in our technical blog series. These articles showcase some of the more compelling problems and deep tech our world class development team wrestles with as they build a fast, powerful and scaleable actuarial platform that is used by life insurers globally.

This series is by developers, for developers, although anyone who’s interested in what makes our models tick might get a kick out of it too.

The Problem

Montoux applications run on custom machines in the cloud, ensuring that our model simulations can run on powerful hardware without requiring any work on the part of our customers. Due to the nature of how our customers work, applications tend to be run in high intensity batches but with low frequency—often for a few days of a month, after which they will spend the majority of their time working with other parts of our platform to process the results of the simulation—but users also occasionally want to execute individual simulations at unpredictable times. We need to be sure that the modelling service is constantly available, and we want to scale it up or down depending on how intensely it is being used, but we can’t always predict when our customers will be using this particular part of our platform.

Our Approach

Amazon's EC2 auto scaling groups seemed like an appropriate solution to this problem. A launch configuration describes the application runner, specifying an initial machine image with the runner software installed, and is configured to use a sufficiently large EC2 type for the instances. An auto scaling group with this launch configuration can then handle creating and destroying the instances automatically in response to demand.

The first interesting problem was how to describe the demand on the runner service to the auto scaling group. Auto scaling groups can be configured with scaling policies, which specify how to respond to CloudWatch events by scaling the number of instances in the group. Scaling policies can attempt to keep the average CPU utilisation of the instances below a certain threshold, or explicitly change the number of instances in the group when a CloudWatch alarm fires.

None of these policies are suitable for our use case. We can be more precise about demand than how hard existing machines are working, and bringing up new instances solely on an analysis of instances that are already in the group means that we can never completely empty the group if the service is completely idle. Responding to CloudWatch alarms also requires us to select a metric to base the alarm on, and it's not clear which of Amazon's metrics about the platform would be suitable to respond to.

Scaling policies don't actually change the number of instances in a group, but instead set the desired capacity of an auto scaling group, and the group automatically responds to a change in this value by adding new instances or terminating existing ones appropriately. Since the desired capacity of a group can also be set by hand, we could still use auto scaling groups to manage the instances for us, while manually managing the desired number of runner instances available at any given time. Amazon's wide variety of web services have allowed us to keep the resulting management infrastructure in the cloud, rather than having to couple instance management into our existing software platform. Our platform communicates simple usage metrics, and the web services take care of the rest.

The resulting infrastructure looks like this:


The entry point into the infrastructure are two Simple Notification Service (SNS) topics, used by our software to communicate events about user interactions with the platform. One topic receives events about user session activity, and the other receives events about queued and running applications. Both events only include basic numbers about the current state of the platform: the number of currently active sessions, or the number of application job runs that have been requested. These events are forwarded from the topics to corresponding Lambda functions.

Custom Capacity with Lambda Functions

The scaling decisions that the Lambda functions make depend on the particular kind of event that they are configured to receive. For instance, the lambda responding to application events can add as many instances as there are jobs running up to the group's maximum, to avoid having any jobs waiting when there is sufficient room. Responding to session events is more nuanced: we want to ensure that the application service is available as quickly as possible, but an active session on the platform does not imply that the service is actually about to be used.

Lambda functions are stateless, and the script does not attempt to retrieve any other information about the platform that is not already conveyed in its event, so each function can only consider the information in the event when deciding how to adjust the desired capacity of the auto scaling group. The functions do have to look at the current capacity of the group, as their minimal view of the platform state mean that they are only qualified to increase the desired capacity, never decrease it. These two ‘scale-out’ functions cannot see the data driving the other, so while a session event might imply that only one instance is necessary, an application event shows that a much larger number of instances are needed to handle all of the current model runs, and it would not be appropriate for the session function to remove those instances.

Each Lambda function is also configured to point to a custom CloudWatch metric, and the functions write the event data to their metric every time an event arrives. These metrics provide us with a quick and easy overview of the activity of the platform from within AWS. The resulting metric statistics are aggregated, so we are not viewing the precise data that was written by the Lambda functions, but for an overview of the use of the system we are more interested in the general shape of the resulting graph instead of exact numbers.

The session and application metrics are useful for two purposes. The first is that it gives us an easy way to view the usage patterns of the platform from the perspective of the Lambda functions, potentially providing insights that can feed back into improving the scaling decisions that they make. The second is to act as a form of state for a third Lambda function, triggered to execute by a timed event.

The final Lambda function's purpose is to decrease the capacity of the auto scaling group to its minimum when it is no longer being used. By gathering the maximum value of both metrics over a short period of time in the past, this ‘scale-in’ function can see if there are a low number of sessions and no current use of the application service, and reduce capacity if so. It is crucial that there are no runner instances currently working at the time, because decreasing the desired capacity arbitrarily terminates instances in the group: AWS does not know which of the runners are inactive and could easily destroy an instance that is currently working.


Using Lambda functions over scaling policies clearly gives us much more flexibility, and in particular means that we can react to events from our platform instead of just data that is already available to AWS, but there are some negative aspects of interacting with the AWS auto scaling API.

There is no way to perform an atomic update to the desired capacity of an auto scaling group, so there is no guarantee that the capacity has changed in between getting its value and setting it. This creates small windows of time where updates from two concurrent events might interfere with one another, and an update to the capacity gets overwritten by a smaller value between the scale-out functions. These windows of time are small, and the sequence of events that could cause this behaviour to be observed are also unlikely, but the possibility is still there.

We honour the cooldown period of the auto scaling group when decreasing capacity, so interference between the scale-in and scale-out functions is not possible, but we do not honour it when increasing capacity because we always want to immediately respond to requests for more instances. One potential solution to the problem of interference between scale-out functions could be to set a small cooldown period on the group (sufficient to cover the window of time that the problem could occur in), honour it when increasing capacity, and if the change is rejected because the group is still in its cooldown period then retry the whole event response again.

A minor nitpick is that the minimum and maximum sizes on an auto scaling group are the min and max values of the desired capacity, not the actual capacity of the group, so attempting to set the desired capacity outside of these bounds returns an error. The Lambda function fetches all of this information when it requests the current capacity, so it is trivial to avoid the error, but it would be nice to be able to set the desired value outside of the bounds but keep the number of instances within them. The values of these group properties over time are written to CloudWatch metrics, and it would be useful to easily graph how often our actual desired capacity is outside of the bounds.

As mentioned above, the scale-in Lambda function only reduces the capacity of the auto scaling group when there is no activity on any of the existing instances, to avoid terminating running applications. This can be a problem if there is a burst of large activity and then a small amount of activity over a long period of time maintains a large number of unnecessary runner instances. Termination policies on auto scaling groups are not sufficiently expressive to describe active instances, but we could set scale-in instance protection on an instance when it begins an application run. Runners are already communicating application events to AWS, so adding information about which instance the event originated from would probably be sufficient to solve this problem and allow the scale-in function to decrease the desired capacity of the group without needing to wait until all activity has stopped.

Overall we're very happy with the infrastructure that AWS makes available. The auto scaling group has removed any need to manually manage the number of instances available to the application service, and if any particular automated system in AWS isn't customisable enough for our purposes then we can just outsource the decision-making to Lambda. Producing custom metrics about our platform was an added bonus that will help us make decisions about how to adjust the scaling infrastructure in the future.

About the Author

Timothy Jones, PhD is a Senior Software Engineer at Montoux, and earned his doctorate in Programming Language Design from Victoria University. Tim’s main interests and areas of expertise include Type Theory, Functional Programming, and Metaprogramming.