I turn on the computer, start the browser and try to buy a concert ticket. Next, the page is taking longer to download than usual. I try again by the phone this time, with the same results. Annoyed, I go to Twitter or Facebook, head for the page which is causing me problems, and sympathize for other frustrated fans who I feel I belong to.
I consider we are all facing frustration when we want to get those concert tickets we all have been waiting for, or watch that new chapter of that trendy series or a live broadcasting. All services that internet make available massively have the same issues and, for users, it becomes highly irritating when these are not available. From the viewpoint of a company which supplies services, this issues can mean the end, a great loss of confidence from the audience, and terrible publicity on the media.
To start analyzing how many painkillers system admins will need, we can try to classify our service in one of these three groups:
- Case 1: My services will occasionally have a big growth of concurring users for brief periods of time.
- Case 2: My services do not allow for failure, I cannot take the chance that one or several are unavailable.
- Case 3: My services do not allow for downtime and may suffer from occasional peaks of users.
For the first case, it is not deemed convenient to have big idle infrastructure just in case. Different clouds offer an option to promptly display virtual machines or containers. In AWS this is called “Auto Scaling Groups” for normal EC2 or “Service Auto Scaling” for containers. Simply put, when certain alarms go off, the cloud automatically deploys containers and virtual machines up to a limit set up by the admin and it is also overruled when other parameters are reached. In this way, the power of the infrastructure is rapidly increased, there is no idle infrastructure and these new elements are offered briefly for as long as they are used.
For the second case, two possible solutions can be put forward to be performed individually or together. It should also be mentioned that orchestration, planning, large-scale testing and measurements beyond these concrete items are required. For both solutions the use of multiple regions globally is suggested, this offers the benefits of having distributed and ready-to-go infrastructure in different parts of the world (something close to the tip that goes you should not put all the eggs into only one basket).
A multi-region Failover can be proposed, which is explained on this presentation.
You can opt for an application distributed in multiple regions and, thanks to the magic of DNS, have improvements depending on the region, which allows us to focus on a particular region that may demand more attention. Its implementation is explained on the next presentation.
For this case, we can think of an architecture oriented to a multi-region Failover besides the replication systems and recovery plans. This case is far more complex and an easy prescription cannot be given, it is necessary here to have a more customized planning that comprehends all the real needs of the business with a view to prioritizing and optimizing features with added value for the client.
A good practice many companies propose is the use of CDN (Content Delivery Networks) for several different types of content, specifically for static video files, images, html, css, or js. In AWS, along with S3, Edge Locations can be used through CloudFront to make files available. The explanation on how this works is here on this video. This comes in handy to enhance performance and our services availability.
At this point many may be wondering: Now, if all these go wrong? What if the demand is higher than my capacity of payment or my expectations? What shall I do? Firstly, let us remember the teachings of El Chapulín Colorado (the famous Mexican TV character): “be calm, be calm, do not panic”.
The most important thing is to recognize problems do exist. The user has to be notified if you are working on solving the issue or if the service is unstable. It is not uncommon that users use social networks to look for answers, hence the use of these is key in crisis times. Communication has to be effective, or even sense of humor can be used to ease tension with users, as many of us could see on the error page of YouTube, displaying the message “A team of highly trained monkeys has been dispatched to deal with this situation”.
Gitlab is another example of companies who manage communication well when in crisis. While the service was down for many hours because of a technical problem, the team decided to show the world how they were solving the incident. You can see the streaming here.
When the storm is over
After working out the issue, the next step is to communicate what happened to the users, showing this way that all was controlled and that the service is actually trying to get better and prevent future critical events. Two good examples are AWS and –again- Gitlab, with a post mortem souvenir of the incident.
Beyond all said, never forget the most important thing is to learn from our mistakes, improve the process, be proactive to prevent future incidents and, of course, have the coffee for emergencies ready.