Holiday Readiness, Part Two — What you Should be Thinking About Three Months Out: Capacity Planning
This is the second post in a blog series about Akamai solutions that can help you manage the surge of traffic (both good and bad) that will be hitting the retail industry during the holiday season. Read part one and part three.
Welcome back to the Holiday Readiness blog series. We hope part one has kept you busy over the past month as you continue to improve your security posture. If you haven’t finished all of the security checklist items, don’t worry — there is still time before Black Friday and Cyber Monday.
For this blog, we will be talking about managing flash crowds, disaster recovery (DR), and communication strategies. When we have these conversations with our customers, there are many directions they can go in, primarily because each customer’s infrastructure and strategy are different. Some customers may have a single data center serving all of their content, some might use two, and others may use at least three scattered all over the world. No matter the type of setup, there are four main questions that you should be asking yourself as you prepare for the rush of holiday traffic:
Can my current infrastructure handle at least two to six times the amount of traffic (sustained) compared to a normal day?
If the Product or Marketing team promotes a major deal/discount, will my websites and applications be ready to handle the burst of short-term traffic?
Are my websites and applications set up for failover in the event there are issues?
Do I have sufficient alerts and monitoring in place to catch issues quickly?
As you read through this blog, each of these questions will be addressed so you can start having conversations with your teams as you prepare for the busiest time of the year.
Year over year, there’s no question that more people are coming online and are continuously connected across thousands of devices. The internet in general is more accessible than ever as modern ISPs expand rural coverage via satellite connections at speeds that rival their terrestrial competitors. Coupled with the fact that the COVID-19 pandemic has forced millions to work from home over the past year and a half, online deals and flash sales are even more popular than before. For instance, in April 2020, Akamai observed a 30% increase in global internet traffic in just a few short weeks, which is the equivalent of an entire year’s worth of growth in internet traffic. If you haven’t started already, now is the time to begin setting up stress and/or load tests to identify any weak points with a solution such as Akamai CloudTest.
Start to think about the critical functions of your website or application that need to remain functional during the increased traffic we’re going to see in November. If you are reading this blog, there is a good chance you are already using a CDN like Akamai to reduce your origin server load by caching static objects and pages at the edge. But what about truly non-cacheable content? Things like personalization, certain database calls, and the checkout flow are all critical components that are extremely difficult, if not impossible, to cache due to the unique nature of the content being delivered to the end user. And since they are not cacheable, that means those requests need to make their way back to your infrastructure every single time.
Now is also a good time to review any major site changes from the previous year. If there has been new functionality added to the website, does it need to be tested to see how well it will handle a sustained high server load?
Review traffic throughout the year
Understanding your traffic levels over the past year will help determine what your baseline should be for testing this year. If you are utilizing Akamai mPulse for real user monitoring, you can look back at up to 13 months of your historical traffic, which will give you more than enough data to help with that determination.
Once you have a view of last year’s traffic, identify the weeks or months that sustained the highest traffic levels to develop an idea of where your baseline should be. Also, review any reports or graphs you may have from last year to determine whether there were issues at those previous load levels, so adjustments can be made.
Time to test
Given how uniquely customer websites and applications can be built, there is no definitive number of concurrent users that you should be working toward handling. Akamai has been supporting our retail customers throughout the holiday season for years, and we typically observe retail traffic increases anywhere between two to six times compared to baseline.
We’ve covered load testing and handling sustained traffic levels which, with the right setup, infrastructure can ultimately scale to support. What about flash crowds?
Managing flash crowds
Let’s start with an example to set the tone: At the last minute, the Marketing/Product teams decide they want to generate a lot of interest for a new product they are promoting with a special 50% off discount. They partner with a celebrity, who has millions of followers across all of their social media channels, to create a post and a video promoting the product. In addition to that, the Marketing team will be deploying millions of emails and text messages to their loyal customers at the same time. They have never run this type of deal before, and it will only last for one hour, so the team expects demand to be very high. At 1 PM, the team who manages the website receives an email from Marketing that this huge event is taking place the same day … an hour later at 2 PM.
The ending to that example can result in a few different outcomes, depending on how prepared they were to handle a short, volumetric spike in traffic to their infrastructure:
They have a solution in place to throttle incoming traffic in a controlled manner.
They were unprepared and the website experienced an outage.
Their infrastructure is already built to handle a large influx of traffic.
Below we will talk about the first two outcomes in a bit more detail. If your infrastructure is already built in a way to scale for flash crowds, then you’re already doing a great job! However, the incurred costs of spinning up and maintaining the additional infrastructure is something to think about.
In the “Stress/load testing” section above, we briefly touched on caching and how a CDN like Akamai can help reduce server load by storing cacheable content at the edge. But a critical workflow such as checkout can be quite difficult, if not impossible, to cache. Therefore, you may want to explore implementing a traffic throttling solution such as Visitor Prioritization, which will allow you to selectively control the amount of incoming traffic to your infrastructure.
By rerouting your end users to a customer-branded virtual “waiting room” that is hosted on Akamai NetStorage, you’ll have peace of mind knowing that your infrastructure will remain healthy while still allowing users to purchase their items in a controlled fashion. During the event, keep an eye on your infrastructure health to determine if you need to throttle up or down. Keep in mind, in addition to the Akamai Control Center user interface, you can also use our Cloudlets API or Visitor Prioritization CLI to adjust threshold levels programmatically.
Traffic control strategies
As you think about traffic throttling, a common question from customers is “How much traffic should I initially let in?” This can be a difficult question to answer unless you’re aware of all of the event details. If you’re unsure of the amount of traffic, then your best bet is to take the conservative approach and allow 0% to 25% of traffic in. Once the event begins and you have a better idea of traffic demands, you can increase the threshold incrementally while monitoring your server health.
Another strategy to think about using Visitor Prioritization outside of the flash crowd use case is in a disaster recovery scenario (read more about DR strategies below). If your infrastructure has already experienced issues and you’re currently experiencing an outage, you can repurpose Visitor Prioritization to ease traffic back onto your website as your servers recover. This still allows you to provide a branded experience to your end users while you incrementally allow traffic back onto the website.
Pro Tip: Some of our customers keep Visitor Prioritization enabled in an “always on” state, but allow 100% of the traffic through. This allows them to use the solution as a “panic button,” so to speak, to enable the virtual waiting room quickly.
Site failover and origin health
Unfortunately, there are occurrences where things can go wrong on the origin. A component can fail, and there is no backup database or throttling solution in place to reroute traffic and keep sales flowing. If that is the case, the next best thing you can do is serve an error or maintenance page from Akamai NetStorage. Even though the website may be down, users will still have a branded experience informing them that there are site issues while teams work to correct them.
Site Failover is a feature included with Ion, the solution that enables routing your website/application through the Akamai CDN. The image below shows a basic example of how to configure it in the event there are issues with your infrastructure. In this example, the Akamai edge server will react to either an origin connection timeout or when it receives an HTTP response status code matching 500 through 504. When this happens, users will be routed to Akamai NetStorage, and the file “page.html” within the maintenance directory will be displayed to the end user. Page.html is a basic HTML page that you have total creative control over. For more advanced implementations or questions, reach out to your Akamai account team.
If you already have Site Failover configured for your website, now is a good time to review and make any necessary changes to the content on your maintenance page. For example, some of our customers will add holiday flair to their maintenance pages to maintain a consistent theme or look if they update their website to use a holiday theme.
Another feature that you should consider implementing is Origin Health Detect (OHD). The purpose of OHD is to track unsuccessful connection attempts from the Akamai edge servers to your origin. By controlling the amount of hits to your origin when they are faulty or unavailable, you can reduce the amount of times Akamai attempts to establish a connection for a faster failover to your maintenance page. See below for a basic example of how to configure this behavior using best practices. Be advised that misusing this behavior can actually increase hits to your origin. Consult with your Akamai account team for more information.
Next we will talk about disaster recovery (DR). But first, let’s talk about what that really means since we’ve already covered the “Site Failover” topic above and how it differs. While having failover ability is certainly a large part of a disaster recovery plan, there are additional areas that ultimately come together to form the full “strategy.” For example, data backups, failover, high availability, data avoidance, data redundancy, alerting, and monitoring all contribute to the overall DR plan in the event of an outage or server failure. Akamai can help in a couple of those areas, so let’s go through them below:
Earlier, we talked about the “Site Failover” feature and how it can help provide a branded experience in the event of an outage. But what we didn’t talk about in that scenario were multi–data center configurations. If you are running multiple data centers to provide high availability and redundancy for your users, a solution like Akamai Global Traffic Management (DNS-based routing) or Application Load Balancer Cloudlet (DNS-based plus application layer–based routing) can help reroute your traffic to a second or third data center if there is an issue.
Using Liveness Test agents deployed around the world, Akamai is able to detect origin issues quickly and reroute traffic to your secondary data centers so your users can continue browsing your website. See the “Monitoring and alerting” section below to learn how to configure an alert in case your data center becomes unavailable.
If you are using either of the solutions mentioned above today, start to review your Global Traffic Management or Application Load Balancer configurations to make sure they are set up correctly to fail over to your secondary data center in the event there are issues. That includes checking that your Liveness Tests are configured correctly and running successfully. See below for an example of a Liveness Test within Global Traffic Management.
While the term “high availability” is normally attributed to availability across data centers, an Akamai feature that provides highly available routing is called SureRoute, which is part of Ion and Dynamic Site Accelerator. SureRoute is a feature that is designed to identify the fastest path across our network between an end user and your infrastructure. A key feature of SureRoute is that it is also continuously monitoring congestion and outages across the internet. When high congestion or an outage is identified, SureRoute will intelligently route traffic around those degraded areas when or if they occur. For example, if an underwater internet cable is suddenly cut, Akamai will automatically find the next best available route to your infrastructure. It may not be the fastest route at that time, but the user will still be able to browse the website or application.
Our edge servers do this by running “races” to a single, small, uncacheable file deployed on your infrastructure across multiple potential Akamai routes. (Typically, if you have a load balancer in front of your infrastructure, the file can be deployed there. Otherwise, that file should be placed on each origin server that you define in your Property Manager configuration.) Once the race is complete, the edge server measures which route is the fastest and will use that route for the next 30 minutes (this is configurable). Users who connect to the same edge server as others will take advantage of that same route since it was determined to be the fastest path from that location.
Monitoring and alerting
Monitoring your traffic and being notified when something unplanned happens are critical components that not only contribute to DR strategy, but also allow you to understand the overall health of your website or application over time. Within the Akamai Control Center, we have a number of alerts that you can configure to monitor the health of your website on client connections to the edge and also to help you monitor issues occurring on your origin. When an alert is triggered, you will be notified via email or SMS.
Below is an example of the various alerts that can be configured to monitor the health of your origin. This should by no means replace your existing Application Performance Monitoring tooling. However, implementing the alerts below to notify you of connection timeouts or a high rate of 5xx errors will help you confirm that there is indeed an issue so you can react quickly.
If you are utilizing Akamai Global Traffic Management for DNS-based load balancing and failover, an additional alert that you should consider configuring is the “Customer Datacenter Down” alert. If the Global Traffic Management Liveness Test agents detect there are errors at one of your data centers, this is the alert that will notify you if that happens. By the time you receive this notification, Global Traffic Management should have already initiated the automatic failover to your other data center(s).
Learn how to configure these alerts for your Akamai websites and applications.
For customers who have our Premium 3.0 Service and Support package, 24/7/365 proactive monitoring is an included feature that will enable an Akamai technical support engineer to notify you that an alert has triggered. When they reach out to you, they will provide the logs that caused the alert to trigger, as well as corrective measures you can take to resolve the issue. For more information, work with your aligned Akamai account team to configure the relevant alerts along with their thresholds.
The final topic we will discuss, which is just as important (if not more) than everything discussed above, is communication. Knowing who to contact and when, internally at your own company or externally with your vendors, is key to identifying and solving issues quickly. Does the DNS or Networking team need to be involved? Does Marketing need to be in the room? How about the infrastructure team? Having representation from the teams who are critical to the success of the holiday event all together in the same room (or chat room!) will set you up for success.
This is also a great time to start breaking down silos that may exist amongst teams to make sure everyone is on the same page. If there will be a “war room,” identify who needs to be there, physically or virtually, to make sure questions are answered and status updates are provided quickly. A tool that we’ve found extremely useful and successful at Akamai during the holiday season are chat rooms via our internal messaging solution. Given the number of customers we are supporting during the busiest time of the year, it has been an efficient means to keep all teams informed and provide updates as needed. You may want to explore this at your organization if it is not something you do already.
Also, as you prepare for the rush of traffic with your teams, identify and agree on the key metrics that will help you determine what is a success versus failure: website performance, revenue, user engagement, specific product quota reached, etc. You should be identifying the values within those metrics that you feel are in-range and out-of-range so they can be acted upon (see the “Monitoring and alerting” section above to help with that).
Start to reach out to your partners and vendors to understand the various metrics and dashboards that they can provide to help you gain an overall view into your holiday traffic. For example, you may want to display Akamai mPulse on one of your monitors to track your real-time performance and business metrics. Another option is the Akamai Event Center dashboard, which will provide breakdowns of overall bandwidth consumption, hits, and HTTP statuses throughout the duration of the event. And finally, you can use the Akamai Traffic Reports to understand your overall traffic and offload statistics.
As we make our way closer to the rush of holiday traffic, making sure that you understand how much stress your website or application can handle, knowing the adjustments that are needed to handle that traffic, and confirming what your plan is in the event of an outage are all things that are going to set you up for success. Knowing these pieces of vital information will allow you to react quickly and swiftly if any issues arise. And remember, communication is at the root of all of these topics.
Tune in next month, when we’ll be discussing general performance recommendations and optimizations that you can add to your Akamai configurations to ensure you are in tip-top shape for the holidays. Wishing you a successful holiday season!