Incident Management at Akamai
Our first line of defense is a resilient system design that allows our software to compensate for many changing conditions and possible points of failure. We maintain an array of sensors, logs, and measurements that allow us to address many problems through normal operational procedures before a customer can see their effects.
When a customer issue can't be solved by technical support within Customer Care, or when our sensors detect a problem outside of normal operations, we declare an incident. Incidents are regularly handled by cooperative effort among engineering/systems development, network operations, and Customer Care personnel. In general, the more severe the incident (we grade them 4 (mild) through 1 (severe)), the more people are involved to work on it.
In all incidents, the goal is fast problem resolution, keeping customers informed and happy, ensuring the network is safe, and focusing the work of those on the incident while minimizing the impact on the rest of the company.
We regard our incident process as one of the security measures on the Akamai system. So do our auditors.
Incidents normally start in phase one, which lasts until the immediate problem is controlled. In phase two, we work to return the system to normal operation. Often, customer communication is a focus in phase two. Phase three is when we learn from the incident and take longer-term measures for future safety.
For all severity levels, we have an Incident Manager role on hand to evaluate the severity of a situation and coordinate with others working on the problem. Many employees can receive incident management training and can volunteer as an incident manager when an issue arises.
In fact, most technical departments in the company have people who are trained to step in and manage the incident with other departments. This cross-disciplinary incident manager coordinates a short-lived project team that forms when needed and then disbands. Participants temporarily put aside their primary duties to focus on the incident at hand.
The following is a breakdown of the roles employees take on to deal with a typical incident.
- The response manager leads the temporary team working on incident resolution. The individual is primarily a focus for communication, and is expected to get help from others as needed.
- In a severe incident requiring involvement of members of the executive team, another individual keeps the executive team involved and helps guide decisions that widely affect the business.
- The Network Operations Command Center (NOCC) monitors the deployed network and provides technical and communications support to response manager.
- An individual from Akamai's customer service department receives an escalation of technical incidents that have customer impact. They are responsible for customer communications in most incidents.
- Subject matter experts and customer service technical leads provide detailed technical information and debugging support. Some more severe or complex incidents require other specialists to join the team.
Some more severe or complex incidents require other specialists to join the team.