We propose you the following article, jointly developed by Pierre-Malo Bovet, Network Engineer and Akéla Bendjeddou, Marketing & Communication Manager at France-IX. Happy reading!
No technical operation, whether initiated by man or machine, is infallible. Breakdowns happen, we all know that.
In case of an incident, what should be communicated? How often? To whom? How? These questions are a headache for many of us for whom quality of service remains a top priority, because there is obviously no single set of instructions, only best practices.
To the question: “Should we communicate during an incident?”, the answer is definitely yes, as soon as the service is impacted. But who is impacted when it comes to the Internet? In the case of France-IX, an outage for a single high capacity player can have as much impact as an incident affecting ten players combined. It is therefore necessary to be able to evaluate or at least estimate very quickly the members affected, in order to decide the communication approach to follow.
Then comes the question of timing. When should you start communicating once the problem is detected? Great caution is needed here and a balance must be found between communicating too early and not communicating early enough. In the first case, communicating with preliminary and therefore incomplete elements also means taking the risk of going back on your first analysis and thus losing credibility with your audience, by revealing a false-positive for example, if the impact is badly estimated at first, which is humanly possible. In the second case, not communicating immediately can be considered as a lack of transparency by some customers, or even as withholding information, and cause a bad buzz, which is extremely damaging to the image of a company and to the perception of its quality of service.
The criterion that seems to prevail is therefore precaution: the France-IX technical team, for example, implements a communication in several stages. An initial alert informing members of the incident is sent, specifying the estimated scope of the impact, and then we provide regular updates as the situation evolves, gradually developing the information thanks to the results of our investigations on the incident. This approach is used when the incident has a scope that impacts several members. By default, a communication at least once every hour at the beginning of the incident is generally appreciated.
This intermediate solution allows us to offer as much transparency as possible to our members and avoids long delays if we had to wait until we have enough concrete elements in our possession.
Another important issue is the level of information to be provided. When the audience is technical, should we go into the details of the architecture? However, providing updates every 15 minutes and moreover, more detailed, requires enough resources to resolve the incident and be on deck to communicate at the same time.
Another question that can be considered as thorny, when the company has a communication department: is it up to the latter to ensure the relay of information? Can it be reactive, can it be technical enough to deliver the information without making mistakes that could have, on the other hand, perverse effects?
For the latter option to work, the transfer of information between teams must be very smooth, with optimal responsiveness from all teams. But once again, this is a question of resources, and each company does as it wishes and, above all, as it can.
At France-IX, saying that there is an incident without giving the nature of it means to face massive de-peers, i.e. massive interruption by members of their peering sessions. These interruptions could be interpreted as being the consequence of the incident when they are not directly linked to it. These chain reactions are obviously synonymous with longer analysis time, because the incident must be analysed as a whole and, as with any experience, measurement and interpretation biases change the results.
On the other hand, if the technical department is in charge of communication, is this some time that it could have been spent on resolving the incident instead, especially if resources are limited? The answer to this question still lies in arbitration, as no company has unlimited technical nor human resources.
The RFO (Reason For Outage) is therefore of paramount importance for our members who must be able to understand and justify to their hierarchy or to their own customers the impact of the incident involved.
The temptation to say as little as possible in the RFO in order to avoid debates can then arise. Everyone will agree that the more details you add, the more you expose yourself to criticism. The right mix of information is therefore essential, because some people are tempted to start discussions that don’t necessarily need to be held on the network architecture, for example, when they don’t have the history of the infrastructure or sufficient contextual elements to judge the situation, especially in the context of an incident where urgency prevails.
At France-IX, we provide a certain amount of details on the infrastructure, but when we publish a RFO, we don’t necessarily have all the elements available right away. The smooth running of the platform also depends on trusted partners. Communicate too quickly, or at least before having sufficiently analysed the situation, can also be detrimental for them. However, it is necessary to publish the RFO very quickly so we have taken the decision to be reactive and to complete it later if necessary.
In any case, communication is always useful, relevant and important, regardless of the content. We all learn from the communication about incidents, as providing public information leads to relevant feedback and recommendations that move the community forward.
Admitting that you had an incident can sometimes make you feel ashamed. The reasons for the incident are often human because yes, humans make mistakes, and sometimes silly ones. We establish processes, we build resilient architectures, we test them in labs to see how the equipment reacts, but even the exceptional quality of an infrastructure does not make it infallible.
It is important to remove the climate of shame that surrounds the communication on technical incidents because blaming or even shaming the players, big or small, who opt for transparency is counterproductive for all and encourages opacity. We must encourage transparency, for the collective benefit, because it is an incredible lever to move the whole community forward. Think of this the next time you receive an incident notification, even if the explanation seems silly, incongruous, or incomplete. Look rather at the transparency effort behind that communication as a proof of humility, an opportunity for learning, and a vector for improvement.
We would be very curious to hear your opinion in comments: according to you, what is the best way to communicate?