I’m going to talk a little more about post-mortem and how it directly impacts the security and compliance area concerning the availability and integrity of data and the system. But after all, what is the post-mortem?
The post-mortem is a meeting that is held at the end of every project or procedure, it is a kind of retrospective, where the team analyzes what was done, what went well and points to be improved.
Here at Convenia, in addition to performing this rite after the end of a project or procedure, we also perform it after an incident related to technology, an example of which is the failure of some service. This meeting is a process that helps improve our services and understand why the failure occurred, changing organizational processes to incorporate lessons learned. A post-mortem is more than just a fact-finding.
The most important functions of a post-mortem are to promote process improvements and best practices for repeating successes. It is usually a quick meeting, of a maximum of 1 hour, where we gather the team responsible for managing the service that was impacted by the incident, and we detail what happened, why it happened, how long it lasted and what we can do to prevent it from happening again.
At a certain point, in the middle of 2022, we felt the lack of a metric to be able to measure the frequency and impact of these incidents, that’s when I decided to implement the RIT, which is the technology incident report.
What is done in case of incident or unavailability
When we are affected by an adverse event, we follow a workflow consisting of 3 steps:
Identify the problem (the team responsible for the service in question is mobilized to discover the root cause);
Quick resolution of the problem (whether definitive or provisional with a more structured task for the future);
The team thinks about preventive measures and learnings for the problem in question. Afterward, we fill out an RIT (Technology Incident Report).
What is RIT?
This document is completed during the post-mortem meeting, as we discuss the points we fill in, and at the end, we review what we wrote. Right from the start, we fill in a brief description of what happened, followed by the severity of the incident, the type of impact, the origin of the alert, and which sector the incident was reported to.
After this initial analysis, we went on to detail the incident, informing its category, such as unplanned change, non-compliance with our security policy, backup failure, or some other category not yet cataloged. After this category definition, we describe what happened, what were the extent and impacts of this incident, and the causes and areas involved.
Concluding this detailing of the incident, we describe how this incident was handled, we describe actions taken to treat or circumvent this incident in as much detail as possible. Finally, we do a final analysis and closure of the incident, which is where we detail whether other actions and resources will be needed to prevent another incident of this nature from happening and, if possible, inform the deadline and those responsible for carrying out these improvement actions. In addition, we also describe the lessons learned, what we learned from the incident, what we learned from what happened, and what we should do to prevent it from happening again.
If you want to download the template for this document, we are making it available here.
Importance of adopting these processes.
Adopting the post-mortem and, from there, completing an RIT is essential for us to be able to periodically review the action plans to understand whether they were implemented. From this set of tools, it is possible to understand failures, understand what we have learned, and mitigate risks, in addition to metrics and a better understanding of past incidents. Another important point for us to defend this implementation is to increase the culture of learning from mistakes, that will happen, this is normal in the product process, but we must learn from them and also be able to take positive points, to implement improvements or improve some broken process that we have no visibility.
Real use case at Convenia
In addition to the points already mentioned in the article, I highlight a great learning experience that we were able to learn from this post-mortem + RIT process.
We always made backups of all our banks and instances, but from a lesson learned in filling out these documents, we rethought our policy, and after some analysis, we decided to improve our process. We started to make a continuous backup, which is a backup, there is a loss, will be a maximum of 5 minutes.
Not only were we able to improve our internal processes, but we were also able to pass on more security and a better product to our customers.
Another point to highlight is that from time to time, we meet and summarize all the completed reports, and with that, we can see whether we have implemented the identified improvements or have any pending. In the beginning, only one squad followed this process, but as time went by, the rest of the team was able to see the advantages, and currently, all squads adopt this practice.
Conclusion
The postmortem meeting is a fundamental tool to promote continuous improvement and organizational learning. An example of this is our case with continuous backup, which brought us even more security and improved our RTO and RPO.
By allowing teams to revisit past processes, the postmortem meeting makes it possible to identify strengths and weaknesses, as well as provide valuable insights to avoid future mistakes. It is important to emphasize that, for the postmortem meeting to be effective, it is necessary to establish a safe and confidential environment so that the team members can speak openly and express their ideas. In addition, it is necessary to ensure that the discussions result in concrete actions and measures for improvement. The post-mortem can help teams to evolve and improve their performance, becoming more efficient and effective in their activities, and with that, guaranteeing more satisfaction to the end customer.
*The content of this article is the author’s responsibility and does not necessarily reflect the opinion of iMasters.
Leave a comment