IT operations — spend n iterations to see a success at n+1th time
I do often see team’s spend 40 hours on a outage incident and thought of explaining, how this progress to have a run book a new runner.
Incident detected my monitoring system:
This is critical and the company should have invested / matured enough to monitor the critical systems and do a synthetic transaction to detect the problem.
Paging System:
This phase will call the engineer in the shift and the engineer will get the resources to debug the system
Involve Vendor:
After basic analysis, if you see the problem with a vendor involve him.
Get dirty:
Multiple Vendors Involved
Scope of the issue is larger
Notifying the teams impacted
Keep the triage call going
Tools & People:
You will see team low on energy,
But you are the total owner and you cannot fail.
Time lines:
Based on the complexity, you will end up spending sleepless days trying different trail & error methods to resolve
Final step:
You will resolve issue or conclude the issue and you will get some sleep. This repeats often after a hectic week of work.