IT operations — spend n iterations to see a success at n+1th time

I do often see team’s spend 40 hours on a outage incident and thought of explaining, how this progress to have a run book a new runner.

Incident detected my monitoring system:

This is critical and the company should have invested / matured enough to monitor the critical systems and do a synthetic transaction to detect the problem.

Paging System:

This phase will call the engineer in the shift and the engineer will get the resources to debug the system

Involve Vendor:

After basic analysis, if you see the problem with a vendor involve him.

Get dirty:

Multiple Vendors Involved

Scope of the issue is larger

Notifying the teams impacted

Keep the triage call going

Tools & People:

You will see team low on energy,

But you are the total owner and you cannot fail.

Time lines:

Based on the complexity, you will end up spending sleepless days trying different trail & error methods to resolve

Final step:

You will resolve issue or conclude the issue and you will get some sleep. This repeats often after a hectic week of work.

