Why an MSP Needs 1 FTE for every 64 Managed Service Customers
The Best Monitoring Alert Triage SOP for MSPs
An MSP needs 1 FTE Technician for every 64 Managed Service Customers**
The other day I was on a call with several MSPs and was asked about the difference in triage process between “normal” tickets and “monitoring alert” tickets. The discussion that followed was lively, so I thought I would share.
When Triaging monitoring alert tickets, the process is different than with “normal” tickets. With normal tickets a person is letting you know something is not working correctly. With monitoring alerts, you have automatically generated tickets with a couple different types of alerts.
The first kind of tickets you’re going to see in the Monitoring Alert queue are the offline tickets. These are for a server, firewall, or special computer which is offline. They should go to the triage queue and appear in the triage widget for the Service Coordinator/Dispatcher to get the right technician working on it. It helps if the Service Coordinator/Dispatcher has another view of the device to check for false positives, such as the firewall cloud management tool or backup remote control.
The other type of alert tickets we commonly see, are the ones that trigger and two minutes later are fixed. There are two ways to fix this type of alerting. One is to change the alert to only create a ticket when the action has happened so many times within a period of time. The other is to let the tickets be created and hold them in a queue for monitoring alerts. Once the ticket has been created for five minutes with no resolution, then have the ticket move by Workflow rule to the Triage queue for the Service Coordinator to get the ticket in the flow.
One great way to minimize your triage work is to utilize components to take action for you before a ticket is created. Two quick and easy places to start with this are disk space and CPU usage. When setting up monitors for usage of disk space or CPU usage, you should have on the monitor a component to run before the ticket is created. When the disk space alert triggers, you can have a component run a cleaner and if the alert is not cleared, generate a ticket for, a technician to take the next step. On the CPU usage, have a component run that will see what is using all the CPU. When that is added to the ticket, it gives the technician a better place to start.
As you can see, it doesn’t take a lot of time or effort to optimize your triage process to clear out the noise tickets and get to the important ones faster. With a few Workflow Rules and some components your triage process will be rocking in no time.
Once the noise has been addressed, the remaining tickets take a technical deep dive to determine:
1) False/Positives
2) Is this a recurring event that needs Root Cause Analysis (RCA)
3) Does it need remediation engagement?
**Here is why you need 1 FTE Technician for every 64 Customers:
1) Triaging Monitoring alerts is more of a technical evaluation than triaging other “New” Customer requests
2) If we spend 5-6 minutes per day per Customer looking at their alerts it would add up to a half-an-hour per week per Customer
a. This is not to say that all Customers have an alert every day
b. It is to say that they time it takes to review all alerts and make an engagement/no engagement decision is about 5-6 minutes per Managed Service Customer
c. To spend less than this time means you are either maintaining a very solid/stable network, or you are not providing the service promised to the Managed Service Customer
3) 30 minutes per Customer per week with an 80% Technician Utilization (32 hours per week) adds up to 64 Managed Service Customers can be supported by 1 FTE just for alert management
4) Once an engagement has been determined, best practice is to move the ticket over to the Triage queue for further triaging and assignment.
Have any questions or need more information? Email us at info@agmspcoaching.com