Tool-specific documentation, including configuration, implementation and troubleshooting.Vendor and Tool Specific Support and Documentation ĭetailed breakdown of the monitoring service roles and responsibilities Ideally this person took the LogicMonitor self -paced training and is LogicMonitor Certified and PagerDuty Certified. They are the escalation-point for team monitoring/alerting challenges, postmortem coordinators, and the person who escalates unresolved challenges to the Product Owner of Monitoring and Alerting. Your team needs a monitoring and alerting point-person to take the lead on team monitoring and alerting challenges. They also escalate service improvement ideas and monitoring improvement ideas. They also respond to any alerts in accordance with the processes defined by the Architect and Service Owner. These team members implement the monitoring and alerting processes identified by the Architect and Service Owner. The role of the Engineer / DevOps Engineer / Administrator They work with the application owner to identify, prioritize, deploy and improve monitoring and alerting services for the service. Link to the Product Owner's Monitoring Checklist The role of the ArchitectĪrchitects should ensure that their services are monitored and will alert before and during service disruptions. This includes making sure their services are monitored, alerts are going where they're expected, tuning of the monitoring, incident response documentation, and making sure app-team members are trained to use LM and PD. Service / Application owners drive the monitoring and alerting lifecycle for their services and apps. What's your team responsible for? The role of the Service Owner Teams should not use LM and PD to retrieve or store Level 3 and Level 4 data.Teams should not leave "Critical" alerts un-acknowledged.All LM and PD users must have Duo Multi-factor Authentication enabled on their Harvard Key Accounts.Weekly office hours and on-demand working sessions.Escalation-point for complex or organization-specific problems.LM and PD system architecture and road-map.Assistance building subject-matter-experts in every team.Assistance building and maintaining your team(s) monitoring and alerting processes and lifecycles.Initial onboarding into the monitoring and alerting systems.Overview of the monitoring & alerting services provided by the Shared Tools Team Together, they provide monitoring and incident response services for thousands of cloud and on-prem technologies. PagerDuty is a cloud-based incident response and on-call management platform. LogicMonitor is a cloud-based infrastructure monitoring platform. The Harvard University Information Technology (HUIT) Shared Tools Team manages the LogicMonitor (LM) and PagerDuty (PD) SaaS services for HUIT and Harvard.