ML Infrastructure Monitoring Checklist
Complete this checklist to assess the current status and health of your machine learning infrastructure.
Date of Monitoring
*
-
Month
-
Day
Year
Date
Name of Reviewer
*
First Name
Last Name
Infrastructure/Cluster Name
*
Component/Node Being Checked
*
Please Select
Master Node
Worker Node
GPU Server
Storage Server
Other
System Resource Status
*
Rows
Status
CPU Usage
OK
Warning
Critical
Not Applicable
Memory Usage
OK
Warning
Critical
Not Applicable
Disk Usage
OK
Warning
Critical
Not Applicable
GPU Utilization
OK
Warning
Critical
Not Applicable
Service Health Check
*
Rows
Status
Model Serving API
Running
Stopped
Degraded
Unknown
Database
Running
Stopped
Degraded
Unknown
Monitoring Agent
Running
Stopped
Degraded
Unknown
Scheduler
Running
Stopped
Degraded
Unknown
Other Service
Running
Stopped
Degraded
Unknown
Network Connectivity
*
Stable
Intermittent
Down
Are there any critical alerts or incidents?
*
No
Yes
If yes, please describe the alert/incident.
Additional Observations or Comments
Attach logs or screenshots (optional)
Upload a File
Drag and drop files here
Choose a file
Cancel
of
Submit Checklist
Should be Empty: