The Challenge
There are risks which accompany ineffective monitoring. According to Dunn & Bradstreet, approximately half of all Fortune 500 companies have a minimum of 1.6 hours of downtime per week. That works out to approximately 83 hours of downtime over a year. With this in mind, if a site averages around $6000 in revenue per hour, the downtime will cost the company over $500 000 per year.
Monitoring plays a major role in avoiding this downtime, while minimizing the downtime if an issue ever occurs.
The Benefits
Using a monitoring service can help your business immediately identify issues and the servers which the issues arose. Your IT team won’t have to spend time locating the cause, sifting through data, or running manual tests to figure out the issues. They will be more focused on what’s important - your app. In addition, monitoring can:
Identify gaps in the environment that could result on crashes, such as databases running out of disk space.
Identify potential risks of outages in an environment such as nodes running out of CPU and memory, applications not being able to handle the amount of requests.
At stack.io we can:
Set up alert notifications for key infrastructure including:
Basic system-level metrics (CPU, disk, memory, load)
Database metrics
Proxy / load balancer / certificate expiration metrics
Design and implement a monitoring system based on best practices
Push alerts to a preferred company Slack channel / Pagerduty / email / etc.
Monitor your application execution using application performance monitoring tools
DevOps Maturity
Where does your setup fit on our DevOps maturity scale?
+ Poor
We have no monitoring setup
We only know about things failing when the website goes down.
Our TLS certificates expire on us all the time, causing downtime.
+ Fair
We monitor one or two endpoints.
We have monitoring that lets us know when there is an outage.
When an alert fires, we can solve the issue after a detailed investigation.
My email inbox has 150k unread emails from our monitoring system.
There's so many alerts that we’ve started ignoring them and not all of the alerts are actually useful or meaningful.
Checking alerts is painful and we actively avoid doing it if there’s another task that needs to be done.
+ Good
We have alerts for most metrics and systems, though occasionally our alerts miss something.
We have dashboards for all of our systems so that we’re able to view the status of our systems at a glance, but are missing some metrics or applications (or have no application performance monitoring at all).
+ Great
- Our alerts catch issues before they happen.
- We have alerts set up for monitoring our costs.
- We have alerts set up for broken pipelines.
- We have alerts for high privileges to monitor security access.
- We have great dashboards of both our application and system metrics for all of our systems and it’s easy to make new ones.