Back in 2011, a few people started complaining about how monitoring sucked with the #monitoringsucks hashtag appearing on Twitter, and since then a lot of DevOps people have joined the rally. However, more recently, #monitoringlove has appeared after a few open-source efforts to improve the state of monitoring, most notably Sensu.
Now as we head into 2014, 2.5 years later, the question we asked when starting Dataloop.IO was does monitoring still suck enough to take the enormous effort of launching a new company in this space?
As anyone who's done a start-up before knows, the last thing you want to do is spend 3 or more years of your life solving a problem that's not really a problem, so we spent the last 10 months talking to as many DevOps & Operations teams as we could to see if things have improved since 2011.
In the process we've spoken to 60 companies (and we're still counting), all of which run Cloud services, and some of which are among the largest Cloud companies in London (UK). I can't name names right now as we promised to keep all our discussions confidential, and this isn't a quantitative survey, but its pretty clear what the answer is to the main question we had:
Yes, Monitoring Still Sucks & Its Going to Get a Lot Worse.
What Everyone's Currently Using
Although a lot of new tools have arrived since 2011, it's clear that older open-source tools like Nagios, and Nagios alternatives like Zabbix and Icinga still dominate the market, with 70% of the companies we spoke to still using these tools for their core monitoring & alerting.
Around 70% of the companies used more than one monitoring tool, with most using an average of 2. Nagios/Graphite configurations were most common, with many also using New Relic. However only 2 of the companies we spoke to actually paid for New Relic, with most of the companies using the free version as they found the paid version too expensive.
In the "Other" category, there were a lot of different tools with no particular one standing out. Types of tools that fell into this category were several smaller start-ups using some of the newer SaaS monitoring tools such as Librato & Datadog, or lots of older open-source tools like Cacti or Munin, some AWS users relying on CloudWatch, and even a few custom built solutions.
If we look at tool usage versus the number of servers the companies manage (< 20 being new start-up services all the way to > 1000 servers for the large online services) you can see the proportion of older open-source tools like Nagios, or paid on-premise tools goes up as the service gets larger, where-as the smaller, newer services are more likely to use developer focused tools like Graphite, LogStash and New Relic.
This makes sense, as many of the larger services are older (> 5 years old) so have legacy monitoring infrastructure, and also have the resources to hire a dedicated operations team who tend to bring in the tools their most familiar with, namely Nagios or Nagios alternatives. They also have more money to pay for monitoring tools like Splunk (which everyone would love to have if they could afford it!) or AppDynamics.
The newer smaller services tend not to have any DevOps/Operations people in their company, so developers tend to use simpler to install SaaS monitoring tools, or tools that help them such as Graphite or LogStash. There seems to be a tipping point between 50-100 servers when the company has the resources to bring in a DevOps/Operations person or team and they start bringing in the infrastructure monitoring tools like Nagios to provide the coverage they need.
While a lot of issues and trends came out in our open end discussions with companies, it was clear monitoring was a hot topic for many of the people we spoke to, and there were a few key issues/trends we saw across all the companies we spoke to.
Everyone's still building kit cars
In the data above, companies use an average of 2 monitoring tools to ensure they get the coverage required across different parts of the stack (infrastructure, application, logs etc.). In some of the larger services it wasn't uncommon to find 4-5 monitoring solutions in place. For example Nagios for alerting, Graphite for dashboards, StatsD for developer metrics, collectD for service metrics and LogStash/Kibana for logs.
In every company with more than 50 servers, everyone shared the same experience we had had at Alfresco - building a "kit car" of open-source monitoring infrastructure over several weeks/months. In addition to setting up, integrating and fine tuning all the tools together, many had invested significant effort into writing their own custom infrastructure to address gaps in the tools themselves. For example, one company had spent 4 months writing puppet modules to automatically register new servers and plugins with Zabbix, and fixing scalability issues with the tool.
In addition many companies had developed their own custom dashboards to display metrics on the large TV screens they had around the office. While most of these were pretty standard dashboards, one particularly notable example (and very cool!), was a "Star Trek" status dashboard that showed Nagios alerts on a Star Trek dashboard with alert sounds from the films.
Most of the people we spoke to understood that investing significant time into building their custom "kit cars" (which then had to be maintained after) created a bottleneck when monitoring requirements changed, especially in high growth services, and diverted a lot of their time and attention from working on higher value tasks such as getting faster deployments or moving the service to new scalable architectures.
When we asked them why they didn't use some of the new out-the-box SaaS monitoring services out there instead of building the "kit car" themselves, many people found the newer services lacked the flexibility of open-source solutions with their ability to customize them to their requirements, and didn't like the idea of learning a proprietary system with its own plugin design and features, leading to lock-in. Also companies in the regulated sectors like financial services or the online gaming industry (of which there are many in London!) had strict requirements on what data could be sent outside their service, and found SaaS solutions like New Relic which sends almost everything to their service by default meant they couldn't use them.
Decentralizing with Micro-Services
A key trend we saw, especially as the services became larger, was the move towards micro-services, with different cross-functional development teams building, deploying and supporting their own parts of the service, as Spotify describes for their company in this paper. While many of the companies we spoke to were some way off achieving this model in their own organization, they were making strategic moves towards it, and they all agreed that the current monitoring tools available, both new and old, would not adapt to this model.
In this model, the DevOps/Operations team no longer control all the deployments and infrastructure, but instead become a support team to the rest of the organisation, providing the tools and processes each team needs to roll out and support their own micro-services. A micro-service in Spotify could be the playlist service, or the music streaming service, essentially a discrete part of the overall service that provides functionality to end users or other micro-services. This architecture and organization enables a large, complex service, to become highly scalable as it grows. However it dramatically increases the number of servers and services the DevOps/Operations team need to support, so the only way to scale it is move support to the development teams themselves, who become the first line of support when things go wrong.
This means that the development teams need monitoring for the services they own, with the ability to add and remove checks as the service changes, and customize what alerts they receive and what dashboards they view.
To everyone we spoke to's current knowledge, none of the monitoring tools available today cater for this model, as they've all been designed for the traditional centralized model, where the development team throw it over the wall to the DevOps/Operations team that deploy and support the entire service in production. Current tools aren't easy to use, or provide the nice GUI that developers want to configure their monitoring, they don't make it easy for different teams to create and roll out their own monitoring checks, or easily configure their own alerting and dashboards, with everyone currently receiving everyone's alerts leading to the next major problem we saw.
In addition some of the more developer friendly tools like New Relic have a business model that is too expensive for services moving to micro-services. When a service that is currently only 50 servers, evolves into 100's of smaller instances running micro-services, with each instance costing only $60/month to run on AWS, $200/agent/month really doesn't make sense.
And finally, if there was one consistent complaint from all the companies we spoke to, it was alerting. Alert Fatigue or Spammy Alerts, it's clear that none of the tools, even the monitoring tools that claim to have advanced machine learning algorithms, have solved this problem, and its getting worse as companies scale out even more servers to run micro-services on continuously changing Cloud environments. In one company's case, they were receiving around 5000 alert emails a day. With that volume, alerting had just become noise, and most of the team simply filtered them out into a folder or automatically deleted them all together.
A key reason we saw for spammy alerts was the lack of granularity with current tools sending alerts. With many more services and servers, ultimately there are many more alerts being created by the monitoring tools. However, many of the alerts were sent to everyone via a group email address, as the tools didn't provide an easy way to allow different developers to only receive, or subscribe, to the alerts they cared about. While everyone we spoke to was interested in the potential of predictive & more intelligent alerting using machine learning (particularly anomaly detection as the number of metrics they monitored increased), we believe there are some simpler things that can be done to reduce spammy alerts in many organizations, most notably only sending alerts to the people who need them!
Many of the companies we spoke to initially told us their monitoring was "good enough". They knew the current tools "sucked", they weren't designed for the type of services they're running today and wouldn't scale, but many had already made the significant investment to address these issues with custom infrastructure in Puppet/Chef to make it work for their needs. However as we dug deeper and became more familiar with the problems, we saw many of the companies we spoke to later resonated with the issues above and were very aware that current monitoring solutions were not designed for where they wanted to go - decentralized micro-services running on continuously changing Cloud infrastructure.
We believe this is the big trend that will change the way monitoring tools are designed for companies running Cloud services, and will require a fundamental rethink around how monitoring is managed and deployed across organizations. It will require a way to bridge the needs of the DevOps/Operations teams with the development teams they support and new solutions that allow decentralized teams to monitor and manage their own micro-services.
So overall, with 70% of companies still running the same tools available back in 2011, its clear for many companies monitoring still sucks. But things have improved overall, there's been a lot of new ideas and tools that have appeared trying to solve the problems raised in 2011. The issue we see however is they're all solving the problems from 2011, and none of them seem to be solving the issues many larger services (or smaller ones that are growing fast) are facing as we move into a world of DevOps and decentralized micro-services, and without new ideas and thinking about these newer problems, we risk being in the same place in 2015, where monitoring still sucks, and even worse than before!
If you're interested in seeing what we're working or giving feedback then signup on our home page and we'll let you know when its ready!