collectd in dynamic environments
Over the last years, the number of services running "in the cloud", on a dynamic pool of virtual machines, has increased dramatically. While "moving to the cloud" poses an opportunity to cut costs and offers new possibilities with regard to scalability, it also creates new problems for system administrators. Many existing open-source monitoring solutions require a centrally configured list of hosts, for example, which is not possible to specify in a cloud environment. Configuring checks for individual hosts also makes less sense than it used to, and the focus is shifted to detecting a misbehaving site instead of fixing a single host.
collectd is a lightweight daemon which collects, processes and stores system and application performance metrics. It comes with over one hundred plugins for all sorts of systems, applications and devices. The daemon is very light-weight and is in wide-spread use in small devices (e.g. OpenWrt) as well as huge, ever changing cloud setups.
This talk will focus on the "cloud" end of the spectrum. First, it will discuss what makes collectd a good choice for performance data collection in such a dynamic environment. It will discuss querying guest metrics from the hypervisor, to collect some basic information about virtual machines without instrumenting them. We will then look at different ways to set up networking using collectd's "network" plugin. We'll discuss per-site and global data aggregation using collectd's "aggregation" plugin and, alternatively, by using Riemann. At last the talk will cover some common storage systems, from the old-but-proven RRDtool to the newer alternatives, such as Graphite.
Florian started his first free software project in 2001 and has been active in the open source community ever since. In 2005 he started the collectd project and is still one of the project maintainers. His interests lie mainly with low-level backends and infrastructure services, though he has contributed to various window managers over the years. In his day job, he is a Site Reliability Engineer at Google.