Max Euston

5 rules for Advanced Infrastructure Monitoring (AIM)

Technology continues to advance on many fronts. Agile and DevOps have enhanced development, deployment, and operations. Cloud/containerization is optimizing infrastructure utilization and scaling. Machine learning has improved usability for both consumers and providers. What can be done to advance monitoring?

Focus on the tenets of IT-Ops

The goal is always to maintain an excellent customer experience - beyond simply meeting SLAs. Likewise, if an outage occurs, the top priority is rapid restoration of service - ideally, with the ability to include only the most appropriate SME teams. In the interest of efficiency, we also need to target well-defined processes with automation. Finally, false alerts must be curtailed in order to preserve our operations teams’ effectiveness.

Information only has value when consumed

Metric and descriptive data each feed active (alerting), passive (investigatory), and explanatory (CMDB) objectives. Capture tooling can and should include a combination of custom, OSS, and COTS components (as they each have their own strengths, and no single domain addresses all use cases).

Threshold alerting is archaic

Without reservation, it’s clear that “the old ways” are not sufficient (and never have been). Yes, the extremes are clearly good or bad, but for the dynamic “grey zone”, a static threshold only ensures that some number of false positives and false negatives will 1) destroy staff, and 2) miss issues. As system complexity has grown, so too must monitoring sophistication. Operational analytics techniques are more useful, and look at higher dimension (multivariate) relationships and non-algorithmic (ANN, ML, AI) patterns.

Vary data capture and storage granularity for multiple use cases

Classical coarse-grained (host-level) metrics are good for capacity planning and detecting hardware failures. Fine-grained (APM) metrics provide extremely high value for application support, but do require a larger maintenance effort and are very instance-specific. Between these two points is a wide gap that process-level metrics address, which works identically for both MS-Windows and Unix/Linux. The last data type is verbatim (log/audit) data. While cloud providers (e.g. AWS or Azure) are changing the landscape for applications, and come with their own advanced monitoring tools, we can still integrate them into this model.

Present the data

Following the sprint notion of iterative development, migration from existing legacy IT-Ops systems can be initiated even without a significant budget or explicit sponsorship. Running new alerts in parallel to legacy systems shows the first signs of value. Capacity and trending spreadsheets are easy to replace (producing reports at an on-demand pace). Asset management systems are verified and enhanced with additional data. New functions generate new mindsets – with process-level data in time series, we are able to recreate the output of common admin tools (e.g. ps, df, netstat) – less often to hear “I don’t see anything now; let me know if it happens again”. Process and socket data (graph nodes and edges) enables unbounded traversal of “related components” and their own time-series data. Soft alerts prioritize direction of investigation when the root cause is unclear. Machine learning (e.g. classification) identifies when processes exhibit new and unexpected behavior (i.e. misbehave).

Summary

There is value in all four classes of tooling: APM (e.g. NewRelic or AppDynamics), log scraping (e.g. Splunk or ELK), metrics (e.g. collectd or custom daemon), and domain-specific (e.g. ExtraHop). Data transport, storage, and access patterns across the enterprise estate map well to distributed filesystems (e.g. Hadoop-based). Visualization is decoupled when there is a data query API (e.g. Tableau, Kibana, or BI tooling). IT-Ops consoles (e.g. ServiceNow or PagerDuty) likewise all accept API feeds. Finally, there are myriad additional use cases, such as compliance verification, software license management, SIEM enrichment, and more.

What else can you create?