Skip to content

Observability quality metrics for IT

Estimated time to read: 9 minutes

Intro

Over the years, the IT landscape has evolved tremendously, with organisations adopting modern and complex systems, applications, and infrastructure to maintain competitiveness and efficiency. In this dynamic environment, observability has emerged as a crucial aspect of IT operations, enabling organisations to gain comprehensive insights into their systems' performance, reliability, and security. By implementing observability, businesses can proactively identify and resolve issues, optimise resource usage, and ensure a seamless user experience.

Key Performance Indicators (KPIs) and metrics play a vital role in observability, as they help measure various aspects of IT operations and provide valuable data for informed decision-making. KPIs and metrics can span multiple categories: availability, performance, security, and customer satisfaction. By monitoring these metrics, organisations can evaluate their IT systems' effectiveness, identify improvement areas, and track progress towards their strategic objectives.

Some of the most widely used KPIs and metrics in observability include error rates, uptime percentage, mean time between incidents (MTBI), mean time to detection (MTTD), and mean time to resolution (MTTR). By analysing these metrics, IT teams can uncover patterns and trends that might indicate potential issues or opportunities for optimisation. Furthermore, advanced analytics and machine learning techniques can predict and forecast metrics, enhancing the organisation's ability to make data-driven decisions and maintain a competitive edge.

In summary, observability and using KPIs and metrics are invaluable tools for modern IT operations. By leveraging these insights, organisations can ensure their IT systems' reliability, performance, and security, ultimately driving business growth and customer satisfaction.

Here's a table categorising the provided metrics and KPIs, along with their descriptions and estimation formulas:

Metric Category Description Estimation Formula
Cycle time Process The time it takes to complete a single unit of work or a cycle of a process. Total time spent / Number of cycles
Endpoint security incidents Security The number of security incidents or breaches that occur on endpoints. Number of incidents
Error rates Quality The frequency or percentage of errors or bugs in software or system components. Errors / Total transactions
Lead time Process The time it takes from the initial idea to the deployment or release of a new feature. Time from idea to deployment
Speed of software performance Performance The rate or efficiency of software in executing tasks or processing data. Tasks completed / Time taken
Uptime percentage Availability The percentage of time that a system or service is available and operational. Uptime / Total time
Availability Availability The percentage of time that a system or service is available and accessible to users. Available time / Total time
Deploy speed and frequency Deployment The rate or frequency of deploying new features, updates, or patches. Deployments / Time period
Error budgets Quality The acceptable or allowable number of errors or failures in a system within a specific period. Acceptable errors / Time period
Mean Time Between Failures (MTBF) Reliability Measures the average time between system or application failures. A higher MTBF indicates greater system reliability. Sum(time between failures)/number of failures
Mean Time Between Incidents (MTBI) Reliability The average time between two consecutive incidents or failures. Total time / Number of incidents
Mean Time TO Detention (MTTD) Detection The average time it takes to detect an incident or failure. Total detection time / Incidents
Mean Time To Resolution (MTTR) Resolution The average time it takes to resolve an incident or failure. Total resolution time / Incidents
Service-Level Agreements (SLAs) Agreement The formal agreements that define the expected level of service, availability, and performance. Defined in the SLA document
Service-Level Indicator (SLIs) Indicator The measurable metrics that indicate the performance, availability, and quality of a system. Defined in the SLA document
Service-Level Objectives (SLOs) Objective The specific targets for the performance, availability, and quality of a system. Defined in the SLA document
Customer satisfaction Satisfaction The degree of satisfaction that customers have with a product, service, or experience. CSAT score (Customer SATisfaction), NPS (Net Promoter Score), or other metric
On-time project completion Timeliness The ability to complete a project within the expected or promised timeline. Completed projects / Total projects
Software development efficiency Development The effectiveness of the software development process in delivering quality software. Value delivered / Development cost
Conversion rates Business The percentage of users who take a desired action, such as making a purchase. Conversions / Total visitors
Cost-effectiveness Business The ratio of the cost of a product or service to its benefits or value. Benefits / Costs
Return-On-Investment (ROI) Business The measure of the profitability or financial return of an investment. (Gain - Investment) / Investment
Speed of innovation Innovation The rate or efficiency of developing and introducing new ideas, products, or services. Innovations / Time period
Total Cost of Ownership (TCO) Cost Analysis The total cost of owning and operating a product, service, or system over its entire lifecycle. Acquisition costs + Operating costs + Disposal costs
Speed of innovation Innovation The rate or efficiency of developing and introducing new ideas, products, or services. Innovations / Time period
Incident response time Security The time it takes to respond to a security incident or breach. Total response time / Number of incidents
System resource utilisation Performance The percentage of system resources being used, such as CPU, memory, and storage. Resource usage / Total capacity
Churn rate Business The percentage of customers who stop using a product or service within a given time period. Churned customers / Total customers
User engagement Satisfaction The level of user interaction or engagement with a product, service, or system. Engagement metric (e.g., session duration, page views)
Developer productivity Development The efficiency of developers in delivering quality code and features. Feature output / Developer hours
Time to market Innovation The time it takes for a product, service, or feature to go from conception to market. Time from conception to market release
Network latency Performance The time it takes for data to travel from one point to another within a network. Round-trip time or one-way latency
First response time Support The average time it takes to respond to a customer inquiry or support request. Total first response time / Number of requests

Forecasting and Predictions

A list of metrics for forecasting and predictions, how to calculate them and where to find them.

Metric Description Estimation Formula Possible Data Sources
Projected Sales Revenue Predict future sales revenue based on historical sales data, seasonal trends, and other factors. Time series forecasting methods, such as ARIMA, Exponential Smoothing, or Machine Learning algorithms (e.g., LSTM, Prophet) Sales databases, Customer Relationship Management (CRM) systems, Enterprise Resource Planning (ERP) systems
Predicted Customer Demand Estimate future customer demand for products or services to ensure appropriate inventory levels. Demand forecasting methods, such as moving average, regression analysis, or machine learning techniques (e.g., XGBoost, Random Forest) Sales databases, CRM systems, ERP systems, web analytics, customer feedback
Required Capacity Predict the capacity needed to meet future demand and optimise resource utilisation. Capacity utilisation ratio, queuing theory, or simulation modelling Infrastructure monitoring tools, cloud provider monitoring services, resource usage databases
Predicted Workload Forecast the expected workload for a team or department to optimise staffing levels and improve efficiency. Time series analysis, regression models, or machine learning algorithms Workforce management systems, project management tools, HR databases
Churn Probability Estimate the likelihood of customers leaving a product or service, allowing for targeted retention efforts. Logistic regression, decision trees, or machine learning techniques (e.g., SVM, Neural Networks) CRM systems, customer feedback, web analytics, transaction records
Projected Financial Metrics Predict future financial performance to support strategic planning and decision-making. Time series forecasting, financial modelling, or machine learning techniques Financial databases, ERP systems, accounting software
Predicted Market Share Estimate a company's future market share based on market trends, competitor analysis, and other factors. Market growth models, regression analysis, or machine learning models Market research data, competitor analysis, industry reports, web analytics
Recommendation Score Predict the likelihood that a customer will be interested in a specific product or service. Collaborative filtering, content-based filtering, or hybrid recommendation systems (e.g., Matrix Factorization, Deep Learning) Transaction records, customer feedback, web analytics, CRM systems
Anomaly Score Identify unusual patterns or behaviour in data to detect potential issues or opportunities. Statistical methods (e.g., Z-score, IQR), machine learning techniques (e.g., Isolation Forest, Autoencoders) Log management tools, monitoring and alerting tools, SIEM tools

More

Metric Description Estimation Possible Data Sources
Incident Probability Predict the likelihood of system incidents or failures based on historical data and trends, enabling proactive maintenance and resource allocation. Logistic regression, decision trees, or machine learning techniques (e.g., SVM, Neural Networks) Monitoring and alerting tools (e.g., Datadog, Nagios, Prometheus), incident management tools (e.g., PagerDuty, Opsgenie), log management tools (e.g., ELK Stack, Splunk)
Predicted Resource Utilization Estimate future resource utilisation (e.g., CPU, memory, storage) for IT systems to optimise capacity planning and avoid performance bottlenecks. Time series forecasting methods, such as ARIMA, Exponential Smoothing, or Machine Learning algorithms (e.g., LSTM, Prophet) Infrastructure monitoring tools (e.g., Datadog, Nagios, Zabbix), cloud provider monitoring services (e.g., Amazon CloudWatch, Google Stackdriver, Microsoft Azure Monitor)
Predicted Network Traffic Forecast network traffic patterns to optimise bandwidth allocation and prevent congestion. Time series analysis, regression models, or machine learning algorithms Network monitoring tools (e.g., SolarWinds, PRTG, Wireshark), traffic analysis tools (e.g., ntopng, Cacti)
Predicted Application Performance Predict the performance of applications based on historical data and trends, enabling proactive optimisation and resource allocation. Time series forecasting, regression analysis, or machine learning techniques Application Performance Monitoring (APM) tools (e.g., New Relic, Dynatrace, AppDynamics), log management tools (e.g., ELK Stack, Splunk)
Threat Probability Estimate the likelihood of potential cybersecurity threats, allowing for targeted prevention and mitigation efforts. Logistic regression, decision trees, or machine learning techniques (e.g., SVM, Neural Networks) Security Information and Event Management (SIEM) tools (e.g., Splunk Enterprise Security, IBM QRadar, LogRhythm), threat intelligence platforms (e.g., Recorded Future, ThreatConnect, Anomali)
Predicted Ticket Volume Forecast the number of help desk tickets to be received, enabling better resource allocation and staffing decisions. Time series forecasting methods, such as ARIMA, Exponential Smoothing, or Machine Learning algorithms (e.g., LSTM, Prophet) IT service management (ITSM) tools (e.g., ServiceNow, Jira Service Management, Freshservice), ticketing systems (e.g., Zendesk, Salesforce Service Cloud)
Predicted Infrastructure Growth Estimate the growth of IT infrastructure requirements based on historical data and trends, supporting strategic planning and budgeting. Time series analysis, regression models, or machine learning algorithms Infrastructure monitoring tools (e.g., Datadog, Nagios, Zabbix), cloud provider monitoring services (e.g., Amazon CloudWatch, Google Stackdriver, Microsoft Azure Monitor), asset management systems
Patch Success Probability Predict the success rate of deploying patches or updates based on historical data, allowing for targeted troubleshooting and improved patch management. Logistic regression, decision trees, or machine learning techniques (e.g., SVM, Neural Networks) Patch management tools (e.g., WSUS, SCCM, BigFix), CI/CD pipelines (e.g., Jenkins, GitLab CI/CD, Azure DevOps), configuration management tools (e.g., Ansible, Puppet, Chef)

These data sources can provide valuable information for generating insights using prediction and forecasting methods. Depending on your specific environment and toolset, you may need to adjust or combine data from multiple sources to generate the most accurate predictions.