What is anomaly detection? A definition

Learn how anomaly detection can help identify problems and deliver insights to improve business outcomes.

Anomaly detection, or outlier analysis, is a technique used in data analysis and machine learning to identify patterns, behaviors, or events that deviate significantly from expected behavior within a given dataset or circumstance. By highlighting atypical findings that may indicate potential problems, errors, or interesting insights, anomaly detection can lead to specific actions to address the issues. With the assistance of AIOps (artificial intelligence for IT operations) automation, much larger volumes of data can be identified and analyzed, freeing up IT teams for other projects, saving time, and improving productivity.

Anomaly detection business use cases

Anomaly detection can be used across a wide variety of industries and applications to derive tangible benefits in efficiency, security, and decision-making. Some examples of specific uses and benefits are below:






Anomaly detection algorithms

Well-defined anomaly detection algorithms are used to find and isolate data outliers to address a known or unknown problem or drive an improvement or enhancement. There are several anomaly detection algorithms, each with its own strengths and suitable applications. The choice of algorithm often depends on the characteristics of the data and the specific requirements of the anomaly detection task, while the effectiveness of an algorithm depends on the characteristics of the data and the anomalies. Generally, a combination of different algorithms may be best for optimal anomaly detection. Here are some commonly used ML-generated anomaly detection algorithms:








Supervised vs. Unsupervised anomaly detection

Supervised and unsupervised anomaly detection are two different methods for identifying anomalies in a dataset. The choice of which method to use will depend on the nature of the anomalies and the characteristics of the data. While a hybrid approach may be appropriate, in general, supervised anomaly detection is most effective when labeled examples are accessible and training the model on known patterns is a priority. Unsupervised anomaly detection is more appropriate when labeled data isn’t available or when anomalies aren’t well-defined.

Supervised anomaly detection

In this method, the algorithm is trained on a pre-labeled dataset that includes both normal and anomalous instances. The model learns to differentiate the two instances during training. Because supervised anomaly detection is most suitable when there are large sets of labeled data and examples of anomalies, it is most often used in situations with vast amounts of pre-existing data, such as in historical data for fraud detection or labeled sensor data in industrial equipment.

Unsupervised anomaly detection

In this method, the algorithm is not provided with labeled data or anomalies. The algorithm learns the natural patterns present in the majority of the data and flags instances that deviate significantly from those learned patterns as anomalies. Because this method can be time- and resource-intensive, it is most useful when anomaly data is limited or when types of anomalies are difficult to categorize, such as in analyzing network traffic data for patterns that may indicate cyberthreats, or in monitoring healthcare data for abnormal patient conditions that may indicate previously unknown conditions.

Anomaly detection metrics

Anomaly detection metrics can help evaluate the specificity and performance of anomaly detection models and algorithms. While there is no one overall metric for measuring anomaly detection, here are three of the most common ML-based anomaly detection metrics:

AUC-ROC (Area Under the Receiver Operating Charact

AUC-ROC measures the model's ability to distinguish between normal and anomalous instances across various threshold values. A higher AUC-ROC indicates better model performance. This metric is particularly helpful when there is an imbalance between anomalies and normal instances in the dataset.

Precision-Recall Curve and Area Under the Curve

PR-AUC measures the trade-off between precision and recall across different thresholds. A high PR-AUC suggests a good balance between precision and recall for anomaly detection. This metric is particularly relevant when the dataset is imbalanced and contains relatively few anomalies.

F1 score

The F1 score is the harmonic mean of precision and recall, providing a single metric to balance false positives and false negatives. The F1 score is especially useful when both false positives and false negatives are important considerations.

The power of pairing AIOps with anomaly detection

AIOps leverages advanced analytics, ML, and automation to enhance IT operations (ITOps), improving the management and maintenance of IT infrastructure and services. When combined with anomaly detection, AIOps becomes a powerful tool for improving system reliability, performance, and staff efficiency by providing proactive monitoring, reducing alert fatigue, automating routine tasks, facilitating root cause analysis, and improving incident response.

It also offers major benefits for predictive analytics and capacity optimization. The combination of AIOps and anomaly detection is particularly valuable in complex and dynamic IT environments where traditional monitoring approaches may not deliver desired results. Here are some additional advantages of AIOps-driven anomaly detection:

Pattern recognition

Pattern recognition

Because AIOps is capable of recognizing patterns and trends within large and complex datasets, it is particularly helpful when there is simply too much data for humans to analyze. AIOps can identify patterns in system behavior, performance metrics, and user interactions, helping distinguish normal from unusual.

Data correlation

Data correlation

AIOps correlates events and data across different sources to help establish context for a specific anomaly. For example, it can link performance degradation with IT environment changes, helping IT teams determine which outliers matter and which ones do not.

Automated remediation

Automated remediation

Automated remediation is the process of automatically addressing and resolving issues without human intervention, measurably improving mean time to repair  (MTTR), and minimizing the impact to operations. By following the process steps of incident identification and categorization; automated alerting on anomalies and the recommended next action; determining root cause; integrating and complying with defined policies and playbook responses; and dynamically adapting remediation actions based on the nature and context of the identified anomalies as well as expert human recommendations (human-in-the-loop, or HITL, integration), AIOps can drastically improve incident response time and accuracy. Automated remediation is particularly valuable in dynamic and complex IT environments where optimal performance relies on fast responses to anomalies.

Adaptive learning and continuous improvement

Adaptive learning and continuous improvement

Adaptive learning refers to the combined capabilities of anomaly detection and AIOps to enable systems and training models to adjust dynamically and in real time, improving their performance based on ongoing experiences, changing conditions, and expanding log, metric, and event data driven by automated remediation actions. This evolution can enhance the efficiency and effectiveness of ITOps by allowing the system to autonomously learn from and adapt to new patterns and changes in feedback loops, reducing false positives and allowing better resource allocation. It can analyze the effectiveness of remediation actions, apply context-aware analysis, and update playbooks to improve future responses and predictive analytics in a cycle of continuous improvement.

By combining adaptive learning principles with AIOps-driven anomaly detection, organizations can ensure their systems are continuously learning from operational events, remain able to adjust to dynamic changes in their environments, and provide fast and effective responses to anomalies for optimal operational resiliency.

How your company can leverage anomaly detection