icon_CloudMgmt icon_DollarSign icon_Globe icon_ITAuto icon_ITOps icon_ITSMgmt icon_Mainframe icon_MyIT icon_Ribbon icon_Star icon_User icon_Users icon_VideoPlay icon_Workload icon_caution icon_close s-chevronLeft s-chevronRight s-chevronThinRight s-chevronThinRight s-chevronThinLeft s-chevronThinLeft s-trophy s-chevronDown

What is anomaly detection? A definition

Learn how anomaly detection can help identify problems and deliver insights to improve business outcomes.

Anomaly detection, or outlier analysis, is a technique used in data analysis and machine learning to identify patterns, behaviors, or events that deviate significantly from expected behavior within a given dataset or circumstance. By highlighting atypical findings that may indicate potential problems, errors, or interesting insights, anomaly detection can lead to specific actions to address the issues. With the assistance of AIOps (artificial intelligence for IT operations) automation, much larger volumes of data can be identified and analyzed, freeing up IT teams for other projects, saving time, and improving productivity.

Anomaly detection business use cases

Anomaly detection can be used across a wide variety of industries and applications to derive tangible benefits in efficiency, security, and decision-making. Some examples of specific uses and benefits are below:

Anomaly detection in machine learning

Anomaly detection in machine learning (ML) involves using algorithms and techniques to identify unusual patterns or observations within a dataset that deviate significantly from the norm. The primary objective is to pinpoint instances that are different from the majority of the data, which may indicate errors, fraud, or otherwise interesting information.

There are several possible approaches, including:

  • Supervised learning: an algorithm is trained on a labeled dataset of both normal and anomalous instances
  • Unsupervised learning: an algorithm is trained on a dataset without explicit labels for anomalies
  • Semi-supervised learning: an algorithm is trained on a model of normal instances and labeled anomalies
  • Ensemble methods: a combination of multiple anomaly detection models is employed for improved overall performance

Anomaly detection in ML may be used in various domains, including fraud detection, network security, and equipment monitoring, among others.

Fraud detection—finance/banking

With algorithms and models to identify potentially fraudulent activities or transactions, data analysts can distinguish between legitimate and fraudulent behavior. Anomaly detection for fraud benefits many industries, including finance, e-commerce, and insurance, helping to prevent financial losses and ensure resilient systems. Continuous model improvements are crucial to stay ahead of evolving fraud strategies. Applications for fraud detection range from credit card misuse, identity theft, and insurance fraud to malicious behavior in online platforms.

Cancer/Tumor detection—healthcare

Anomaly detection plays a crucial role in healthcare, helping to detect irregularities that may indicate potential issues, anomalies in patient data, or patterns that could indicate diseases or adverse events. For cancer or tumor detection, identifying deviations from normal tissue characteristics in medical imaging is crucial to early cancer detection, enabling rapid intervention and improved patient outcomes. In addition, anomaly detection in healthcare can play a critical role in other domains:

  • Patient monitoring of vital signs and patient data to identify abnormal patterns that may suggest deteriorating health or the onset of a medical condition
  • Early disease diagnosis by identifying unusual patterns in medical imaging, laboratory test results, or patient histories.
  • Fraudulent activities, billing anomalies, or abuse of healthcare resources to help prevent financial losses
  • Medication monitoring to better ensure patient adherence to prescribed medications
  • Population health data to identify trends, clusters of diseases, or unusual health outcomes

Cybersecurity—IT

Identifying unusual or abnormal patterns of behavior in computer networks, systems, or user activities can help detect potential security threats, attacks, or malicious activities, sending early warnings to allow faster responses to potential security breaches. With the help of AI/ML, algorithms can learn and adapt to the evolving nature of cyberthreats automatically, helping to enhance an organization’s overall security posture.

Anomaly detection in cybersecurity can help organizations detect both known and unknown threats, reduce false positives, and respond more efficiently and effectively to security incidents. Continuous monitoring and updating of anomaly detection models is crucial to adapt to evolving cybersecurity threats and strategies.

Examples of anomaly detection for cybersecurity include network traffic monitoring, user behavior analysis, endpoint and application anomalies, insider threat detection, and SIEM (security information and event management) systems that collect and analyze log data for abnormalities.

Sales spikes—retail

In retail, anomaly detection involves looking for changes within sales transactions, customer behaviors, and inventory management. By identifying unusual and unexpected surges (spikes) or drops in sales data, retailers can use anomaly detection to pinpoint specific time periods or events where consumer behavior and sales behavior significantly differ from historical or expected patterns. Observing these changes can help businesses understand the unusual behavior and act by assessing and adjusting inventory, rethinking marketing strategies, or addressing operational issues in order to improve efficiency, prevent financial losses, and boost overall business intelligence.

Anomaly detection is also a key driver of transaction monitoring activities like unauthorized credit card use and other types of payment fraud, inventory management, price anomalies, customer behavior analysis, employee fraud detection, and data security, among other retail applications.

By implementing effective anomaly detection systems, retailers can mitigate risks, improve security, and maintain a high level of trust with customers while maximizing operational efficiency.

Anomaly detection algorithms

Well-defined anomaly detection algorithms are used to find and isolate data outliers to address a known or unknown problem or drive an improvement or enhancement. There are several anomaly detection algorithms, each with its own strengths and suitable applications. The choice of algorithm often depends on the characteristics of the data and the specific requirements of the anomaly detection task, while the effectiveness of an algorithm depends on the characteristics of the data and the anomalies. Generally, a combination of different algorithms may be best for optimal anomaly detection. Here are some commonly used ML-generated anomaly detection algorithms:

Isolation Forest

Isolation Forests are decision tree-based algorithms that isolate anomalies by randomly selecting features and creating isolation trees. Anomalies are identified as instances that require fewer steps to isolate in the tree. Isolation Forests are particularly useful for high-dimensional data (data in which there are many features or variables relative to observations) and are relatively efficient, making them a good choice for real-time anomaly detection.

Local Outlier Factor (LOF)

LOF calculates the local density deviation of a data point relative to its neighbors. Anomalies have significantly lower local density than their neighbors. LOF algorithms can be effective for datasets with complex structures.

One-Class Support Vector Machine (SVM)

One-class SVMs are designed to train for binary classification, with only one class shown in the training data. A one-class SVM learns to define a subset (hyperplane) that encapsulates the normal instances, and anomalies are identified as instances lying on the other side of the hyperplane. One-Class SVMs are useful when there is insufficient labeled data for anomalies.

Autoencoders

Autoencoders are neural-network architectures that learn to encode and decode input data, enabling them to learn data representations. Anomalies are identified by measuring the reconstruction error, with higher errors indicating unusual instances. Autoencoders are useful for detecting anomalies in images or sequences since they include many features and variables.

Time-Series anomaly detection

Time-series anomaly detection involves identifying outliers from expected behavior in a sequence of data points collected over and correlated with time. Detecting anomalies in time-series data can help build predictive models and identify potential issues, such as equipment failures, fraud, or abnormal system behavior, allowing for preventive actions.

Methods used for time-series anomaly detection can include statistical methods, moving averages, ML models (autoencoders and LSTM (long short-term memory) networks), spectral residual method, Isolation Forests for time series, and many more. Choosing the right method can depend on the data and anomaly characteristics, and generally a combination of methods is most effective.

Outlier anomaly detection

Outliers in time-series anomaly detection are data points that deviate significantly from the expected pattern within a sequence of observations collected over time, and can represent unusual events or errors. Here are some ways outliers in time-series data can present:

  • Global outliers are data points that deviate significantly from the overall pattern or distribution of the represented dataset. These outliers are unusual when considered in the context of the entire dataset, rather than just within a nearby subset.
  • Contextual outliers, also known as conditional outliers, are data points that stand out compared to the rest only within a specific context or subset of the data (e.g., spikes/peaks, dips/troughs). Unlike global outliers, which deviate significantly from the overall pattern of the dataset, contextual outliers seem unusual only when accounting for certain conditions or contextual factors. These outliers may be expected within one context but abnormal in another, for example, a website traffic spike at night when there is usually no traffic.
  • Collective outliers, also known as group outliers or anomalies, refer to a group of data points that may represent anomalous behavior when considered together. Instead of individual data points being outliers, the collective behavior of a group is considered abnormal compared to the overall pattern of the entire dataset.

Supervised vs. Unsupervised anomaly detection

Supervised and unsupervised anomaly detection are two different methods for identifying anomalies in a dataset. The choice of which method to use will depend on the nature of the anomalies and the characteristics of the data. While a hybrid approach may be appropriate, in general, supervised anomaly detection is most effective when labeled examples are accessible and training the model on known patterns is a priority. Unsupervised anomaly detection is more appropriate when labeled data isn’t available or when anomalies aren’t well-defined.

Supervised anomaly detection

In this method, the algorithm is trained on a pre-labeled dataset that includes both normal and anomalous instances. The model learns to differentiate the two instances during training. Because supervised anomaly detection is most suitable when there are large sets of labeled data and examples of anomalies, it is most often used in situations with vast amounts of pre-existing data, such as in historical data for fraud detection or labeled sensor data in industrial equipment.

Unsupervised anomaly detection

In this method, the algorithm is not provided with labeled data or anomalies. The algorithm learns the natural patterns present in the majority of the data and flags instances that deviate significantly from those learned patterns as anomalies. Because this method can be time- and resource-intensive, it is most useful when anomaly data is limited or when types of anomalies are difficult to categorize, such as in analyzing network traffic data for patterns that may indicate cyberthreats, or in monitoring healthcare data for abnormal patient conditions that may indicate previously unknown conditions.

Anomaly detection metrics

Anomaly detection metrics can help evaluate the specificity and performance of anomaly detection models and algorithms. While there is no one overall metric for measuring anomaly detection, here are three of the most common ML-based anomaly detection metrics:

AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

AUC-ROC measures the model's ability to distinguish between normal and anomalous instances across various threshold values. A higher AUC-ROC indicates better model performance. This metric is particularly helpful when there is an imbalance between anomalies and normal instances in the dataset.

Precision-Recall Curve and Area Under the Curve (PR-AUC)

PR-AUC measures the trade-off between precision and recall across different thresholds. A high PR-AUC suggests a good balance between precision and recall for anomaly detection. This metric is particularly relevant when the dataset is imbalanced and contains relatively few anomalies.

F1 score

The F1 score is the harmonic mean of precision and recall, providing a single metric to balance false positives and false negatives. The F1 score is especially useful when both false positives and false negatives are important considerations.

The power of pairing AIOps with anomaly detection

AIOps leverages advanced analytics, ML, and automation to enhance IT operations (ITOps), improving the management and maintenance of IT infrastructure and services. When combined with anomaly detection, AIOps becomes a powerful tool for improving system reliability, performance, and staff efficiency by providing proactive monitoring, reducing alert fatigue, automating routine tasks, facilitating root cause analysis, and improving incident response.

It also offers major benefits for predictive analytics and capacity optimization. The combination of AIOps and anomaly detection is particularly valuable in complex and dynamic IT environments where traditional monitoring approaches may not deliver desired results. Here are some additional advantages of AIOps-driven anomaly detection:

Pattern recognition

Because AIOps is capable of recognizing patterns and trends within large and complex datasets, it is particularly helpful when there is simply too much data for humans to analyze. AIOps can identify patterns in system behavior, performance metrics, and user interactions, helping distinguish normal from unusual.

Data correlation

AIOps correlates events and data across different sources to help establish context for a specific anomaly. For example, it can link performance degradation with IT environment changes, helping IT teams determine which outliers matter and which ones do not.

Automated remediation

Automated remediation is the process of automatically addressing and resolving issues without human intervention, measurably improving mean time to repair (MTTR), and minimizing the impact to operations. By following the process steps of incident identification and categorization; automated alerting on anomalies and the recommended next action; determining root cause; integrating and complying with defined policies and playbook responses; and dynamically adapting remediation actions based on the nature and context of the identified anomalies as well as expert human recommendations (human-in-the-loop, or HITL, integration), AIOps can drastically improve incident response time and accuracy. Automated remediation is particularly valuable in dynamic and complex IT environments where optimal performance relies on fast responses to anomalies.

Adaptive learning and continuous improvement

Adaptive learning refers to the combined capabilities of anomaly detection and AIOps to enable systems and training models to adjust dynamically and in real time, improving their performance based on ongoing experiences, changing conditions, and expanding log, metric, and event data driven by automated remediation actions. This evolution can enhance the efficiency and effectiveness of ITOps by allowing the system to autonomously learn from and adapt to new patterns and changes in feedback loops, reducing false positives and allowing better resource allocation. It can analyze the effectiveness of remediation actions, apply context-aware analysis, and update playbooks to improve future responses and predictive analytics in a cycle of continuous improvement.

By combining adaptive learning principles with AIOps-driven anomaly detection, organizations can ensure their systems are continuously learning from operational events, remain able to adjust to dynamic changes in their environments, and provide fast and effective responses to anomalies for optimal operational resiliency.

How your company can leverage anomaly detection

Improved product quality

By identifying deviations from expected patterns in the production process, product characteristics, or quality metrics, anomaly detection can lend measurable improvements in product quality. With early and accurate response to those deviations, organizations can proactively address issues, reduce product defects, and enhance overall product quality, leading to better customer experiences and organizational outcomes.

The process for driving improved product quality with AIOps-enabled anomaly detection may include:

  1. Defining quality metrics, including dimensional accuracy (data parameters), material composition, and performance specifications.
  2. Collecting data from the production process, testing, and quality control stages, including sensor data, inspection records, and testing results, and any other product quality data points.
  3. Establishing baseline behavior from historical data by identifying typical ranges, variations, and trends.
  4. Implementing anomaly detection models for the collected data with appropriate algorithms.
  5. Monitoring continuously to evaluate quality metrics in real time for immediate anomaly detection.
  6. Designing early warning systems that trigger alerts for anomalies, enabling rapid response to quality issues.
  7. Conducting root cause analysis to determine the reason for deviations from normal, whether in the production process, equipment, materials, or elsewhere.
  8. Integrating with manufacturing systems to ensure collaboration between anomaly detection tools and production infrastructure.
  9. Establishing a feedback loopfor continuous improvement in anomaly detection models and systems.
  10. Using predictive analytics for forecasting potential quality trends and improving quality management.
  11. Training teams involved in the production and quality control processes so they can respond effectively to anomalies.
  12. Regularly calibrating and validating anomaly detection models so they stay current with production process technology updates and other changes.

System availability

Advanced analytics and ML techniques can be a critical part of efforts to monitor, identify, and address anomalies that could impact IT system availability. By detecting deviations from normal behavior in real time or near real time, proactive measures can be taken to minimize system downtime and optimize performance.

The process for using AIOps-enabled anomaly detection to help ensure system availability may include:

  1. Defining key availability metrics such as uptime, response time, error rates, and other indicators of performance.
  2. Collecting relevant data that provides a complete view of system behavior and performance, such as system logs and performance metrics.
  3. Establishing baseline behavior from historical data for normal operating conditions.
  4. Choosing appropriate models based on the data and system architecture.
  5. Implementing real-time monitoring to ensure deviations from normal are acted on in a timely way.
  6. Configuring automated alerts for IT administrators, operations teams, or others responsible for system availability.
  7. Developing incident response playbooks to guide IT teams for structured and efficient responses to incidents.
  8. Integrating automated remediation for automated responses to certain types of anomalies such as restarts, reallocating resources, or triggering failovers to redundant systems.
  9. Conducting root cause analysis to determine underlying issues related to systems availability.
  10. Monitoring and predicting capacity-related issues to ensure system scalability in response to increased loads.
  11. Implementing continuous learning models to ensure evolutions in system behavior are accommodated.
  12. Ensuring integration with existing monitoring tools for an efficient, unified approach to system availability.
  13. Collaborating with ITOps teams to ensure proper training for interpretation of anomaly alerts, issue severity, and recommended responses.

Cost savings

By proactively identifying deviations from normal behavior in systems, processes, or equipment that could lead to failures or downtime, anomaly detection can lead to significant cost savings and operational reliability. When issues are found early, organizations can take preventative measures to mitigate the risk of system or equipment failures, minimize downtime, extend the lifespan of equipment, and ultimately save costs attributed to repairs, replacements, and operational disruptions.

For example, in a manufacturing facility, the production line consists of various machines and equipment. Sudden failures of these machines can lead to unplanned downtime, increased maintenance costs, and a reduction in overall productivity. Here are possible steps for implementing anomaly detection for equipment health monitoring.

  1. Sensor data monitoring on critical machinery to collect real-time data on performance measures like temperature, vibration, and pressure.
  2. Establishing a normal operating condition baseline when equipment is in good health.
  3. Continuously monitoring sensor data with anomaly detection algorithms to identify deviations.
  4. Generating alerts for maintenance teams to address potential issues before they escalate to failures.
  5. Following a regular maintenance schedule during downtime to minimize impact on production schedules and avoid costly emergency repairs.

Risk management

Compliance violations and their associated risks are a major concern for many organizations. By using advanced analytics and ML techniques to identify and detect anomalies in compliance-related data in real time, IT teams can help enable timely interventions for compliance issues to mitigate risks associated with fraud, data breaches, and other threats and help ensure adherence to regulatory requirements.

Here are some considerations for getting started with a data-driven approach to anomaly detection for compliance and risk management:

  • Defining compliance and risk metrics such as those associated with financial transactions, employee activities, or data access.
  • Collecting relevant data from transaction logs, employee activities, financial records, etc., to gain insight into compliance-related patterns.
  • Gaining familiarity with regulatory requirements (e.g., HIPAA, GDPR, PCI-DSS, WEEE, FSMA, OSHA, etc.) and industry standards required to achieve and maintain compliance.
  • Establishing a historical baseline of behavior related to compliance and risk metrics under normal operating conditions.
  • Choosing anomaly detection models and algorithmsappropriate for compliance and risk scenarios.
  • Implementing real-time monitoring and alerting to continuously analyze compliance and risk-related data, trigger notifications for compliance officers, risk managers, and other relevant stakeholders, and enable timely intervention for potential violations.
  • Investigating anomalies to understand root cause and implications that need attention.
  • Detecting and preventing fraud by identifying unusual patterns in financial transactions, user access, or other areas.
  • Monitoring data access and usage patterns that may indicate unusual or unauthorized access or data breaches, crucial for data protection regulations and safeguarding protected or sensitive information.
  • Maintaining documentation of results, investigations, and actions taken and generating reports for regulatory authorities, auditors, and other stakeholders.
  • Implementing mechanisms for periodic audits and ongoing model adaptation to ensure the anomaly detection system is relevant and effective.
  • Integrating with governance, risk, and compliance (GRC) systems to ensure that insights gained are incorporated in overall compliance and risk management workflows.

We’ll help you run your business as you reinvent it

contact-sales

We know you have a lot to juggle, so we’ll get back to you as soon as possible. The more you can tell us about your unique business needs, the faster we can guide you to the right solution.

Whether you’re in the early stages of product research, evaluating competitive solutions, or just trying to scope your needs to begin a project, we’re ready to help you get the information you need.

BMC has helped many of the world’s largest businesses automate and optimize their IT environments. Let’s put that experience to work for your organization.