Invented by Rahul Kumar K Sevakula, Parag Sanjay Mhatre, International Business Machines Corp

The market for machine learning for failure event detection and prediction is rapidly growing as industries recognize the importance of preventing and mitigating failures in their operations. Machine learning algorithms have proven to be highly effective in analyzing large amounts of data and identifying patterns that can indicate potential failures or predict future events. One of the key industries that can benefit from machine learning for failure event detection and prediction is manufacturing. In manufacturing plants, failures can lead to costly downtime, production delays, and even safety hazards. By implementing machine learning algorithms, manufacturers can monitor various parameters and variables in real-time to detect anomalies that may indicate an impending failure. This allows them to take proactive measures to prevent the failure from occurring, such as scheduling maintenance or replacing faulty components. Another industry that can greatly benefit from machine learning for failure event detection and prediction is the energy sector. Power plants, for example, rely on complex systems that can be prone to failures. By analyzing historical data and continuously monitoring various factors, machine learning algorithms can identify patterns that may lead to failures, such as equipment degradation or abnormal operating conditions. This enables energy companies to take corrective actions before failures occur, ensuring a reliable and uninterrupted power supply. The transportation industry is also embracing machine learning for failure event detection and prediction. In the aviation sector, for instance, aircraft maintenance is crucial to ensure passenger safety. Machine learning algorithms can analyze data from various sensors and systems on the aircraft to identify potential failures or deviations from normal operating conditions. This allows airlines to proactively address maintenance issues, reducing the risk of in-flight failures and improving overall safety. The healthcare sector is another area where machine learning for failure event detection and prediction is gaining traction. Hospitals and healthcare facilities rely on numerous medical devices and equipment to provide quality care to patients. Machine learning algorithms can analyze data from these devices to detect anomalies or patterns that may indicate potential failures. By identifying these issues early on, healthcare providers can take immediate action to prevent equipment failures that could compromise patient safety or disrupt critical medical procedures. The market for machine learning for failure event detection and prediction is expected to grow significantly in the coming years. According to a report by MarketsandMarkets, the global market for machine learning in manufacturing is projected to reach $4.8 billion by 2026, with a compound annual growth rate of 49.2%. Similarly, the market for machine learning in the energy sector is expected to reach $2.2 billion by 2026. As the demand for more efficient and reliable operations continues to increase, industries are increasingly turning to machine learning for failure event detection and prediction. By leveraging the power of data and advanced algorithms, businesses can minimize downtime, reduce costs, and improve overall operational efficiency. As the technology continues to advance, we can expect to see even more innovative applications of machine learning in failure event detection and prediction across various industries.

The International Business Machines Corp invention works as follows

Techniques are provided for failure prediction. Each event indication is correlated to a failure of a computing system. The machine learning models (ML) are trained using combinations of the event indications from the plurality. Each ML model is then evaluated in order to determine its quality score. A plurality ML model ensemble is created by identifying the ML model with the highest quality score. The ensemble of ML-models is used to process the data logs of the computing system. If any of the ML models in the ensemble predicts a failure, then an alert will be generated.

Background for Machine learning for failure event detection and prediction

The present disclosure is related to failure events and, more specifically, the use of machine learning to predict failures.

Companies and industries are concerned about failures, downtime, and latency in various computing systems. Monitoring teams, consisting of a large number of human experts, are often maintained to monitor alerts from the system and determine what is important enough to take action. The cost of maintaining such a team is significant. Identifying the event also takes time. The monitoring team usually waits for a certain period of time to confirm an event. This delay in resolution allows operations to continue at sub-optimal levels. In most cases, the team responsible for the resolution will need to restart one or several components or make them inoperable for a short period of time until the problem is resolved. This downtime can cause huge losses, and even lead to dissatisfaction from customers, particularly during peak times. In most cases, the monitoring and operation teams do not know the cause of these events and are therefore unable to predict or solve the problem in advance. The logs produced by servers and other components are also too complex and large to be manually evaluated or analyzed.

A method is described in one embodiment of this disclosure. The method involves receiving a number of event indicators, where each event indicator of the plurality corresponds to an individual failure in a computer system, and then training a variety of machine learning models (ML) based on combinations in the number of event indicators. The method also includes evaluating the multitude of ML Models to generate respective quality scores for each respective model in the plurality. The ensemble of ML model is also used to process current data logs generated by the computing system. The method generates an alert when it is determined that any of the ensemble’s ML models has predicted a failure using the current data logs.

A computer-readable medium is described in a second embodiment. The computer-readable medium includes computer-readable code, which can be executed by one or multiple computer processors in order to perform an action. The operation comprises receiving a plurality event indicators, wherein each event indicator of the plurality corresponds to an individual failure in a computer system, and then training a plurality machine learning (ML), models, based on combinations between the event indications. The operation also includes evaluating the multitude of ML Models to generate respective quality scores for each respective model in the plurality. The ensemble of ML model is also used to process current data logs. The operation generates an alert when it is determined that any of the ensemble’s ML models has predicted a failure using the current data logs.

According to a third embodiment of the present disclosure, a system is provided. The system includes one or more computer processors, and a memory containing a program which when executed by the one or more computer processors performs an operation. The operation includes receiving a plurality of event indications, wherein each respective event indication of the plurality of event indications corresponds to a respective failure in a computing system, and training a plurality of machine learning (ML) models based on combinations of event indications in the plurality of event indications. The operation further includes evaluating the plurality of ML models to generate a respective quality score for each respective ML model of the plurality of ML models, and defining an ensemble of ML models from the plurality of ML models, based on identifying ML models of the plurality of ML models with corresponding quality scores exceeding a predefined threshold. Additionally, the operation includes processing current data logs from the computing system using the ensemble of ML models. Upon determining that any ML model of the ensemble of ML models predicted a failure based on the current data logs, the operation includes generating an alert.



FIG. According to one embodiment, FIG. 2 shows a block diagram of a monitor device that uses machine learning to forecast failure events within a computing system.






Embodiments” of the present disclosure use machine learning to predict and identify failure events more accurately and on time. In embodiments, a machine learning model identifies useful patterns in time-series data for event identification and forecasting. Machine learning is not possible in many cases because failure events (e.g. occurring a few time per month) are rare. Also, failure events of the same type (or even similar types) can be caused by different root cause (e.g. different patterns in data logs). In some embodiments, highly constrained model are trained and deployed in order to identify failure events.

In one embodiment, each failure event combination is classified by a different long-short-term memory classifier. To train the classifiers, in one embodiment the timestamps of the events that are considered are labeled as “1” and the remaining timestamps as “0”. The constrained classifier should learn a pattern that is shared by all events. After the models are trained, in one embodiment, the most accurate classifiers will be retained and the less accurate ones removed. In some embodiments the best models can be identified by, for instance, their training performance measured in relation to geometric mean of sensitivity or specificity and/or comprehensiveness of the learning events that the better classifiers may miss.

In one embodiment, a group of LSTM classifiers are used to predict failures in advance and/or identify failures at their onset. In one embodiment, failure events are said to have been predicted/identified when any of the classifiers can predict/identify their occurrence. In one embodiment, each combination of failure event is assigned a different classifier. In this embodiment, the number trained models is equal the size of the Power set (i.e. O(2K), K being the number events). The computational cost of this procedure can be high when the number failure events increases. To control the complexity of computation, various techniques are employed to train and identify more useful models. In one embodiment, top-down pruning is used. This will be discussed in greater detail below. “In another embodiment, the greedy approach is used to identify similar events. This is discussed below in greater detail.

When failures are predicted ahead of time, corrective actions can be carried out at a time and in a manner that has the least impact to the business or customers. The customer satisfaction will improve, and costs for support staff can be cut. Costs can also be reduced in terms of efficiency of the process, since the staff could be better prepared for failure events by predicting them.

FIG. According to one embodiment, FIG. 1 shows a workflow 100 that enables machine learning models to be trained and used to predict failures using log data. In the embodiment illustrated, Data Logs 105 are used in conjunction with Event Data 110 to train a collection ML Models 115A to N. The Evaluator 120 selects the model to be deployed and creates an ensemble 122 of ML Models 125. This Ensemble 122, as shown, receives the Real-time Logs 130 and generates predictions 135. In one embodiment, the Event Data 110 indicates a failure in a computer system and includes the timestamp. In one embodiment, Data Logs 105 includes time-series data collected over a certain period of time, each with d dimensions. The Data Logs 105 may have any number of dimensions (e.g. any amount or type of data logged and taken into consideration).

Embodiments” of the current disclosure can identify useful patterns within the Event Data 110, and Data Logs 105 that are helpful for predicting future events. It is possible that two events can cause different failures but they could share the same cause. In the same way, two failures can have the exact same effect, but may be caused by different reasons. In one embodiment, a heavily constrained LSTM is trained for every possible combination of events. As an example, suppose that the Event Data 110 contains indications of events B, C, and A. The system could train a first ML Model 115 based upon the combination of these events. A second model based upon events A and B would be trained based only on event B. A third model based solely on B andC, a fourth based exclusively on A andC, etc. The details are discussed in the following paragraphs.

In one embodiment, the ddimensional Data Logs 105 are used to train all of the ML Models 115, but their training labels change depending on which combination of events is being used. In one embodiment, data logs can be labeled according to whether or not an event was considered to have taken place (or continued) within the log. For instance, all logs between the start time and end time of an event can be given a label?1? A label of?0? The label?0? All other data logs can be given a label of?0? If an ML Model 115 was to be trained based solely on event ‘A,’ then the Data Logs 105 that corresponded to this time would be labeled?1,? The Data Logs 105 that correspond with the time of the event A are labeled as?1,? All other Data Logs 105 are labeled as?0. The ML Model 115 will then be trained with the data logs and the labels corresponding to them as the output. If an ML Model 115 based on a combination of events A & B is being trained, then all data logs corresponding the the time when event A occurred are labeled ‘1,? All data logs for the event B are also labeled?1,? The data logs that were recorded when neither A nor B occurred are marked “0.?

In some embodiments, ML Models 115 can be trained using sequences. Input data for each ML Model 115 consists of a three-dimensional matrix, with the dimensions referring to the number and length of samples as well as the features at each time stamp. In order to train ML Models 115 a sequence of features values observed at consecutive timestamps are input as a sample of training data, while the class label for the next (e.g. the following) timestamp will be used as the target class label. In one embodiment, ML Models 115 can be trained by using Data Logs 105, and the labels of Data Logs 105 that follow are then trained as output. The ML Models 115 can be trained to predict future events, instead of predicting them as they happen.

Click here to view the patent on Google Patents.