#### Analyzing thermal cycles for Remaining Useful Lifetime predictions

Image by Author using this tool under CreativeML Open RAIL-M license.

#### Introduction

In today’s data-driven world, businesses are increasingly turning to technology to optimize their operations and reduce downtime. Be it power supplies, wind turbines, transistors, engines — sensors collect data from various components during all stages of a product’s life cycle: from development via manufacturing to operation, companies monitor their products digitally.

Hence, predictive maintenance, condition-based maintenance and condition monitoring are techniques that have gained widespread popularity in recent years. By analyzing sensor data such as temperature, vibration or pressure, businesses aim at predicting likely failures of equipment and machines in order to schedule maintenance accordingly.

Condition monitoring is crucial for keeping track of the wear and tear of your system’s components, thus enabling you to minimize unplanned downtime, maximize availability and operating hours, reduce maintenance costs, plan for better maintenance schedules, effectively manage your spare parts, keep your customers satisfied and happy — just to name a few of the advantages of condition-based and predictive maintenance.

One can differentiate between three general types of maintenance:

Reactive Maintenance: maintenance happens only when needed after a failure meaning it involves unscheduled downtime and repair costs;Preventive Maintenance: maintenance happens at regular intervals, this comes at the risk of performing too many maintenance operations in spite of fully-functioning equipment;Predictive Maintenance: solves these issues because it relies on data and condition-monitoring to reliably predict when a failure occurs for a given component. In this way, one can effectively schedule downtimes for inspection or maintenance and prepare the resources in a smart manner.

In this post, I want to explore techniques for monitoring semiconductor scenarios, inspired by Ref. 1¹. As the method of Rainflow counting can be extended beyond semiconductor applications, the results presented here can be adapted to a plethora of business cases.

So, whether you are a maintenance manager looking to improve your organization’s maintenance program or a business owner interested in reducing downtime and increasing efficiency, this blog post is for you.

*Semiconductors at work — stress profiles*

Semiconductors run our modern world — you can find them in power modules for wind turbines, photovoltaic systems, electric vehicles and many more. Therefore, it is highly critical to have a real-time monitoring of the wear and the general condition of the components. In semiconductors, thermomechanical fatigue is one of the root causes for transistor failures in power modules.

Since the different materials in a semiconductor have different thermal extension coefficients, temperature swings lead to mechanical stress. When transistors undergo cyclic loading, the associated thermomechanical stress causes fatigue of the materials within the transistor leading to degradation and eventually, failure. This can ultimately result in the collapse of the entire system you want to operate. Obviously, it is of utmost importance to have a reliable estimate of the Remaining Useful Lifetime of the components given the stress loads the systems and parts are enduring under operation.

Assuming an ideal cyclic loading, the time series of the semiconductor temperature would be sinusoidal, cf. Fig. 1.

Fig. 1: Sinusoidal temperature time series. Plot generated by Pia Baronetzky and the Author.

In this scenario, all stress cycles would have identical temperature amplitudes and identical cycle durations. Counting stress cycles and quantifying the damage they cause to the material would be trivial.

In reality, a plausible temperature time series looks like the following, cf. Fig. 2:

Fig. 2: More realistic temperature time series. Plot generated by Pia Baronetzky and the Author.

User behavior can hardly be fully simulated and may deviate from lab settings. Furthermore, environmental factors cannot easily be modeled or predicted. Hence, individual cycles vary in duration and amplitude with the possibility that a single cycle of a certain large amplitude can last for minutes, hours and even days while several small-amplitude loading cycles start and stop before the large-amplitude cycle ends.

In order to correctly count all cycles and quantify the damage they caused in a realistic scenario, one has to use Rainflow counting.

#### Rainflow counting

Rainflow counting is a standard procedure in fatigue analysis and has been added among other cycle-counting methods in *Standard Practices for Cycle Counting in Fatigue Analysis²* after it was developed by T. Endo and M. Matsuishi in 1968.

When performing a Rainflow analysis, you are not only evaluating the current state of the system, you are even taking into account the entire time series history of a given observable. This is what makes Rainflow counting for condition-monitoring powerful and reliable.

In our scenario, we analyze a temperature time series.

Initially, one extracts the extrema of the time series. Two consecutive extrema (minimum following maximum or vice-versa) constitute a half-cycle and the temperature difference between the two extrema is called the cycle amplitude or stress range.

Consider an upward half-cycle starting with a minimum and ending with a maximum (the opposite case can be treated analogously). The minimum of the half-cycle is called the start value and the maximum is called the stop value.

The upward half-cycle is closed to a full cycle the next time a downward half-cycle falls to or below the start value of the initial upward cycle. Analogously, a downward half-cycle is closed to a full cycle the next time an upward half-cycle rises to or above the start value of the initial downward cycle.

Once a cycle is closed, it is removed from the time series and contributes 1 Rainflow cycle count with the corresponding cycle amplitude. After performing a full Rainflow counting on a time series, there might be overhanging half-cycles that could not be closed which will contribute 0.5 Rainflow cycle counts with their corresponding cycle amplitudes.

In this way, one arrives at a Rainflow cycle distribution (a_i, n_i) with a_i the cycle amplitudes and n_i the Rainflow cycle counts.

It is common practice to perform a binning of the cycle amplitudes such that one reduces complexity and enables comparison of Rainflow analyses of different machines but with the same observables (temperature in our case).

For simplicity, we denote the resulting Rainflow cycle distribution as (a_i, n_i) with a_i the binned cycle amplitudes and n_i the corresponding Rainflow cycle counts.

A convenient way of visualizing the Rainflow analysis is by plotting a Rainflow matrix of the underlying time series, cf. Fig. 3:

Fig. 3: Rainflow matrix for temperature time series depicted in Fig. 2. Rainflow cycles on the diagonal are omitted, see text below. Plot generated by Pia Baronetzky and the Author.

The vertical axis shows the start value of a Rainflow cycle while one can see the stop value on the horizontal axis.

If a Rainflow cycle starts and stops in the same temperature bin or an adjacent one, it will end up on the diagonal of the Rainflow matrix or on the first sub-diagonal, respectively.

Since Rainflow cycles on the diagonal contribute to the thermomechanical fatigue only negligibly, we omit them in Fig. 3.

A full load cycle starting at a low temperature and stopping at a high one results in cycles that populate the matrix far from the diagonal as can be seen in the upper right corner of the Rainflow matrix.

These full load cycles contribute the most to the deterioration of the material and the reduction of the Remaining Useful Lifetime while cycles from (close to) the diagonal of the Rainflow matrix only cause minor to no harm to the material.

The color coding tells us that most cycles lie on the first sub-diagonal of the Rainflow matrix and only a few corresponding to full load cycles populate the corners far from the diagonal which is a common scenario and reflects the temperature time series shown in Fig. 2.

#### Remaining Useful Lifetime

Now that we can correctly quantify and monitor the stress loads seen by the components of our system, we want to convert this information into a measure of material damage and/or fatigue.

For any given temperature or stress bin, i.e. cycle amplitude a_i, there is a maximum fatigue life N_i denoting the maximum number of Rainflow cycles a material can endure at the given stress level until failure occurs. This information is encoded in the Wöhler curve which needs to be determined either by experiment or by simulation. With limited resources, a Wöhler curve is hard to produce. Hence, we avail ourselves of Machine Learning approaches such as the SGDRegressor from *scikit-learn* as you will see below.

Palmgren-Miner’s rule³ tells us that the total damage a component has suffered is given by the sum of relative damages per stress level where relative damage is given by n_i, the Rainflow cycles of a certain stress level over N_i, the corresponding fatigue life:

Eqn. 1: Total damage D as described by Palmgren-Miner’s rule.

When D=1, the component has accumulated total damage and breaks. Hence, the inverse of the total damage serves as a measure for the Remaining Useful Lifetime.

Condition-based Maintenance is already possible with a reliable real-time monitoring of the accumulated stress and a measure for when the next maintenance program needs to be scheduled.

For a prediction based on Machine Learning models fed by real-time monitoring data, we need to go one step further.

*Predictive Maintenance — research collaboration*

Since Palmgren-Miner’s rule assumes linear damage accumulation, it has shortcomings: The model ignores the temporal order and cross-correlations of all occurring stress load cycles. It also assumes that cycles of different stress levels contribute to the total damage with the same weights.

In a research collaboration⁴ between Munich Data Science Institute at Technische Universität München and PROCON IT GmbH, we tackle these shortcomings by making use of Machine Learning techniques such as *scikit-learn*’s SGDRegressor.

Instead of assuming equal weights for all stress levels, we let the model learn the weights of the relative damage contributions to the total damage D (*Eqn. 1*) to properly predict the probability of failure. The results are promising and the procedure can be extended to a plethora of different use cases, systems and input observables.

For further information, have a look at the report⁵.

In this way, we extend a Condition-based Maintenance program to a Predictive Maintenance one.

#### Conclusion

Businesses aim at leveraging their data to move their maintenance strategy forward from reactive or preventive maintenance towards condition-based or even predictive maintenance.

Sensor data provide insights about the wear and tear of your system’s components. In order to reliably quantify the damage your components have accumulated and to perform condition-monitoring, one can use Rainflow counting.

The inverse of the total damage given by Miner’s rule can serve as a Remaining Useful Lifetime estimation.

To move from condition-based towards predictive maintenance, one can make use of Machine Learning techniques as demonstrated in a research collaboration between Munich Data Science Institute at TU München and PROCON IT GmbH.

The author expresses his sincere gratitude towards all people involved in this work.

#### Bibliography

[1]* *M. Andresen, G. Buticchi, M. Liserre, Study of reliability-efficiency tradeoff of active thermal control for power electronic systems, Microelectronics Reliability Volume 58 March 2016 Pages 119–125

[2] Standard Practices for Cycle Counting in Fatigue Analysis, ASTM1049–85

[3] M. A. Miner, Cumulative Damage in Fatigue, J. Appl. Mech. Sep 1945 12(3) A159-A164

[4] *Remaining Lifetime Estimation in Semiconductor Scenarios*, TUM-DI-LAB 2023

[5] S. Bayer, O. Neumann, D. Raj, Y. Savva, Remaining Lifetime Estimation in Semiconductor Scenarios, TUM-DI-LAB 2023

Condition-based Maintenance: Rainflow Counting was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.