### Change Point Detection — A Bayesian Approach

#### Identifying a change point in time series analysis could provide you much more information about the series than you previously though

Change point analysis has become a field of interest in many areas of study. This kind of analysis refers to the problem of finding abrupt or sudden changes in a given time series. According to the definition made by Iwata *et al.* (2018), change point analysis is “the method for identifying change points, which are the moments when the probability distribution of a time series changes.” According to Van den Burg and Williams (2020), “moments of abrupt change in the behavior of a time series are often the cause of alarm, as they can suggest a significant change in the data-generating process.”

As demonstrated by Aminikhanghahi and Cook (2017) and Iwata et al. (2018), the increasing attention paid to this kind of analysis is due to recent technological developments. These developments generate large amounts of data that often need to be closely monitored, such as robotics, medicine, meteorology, voice and image recognition, etc. There is a wide variety of models and methodologies to deal with such a problems. But since it is not the goal of this article to make a descriptive analysis of these models, I would like to suggest the work of Van den Burg and Williams (2020) to you to get a better overview of those methodologies. There you can find the differences between online and offline change point detection; univariate and multivariate approaches, which could be parametric or nonparametric; and supervised or unsupervised models.

Photo by Tech Daily — Unsplash

From the beginning, a time series, according to Ehlers (2007), “is a collection of observations made sequentially over time,” whose main characteristic is the dependence that a given observation has on the neighboring observations. A time series can be continuous or discrete, and in the first case, according to Ehlers (2007), having the set *T={t∶ t1< t < t2},* the series is denoted by *{X(t):t ∈ T}*. Therefore, taking an observation window of the time series T with n observations, we have a time series denoted by *{Xm, X(m+1),…, Xn }*. According to Aminikhanghahi and Cook (2017, p. 3), “the detection of the change point can be defined as a problem of hypothesis testing, between two alternatives”, namely “the null hypothesis *H0*: ‘no change occurs’ ‘ and the alternative hypothesis* H1*: ‘a change occurs’”.

### From the Beginning

So if you like to code, it’s time to run Jupyter Notebook and start making some simulations and analysis as we take a “random walk” throughout this approach. Let’s import the following packages:

import numpy as np

from numpy.random import seed

from numpy.random import randn

import random

import datetime

import matplotlib.pyplot as plt

import seaborn as sns

import math

import decimal

from scipy import stats

np.seterr(divide=’raise’) #Make sure you set this

We will start with a time series *y(t)* consisting of two-time series, which are *y(t1)* with mean *μ=1* plus some noise and *y(t2)* with mean* μ=2 *plus some noise, both with 30 observations. As you might think, the proposed time series will have a considerable change point.

yt1 = u1 + 0.1*randn(30)

yt2 = u2 + 0.1*randn(30)

y = np.concatenate((yt1, yt2), axis=0)

plt.figure(figsize=(16, 10))

plt.plot(y)A time series with two differing averages — Image by author

If you know Bayesian statistics, you know that the construction of any model is fundamentally composed of 3 distributions. The prior distribution *h(*θ), reflects our prior knowledge about the problem. The likelihood *f(x|*θ) reflects the data obtained and has to be incorporated into the prior distribution. And that will result in a posterior distribution *h(*θ|x), which we are interested in. That’s how we use Bayes’ theorem.

### Fuzzy Clustering

At this point, the first (and most important?) problem we have, is to obtain a prior distribution from the time series we built — that’s the first piece of the model. **The problem is: We do not have one!** And if you are dealing with a time series, once we get a prior distribution, most of the tasks are already done.

D’Angelo *et al. *(2011) took an interesting approach to get around this problem. He used a Kohonen Network to clusterize the time series. Unlike hard clustering, the Kohonen Network is a fuzzy clustering algorithm, which means that any given point X is associated with group A with probability* p*. This association is given by the function *fA(X) *which associates with each point in A a real number in the interval *[0, 1]*, representing the grade of membership of X in A.

For a complete and better explanation of the Kohonen Network, you can refer to Kohonen (1990) and Haykin (2007). Using Python, I built such a network using two functions:

def Center_Kohonen(y, X=0, K=2, alfa=0.8, C=500):

# This kohonen network implementation finds two centers in a time series

#Params:

#Y = Time Series

#M = Number of input patterns

#N = Dimension of the input pattern (for our case it will be 1)

#K = number of neurons, for the proposed problem (number of centers)

M = y.shape[0]

N = 1

f = 0

#Initializin the neurons’ weights

I = y.argsort(axis=0) # Sorted Indexes

Y = np.sort(y) # Sortes Points in the Time Series

c1 = Y[0:7] # Beginning of the series

c2 = Y[M-7:M] # End of the series

#Adjusting the values

while np.std(c2) > 0.1: #As long as the standard deviation is greater than 0.1, replace the highest value with the mean

ma = c2.argmax()

c2[ma] = np.mean(c2);

y[I[ma+60-7]] = np.mean(c2);

while np.std(c1) > 0.1: #As long as the standard deviation is greater than 0.1, replace the lowest value with the mean

mi = c1.argmin()

c1[mi] = np.mean(c1)

y[I[mi]] = np.mean(c1);

#Definition of weight values

W = [np.mean(c1), np.mean(c2)]

#Finding centers from Kohonen’s network training

for l in range(1, C+1):

alfa=alfa*(1-(C-(C-l))/C)

for i in range(0, M): #For each value in the time series

a=999

for j in range(0, K): #Where K is the number of cluster

if np.linalg.norm([(y[i]-W[j])], 2) < a:

a = np.linalg.norm([(y[i]-W[j])], 2)

f = j

W[f] = W[f]+alfa*(y[i]-W[f])

return c1, c2, I, Y, W, a, alfadef Fuzzy_Set(y, W):

# This program finds membership values for a time series from previously specified centers.

# Where y is the time series and c a vector with the found centers

center_1 = []

center_2 = []

n = y.shape[0]

l = 2

# Finding the membership association for each point in the time series

for i in range(0, l):

for t in range(0, n):

sum=0;

for k in range(0, l):

sum = sum+(y[t]-W[k])*(y[t]-W[k])

if i == 0:

center_1.append(np.round(1-((y[t]-W[i])*(y[t]-W[i]))/sum, 3))

else:

center_2.append(np.round(1-((y[t]-W[i])*(y[t]-W[i]))/sum, 3))

return center_1, center_2

If you call both functions with our time series sequentially, you may plot the results and get something like this:

c1, c2, I, Y, W, a, alfa = Center_Kohonen(y, X=0, K=2, alfa=0.8, C=500)

center_1, center_2 = Fuzzy_Set(y, W)

plt.figure(figsize=(16, 10))

plt.plot(center_1, ‘b’) # plotting t, a separately

plt.plot(center_2, ‘r’) # plotting t, b separately

plt.show()The fuzzy clusterization for the time series y(t) using Kohonen Network — Image by author

And that’s pretty interesting! With the Kohonen Network, we are able to split the time series *y(t)*. This chart shows we have two clusters since we set *K=2. *In the *xlabel, *we have each point of the time series, while the *ylabel *shows the probability that a given point is associated with one of the two clusters. As you might see, the blue line shows us that until we reach point 30 all the points are more likely (around ~99%) to belong to the first group or the set *μ1(t)*. The red line shows the opposite since it represents the associations to the second group, namely the set *μ2(t)*. That’s reasonable since we built a time series with two different averages and illustratively this chart is related to the first.

Although interesting we did not really find a change point until now (we have some clues) and **there’s nothing Bayesian here.**

By the way, most of the time, the distinction between the points of a time series will not be that simple. For instance, if we built the time series *y(t)* with *y(t1)* with mean *μ=1* plus some noise and *y(t2)* with mean* μ=1.3 *(instead of 2)* *plus some noise, would that split be that good? I will let you try this exercise

### The Metropolis-Hastings Algorithm comes to the Game

If you tried the exercise above, you found yourself in trouble using just the Kohonen Network to find any indication of a change point in an alternative time series. That is because the Kohonen network does not provide us with a change point, but two sets of continuous variables, representing the association of each point to a given cluster.

But recall that the sets *μ1(t) *and* μ2(t) *have its values within the interval [0,1]. This implies saying that *μ1(t)* and *μ2(t) *are approximations of Beta distributions with different parameters (Have you ever heard about Kullback–Leibler?). According to D’Angelo *et al. *(2011), assuming that the change point is denoted by *m*, then for *t≤m*, we will have a *Beta(a,b) *distribution and a *Beta(c,d) *distribution for *t>m*. Given the properties of a Beta distribution, if there’s any change point in the time series, *parameter a *will be greater than *parameter b *in *Beta(a,b) *and *parameter c *will be smaller than *parameter d *in *Beta(c,d).*

**The problem is:** How do you build those two Beta Distributions? The Metropolis-Hastings Algorithm is a form of a Markov Chain Monte Carlo initially proposed by Metropolis *et al.* (1953) and then generalized by Hastings (1970). According to Gelman *et al.* (2003), the goal of any Markov chain simulation “is to create a Markov process whose stationary distribution is specified by *p(θ | x).*” Running the simulation enough allows us to obtain a distribution close enough to the stationary and posterior distribution. The posterior could be resumed in terms of expectations of particular functions of a parameter θ, that is, *∫g(θ)p(θ | x)dθ = E [g(θ) | x].* Such integral is not trivial and that’s why the MCMC methods are good to approximate the posterior distribution.

Metropolis-Hastings’ algorithm uses the idea of rejection, which means it generates a value from an auxiliary distribution and accepts it with a given probability. If you are unfamiliar with MCMC methods, you might be questioning how the algorithm may reject a drawn value. We use the transition rule given by Hastings (1970):

Putting it in order words, we can use the two sets of continuous variables, given by the fuzzy clustering to reject randomly drawn values from a given prior distribution for the change point detection. If you want to learn more about MCMC methods, I suggest Gamerman and Lopes (2018).

Let’s get back to the Jupyter Notebook. The function below is an implementation of the Metropolis-Hastings Algorithm for this problem. Although powerfoul, this algorithm requires some things. The first is to set a prior distribution for every parameter that needs to be found. For the parameter *m*, we are using a uniform distribution between 1 and 60, which means the algorithm randomly selects a change point candidate in the time series. For the parameters *a, b, c* and *d, *I selected weakly information gamma distributions. The function also needs the arguments, which are a set of random variables (*μ1(t) *or *μ2(t)) *and the number of simulations.

def Metropolis_Hastings(center_kohonen, n_sims=1000):

n = len(y)

m = 1 + round((n-1) * np.random.uniform(0, 1))

shape, scale, loc = 10, 0.1, 0

#Lists to save the date for each parameter

a_params = []

b_params = []

c_params = []

d_params = []

m_params = []

#Prior Distributions for the Parameters

a = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

b = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

c = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

d = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

for i in range(0, n_sims):

m1 = 1+round((n-1) * np.random.uniform(0, 1));

a1 = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

b1 = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

c1 = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

d1 = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

#PARAM A

aux1 = 1

for j in range(0, m):

try:

aux1 = aux1 * (center_kohonen[j] ** (a1-1))

except:

aux1 = aux1

aux2 = 1

for j in range(0, m):

try:

aux2 = aux2 * center_kohonen[j] ** (a-1)

except:

aux2 = aux2

try:

ra = ((math.gamma(a1+b)/math.gamma(a1))**m)*aux1*((((a/a1)**.9)*math.exp(-.1*(a1-a)))**2)/(((math.gamma(a+b)/math.gamma(a))**m)*aux2)

if (min(1, ra) > np.random.uniform(0, 1)):

a=a1

except:

pass

#PARAM B

aux1 = 1

for j in range(0, m):

try:

aux1 = aux1*(1-center_kohonen[j])**(b1-1)

except:

aux1 = aux1

aux2 = 1

for j in range(0, m):

try:

aux2 = aux2*(1-center_kohonen[j])**(b-1)

except:

aux2 = aux2

try:

rb = ((math.gamma(a+b1)/math.gamma(b1))**m)*aux1*((((b/b1)**.9)*math.exp(-.1*(b1-b)))**2)/(((math.gamma(a+b)/math.gamma(b))**m)*aux2)

if (min(1, rb) > np.random.uniform(0, 1)):

b = b1

except:

pass

#PARAM C

aux1 = 1

for j in range(m, n):

try:

aux1=aux1*center_kohonen[j]**(c1-1)

except:

aux1 = aux1

aux2 = 1

for j in range(m, n):

try:

aux2=aux2*center_kohonen[j]**(c-1)

except:

aux2 = aux2

try:

rc = ((math.gamma(c1+d)/math.gamma(c1))**(n-m))*aux1*((((c/c1)**.9)*math.exp(-.1*(c1-c)))**2)/(((math.gamma(c+d)/math.gamma(c))**(n-m))*aux2)

if (min(1, rc) > np.random.uniform(0, 1)):

c = c1

except:

pass

#PARAM D

aux1 = 1

for j in range(m, n):

try:

aux1=aux1*(1-center_kohonen[j])**(d1-1)

except:

aux1 = aux1

aux2 = 1

for j in range(m, n):

try:

aux2=aux2*(1-center_kohonen[j])**(d-1)

except:

aux2 = aux2

try:

rd = ((math.gamma(c+d1)/math.gamma(d1))**(n-m))*aux1*((((d/d1)**.9)*math.exp(-.1*(d1-d)))**2)/(((math.gamma(c+d)/math.gamma(d))**(n-m))*aux2)

if (min(1, rd) > np.random.uniform(0, 1)):

d = d1

except:

pass

#PARAM M

aux1 = 1

for j in range(0, m1):

try:

aux1 = aux1*(center_kohonen[j]**(a-1))*((1-center_kohonen[j])**(b-1))

except:

aux1 = aux1

aux2 = 1;

for j in range(m1, n):

try:

aux2 = aux2*(center_kohonen[j]**(c-1))*((1-center_kohonen[j])**(d-1))

except:

aux2 = aux2

aux3 = 1

for j in range(0, m):

try:

aux3 = aux3*(center_kohonen[j]**(a-1))*((1-center_kohonen[j])**(b-1))

except:

aux3 = aux3

aux4 = 1

for j in range(m, n):

try:

aux4 = aux4*(center_kohonen[j]**(c-1))*((1-center_kohonen[j])**(d-1))

except:

aux4 = aux4

try:

rm = (((math.gamma(a+b)/(math.gamma(a)*math.gamma(b)))**m1)*((math.gamma(c+d)/(math.gamma(c)*math.gamma(d)))**(n-m1))*aux1*aux2)/(((math.gamma(a+b)/(math.gamma(a)*math.gamma(b)))**m)*((math.gamma(c+d)/(math.gamma(c)*math.gamma(d)))**(n-m))*aux3*aux4)

if (min(1, rm) > np.random.uniform(0, 1)):

m = m1

except:

pass

a_params.append(a)

b_params.append(b)

c_params.append(c)

d_params.append(d)

m_params.append(m)

return a_params, b_params, c_params, d_params, m_params

Call that function with the two required parameters: here, I sent ** center_1 **given by the function

*Fuzzy_Set*and

*n_sims=1000*a_params, b_params, c_params, d_params, m_params = Metropolis_Hastings(center_1, n_sims=1000)

fig_dims = (16, 10)

fig, ax = plt.subplots(figsize=fig_dims)

plt.plot(m_params, ‘r’)

ax.set(xlabel=’# Simulations’, ylabel=’Change Point Candidates (m)’)The Metropolis-Hastings Algorithm Simulations — Image by author

You finally found the change point now. This chart is interesting because it shows us how the drawing process was made. The first value drawn given by the uniform distribution was *m=55*. The algorithm rejected it and then tried another until it get a satisfactory and stationary result. After ~150 additional runs the value of *m=30 *could not be rejected by the algorithm anymore.

Since the function returns the sampled values for every parameter, we may plot their values as well. Starting with the parameter *m*, which is all the change points drawn. To see the density plot, you may discard the first 200 simulations as “burn-in”:

fig_dims = (16, 10)

fig, ax = plt.subplots(figsize=fig_dims)

ax.set(xlabel=’Change Point Candidates’, ylabel=’Density’)

sns.kdeplot(m_params[200:])Density Plot for the Change Point Candidates — Image by author

We may also create Beta distributions using the mean of the four other parameters, namely the parameters *a, b, c, *and *d. *As we discussed previously these parameters are essential for the Metropolis-Hastings algorithm once the reject rule must assert that for *t≤m*, we will have a *Beta(a,b) *distribution and a *Beta(c,d) *distribution for *t>m*.

Let’s go ahead and build a representation of that using the variables *a_params, b_params, c_params* and *d_params*, which contains the sampled values for *a, b, c *and *d*

fig, ax = plt.subplots(1, 1, figsize=(16, 8))

ax.set_ylabel(‘Density’)

beta_1 = np.random.beta(np.mean(a_params[200:]), np.mean(b_params[200:]), size=1000)

beta_2 = np.random.beta(np.mean(c_params[200:]), np.mean(d_params[200:]), size=1000)

sns.kdeplot(beta_1, color=’b’, fill=True, common_norm=False, alpha=.5, linewidth=0)

sns.kdeplot(beta_2, color=’r’, fill=True, common_norm=False, alpha=.5, linewidth=0)

The first Beta Distribution built with the mean of the parameters *a* and *b *is in red and the second with the mean of the parameters *c* and *d *is the blue one. In the middle, where both are less dense, we found the change point. Undoubtedly one of the great advantages of using such an approach is the availability of such distributions, since we can use them to make Bayesian inferences, enrich a predictive model or even use other types of Monte Carlo simulations.

Representations of Two Beta Distribuitions for the A, B, and D parameters — Image by author

### Conclusion

Finding change points in time series could prevent systems from getting into serious trouble. Just imagine a machine whose temperature has to be controlled. Any abrupt changes must be recognized as soon as possible in order for an engineer to inspect them. Or the energy consumption in various production facilities. Any excessive consumption has to be analyzed because it may indicate some deviation in production or even energy leakage, significantly impacting production costs.

That said, the approach developed by D’Angelo (2011) and implemented here in Python proved to be of great value for detecting change points in a given time series. In addition, as previously pointed out, another great aspect of this approach is precisely the fact we obtain two beta distributions as outputs, which can be really useful.

### References

AMINIKHANGHAHI, Samaneh e COOK, J. Diane. **A Survey of Methods for Time Series Change Point Detection. **Knowledge and Information Systems, 51, 2017.

D’ANGELO, Marcos Flávio *et al. *A Fuzzy/Bayesian Approach for the Time Series Change Point Detection Problem. Pesquisa Operacional, 31(2), 2011.

EHLERS, Ricardo S. **Inferência Bayesiana**. 5ª Ed. Departamento de Estatística da Universidade Federal do Paraná, 2007.

FAMA, Eugene. **Random Walks in Stock Market Prices**. Financial Analysts, 21, 1965.

_____________ **Efficient Capital Markets:** A Review of Theory and Empirical Work. The Journal of Finance, 25, 1970.

GAMERMAN, Dani. LOPES, Hedibert Freitas. **Markov Chain Monte Carlo: **Stochastic Simulation for Bayesian Inference. 2º Ed. Florida : Chapman & Hall/CRC, 2006. 315p.

GELMAN, Andrew *et al.* **Bayesian Data Analysis. **2ª Ed. Florida : Chapman & Hall/CRC, 2003. 668p.

HASTINGS, W. K. **Monte Carlo Sampling Methods Using Markov Chains and their Applications. **Biometrika, 57, 1970.

HAYKIN, Simon. **Redes Neurais**: Princípios e Práticas. 2ª Ed. Porto Alegre : Bookman, 2007. 900p.

IWATA, Takuma *et al. ***Accelerating Online Change-Point Detection Algorithm using 10GbE FPGA NIC**. Trabalho apresentado na 24ª Conferência Internacional de Computação Paralela e Distribuída. Turim : 2018.

KOHONEN, Teuvo. **Self-Organizing Maps**. Nova Iorque : Springer, 1990, 501p.

METROPOLIS, Nicholas *et al. ***Equation of State Calculations by Fast Computing Machines**. Journal of Chemical Physics, 21, 1953.

OH KJ, *et al*. **Developing Time-Based Clustering Neural Networks To Use Change-Point Detection: Application To Financial Time Series.** Asia-Pacific Journal of Operational Research, 22(1), 2005.

PAULINO, Carlos Daniel *et al.* **Estatística Bayesiana**.** **2º Edição. Lisboa : Fundação Calouste Gulbenkian, 2018, 601p.

VAN DEN BURG, Gerrit J. J., WILLIAMS, Christopher K. I. **An Evaluation of Change Point Detection Algorithms**. stat.ML, arXiv:2003.06222, 2020

Change Point Detection — A Bayesian Approach was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.