Unfolding the universe of possibilities..

Navigating the waves of the web ocean

Back to the Basics: Probit Regression

A crucial method in binary outcome analysis

Image by Issac Smith on Unsplash

Whenever we face any task related to analyzing binary outcomes, we often think of logistic regression as the go-to method. That’s why most articles about binary outcome regression focus exclusively on logistic regression. However, logistic regression is not the only option available. There are other methods, such as the Linear Probability Model (LPM), Probit regression, and Complementary Log-Log (Cloglog) regression. Unfortunately, there is a lack of articles on these topics available on the internet.

The Linear Probability Model is rarely used because it is not very effective in capturing the curvilinear relationship between a binary outcome and independent variables. I have previously discussed Cloglog regression in one of my previous articles. While there are some articles on Probit regression available on the internet, they tend to be technical and difficult for non-technical readers to understand. In this article, we will explain the basic principles of Probit regression and its applications and compare it with logistic regression.

Background

This is how a relationship between a binary outcome variable and an independent variable typically looks:

Image by the author

The curve you see is called an S-shaped curve or sigmoid curve. If we closely observe this plot, we’ll notice that it resembles a cumulative distribution function (CDF) of a random variable. Therefore, it makes sense to use the CDF to model the relationship between a binary outcome variable and independent variables. The two most commonly used CDFs are the logistic and the normal distributions. Logistic regression utilizes the logistic CDF, given with the following equation:

Image by the author

In Probit regression, we utilize the cumulative distribution function (CDF) of the normal distribution. Reasonably, we can just replace logistic CDF with normal distribution CDF to get the equation of Probit regression:

Image by the author

Where Φ() represents the cumulative distribution function of the standard normal distribution.

We can memorise this equation, but it will not clarify our concept related to the Probit regression. Therefore, we will adopt a different approach to gain a better understanding of how Probit regression works.

The basic concept behind Probit regression

Let us say we have data on the weight and depression status of a sample of 1000 individuals. Our objective is to examine the relationship between weight and depression using Probit regression. (Download the data from this link. )

To provide some intuition, let’s imagine that whether an individual (the “ith” individual) will experience depression or not depends on an unobservable latent variable, denoted as Ai. This latent variable is influenced by one or more independent variables. In our scenario, the weight of an individual determines the value of the latent variable. The probability of experiencing depression increases with increase in the latent variable.

Image by the author

The question is, since Ai is an unobserved latent variable, how do we estimate the parameters of the above equation? Well, if we assume that it is normally distributed with the same mean and variance, we will be able to obtain some information regarding the latent variable and estimate the model parameters. I will explain the equations in more detail later, but first, let’s perform some practical calculations.

Coming back to our data: In our data, let us calculate the probability of depression for each age and tabulate it. For example, there are 7 people with a weight of 40kg, and 1 of them has depression, so the probability of depression for weight 40 is 1/7 = 0.14286. If we do this for all weight, we will get this table:

Image by the author

Now, how do we get the values of the latent variable? We know that the normal distribution gives the probability of Y for a given value of X. However, the inverse cumulative distribution function (CDF) of the normal distribution enables us to obtain the value of X for a given probability value. In this case, we already have the probability values, which means we can determine the corresponding value of the latent variable by using the inverse CDF of the normal distribution. [Note: Inverse Normal CDF function is available in almost every statistical software, including Excel.]

Image by the author

This unobserved latent variable Ai is known as normal equivalent deviate (n.e.d.) or simply normit. Looking closely, it is nothing but Z-scores associated with the unobserved latent variable. Once we have the estimated Ai, estimating β1 and β2 is relatively simple. We can run a simple linear regression between Ai and our independent variable.

Image by the author

The coefficient of weight 0.0256 gives us the change in the z-score of the outcome variable (depression) associated with a one-unit change in weight. Specifically, a one-unit increase in weight is associated with an increase of approximately 0.0256 z-score units in the likelihood of having high depression. We can calculate the probability of depression for any age using standard normal distribution. For example, for weight 70,

Ai = -1.61279 + (0.02565)*70

Ai = 0.1828

The probability associated with a z-score of 0.1828 (P(x<Z)) is 0.57; i.e. the predicted probability of depression for weight 70 is 0.57.

It is quite reasonable to say that the above explanation was an oversimplification of a moderately complex method. It is also important to note that it is just an illustration of the basic principle behind the use of cumulative normal distribution in Probit regression. Now, let us have a look at the mathematical equations.

Mathematical Structure

We discussed earlier that there exists a latent variable, Ai, that is determined by the predictor variables. It will be very logical to consider that there exists a critical or threshold value (Ai_c) of the latent variable such that if Ai exceeds Ai_c, the individual will have depression; otherwise, he/she will not have depression. Given the assumption of normality, the probability that Ai is less than or equal to Ai_c can be calculated from standardized normal CDF:

Image by the author

Where Zi is the standard normal variable, i.e., Z ∼ N(0, σ 2) and F is the standard normal CDF.

The information related to the latent variable and β1 and β2 can be obtained by taking the inverse of the above equation:

Image by the author

Inverse CDF of standardized normal distribution is used when we want to obtain the value of Z for a given probability value.

Now, the estimation process of β1, β2, and Ai depends on whether we have grouped data or individual-level ungrouped data.

When we have grouped data, it is easy to calculate the probabilities. In our depression example, the initial data is ungrouped, i.e. there is weight for each individual and his/her status of depression (1 and 0). Initially, the total sample size was 1000, but we grouped that data by weight, resulting in 71 groups, and calculated the probability of depression in each weight group.

However, when the data is ungrouped, the Maximum Likelihood Estimation (MLE) method is utilized to estimate the model parameters. The figure below shows the Probit regression on our ungrouped data (n = 1000):

Image by the author

It can be observed that the coefficient of weight is very close to what we estimated with the grouped data.

Probit vs Logit

Now that we have grasped the concept of Probit regression and are familiar (hopefully) with logistic regression, the question arises: which model is preferable? Which model performs better under different conditions? Well, both models are quite similar in their application and yield comparable results (in terms of predicted probabilities). The only minor distinction lies in their sensitivity to extreme values. Let’s take a closer look at both models:

Image by the author

From the plot, we can observe that the Probit and Logit models are quite similar. However, Probit is less sensitive to extreme values compared to Logit. It means that at extreme values, the change in probability of outcome with respect to unit change in the predictor variable is higher in the logit model compared to the Probit model. So, if you want your model to be sensitive at extreme values, you may prefer using logistic regression. However, this choice will not significantly affect the estimates, as both models yield similar results in terms of predicted probabilities. It is important to note that the coefficients obtained from both models represent different quantities and cannot be directly compared. Logit regression provides changes in the log odds of the outcome with changes in the predictor variable, while Probit regression provides changes in the z-score of the outcome. However, if we calculate the predicted probabilities of the outcome using both models, the results will be very similar.

In practice, logistic regression is preferred over Probit regression because of its mathematical simplicity and easy interpretation of the coefficients.

Back to the Basics: Probit Regression was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment