#### An alternative of logistic regression in special conditions

When it comes to statistical modelling and regression analysis, there are a plethora of techniques to choose from. One such method that often gets ignored but can be incredibly useful in certain scenarios is Complementary Log-Log (Cloglog) Regression. In this article, we’ll take a closer look at what Cloglog regression is, when to use it, and how it works.

**Precursor of Cloglog regression**

Cloglog regression is a statistical modelling technique used to analyze binary response variables. As we know when it comes to modelling binary outcomes, the first model that strikes our mind is logistic regression. Actually, cloglog is an alternative to logistic regression in special scenarios. I am assuming that you all have a basic understanding of logistic regression. However, if you are unfamiliar with logistic regression, it is advised to first gain a fundamental understanding of it. There is a wealth of online resources available on logistic regression that can help familiarize you with the topic.

Cloglog regression is an extension of the logistic regression model and is particularly useful when the probability of an event is very small or very large. Most of the time cloglog regression is used while dealing with rare events or situations where the outcome is extremely skewed.

**The Need for Cloglog Regression**

As we are aware, logistic regression follows the form of a sigmoid function. The sigmoid curve is depicted below:

Image by the author

From this graphical representation, it becomes apparent that for smaller values of ‘x’, the probability of the outcome remains relatively low, while for larger values, the probability of the outcome becomes higher. The curve exhibits symmetry around the value of 0.5 for ‘Y’. This symmetry implies that in logistic regression, there exists an underlying characteristic where the distribution of the probability of success or event occurrence (Y = 1) is symmetrically distributed around 0.5. This implies that the most significant change in probability occurs in the middle of the graph, while the probability remains relatively less sensitive at extreme values of ‘x’. This assumption holds true when our outcome variable has a substantial number of cases with success or events, as demonstrated by examples such as:

Prevalence of depression

Image by the author

Or student passed in an exam

Image by the author

However, this assumption might not hold in the case of rare events or too frequent events, where the probability of success or event occurrence is either extremely low or very high. For instance, consider the scenario of people surviving a cardiac arrest, where the likelihood of success is significantly lower:

Image by the author

Or, success of glaucoma surgery in a hospital (chances of success are very high):

Image by the author

In such cases, the symmetrical distribution around 0.5 is not considered ideal, and a different modelling approach is suggested, which is where Complementary Log-Log Regression comes into the picture.

Unlike logit and probit, the Cloglog function is asymmetrical and skewed to one side.

**How Complementary Log-Log Regression Works**

Cloglog regression uses complementary log-log function which generates an S-shaped curve but asymmetrical. The Cloglog regression has the following form:

Image by the author

The left side of the equation is called the Complementary Log-Log transformation. Similar to logit and probit transformations this also takes a binary response (0 or 1) and converts it into (-∞ to +∞). The model can also be written as:

Image by the author

In the graph below, we visualize the curves generated using the logit, probit, and cloglog transformations in R.

# Load the ggplot2 package

library(ggplot2)

# Create a sequence of values for the x-axis

x <- seq(-5, 5, by = 0.1)

# Calculate the values for the logit and probit functions

logit_vals <- plogis(x)

probit_vals <- pnorm(x)

# Calculate the values for the cloglog function manually

cloglog_vals <- 1 – exp(-exp(x))

# Create a data frame to store the values

data <- data.frame(x, logit_vals, probit_vals, cloglog_vals)

# Create the plot using ggplot2

ggplot(data, aes(x = x)) +

geom_line(aes(y = logit_vals, color = “Logit”), size = 1) +

geom_line(aes(y = probit_vals, color = “Probit”), size = 1) +

geom_line(aes(y = cloglog_vals, color = “CLogLog”), size = 1) +

labs(title = “Logit, Probit, and CLogLog Functions”,

x = “x”, y = “Probability”) +

scale_color_manual(values = c(“Logit” = “red”, “Probit” = “blue”, “CLogLog” = “green”)) +

theme_minimal()Image by the author

From the graph, we observe a distinct difference: while logit and probit transformations are symmetric around the value 0.5, the cloglog transformation exhibits asymmetry. In logistic and probit functions, the probability changes at a similar rate when approaching both 0 and 1. In cases where the data is not symmetric within the [0, 1] interval and increases slowly at small to moderate values but sharply near 1, the logit and probit models may not be suitable choices. In such situations, where asymmetry in the response variable is evident, the complementary log-log model (cloglog) emerges as a promising alternative, offering improved modelling capabilities. From the graph of the Cloglog function, we can see that P(Y = 1) approaches 0 relatively slowly and approaches 1 sharply.

**Let us take an example: Examining Zinc deficiency**

I have simulated data on zinc deficiency within a specific group of individuals [note: the data is simulated data created by the author for personal use]. The dataset also consists of data on factors such as age, sex, and BMI (Body Mass Index). Remarkably, only 2.3% of the individuals in this dataset exhibit zinc deficiency, indicating its relatively infrequent occurrence within this population. Our outcome variable is Zinc deficiency (binary variable (0 = no, 1 = yes)), and our predictor variables are age, sex and Body Mass Index (BMI). We employ logistic, probit and Cloglog regression in R and compare the three models using AIC:

> #tabulating zinc deficiency

> tab = table(zinc$zinc_def)

> rownames(tab) = c(“No”, “Yes”)

> print(tab)

No Yes

8993 209

> #tabulating sex and zinc deficieny

> crosstab = table(zinc$sex, zinc$zinc_def)

> rownames(crosstab) = c(“male” , “female”)

> colnames(crosstab) = c(“No”, “Yes”)

> print(crosstab)

No Yes

male 4216 159

female 4777 50

> #definig sex as a factor variable

> zinc$sex = as.factor(zinc$sex)

> #logistic regression of zinc deficiency predicted by age, sex and bmi

> model1 = glm(zinc_def ~ age + sex + bmi, data = zinc, family = binomial(link = “logit”))

> summary(model1)

Call:

glm(formula = zinc_def ~ age + sex + bmi, family = binomial(link = “logit”),

data = zinc)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.064053 0.415628 -4.966 6.83e-07 ***

age -0.034369 0.004538 -7.574 3.62e-14 ***

sex2 -1.271344 0.164012 -7.752 9.08e-15 ***

bmi 0.010059 0.015843 0.635 0.525

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1995.3 on 9201 degrees of freedom

Residual deviance: 1858.8 on 9198 degrees of freedom

(1149 observations deleted due to missingness)

AIC: 1866.8

Number of Fisher Scoring iterations: 7

> #probit model

> model2 = glm(zinc_def ~ age + sex + bmi, data = zinc, family = binomial(link = “probit”))

> summary(model2)

Call:

glm(formula = zinc_def ~ age + sex + bmi, family = binomial(link = “probit”),

data = zinc)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.280983 0.176118 -7.273 3.50e-13 ***

age -0.013956 0.001863 -7.493 6.75e-14 ***

sex2 -0.513252 0.064958 -7.901 2.76e-15 ***

bmi 0.003622 0.006642 0.545 0.586

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1995.3 on 9201 degrees of freedom

Residual deviance: 1861.7 on 9198 degrees of freedom

(1149 observations deleted due to missingness)

AIC: 1869.7

Number of Fisher Scoring iterations: 7

> #cloglog model

> model3 = glm(zinc_def ~ age + sex + bmi, data = zinc, family = binomial(link = “cloglog”))

> summary(model3)

Call:

glm(formula = zinc_def ~ age + sex + bmi, family = binomial(link = “cloglog”),

data = zinc)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.104644 0.407358 -5.167 2.38e-07 ***

age -0.033924 0.004467 -7.594 3.09e-14 ***

sex2 -1.255728 0.162247 -7.740 9.97e-15 ***

bmi 0.010068 0.015545 0.648 0.517

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1995.3 on 9201 degrees of freedom

Residual deviance: 1858.6 on 9198 degrees of freedom

(1149 observations deleted due to missingness)

AIC: 1866.6

Number of Fisher Scoring iterations: 7

> #extracting AIC value of each model for model comparison

> AIC_Val = AIC(model1, model2, model3)

> print(AIC_Val)

df AIC

model1 4 1866.832

model2 4 1869.724

model3 4 1866.587

**Interpretation of the coefficients**

The interpretation of coefficients in Cloglog regression is similar to that in logistic regression. Each coefficient represents the change in the log odds of the outcome associated with a one-unit change in the predictor variable. By exponentiating the coefficients, we obtain the Odds Ratio.

In our specific model, the coefficient for Age is -0.034. This implies that for every one-year increase in age, there is a 0.034-unit decrease in the log odds of zinc deficiency. By exponentiating this coefficient, we can calculate the Odds Ratio:

Odds Ratio = exp(-0.034) = 0.97

This suggests that a one-year increase in age is associated with a 3% decrease in the odds of zinc deficiency.

Similarly, for the variable ‘sex’:

Odds Ratio = exp(-1.25) = 0.28

This indicates that compared to males, females have 72% lower odds of experiencing zinc deficiency.

We can also interpret the BMI coefficient, although it should be noted that the p-value for BMI is 0.52, suggesting that it is not significantly associated with zinc deficiency in this model.

**Application and Uses**

Cloglog regression is utilized across various research fields, encompassing rare disease epidemiology, drug efficacy studies, credit risk assessment, defect detection, and survival analysis. In particular, the Cloglog model holds significant implications in survival analysis due to its close association with continuous-time models for event occurrences.

Complementary Log-Log Regression is a powerful and often overlooked statistical technique that can be invaluable in situations where traditional logistic regression might not be the right choice. By understanding its principles and applications, you can add this versatile tool to your data analysis arsenal.

A Gentle Introduction to Complementary Log-Log Regression was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.