Breaking down a Maximum Likelihood-based approach with example code
This is the 2nd article in a series about Power Laws and Fat Tails. In the previous post, I gave a beginner-friendly introduction to power laws and presented 3 problems with our standard statistical tools in analyzing them. While awareness can help us avoid these problems, it is not always clear what distribution some given data follow in practice. In this article, I will describe how to objectively detect Power Laws from real-world data and share a concrete example with social media data.
Note: If you are unfamiliar with terms like Power Law distribution or Fat Tail, review the first article of this series as a primer.
Photo by Luke Chesser on Unsplash
Power Laws Break STAT 101
In the previous article, we focused on two types of distributions: the Gaussian distribution and the Power Law distribution. We saw that these distributions had diametrically opposite statistical properties. Namely, Power Laws are driven by rare events, while Gaussians are not.
Example Gaussian and Power Law distributions, respectively. Image by author
This rare-event-driven property raised 3 problems with many of our favorite statistical tools (e.g. mean, standard deviation, regression, etc.) in analyzing Power Laws. The takeaway was that if data are Gaussian-like, one can use common approaches like regression and computing expectation values without worry. However, if data are more Power Law-like, these techniques can give incorrect and misleading results.
We also saw a third (more mischievous) distribution that could resemble both a Gaussian and a Power Law (despite their opposite properties) called a Log Normal distribution.
The (mischievous) Log Normal distribution appears both Guassian-like and Power Law-like. Image by author.
This ambiguity presents challenges for practitioners in deciding the best way to analyze a given dataset. To help overcome these challenges, it can be advantageous to determine whether data fit a Power Law, Log Normal, or some other type of distribution.
Log-Log Approach
A popular way of fitting a Power Law to real-world data is what I’ll call the “Log-Log approach” [1]. The idea comes from taking the logarithm of the Power Law’s probability density function (PDF), as derived below.
Taking the log of Power Law probability distribution function [2]. Image by author.
The above derivation translates the Power Law’s PDF definition into a linear equation, as shown in the figure below.
Highlight the linear form of the log(PDF). Image by author.
This implies that the histogram of data following a power law will follow a straight line. In practice, what this looks like is generating a histogram for some data and plotting it on a log-log plot [1]. One might go even further and perform a linear regression to estimate the distribution’s α value (here, α = -m+1).
However, there are significant limitations to this approach. These are described in reference [1] and summarized below.
Slope (hence α) estimations are subject to systematic errorsRegression errors can be hard to estimateFit can look good even if the distribution does not follow a Power LawFits may not obey basic conditions for probability distributions e.g. normalization
Maximum Likelihood Approach
While the Log-Log approach is simple to implement, its limitations make it less than optimal. Instead, we can turn to a more mathematically sound approach via Maximum Likelihood, a widely used statistical method for inferring the best parameters for a model given some data.
Maximum Likelihood consists of 2 key steps. Step 1: obtain a likelihood function. Step 2: maximize the likelihood with respect to your model parameters.
Step 1: Write Likelihood Function
Likelihood is a special type of probability. Put simply, it quantifies the probability of our data given a particular model. We can express it as the joint probability over all our observed data [3]. In the case of a Pareto distribution, we can write this as follows.
Likelihood function for Pareto distribution (i.e. a special type of Power Law) [4]. Note: when working with Likelihood functions, observations (i.e. x_i) are fixed while model parameters are what vary. Image by author.
To make maximizing the likelihood a little easier, it is customary to work with the log-likelihood (they are maximized by the same value of α).
Log-likelihood derivation [4]. Image by author.
Step 2: Maximize Likelihood
With a (log) likelihood function in hand, we can now frame the task of determining the best choice of parameters as an optimization problem. To find the optimal α value based on our data, this boils down to setting the derivative of l(α) with respect to α equal to zero and then solving for α. A derivation of this is given below.
Derivation of max likelihood estimator for α [4]. Image by author.
This provides us with the so-called Maximum Likelihood estimator for α. With this, we can plug in observed values of x to estimate a Pareto distribution’s α value.
With the theoretical foundation set, let’s see what this looks like when applied to real-world data (from my social media accounts).
Example code: Fitting Power Laws to Social Media Data
One domain in which fat-tailed data are prevalent is social media. For instance, a small percentage of creators get the bulk of the attention, a minority of Medium blogs get the majority of reads, and so on.
Here we will use the powerlaw Python library to determine whether data from my various social media channels (i.e. Medium, YouTube, LinkedIn) truly follow a Power Law distribution. The data and code for these examples are available on the GitHub repository.
YouTube-Blog/power-laws/2-detecting-powerlaws at main · ShawhinT/YouTube-Blog
Artificial Data
Before applying the Maximum Likelihood-based approach to messy data from the real world, let’s see what happens when we apply this technique to artificial data (truly) generated from Pareto and Log Normal distributions, respectively. This will help ground our expectations before using the approach on data in which we do not know the “true” underlying distribution class.
First, we import some helpful libraries.
import numpy as np
import matplotlib.pyplot as plt
import powerlaw
import pandas as pd
np.random.seed(0)
Next, let’s generate data from Pareto and Log Normal distributions.
# power law data
a = 2
x_min = 1
n = 1_000
x = np.linspace(0, n, n+1)
s_pareto = (np.random.pareto(a, len(x)) + 1) * x_min
# log normal data
m = 10
s = 1
s_lognormal = np.random.lognormal(m, s, len(x)) * s * np.sqrt(2*np.pi)
To get a sense of what these data look like, it’s helpful to plot histograms. Here, I plot a histogram of each sample’s raw values and the log of the raw values. This latter distribution makes it easier to distinguish between Power Law and Log Normal data visually.
Histograms of data from Power Law distribution. Image by author.Histograms of data from Log Normal distribution. Image by author.
As we can see from the above histograms, the distributions of raw values look qualitatively similar for both distributions. However, we can see a stark difference in the log distributions. Namely, the log Power Law distribution is highly skewed and not mean-centered, while the log of the Log Normal distribution is reminiscent of a Gaussian distribution.
Now, we can use the powerlaw library to fit a Power Law to each sample and estimate values for α and x_min. Here’s what that looks like for our Power Law sample.
# fit power to power law data
results = powerlaw.Fit(s_pareto)
# printing results
print(“alpha = ” + str(results.power_law.alpha)) # note: powerlaw lib’s alpha definition is different than standard i.e. a_powerlawlib = a_standard + 1
print(“x_min = ” + str(results.power_law.xmin))
print(‘p = ‘ + str(compute_power_law_p_val(results)))
# Calculating best minimal value for power law fit
# alpha = 2.9331912195958676
# x_min = 1.2703447024073973
# p = 0.999
The fit does a decent job at estimating the true parameter values (i.e. a=3, x_min=1), as seen by the alpha and x_min values printed above. The value p above quantifies the quality of the fit. A higher p means a better fit (more on this value in section 4.1 of ref [1]).
We can do a similar thing for the Log Normal distribution.
# fit power to log normal data
results = powerlaw.Fit(s_lognormal)
print(“alpha = ” + str(results.power_law.alpha)) # note: powerlaw lib’s alpha definition is different than standard i.e. a_powerlawlib = a_standard + 1
print(“x_min = ” + str(results.power_law.xmin))
print(‘p = ‘ + str(compute_power_law_p_val(results)))
# Calculating best minimal value for power law fit
# alpha = 2.5508694755027337
# x_min = 76574.4701482522
# p = 0.999
We can see that the Log Normal sample also fits a Power Law distribution well (p=0.999). Notice, however, that the x_min value is far in the tail. While this may be helpful for some use cases, it doesn’t tell us much about the distribution that best fits all the data in the sample.
To overcome this, we can manually set the x_min value to the sample minimum and redo the fit.
# fixing xmin so that fit must include all data
results = powerlaw.Fit(s_lognormal, xmin=np.min(s_lognormal))
print(“alpha = ” + str(results.power_law.alpha))
print(“x_min = ” + str(results.power_law.xmin))
# alpha = 1.3087955873576855
# x_min = 2201.318351239509
The .Fit() method also automatically generates estimates for a Log Normal distribution.
print(“mu = ” + str(results.lognormal.mu))
print(“sigma = ” + str(results.lognormal.sigma))
# mu = 10.933481999687547
# sigma = 0.9834599169175509
The estimated Log Normal parameter values are close to the actual values (mu=10, sigma=1), so the fit did a good job once again!
However, by fixing x_min, we lost our quality metric p (for whatever reason, the method doesn’t generate values for it when x_min is provided). So this begs the question, which distribution parameters should I go with? The Power Law or Log Normal?
To answer this question, we can compare the Power Law fit to other candidate distributions via Log-likelihood ratios (R). A positive R implies the Power Law is a better fit, while a negative R implies the alternative distribution is better. Additionally, each comparison gives us a significance value (p). This is demonstrated in the code block below.
distribution_list = [‘lognormal’, ‘exponential’, ‘truncated_power_law’,
‘stretched_exponential’, ‘lognormal_positive’]
for distribution in distribution_list:
R, p = results.distribution_compare(‘power_law’, distribution)
print(“power law vs ” + distribution +
“: R = ” + str(np.round(R,3)) +
“, p = ” + str(np.round(p,3)))
# power law vs lognormal: R = -776.987, p = 0.0
# power law vs exponential: R = -737.24, p = 0.0
# power law vs truncated_power_law: R = -419.958, p = 0.0
# power law vs stretched_exponential: R = -737.289, p = 0.0
# power law vs lognormal_positive: R = -776.987, p = 0.0
As shown above, every alternative distribution is preferred over the Power Law when including all the data in the Log Normal sample. Additionally, based on the likelihood ratios, the lognormal and lognormal_positive fits work best.
Real-world Data
Now that we’ve applied the powerlaw library to data where we know the ground truth let’s try it on data for which the underlying distribution is unknown.
We will follow a similar procedure as we did above but with data from the real world. Here, we will analyze the following data. Monthly followers gained on my Medium profile, earnings across all my YouTube videos, and daily impressions on my LinkedIn posts for the past year.
We’ll start by plotting histograms.
Medium followers gained histograms. Image by author.YouTube video earnings histograms. Image by author.LinkedIn daily impressions histograms. Image by author.
Two things jump out to me from these plots. One, all three look more like the Log Normal histograms than the Power Law histograms we saw before. Two, the Medium and YouTube distributions are sparse, meaning they may have insufficient data for drawing strong conclusions.
Next, we’ll apply the Power Law fit to all three distributions while setting x_min as the smallest value in each sample. The results of this are printed below.
Power Law and Log Normal parameter estimates for empirical data. Image by author.
To determine which distribution is best, we can again do head-to-head comparisons of the Power Law fit to some alternatives. These results are given below.
Fit comparisons of Power Law and alternative distributions. Image by author.
Using the rule of thumb significance cutoff of p<0.1 we can draw the following conclusions. Medium followers and LinkedIn impressions best fit a Log Normal distribution, while a Power Law best represents YouTube earnings.
Of course, since the Medium followers and YouTube earrings data here is limited (N<100), we should take any conclusions from those data with a grain of salt.
Conclusion
Many standard statistical tools break down when applied to data following a Power Law distribution. Accordingly, detecting Power Laws in empirical data can help practitioners avoid incorrect analyses and misleading conclusions.
However, Power Laws are an extreme case of the more general phenomenon of fat tails. In the next article of this series, we will take this work one step further and quantify fat-tailedness for any given dataset via 4 handy heuristics.
Pareto, Power Laws, and Fat Tails
Resources
Connect: My website | Book a call | Ask me anything
Socials: YouTube 🎥 | LinkedIn | Twitter
Support: Buy me a coffee ☕️
[1] arXiv:0706.1062 [physics.data-an]
[2] arXiv:2001.10488 [stat.OT]
[3] https://en.wikipedia.org/wiki/Likelihood_function
[4] https://en.wikipedia.org/wiki/Pareto_distribution
Detecting Power Laws in Real-world Data with Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.