Lagrange Multipliers, KKT Conditions, and Duality — Intuitively Explained
Your key to understanding SVMs, Regularization, PCA, and many other machine learning concepts
In this story, we will explore a clear and insightful grasp of three related concepts in mathematical optimization. These concepts required substantial time and effort for me to fully grasp, so I’ve aimed to present them in an intuitive way for all readers. Our journey will commence with a refresher on unconstrained optimization, followed by a consideration for constrained optimization, where we’ll utilize Lagrange Multipliers and KKT conditions. We also delve into the interplay between these ideas and their connections to the concept of duality.
Photo by Filip Mroz on Unsplash
Unconstrained Optimization
Plot of a Multivariable Function by Cdang on Wikimedia CC BY-SA 4.0.
In unconstrained optimization, we are given a multivariable function f(u) and we want to find the value for the vector u* where the value of the function f(u*) is optimum (maximum or minimum).
A function in general can have multiple maxima and minima as shown above. In classical machine learning and throughout this story, we will be mostly interested in convex functions (that are also sufficiently smooth). Being convex implies that the function has at most one optimum value (which is one minimum when the function at hand is a loss function) as shown below:
Graph for a 3D Surface by Andrebis on Wikipedia CC BY-SA 3.0.
It’s much easier to deal with convex functions since otherwise it can be really hard to tell whether the minimum found is the lowest of all (i.e., the global minimum) and not just some local minimum. In general, even when there is one minimum value there can be many points that satisfy it (e.g., if it’s flat) we will pretend that this case doesn’t happen to simplify the explanation; assuming that it happens won’t change anything we derive whatsoever.
Performing unconstrained optimization on a given multivariable function f(u) is possible by solving ∇ᵤf(u) = 0. If f(u) is a function in n variables (u₁, u₂,…,uₙ) then this is a system of n equations:
Which once solved returns the optimal solution u*=(u*₁,u*₂,…,u*ₙ) which is where the optimal value (e.g., minimum) occurs.
Tangent Plane on a Surface by Mike Run on Wikimedia CC BY-SA 4.0.
To see where this comes from, recall that:
The normal vector to the tangent plane at any point u takes the form (∂f(u)/∂u₁, ∂f(u)/∂u₂, …, ∂f(u)/∂uₙ, f(u))The tangent plane at any minimum or maximum is horizontal (visually obvious)Hence, whenever ∇ᵤf(u) = 0 holds, there is a horizontal tangent plane at that point and thus, must be the minimum we are looking for.
Another way to rationalize this which will be useful shortly, is to observe that the gradient points towards the direction of greatest increase (AND opposite to the direction of greatest decrease). Thus, when ∇ᵤf(u) = 0 holds, it must either be not possible to (there is no direction that would) increase the function from that point (i.e., at maximum) or decrease the function from that point (i.e., at a minimum).
Unconstrained Optimization Summary
Given: f(u)
Wanted: u where f(u) is minimum
Approach: Solve ∇ᵤf(u) = 0 since that holds at the minimum
Photo by Jorge Reyna on Unsplash
Constrained Optimization
In this type of optimization, we are given an equality constraint of the form g(u)=0 or g(u)≤0 (else we can put it in this form by rearranging terms or multiplying by a negative) and we want to optimize over only all the points satisfying the constraint.
We typically assume that equality constraints g(u)=0 are affine (generalization of linear) and that inequality constraints g(u)≤0 involve convex functions so that the whole optimization problem is convex (otherwise, the fact that f(u) is convex alone may not be sufficient to guarantee one optimal value).
Constraint Optimization with Equality Constraints
In this, we are given a multivariable function f(u) and a constraint g(u)=0 and we want to find the point u* where g(u*)=0 and f(u*) is minimum (i.e., lowest possible point while satisfying the constraint).
Constrained Optimization by Jacobmelgrad on Wikipedia CC BY-SA 3.0.
For instance, in the example shown the objective function is f(u₁,u₂) = u₁+ u₂ (3D plane) and the constraint is u²₁+ u²₂=1 (2D circle). The goal is to find the point (u₁, u₂) corresponding to the lowest point on the plane where u²₁+ u²₂=1 holds (i.e., the (u₁, u₂) where the lowest point on the circle projected onto the plane occurs).
One approach to solve this type of constrained optimization problems is to use the method of Lagrange multipliers. In simple terms, the Lagrange multiplier theorem states that any solution u* to an optimization problem of the form:
Minimize f(u) such that g(u)=0
must satisfy the equation∇ᵤL(u*,λ)=0 for some λ∈R (and trivially, g(u*)=0) where L is the Lagrangian function which is given by L(u,λ)=f(u)+λg(u). It assumes that ∇ᵤg(u*)≠0.
It follows from this that we can solve constrained optimization problems with equality constraints as follows (assuming ∇ᵤg(u*)≠0):
Write the Lagrangian function L(u,λ)=f(u)+λg(u).Solve n+1 equations resulting from ∇ᵤL(u,λ) = 0 (n equations) with g(u)=0 to find the n+1 unknowns u*₁,u*₂,…,u*ₙ,λThe solution is u*=(u*₁,u*₂,…,u*ₙ)
λ is called a Lagrange multiplier. We only need to find it because it’s part of the system that yields the solution u*.
You can work out the example corresponding to the figure above here. In this example, the problem is not convex and solving should yield any minimum or maximum present. Notice that steps (1) and (2) above are equivalent to performing unconstrained optimization on L(u,λ)=f(u)+λg(u). That is, setting:
∇ L(u,λ) =(∂L(u)/∂u₁, ∂L(u)/∂u₂, …, ∂L(u)/∂uₙ, ∂L(u)/∂λ)= 0
In this sense, this method of Lagrange multipliers is powerful in that it casts a constrained optimization problem into an unconstrained optimization problem which we can solve by simply setting the gradient as zero.
Constrained Optimization by Jacobmelgrad on Wikipedia CC BY-SA 3.0.
Rationale
It’s not hard to derive with intuition why this works. The feasible region is the set of points that satisfy the problem’s constraints (e.g., those on the circle above); we want to find, among such points, the point where the objective function is optimum.
We know that ∇ f(u) points opposite to the direction of the greatest decrease (along the direction of the greatest increase). However, in our case, we are only allowed to move in the feasible region (points satisfying the constraint); thus, to minimize the function under the constraint, we should want to move in the direction of greatest decrease along the constraint curve (so we never exit the feasible region and reach the min).
Suppose the direction of the tangent at point u on the constraint curve is given by r(u), then recalling the formula for vector projection, we want to move opposite to the following direction (∇ f(u) projection on r(u)):
Similar to the unconstrained case above, it should dawn on you that whenever this is 0, we can’t move in any direction along the constraint to further increase f(u) (if at a maximum) or decrease it (if at a minimum).
It’s obvious that for this to be zero, we need r(u)≠0 (so the denominator isn’t zero) and ∇ f(u) ⋅ r(u)=0. For the latter, We know that the normal vector on the constraint curve∇ g(u) is perpendicular to the tangent r(u). Hence, all we need is∇ f(u) to be parallel to ∇ g(u).
Thereby, it must hold at the optimal point u* that:
The normal to the constraint is nonzero: ∇ g(u*)≠0 (so that r(u*)≠0)The constraint is satisfied: g(u*)=0 (trivial requirement)∇ f(u*) ∥∇ g(u*): there exists some real β where∇ f(u*) = β∇ g(u*)
Notice that by rearranging terms and renaming -β, (3) is equivalent to “there exists some real λ where ∇ f(u)+λ∇ g(u)=0”.
In other words, ∇ᵤL(u,λ) = 0, and by that, we have intuitively derived the Lagrange multipliers theorem for one constraint (scroll up if needed).
Notice that the first condition is called a constraint qualification. If it isn’t satisfied by the constraint at a point where (2) and (3) are satisfied, then there are no guarantees that this point is optimal as the projection is not defined there.
Multiple Equality Constraints
When multiple constraints g₁(u), g₂(u),…,gₖ(u) are present, the method smoothly generalizes to the following:
Write the Lagrangian L(u,λ₁,λ₂,…,λₖ) = f(u) + λ₁g₁(u) + λ₂g₂(u) +…+λₖgₖ(u)Solve n+k equations by setting∇ᵤL(u,λ₁,λ₂,…,λₖ) = 0 (n equations) with g₁(u)=0, g₂(u)=0, …, gₖ(u)=0 to find n+k unknowns u*₁,u*₂,…,u*ₙ, λ₁,λ₂,…,λₖThe solution is u*=(u*₁,u*₂,…,u*ₙ)
The assumption∇ g(u*)≠0 generalizes to that ∇ g₁(u), ∇ g₂(u),…,∇ gₖ(u) must be linearly independent. This is called LICQ (linear independence constraint qualification).
Constraint Optimization with Inequality Constraints
Matters don’t get much more complex when we are rather dealing with an inequality constraint of the form g(u)≤0. In this case, we want the optimum point of f(u) that satisfies g(u)≤0.
For the problem above, this means that the feasible region is not just the points on the circle, it’s also the points inside it. It should be obvious that for that particular problem (and not generally), this does not change the solution.
Constrained Optimization by Jacobmelgrad on Wikipedia CC BY-SA 3.0. Modified by Shading.
Instead of solving the two conditions of Lagrange multipliers (2, 3) we solve a set of four conditions called KKT conditions that generalize the Lagrange multipliers case. We can derive them as follows:
Inequality constraint diagram for optimization problems by Onmyphd on Wikipedia CC BY-SA 3.0.
Observe that with an arbitrary hypersurface f(u) and constraint g(u)≤0, there are exactly two possibilities assuming a convex smooth function with one optimum:
The optimum point uᴾ lies inside the feasible region.In this case, the solution to the optimization problem u* must be uᴾ and g(u*)<0 must hold (left image).It’s impossible to find a more optimum point in the feasible region because uᴾ is the most optimal point (e.g., minimum) over the entire region (domain) of f(u).
2. The optimum point uᴾ lies outside the feasible region.
In this case, f(u) must be only decreasing in the feasible region if that point is a maximum (can’t increase again else creates another optimal point)f(u) must be only increasing in the feasible region if that point is a minimum (can’t decrease again else creates another optimal point)Thus, the optimum point u* must be at the edge of the feasible region as it never gets better inside (g(u*) = 0 must hold)
In the first case, it’s obvious that solving the optimization problem is equivalent to solving the unconstrained version of it.
∇ᵤf(u) = 0
We say that the constraint is “inactive because” it doesn’t make a difference in the optimization problem.
In the second case, it’s obvious that solving the optimization problem is equivalent to solving the equality constraint version of it (Lagrange multipliers).
The only catch for that case is that λ must be ≥ 0 for minimization and must be ≤ 0 for maximization. For minimization, this implies that ∇ᵤf(u) and ∇ᵤg(u) point in opposite directions (i.e., β in∇ᵤ f(u) = β∇ ᵤg(u) is ≤0) which has to hold because ∇ᵤg(u) points towards the positive side of the constraint g(u)≥0 (basic property); meanwhile, ∇ᵤ f(u) points to the negative side of the constraint because that’s where f(u) is increasing. A similar argument can be easily constructed for the case of maximization.
But we don’t know beforehand which of these two cases applies. We can merge their methods as follows (assuming minimization):
Write the Lagrangian function L(u,λ)=f(u)+λg(u)Set ∇ᵤL(u,λ) = 0 (n equations) and g(u)≤0Solve for the solution (u*₁,u*₂,…,u*ₙ,λ.) where one of the cases above applies:λ=0 and g(u*)<0 (first case as λ=0 means that ∇ᵤL(u,λ) = ∇ᵤf(u) = 0 so steps 1,2 are equivalent to solving ∇ᵤf(u) = 0)g(u*)=0 and λ≥0 (second case as g(u)=0 means that applying Lagrange is correct and that’s what we did)
We can summarize these two bullets in that g(u*)≤0 and λ≥0 must hold and that λg(u*)=0 must hold (one of λ or g(u*) must be zero). This implies that given an optimization problem of the form:
Minimize f(u) such that g(u)≤0
We expect the following four conditions to be satisfied for the optimal point u*:
Stationarity: ∇ᵤL(u*,λ) = 0Primal Feasibility: g(u*)≤0Dual Feasibility: λ≥0Complementary Slackness: λg(u*)=0
and that solving these conditions together yields the optimal point u*. In reality for convex problems, these conditions are sufficient but not necessary for optimality. That is,
If a point satisfies these conditions (e.g., found by solving them together) then that suffices to prove that the point is optimal (no need to look further for convex problems).Meanwhile, these conditions are not necessary for a point to be optimal. It is possible that solving the conditions gives no solution when in reality there is an optimal point that doesn’t satisfy them. For example, consider f(x) = x and the constraint x² ≤ 0 (both this and another KKT example are solved in this document)
Once we enforce a constraint qualification such as LICQ (stated earlier), we can guarantee that KKT conditions are both sufficient and necessary. An alternative constraint qualification that is easier to check is Slater’s condition which guarantees that KKT is necessary and sufficient for convex problems.
Slater’s condition simply states that the feasible region must have an interior point. That is, for a constraint g(u)≤0 the function must have point(s) in the domain of f(u) satisfying g(u)<0. This is a basic condition that is almost always satisfied in real-life problems (but not the counterexample above), and this means that KKT will rarely ever miss finding an optimal solution.
Multiple Constraints
When multiple equality constraints h₁(u), h₂(u),…,hₖ(u) are present along with multiple inequality constraints g₁(u), g₂(u),…,gₚ(u), the method smoothly generalizes by writing the full Lagrangian and checking the KKT conditions only for inequality constraints and their multipliers (which we will call α):
0. Write the Lagrangian
Set∇ᵤL(u,λ₁,λ₂,…,λₖ, α₁, α₂, …, αₚ) = 0 (n equations)
2. Set h₁(u)=0, h₂(u)=0, …, hₖ(u)=0 (k equations) and set
g₁(u)≤0, g₂(u)≤0, …, gₚ(u)≤0 (p inequalities)
3. Set α₁≥0, α₂≥0, …, αₚ≥0 (p inequalities)
4. Set α₁g₁(u) = α₂g₂(u) = αₚgₖ(u) = 0 (p equations)
In total, you have n+k+p equations and 2p inequalities that you will solve together to find n+k+p variables (u*₁,u*₂,…,u*ₙ,λ₁,λ₂,…,λₖ,α₁, α₂, …, αₚ ) which would yield the solution u*=(u*₁,u*₂,…,u*ₙ) that minimizes the function while satisfying the k+p constraints.
The Duality Principle
The duality principle simply states that for any optimization problem, we can write a dual optimization problem that when solved either tells us something about the original problem (called primal) or solves it.
For any optimization problem of the form:
The dual optimization problem takes the form:
Yes, what’s minimized takes the same form as the Lagrangian
and vice versa if it is maximization.
Example
For instance, the constrained optimization problem that we discussed earlier:
Has the corresponding dual problem:
As basic calculus suggests, to carry on the minimization first we do
which implies
and thereby, the optimization problem becomes
and all it takes now is to differentiate this and equate to zero to get λ = 1/√2 which implies that (x*, y*) = (−1/√2, −1/√2) and which is the same solution that we had by solving the primal problem via KKT.
Deriving the Dual
The primal (original) problem is
Suppose we define a function that returns infinity whenever u is not in the feasible region (doesn’t satisfy the constraint), and zero otherwise:
In this case, the primal problem is equivalent to
This should make sense because f(u)+P(u) sets anything outside of the feasible region to infinity and leaves the feasible region as is. The minimum of this sum must occur in the feasible region even though the constraint is enforced explicitly because infinity is greater than anything.
Observe that we can claim that:
Because with this, if:
g(u)<0 then P(u)=0 as defined earlier since λ=0 must hold to maximize the quantity λg(u) (else it’s negative due to g(u)<0)g(u)=0 then P(u)=0 as defined earlier as λg(u) will be zero (λ can be anything)g(u)>0 then P(u)=∞ as defined earlier as λ=∞ is what would maximize λg(u)
Thereby, the primal problem is equivalent to
It’s okay to introduce f to the Max since it isn’t an explicit function in λ
The difference between this and the dial problem is that in the dual the Max and Min are swapped.
Thus, because in general MinMax(…) ≥ MaxMin(…) a solution to the dual would be a lower bound to the solution of the primal. This is called weak duality.
A more interesting case is when MinMax(…) = MaxMin(…) where a solution to the dual would be exactly the solution to the primal as well (as in the example). This is called strong duality. You can moderatily easily prove that the equality holds (and hence strong duality) when KKT is both necessary and sufficient. In other words, strong duality will hold for convex problems whenever Slater’s condition holds!
So What?
If you think about it, solving the dual problem is akin to only applying the stationarity and dual feasibility conditions of KKT on the primal problem. Instead of applying primal feasibility and complementary slackness you get to deal with an extra minimization over dual variables. In many cases, this is much easier than solving KKT on the primal problem. The extra minimization can be for instance tackled with linear or quadratic programming.
Multiple Constraints?
In generalizing to multiple constraints, the Lagrangian changes just as you would expect (similar to what we have seen) and we only add α≥0 conditions in maximization to multipliers associated with inequality constraints.
Photo by Hyundai Motor Group on Unsplash
Hope this story has helped you truly understand unconstrained optimization, Lagrange Mulipliers, KKT and duality. Till next time, au revoir!
Lagrange Multipliers, KKT Conditions & Duality — Intuitively Explained was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.