Motivation

Machine learning (ML) is a powerful field driven by mathematics. Whether you’re building models, optimizing algorithms, or simply trying to understand how ML works under the hood, mastering the core equations is essential. This blog post is designed to be your go-to resource, covering the most critical and “mind-breaking” ML equations—enough to grasp most of the core math behind ML. Each section includes theoretical insights, the equations themselves, and practical implementations in Python, so you can see the math in action.

This guide is for anyone with a basic background in math and programming who wants to deepen their understanding of ML and is inspired by this tweet from @goyal__pramod. Let’s dive into the equations that power this fascinating field!


Table of Contents


Introduction

Mathematics is the language of machine learning. From probability to linear algebra, optimization to advanced generative models, equations define how ML algorithms learn from data and make predictions. This blog post compiles the most essential equations, explains their significance, and provides practical examples using Python libraries like NumPy, scikit-learn, TensorFlow, and PyTorch. Whether you’re a beginner or an experienced practitioner, this guide will equip you with the tools to understand and apply ML math effectively.


Probability and Information Theory

Probability and information theory provide the foundation for reasoning about uncertainty and measuring differences between distributions.

Bayes’ Theorem

Equation:

\[P(A|B) = \frac{P(B|A) P(A)}{P(B)}\]

Explanation: Bayes’ Theorem describes how to update the probability of a hypothesis ($A$) given new evidence ($B$). It’s a cornerstone of probabilistic reasoning and is widely used in machine learning for tasks like classification and inference.

Practical Use: Applied in Naive Bayes classifiers, Bayesian networks, and Bayesian optimization.

Implementation:

def bayes_theorem(p_d, p_t_given_d, p_t_given_not_d):
    """
    Calculate P(D|T+) using Bayes' Theorem.
    
    Parameters:
    p_d: P(D), probability of having the disease
    p_t_given_d: P(T+|D), probability of testing positive given disease
    p_t_given_not_d: P(T+|D'), probability of testing positive given no disease
    
    Returns:
    P(D|T+), probability of having the disease given a positive test
    """
    p_not_d = 1 - p_d
    p_t = p_t_given_d * p_d + p_t_given_not_d * p_not_d
    p_d_given_t = (p_t_given_d * p_d) / p_t
    return p_d_given_t

# Example usage
p_d = 0.01  # 1% of population has the disease
p_t_given_d = 0.99  # Test is 99% sensitive
p_t_given_not_d = 0.02  # Test has 2% false positive rate
result = bayes_theorem(p_d, p_t_given_d, p_t_given_not_d) 
print(f"P(D|T+) = {result:.4f}")  # Output: P(D|T+) = 0.3333 

Entropy

Equation:

\[H(X) = -\sum_{x \in X} P(x) \log P(x)\]

Explanation: Entropy measures the uncertainty or randomness in a probability distribution. It quantifies the amount of information required to describe the distribution and is fundamental in understanding concepts like information gain and decision trees.

Practical Use: Used in decision trees, information gain calculations, and as a basis for other information-theoretic measures.

Implementation:

import numpy as np

def entropy(p):
    """
    Calculate entropy of a probability distribution.
    
    Parameters:
    p: Probability distribution array
    
    Returns:
    Entropy value
    """
    return -np.sum(p * np.log(p, where=p > 0))

# Example usage
fair_coin = np.array([0.5, 0.5])  # fair coin has the same probability of heads and tails
print(f"Entropy of fair coin: {entropy(fair_coin)}")  # Output: 0.6931471805599453 

biased_coin = np.array([0.9, 0.1])  # biased coin has a higher probability of heads
print(f"Entropy of biased coin: {entropy(biased_coin)}")  # Output: 0.4698716731013394 

Joint and Conditional Probability

Equations:

  • Joint Probability:

    \[P(A, B) = P(A|B) P(B) = P(B|A) P(A)\]
  • Conditional Probability:

    \[P(A|B) = \frac{P(A, B)}{P(B)}\]

Explanation: Joint probability describes the likelihood of two events occurring together, while conditional probability measures the probability of one event given another. These are the building blocks of Bayesian methods and probabilistic models.

Practical Use: Used in Naive Bayes classifiers and probabilistic graphical models.

Implementation:

from sklearn.naive_bayes import GaussianNB
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([0, 0, 1, 1])
model = GaussianNB().fit(X, y)
print(model.predict([[2.5, 3.5]]))  # Output: [1]

Kullback-Leibler Divergence (KLD)

Equation:

\[D_{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right)\]

Explanation: KLD measures how much one probability distribution $P$ diverges from another $Q$. It’s asymmetric and foundational in information theory and generative models.

Practical Use: Used in variational autoencoders (VAEs) and model evaluation.

Implementation:

import numpy as np

P = np.array([0.7, 0.3])
Q = np.array([0.5, 0.5])
kl_div = np.sum(P * np.log(P / Q))
print(f"KL Divergence: {kl_div}")  # Output: 0.08228287850505156

Cross-Entropy

Equation:

\[H(P, Q) = -\sum_{x \in \mathcal{X}} P(x) \log Q(x)\]

Explanation: Cross-entropy quantifies the difference between the true distribution $P$ and the predicted distribution $Q$. It’s a widely used loss function in classification.

Practical Use: Drives training in logistic regression and neural networks.

Implementation:

import numpy as np

y_true = np.array([1, 0, 1])
y_pred = np.array([0.9, 0.1, 0.8])
cross_entropy = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
print(f"Cross-Entropy: {cross_entropy}")  # Output: 0.164252033486018

Linear Algebra

Linear algebra powers the transformations and structures in ML models.

Linear Transformation

Equation:

\[y = Ax + b \quad \text{where } A \in \mathbb{R}^{m \times n}, x \in \mathbb{R}^n, y \in \mathbb{R}^m, b \in \mathbb{R}^m\]

Explanation: This equation represents a linear mapping of input $x$ to output $y$ via matrix $A$ and bias $b$. It’s the core operation in neural network layers.

Practical Use: Foundational for linear regression and neural networks.

Implementation:

import numpy as np

A = np.array([[2, 1], [1, 3]])
x = np.array([1, 2])
b = np.array([0, 1])
y = A @ x + b
print(y)  # Output: [4 7]

Eigenvalues and Eigenvectors

Equation:

\[Av = \lambda v \quad \text{where } \lambda \in \mathbb{R}, v \in \mathbb{R}^n, v \neq 0\]

Explanation: Eigenvalues $\lambda$ and eigenvectors $v$ describe how a matrix $A$ scales and rotates space, crucial for understanding data variance.

Practical Use: Used in Principal Component Analysis (PCA).

Implementation:

import numpy as np

A = np.array([[4, 2], [1, 3]])
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")

Singular Value Decomposition (SVD)

Equation:

\[A = U \Sigma V^T\]

Explanation: SVD breaks down a matrix $A$ into orthogonal matrices $U$ and $V$ and a diagonal matrix $\Sigma$ of singular values. It reveals the intrinsic structure of data.

Practical Use: Applied in dimensionality reduction and recommendation systems.

Implementation:

import numpy as np

A = np.array([[1, 2], [3, 4], [5, 6]])
U, S, Vt = np.linalg.svd(A)
print(f"U:\n{U}\nS: {S}\nVt:\n{Vt}")

Optimization

Optimization is how ML models learn from data.

Gradient Descent

Equation:

\[\theta_{t+1} = \theta_t - \eta \nabla_{\theta} L(\theta)\]

Explanation: Gradient descent updates parameters $\theta$ by moving opposite to the gradient of the loss function $L$, scaled by learning rate $\eta$.

Practical Use: The backbone of training most ML models.

Implementation:

import numpy as np

def gradient_descent(X, y, lr=0.01, epochs=1000):
    m, n = X.shape
    theta = np.zeros(n)
    for _ in range(epochs):
        gradient = (1/m) * X.T @ (X @ theta - y)
        theta -= lr * gradient
    return theta

X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([1, 2, 3])
theta = gradient_descent(X, y)
print(theta)  # Output: ~[0., 1.]

Backpropagation

Equation:

\[\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial a_j} \cdot \frac{\partial a_j}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_{ij}}\]

Explanation: Backpropagation applies the chain rule to compute gradients of the loss $L$ with respect to weights $w_{ij}$ in neural networks.

Practical Use: Enables efficient training of deep networks.

Implementation:

import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(2, 1), nn.Sigmoid())
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

X = torch.tensor([[0., 0.], [1., 1.]], dtype=torch.float32)
y = torch.tensor([[0.], [1.]], dtype=torch.float32)

optimizer.zero_grad()
output = model(X)
loss = loss_fn(output, y)
loss.backward()
optimizer.step()
print(f"Loss: {loss.item()}")

Loss Functions

Loss functions measure model performance and guide optimization.

Mean Squared Error (MSE)

Equation:

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

Explanation: MSE calculates the average squared difference between true $y_i$ and predicted $\hat{y}_i$ values, penalizing larger errors more heavily.

Practical Use: Common in regression tasks.

Implementation:

import numpy as np

y_true = np.array([1, 2, 3])
y_pred = np.array([1.1, 1.9, 3.2])
mse = np.mean((y_true - y_pred)**2)
print(f"MSE: {mse}")  # Output: 0.01

Cross-Entropy Loss

(See Cross-Entropy above for details.)


Advanced ML Concepts

These equations power cutting-edge ML techniques.

Diffusion Process

Equation:

\[x_t = \sqrt{\alpha_t} x_0 + \sqrt{1 - \alpha_t} \epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, I)\]

Explanation: This describes a forward diffusion process where data $x_0$ is gradually noised over time $t$, a key idea in diffusion models.

Practical Use: Used in generative AI like image synthesis.

Implementation:

import torch

x_0 = torch.tensor([1.0])
alpha_t = 0.9
noise = torch.randn_like(x_0)
x_t = torch.sqrt(torch.tensor(alpha_t)) * x_0 + torch.sqrt(torch.tensor(1 - alpha_t)) * noise
print(f"x_t: {x_t}")

Convolution Operation

Equation:

\[(f * g)(t) = \int f(\tau) g(t - \tau) \, d\tau\]

Explanation: Convolution combines two functions by sliding one over the other, extracting features in data like images.

Practical Use: Core to convolutional neural networks (CNNs).

Implementation:

import torch
import torch.nn as nn

conv = nn.Conv2d(1, 1, kernel_size=3)
image = torch.randn(1, 1, 28, 28)
output = conv(image)
print(output.shape)  # Output: torch.Size([1, 1, 26, 26])

Softmax Function

Equation:

\[\sigma(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}\]

Explanation: Softmax converts raw scores $z_i$ into probabilities, summing to 1, ideal for multi-class classification.

Practical Use: Used in neural network outputs.

Implementation:

import numpy as np

z = np.array([1.0, 2.0, 3.0])
softmax = np.exp(z) / np.sum(np.exp(z))
print(f"Softmax: {softmax}")  # Output: [0.09003057 0.24472847 0.66524096]

Attention Mechanism

Equation:

\[\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V\]

Explanation: Attention computes a weighted sum of values $V$ based on the similarity between queries $Q$ and keys $K$, scaled by $\sqrt{d_k}$.

Practical Use: Powers transformers in NLP and beyond.

Implementation:

import torch

def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    attn = torch.softmax(scores, dim=-1)
    return torch.matmul(attn, V)

Q = torch.tensor([[1., 0.], [0., 1.]])
K = torch.tensor([[1., 1.], [1., 0.]])
V = torch.tensor([[0., 1.], [1., 0.]])
output = attention(Q, K, V)
print(output)

Conclusion

This blog post has explored the most critical equations in machine learning, from foundational probability and linear algebra to advanced concepts like diffusion and attention. With theoretical explanations, practical implementations, and visualizations, you now have a comprehensive resource to understand and apply ML math. Point anyone asking about core ML math here—they’ll learn 95% of what they need in one place!


Further Reading