
A statistical perspective
June 13, 2026
Let \(f:\mathcal X \to \mathbb R\) be a real-valued function defined on an input set \(\mathcal X\).
Definition 1 (Differentiability) \(f(x)\) is said to be differentiable at a point \(x \in \mathcal X\) if the limit
\[ L = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} \tag{1}\]
exists. If \(L\) exists, we denote it by \(f'(x)\) or \(\frac{df}{dx}(x)\), and call it the derivative of \(f\) at \(x\). Further, \(f\) is said to be differentiable on \(\mathcal X\) if it is differentiable at every point in \(\mathcal X\).
For now, we assume \(\mathcal X \subseteq \mathbb R\), and will extend to higher dimensions later.
| Function | Derivative |
|---|---|
| \(f(x) = x^2\) | \(f'(x) = 2x\) |
| \(f(x) = \sum_{n} a_n x^n\) | \(f'(x) = \sum_{n} n a_n x^{n-1}\) |
| \(f(x) = \sin(x)\) | \(f'(x) = \cos(x)\) |
| \(f(x) = \cos(x)\) | \(f'(x) = -\sin(x)\) |
| \(f(x) = e^x\) | \(f'(x) = e^x\) |
| \(f(x) = \ln(x)\) | \(f'(x) = \frac{1}{x}\) |
We can derive it “by hand” using the definition. Let \(f(x) = x^2\). Then,
\[ \begin{align} \lim_{h \to 0} & \frac{f(x + h) - f(x)}{h} \\ &= \lim_{h \to 0} \frac{(x + h)^2 - x^2}{h} \\ &= \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} \\[0.5em] &= \lim_{h \to 0} 2x + h \\[0.5em] &= 2x. \end{align} \]
The derivative of a function tells you:

Example 1 Find the maximum of \(f(x) = -3x^4 + 4x^3 + 12x^2\).
\[ \begin{align*} f'(x) = -12x^3 + 12x^2 + 24x &= 0 \\ \Leftrightarrow 12x(2 + x - x^2) &= 0 \\ \Leftrightarrow 12x(x+1)(x-2) &= 0 \\ \Leftrightarrow x &= 0, -1, 2. \end{align*} \]
Are all of these critical values maxima values? 🤔

Second derivative test: Measure the change in slope around a point \(x\), i.e. \(f''(\hat x) = \frac{d}{dx}\left( \frac{df}{dx}(x) \right) = \frac{d^2f}{dx^2}(x)\).
| Behaviour of \(f\) near \(\hat x\) | \(f''(\hat x)\) | Shape | Conclusion | |
|---|---|---|---|---|
| Increasing → Decreasing | \(f''(\hat x) < 0\) | Concave (∩) | Local maximum | |
| Decreasing → Increasing | \(f''(\hat x) > 0\) | Convex (∪) | Local minimum | |
| No sign change / flat region | \(f''(\hat x) = 0\) | Unknown / flat | Inconclusive |
From Example 1, the second derivative is given by \[ \begin{align*} f''(x) &= \frac{d}{dx}\left(-12x^3 + 12x^2 +24x\right) \\ &= -36x^2 + 24x + 24 \end{align*} \]
Plug in the critical points:
Tip
Often it is not enough to just differentiate once to find optima. Differentiate twice to classify critical points.
Example 2 (Pricing a cup of fries 🍟) Your Potato Story outlet sells 80 cups a day at RM4 each. Market testing shows that every RM1 you add to the price loses 10 cups a day. What price brings in the most money?
At a price of \(x\) ringgit, the number of cups sold each day is \[ q(x) = 80 - 10(x - 4) = 120 - 10x. \] So the daily revenue to maximise is \[ R(x) = x \times q(x) = 120x - 10x^2. \]
Set the derivative to zero: \[ R'(x) = 120 - 20x = 0 \;\Rightarrow\; x = \text{RM}\,6. \] Since \(R''(x) = -20 < 0\), it is a maximum: sell 60 cups for RM360 a day (beating RM320 at the old price!).
Note
Every potato matters! 🥔 Same recipe everywhere: write the quantity → set the derivative to zero → check it’s a maximum.

The peak sits exactly where the slope is flat — where \(R'(x) = 0\). That is all “setting the derivative to zero” is really doing. 🎯
input
hidden layer
output


cat 🐱
Transformers, the tech behind LLMs, and Deep Learning: 3B1B https://youtu.be/wjZofJX0v4M?si=IzEE6fzSlDLLpLsc
We cracked the potato problem by hand. But what is an “AI”? Underneath, it is a network of simple units wired together:

An LLM is a giant version of this, with billions of arrows.
\[ h = \sigma(\, \overbrace{w_1 x_1 + w_2 x_2 + \cdots + b}^{\text{a linear equation}} \,) \]

It is the potato’s \(f'(p)=0\) all over again — just on the error \(L\), for every weight at once: \[ \underbrace{\frac{\partial L}{\partial w} = 0}_{\textstyle\text{the goal}} \qquad\text{reached step by step via}\qquad \underbrace{w \;\leftarrow\; w - \alpha\,\frac{\partial L}{\partial w}}_{\textstyle\text{gradient descent}} \]
📈 Derivatives represent rate of change (slope) of a function \(f:\mathcal X \to \mathbb R\).
🎯 Interested in optimising an objective function \(f(x)\) representing some kind of “reward” or “cost”.
🔍 Find critical points by solving \(f'(x) = 0\).
Use the second derivative test to classify critical points:
Statistics is a scientific subject that deals with the collection, analysis, interpretation, and presentation of data.
Collection means designing experiments, questionnaires, sampling schemes, and also administration of data collection.
Analysis means mathematically modelling, estimation, testing, forecasting.

Demand isn’t fixed—each passer-by is a ‘gamble’. So at today’s price, what’s the chance a customer buys?
At the Potato Story stall, \(n\) customers walk past at today’s price and each either buys a cup or doesn’t. I want \(p\), the probability that a customer buys. Let \(X_i=1\) if customer \(i\) buys, and \(X_i=0\) if they walk on.
🍟
I do not know the value of \(p\), so I want to estimate it somehow.
I have a “guess” what it might be e.g. \(p=0.5\) or \(p=0.7\).
How do I objectively decide which value is better?
Each \(X_i\) is a random variable taking only two possible outcomes, i.e.
\[ X_i = \begin{cases} 1 &\text{w.p. } \ \ p \quad (\text{buys}) \\ 0 &\text{w.p. } \ \ 1-p \quad (\text{walks on}) \\ \end{cases} \]
This is known as a Bernoulli random variable.
Suppose that \(X=X_1 + \dots + X_n\). So we are counting the number of cups sold to \(n\) customers. Then this becomes a binomial random variable. We write \(X \sim \operatorname{Bin}(n, p)\), and the probability mass function is given by
\[ f(x \mid p) = \Pr(X = x) = \binom{n}{x} p^x (1 - p)^{n - x}, \quad x = 0, 1, \ldots, n. \]
Often we might want to find quantities such as \(\operatorname{E}(X)=np\) and \(\operatorname{Var}(X)=np(1-p)\), but we will not go into details here.
\(p\) is unknown
If we do not know \(p\), then it is not possible to calculate probabilities, expectations, variances… ☹️
Naturally, you watch the next \(n=10\) customers walk past the stall:
🍟 🍟 🍟 🚶 🚶 🍟 🍟 🚶 🍟 🍟
A total of \(X=7\) buy a cup (🍟), and from this you surmise that customers are fairly likely to buy, because:
Definition 2 Given a probability function \(x \mapsto f(x\mid\theta)\) where \(x\) is a realisation of a random variable \(X\), the likelihood function is \(\theta \mapsto f(x\mid\theta)\), often written \(\mathcal L(\theta) = f(x \mid \theta)\).

The value \(\hat \theta\) which maximises \(\mathcal L(\theta)\) is called the maximum likelihood estimator (MLE) of \(\theta\).
Assume that \(X_i \sim f(x \mid \theta)\) independently for \(i = 1, \ldots, n\). Here, functional form of \(f\) is known, but the parameter \(\theta\) is unknown. Examples:
| Name | \(f(x \mid \theta)\) | \(\theta\) | Remarks |
|---|---|---|---|
| Binomial | \(\binom{n}{x} p^x (1 - p)^{n - x}\) | \(p \in (0, 1)\) | No. successes in \(n\) trials |
| Poisson | \(\frac{\lambda^x e^{-\lambda}}{x!}\) | \(\lambda > 0\) | Count data |
| Uniform | \(\frac{1}{b - a}\) for \(x \in [a, b]\) | \(a < b\) | Equally likely outcomes |
| Exponential | \(\lambda e^{-\lambda x}\) for \(x \geq 0\) | \(\lambda > 0\) | Waiting time |
| Normal | \(\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\) | \(\mu \in \mathbb{R},\ \sigma^2 > 0\) | Bell curve |
We have \(X \sim \operatorname{Bin}(n, p)\) — \(X\) cups sold out of \(n\) passers-by. Treat the pmf as a function of \(p\) and take logs:
\[ \begin{aligned} \ell(p) = \log \mathcal L(p) &= \log\!\left[ \binom{n}{X}\, p^{X} (1-p)^{n-X} \right] \\ &= \underbrace{\log \tbinom{n}{X}}_{\text{constant in } p} \; + \; X \log p \; + \; (n - X)\log(1 - p). \end{aligned} \]
Differentiate with respect to \(p\) and set it to zero:
\[ \begin{aligned} \frac{d}{dp}\,\ell(p) = \frac{X}{p} - \frac{n - X}{1 - p} = 0 \;\;\Leftrightarrow\;\; X(1-p) = (n-X)\,p \;\;\Leftrightarrow\;\; \hat p = \frac{X}{n} = \bar X. \end{aligned} \]
Note
The MLE is just the sample proportion — the fraction of customers who bought a cup!
Remember the A/B test we promised? To learn the effect of a price hike, run both prices at once and estimate each \(p\) by its sample proportion:
Group A — keep RM4
\(n_A = 100\) pass by, \(X_A = 70\) buy
\(\hat p_A = 70/100 = \mathbf{0.70}\)
Group B — raise to RM6
\(n_B = 20\) pass by, \(X_B = 10\) buy
\(\hat p_B = 10/20 = \mathbf{0.50}\)
Fewer buyers at RM6 — but more money per head: \(\text{RM}4 \times 0.70 = \text{RM}2.80\) vs \(\text{RM}6 \times 0.50 = \text{RM}3.00\). RM6 still edges ahead, just as the calculus predicted. But how sure are we about that 0.50? 👉

Wider curve = more uncertainty. The buy-rate looks ~20 points lower at RM6, but Group B’s curve is broad (\(n=20\)): the drop could be real, or just noise. There is overlap in the 95% confidence intervals 🔍

More data = more confidence. With more data, both groups likelihood curves are sharp and trustworthy. The drop in buy-rate at RM6 is now clearly visible, and the 95% ci do not overlap. The price effect is (likely) real! 💰
The MLE solves \(\ell'(p) = 0\). That first derivative — the slope of the log-likelihood — is the score: \[ U(p) = \ell'(p) = \frac{X}{p} - \frac{n - X}{1 - p}. \]
In the A/B test both curves peaked — yet one was sharp, one flat. Sharpness is how fast that slope falls away from the peak: the second derivative. So differentiate the score again:
\[ \ell''(p) = -\frac{X}{p^{2}} - \frac{n - X}{(1 - p)^{2}} \;<\; 0. \]
It is negative, so the peak is a genuine maximum ✓ — the second-derivative test, doing statistical work.
Definition 3 (Fisher information) Under certain regularity conditions, the Fisher information is defined as \[ \mathcal I(p) = -\operatorname{E}\left[\frac{d^2}{dp^2} \ell(p)\right]. \]
In principle, larger information \(\rightarrow\) sharper peak \(\rightarrow\) lower uncertainty. \[ \operatorname{Var}(\hat p) \;\approx\; \frac{1}{\mathcal I(p)} = \frac{p(1-p)}{n}. \]
The Standard Error of \(\hat p\) is \(\operatorname{SE} = 1/\sqrt{\mathcal I(p)}\).
| Group | \(n\) | \(\mathcal I(\hat p) = \frac{n}{\hat p(1-\hat p)}\) | \(\text{SE} = 1/\sqrt{\mathcal I}\) |
|---|---|---|---|
| RM4 (sharp) | 100 | \(\approx 476\) | \(0.046\) |
| RM6 (wide) | 20 | \(= 80\) | \(0.112\) |
Definition 4 (Asymptotic 95% confidence interval) For large \(n\), a 95% confidence interval for \(p\) is \[ \hat p \;\pm\; 1.96 \big/ \sqrt{\mathcal I(\hat p)} \;=\; \hat p \;\pm\; 1.96\,\operatorname{SE}. \]
Plugging in each group’s \(\operatorname{SE} = 1/\sqrt{\mathcal I(\hat p)} = \sqrt{\hat p(1-\hat p)/n}\):
Given a model, probability allows us to predict data. Statistics on the other hand, allows us to learn from data.


🌐 Autodiff
Computers compute derivatives for us: nudge \(x\) a little and measure the change
Deep learning goes further with automatic differentiation — the chain rule applied exactly, even for functions with billions of inputs.
🚀 Where these ideas go next