Calculus: Differentiation and Its Application

A statistical perspective

Haziq Jamil

Research Specialist
BAYESCOMP @ CEMSE-KAUST

https://haziqj.ml/uitm-calculus | 📥 PDF

June 13, 2026

(Almost) Everything you ought to know…

…about calculus in the first year

Let \(f:\mathcal X \to \mathbb R\) be a real-valued function defined on an input set \(\mathcal X\).

Definition 1 (Differentiability) \(f(x)\) is said to be differentiable at a point \(x \in \mathcal X\) if the limit

\[ L = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} \tag{1}\]

exists. If \(L\) exists, we denote it by \(f'(x)\) or \(\frac{df}{dx}(x)\), and call it the derivative of \(f\) at \(x\). Further, \(f\) is said to be differentiable on \(\mathcal X\) if it is differentiable at every point in \(\mathcal X\).

For now, we assume \(\mathcal X \subseteq \mathbb R\), and will extend to higher dimensions later.

Some examples


Function Derivative
\(f(x) = x^2\) \(f'(x) = 2x\)
\(f(x) = \sum_{n} a_n x^n\) \(f'(x) = \sum_{n} n a_n x^{n-1}\)
\(f(x) = \sin(x)\) \(f'(x) = \cos(x)\)
\(f(x) = \cos(x)\) \(f'(x) = -\sin(x)\)
\(f(x) = e^x\) \(f'(x) = e^x\)
\(f(x) = \ln(x)\) \(f'(x) = \frac{1}{x}\)

We can derive it “by hand” using the definition. Let \(f(x) = x^2\). Then,

\[ \begin{align} \lim_{h \to 0} & \frac{f(x + h) - f(x)}{h} \\ &= \lim_{h \to 0} \frac{(x + h)^2 - x^2}{h} \\ &= \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} \\[0.5em] &= \lim_{h \to 0} 2x + h \\[0.5em] &= 2x. \end{align} \]

Graphically…

But what is a derivative?

The derivative of a function tells you:

  • 🚀 How fast the function is changing at any point
  • 📐 The slope of the tangent line at that point

The concept of optimisation

  • When \(f\) is some kind of a “reward” function, then the value of \(x\) that maximises \(f\) is highly of interest. Some examples:
    • 💰 Profit maximisation: Find the price that maximises profit.
    • 🧬 Biological processes: Find the conditions that maximise growth or reproduction rates.
    • 👷‍♂️ Engineering: Find the design parameters that maximise strength or efficiency.
  • Derivatives help us find so-called critical values: Solve \(f'(x) = 0\).

Example 1 Find the maximum of \(f(x) = -3x^4 + 4x^3 + 12x^2\).

\[ \begin{align*} f'(x) = -12x^3 + 12x^2 + 24x &= 0 \\ \Leftrightarrow 12x(2 + x - x^2) &= 0 \\ \Leftrightarrow 12x(x+1)(x-2) &= 0 \\ \Leftrightarrow x &= 0, -1, 2. \end{align*} \]

Are all of these critical values maxima values? 🤔

Graphically…

How do we know if it’s a maxima or minima?

Second derivative test: Measure the change in slope around a point \(x\), i.e. \(f''(\hat x) = \frac{d}{dx}\left( \frac{df}{dx}(x) \right) = \frac{d^2f}{dx^2}(x)\).

Behaviour of \(f\) near \(\hat x\) \(f''(\hat x)\) Shape Conclusion
Increasing → Decreasing \(f''(\hat x) < 0\) Concave (∩) Local maximum
Decreasing → Increasing \(f''(\hat x) > 0\) Convex (∪) Local minimum
No sign change / flat region \(f''(\hat x) = 0\) Unknown / flat Inconclusive

Second derivative test

From Example 1, the second derivative is given by \[ \begin{align*} f''(x) &= \frac{d}{dx}\left(-12x^3 + 12x^2 +24x\right) \\ &= -36x^2 + 24x + 24 \end{align*} \]

Plug in the critical points:

  • \(x=-1\): \(f''(-1) = -36 - 24 + 24 = -36 < 0\), hence local maximum.
  • \(x=0\): \(f''(0) = 0 + 0 + 24 = 24 > 0\), hence local minimum.
  • \(x=2\): \(f''(2) = -144 + 48 + 24 = -72 < 0\), hence local maximum.

Tip

Often it is not enough to just differentiate once to find optima. Differentiate twice to classify critical points.

Optimisation in the real world

Example 2 (Pricing a cup of fries 🍟) Your Potato Story outlet sells 80 cups a day at RM4 each. Market testing shows that every RM1 you add to the price loses 10 cups a day. What price brings in the most money?

At a price of \(x\) ringgit, the number of cups sold each day is \[ q(x) = 80 - 10(x - 4) = 120 - 10x. \] So the daily revenue to maximise is \[ R(x) = x \times q(x) = 120x - 10x^2. \]

Set the derivative to zero: \[ R'(x) = 120 - 20x = 0 \;\Rightarrow\; x = \text{RM}\,6. \] Since \(R''(x) = -20 < 0\), it is a maximum: sell 60 cups for RM360 a day (beating RM320 at the old price!).

Note

Every potato matters! 🥔 Same recipe everywhere: write the quantity → set the derivative to zero → check it’s a maximum.

Finding the sweet spot 🍟

The peak sits exactly where the slope is flat — where \(R'(x) = 0\). That is all “setting the derivative to zero” is really doing. 🎯

Optimisation in AI

input

hidden layer

output

cat 🐱

Transformers, the tech behind LLMs, and Deep Learning: 3B1B https://youtu.be/wjZofJX0v4M?si=IzEE6fzSlDLLpLsc

Inside the machine 🧠

We cracked the potato problem by hand. But what is an “AI”? Underneath, it is a network of simple units wired together:

An LLM is a giant version of this, with billions of arrows.

\[ h = \sigma(\, \overbrace{w_1 x_1 + w_2 x_2 + \cdots + b}^{\text{a linear equation}} \,) \]

  • Stack thousands of these; a giant system of linear equations.
  • Every arrow is a weight \(w\); an unknown.
  • “Learning” = choosing the \(w\)’s that give the smallest error.
  • Billions of unknowns, and no tidy formula to set \(f'=0\). So how do we ever find them all? 🤔

How does AI learn? 🤖

Follow the slope downhill

  1. Start with a wild guess!
  2. Measure how wrong it is, via \(L(w)\), the loss function.
  3. The slope points uphill → step the opposite way.
  4. Repeat until you hit the bottom.

It is the potato’s \(f'(p)=0\) all over again — just on the error \(L\), for every weight at once: \[ \underbrace{\frac{\partial L}{\partial w} = 0}_{\textstyle\text{the goal}} \qquad\text{reached step by step via}\qquad \underbrace{w \;\leftarrow\; w - \alpha\,\frac{\partial L}{\partial w}}_{\textstyle\text{gradient descent}} \]

Summary so far

  • 📈 Derivatives represent rate of change (slope) of a function \(f:\mathcal X \to \mathbb R\).

  • 🎯 Interested in optimising an objective function \(f(x)\) representing some kind of “reward” or “cost”.

  • 🔍 Find critical points by solving \(f'(x) = 0\).

  • Use the second derivative test to classify critical points:

    • If \(f''(x) < 0\), then \(f\) is concave down at \(x\) and \(x\) is a local maximum. ⛰️
    • If \(f''(x) > 0\), then \(f\) is concave up at \(x\) and \(x\) is a local minimum. 🫧
    • If \(f''(x) = 0\), then the test is inconclusive. 🤷

A statistical perspective

But what is statistics?

Statistics is a scientific subject that deals with the collection, analysis, interpretation, and presentation of data.

  • Collection means designing experiments, questionnaires, sampling schemes, and also administration of data collection.

  • Analysis means mathematically modelling, estimation, testing, forecasting.

Motivation

Demand isn’t fixed—each passer-by is a ‘gamble’. So at today’s price, what’s the chance a customer buys?

At the Potato Story stall, \(n\) customers walk past at today’s price and each either buys a cup or doesn’t. I want \(p\), the probability that a customer buys. Let \(X_i=1\) if customer \(i\) buys, and \(X_i=0\) if they walk on.

🍟

  • I do not know the value of \(p\), so I want to estimate it somehow.

  • I have a “guess” what it might be e.g. \(p=0.5\) or \(p=0.7\).

  • How do I objectively decide which value is better?

A probabilistic model

Each \(X_i\) is a random variable taking only two possible outcomes, i.e.

\[ X_i = \begin{cases} 1 &\text{w.p. } \ \ p \quad (\text{buys}) \\ 0 &\text{w.p. } \ \ 1-p \quad (\text{walks on}) \\ \end{cases} \]

This is known as a Bernoulli random variable.

Suppose that \(X=X_1 + \dots + X_n\). So we are counting the number of cups sold to \(n\) customers. Then this becomes a binomial random variable. We write \(X \sim \operatorname{Bin}(n, p)\), and the probability mass function is given by

\[ f(x \mid p) = \Pr(X = x) = \binom{n}{x} p^x (1 - p)^{n - x}, \quad x = 0, 1, \ldots, n. \]

Often we might want to find quantities such as \(\operatorname{E}(X)=np\) and \(\operatorname{Var}(X)=np(1-p)\), but we will not go into details here.

Learning from data

\(p\) is unknown

If we do not know \(p\), then it is not possible to calculate probabilities, expectations, variances… ☹️

Naturally, you watch the next \(n=10\) customers walk past the stall:

🍟 🍟 🍟 🚶 🚶 🍟 🍟 🚶 🍟 🍟

A total of \(X=7\) buy a cup (🍟), and from this you surmise that customers are fairly likely to buy, because:

  • If \(p=0.5\), then \(\Pr(X=7 \mid p = 0.5) = \binom{10}{7} (0.5)^7 (0.5)^3 = 0.117\).
  • If \(p=0.7\), then \(\Pr(X=7 \mid p = 0.7) = \binom{10}{7} (0.7)^7 (0.3)^3 = 0.267\).
  • If \(p=0.9\), then \(\Pr(X=7 \mid p = 0.9) = \binom{10}{7} (0.9)^7 (0.1)^3 = 0.057\).
  • How to formalise this idea?

The likelihood function

Definition 2 Given a probability function \(x \mapsto f(x\mid\theta)\) where \(x\) is a realisation of a random variable \(X\), the likelihood function is \(\theta \mapsto f(x\mid\theta)\), often written \(\mathcal L(\theta) = f(x \mid \theta)\).


The value \(\hat \theta\) which maximises \(\mathcal L(\theta)\) is called the maximum likelihood estimator (MLE) of \(\theta\).

Parameteric statistical models

Assume that \(X_i \sim f(x \mid \theta)\) independently for \(i = 1, \ldots, n\). Here, functional form of \(f\) is known, but the parameter \(\theta\) is unknown. Examples:

Name \(f(x \mid \theta)\) \(\theta\) Remarks
Binomial \(\binom{n}{x} p^x (1 - p)^{n - x}\) \(p \in (0, 1)\) No. successes in \(n\) trials
Poisson \(\frac{\lambda^x e^{-\lambda}}{x!}\) \(\lambda > 0\) Count data
Uniform \(\frac{1}{b - a}\) for \(x \in [a, b]\) \(a < b\) Equally likely outcomes
Exponential \(\lambda e^{-\lambda x}\) for \(x \geq 0\) \(\lambda > 0\) Waiting time
Normal \(\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\) \(\mu \in \mathbb{R},\ \sigma^2 > 0\) Bell curve

Example: the MLE for \(p\), by differentiation

We have \(X \sim \operatorname{Bin}(n, p)\)\(X\) cups sold out of \(n\) passers-by. Treat the pmf as a function of \(p\) and take logs:

\[ \begin{aligned} \ell(p) = \log \mathcal L(p) &= \log\!\left[ \binom{n}{X}\, p^{X} (1-p)^{n-X} \right] \\ &= \underbrace{\log \tbinom{n}{X}}_{\text{constant in } p} \; + \; X \log p \; + \; (n - X)\log(1 - p). \end{aligned} \]

Differentiate with respect to \(p\) and set it to zero:

\[ \begin{aligned} \frac{d}{dp}\,\ell(p) = \frac{X}{p} - \frac{n - X}{1 - p} = 0 \;\;\Leftrightarrow\;\; X(1-p) = (n-X)\,p \;\;\Leftrightarrow\;\; \hat p = \frac{X}{n} = \bar X. \end{aligned} \]

Note

The MLE is just the sample proportion — the fraction of customers who bought a cup!

Now compare two prices: an A/B test 🍟

Remember the A/B test we promised? To learn the effect of a price hike, run both prices at once and estimate each \(p\) by its sample proportion:

Group A — keep RM4

\(n_A = 100\) pass by, \(X_A = 70\) buy

\(\hat p_A = 70/100 = \mathbf{0.70}\)

Group B — raise to RM6

\(n_B = 20\) pass by, \(X_B = 10\) buy

\(\hat p_B = 10/20 = \mathbf{0.50}\)

Fewer buyers at RM6 — but more money per head: \(\text{RM}4 \times 0.70 = \text{RM}2.80\) vs \(\text{RM}6 \times 0.50 = \text{RM}3.00\). RM6 still edges ahead, just as the calculus predicted. But how sure are we about that 0.50? 👉

Two prices, two likelihoods 🍟

Wider curve = more uncertainty. The buy-rate looks ~20 points lower at RM6, but Group B’s curve is broad (\(n=20\)): the drop could be real, or just noise. There is overlap in the 95% confidence intervals 🔍

Two prices, two likelihoods 🍟 (cont.)

More data = more confidence. With more data, both groups likelihood curves are sharp and trustworthy. The drop in buy-rate at RM6 is now clearly visible, and the 95% ci do not overlap. The price effect is (likely) real! 💰

How sharp is the peak? Differentiate again

The MLE solves \(\ell'(p) = 0\). That first derivative — the slope of the log-likelihood — is the score: \[ U(p) = \ell'(p) = \frac{X}{p} - \frac{n - X}{1 - p}. \]

In the A/B test both curves peaked — yet one was sharp, one flat. Sharpness is how fast that slope falls away from the peak: the second derivative. So differentiate the score again:

\[ \ell''(p) = -\frac{X}{p^{2}} - \frac{n - X}{(1 - p)^{2}} \;<\; 0. \]

It is negative, so the peak is a genuine maximum ✓ — the second-derivative test, doing statistical work.

Average the curvature: Fisher information

Definition 3 (Fisher information) Under certain regularity conditions, the Fisher information is defined as \[ \mathcal I(p) = -\operatorname{E}\left[\frac{d^2}{dp^2} \ell(p)\right]. \]

  • For the binomial model, we can calculate \(\mathcal I(p)\) explicitly: \[ \mathcal I(p) = \operatorname{E}\!\left[-\ell''(p)\right] = \frac{np}{p^{2}} + \frac{n - np}{(1-p)^{2}} = \frac{n}{p} + \frac{n}{1-p} = \frac{n}{p(1-p)}. \]
  • In principle, larger information \(\rightarrow\) sharper peak \(\rightarrow\) lower uncertainty. \[ \operatorname{Var}(\hat p) \;\approx\; \frac{1}{\mathcal I(p)} = \frac{p(1-p)}{n}. \]

  • The Standard Error of \(\hat p\) is \(\operatorname{SE} = 1/\sqrt{\mathcal I(p)}\).

Information = confidence 🎯

Group \(n\) \(\mathcal I(\hat p) = \frac{n}{\hat p(1-\hat p)}\) \(\text{SE} = 1/\sqrt{\mathcal I}\)
RM4 (sharp) 100 \(\approx 476\) \(0.046\)
RM6 (wide) 20 \(= 80\) \(0.112\)

Definition 4 (Asymptotic 95% confidence interval) For large \(n\), a 95% confidence interval for \(p\) is \[ \hat p \;\pm\; 1.96 \big/ \sqrt{\mathcal I(\hat p)} \;=\; \hat p \;\pm\; 1.96\,\operatorname{SE}. \]

Plugging in each group’s \(\operatorname{SE} = 1/\sqrt{\mathcal I(\hat p)} = \sqrt{\hat p(1-\hat p)/n}\):

  • RM4: \(0.70 \pm 1.96(0.046) = [0.61,\ 0.79]\)
  • RM6: \(0.50 \pm 1.96(0.112) = [0.28,\ 0.72]\)

Conclusions

Summary

Given a model, probability allows us to predict data. Statistics on the other hand, allows us to learn from data.

  • Parameter estimation is a central task in statistics.
    • The MLE solves \(\ell'(\theta)=0\) — the first derivative (the score).
    • The Fisher information \(I(\theta)=\operatorname{E}[-\ell''(\theta)]\) — the second derivative, averaged — sets how confident we are (more data → sharper peak → tighter CI).
  • Calculus is not just background maths—it’s the engine driving statistical theory and modern AI.

Where to go from here

🌐 Autodiff

Computers compute derivatives for us: nudge \(x\) a little and measure the change

fun <- function(x) x ^ 2
# f'(2) = 4
numDeriv::grad(fun, x = 2)  
[1] 4


Deep learning goes further with automatic differentiation — the chain rule applied exactly, even for functions with billions of inputs.

🚀 Where these ideas go next

  • From one variable to many — partial derivatives and the gradient \(\nabla f\); the very same \(f'(x)=0\) in higher dimensions (regression, neural nets, …)
  • The other half of calculusintegration, behind probabilities, expectations, and areas under the curve
  • Constrained optimisation — Lagrange multipliers, for when your choices have limits

شكراً جزيلاً

https://haziqj.ml/uitm-calculus