Composition and the Chain Rule
How does local change propagate through multiple transformations?
Why must the derivative of a composition be a composition of derivatives?
The derivative names the linear structure visible inside continuous change near one point. It takes a changing process and, at a chosen input, extracts the linear map that describes first-order behavior there.
That already gives calculus a local language.
But change rarely appears as one isolated process.
A position may depend on time. A temperature may depend on position. A measurement may depend on temperature. A cost may depend on a measurement. Once processes feed into other processes, local change has to travel through a chain of transformations.
Suppose one smooth function sends x to y, and a second smooth function sends y to z:
x ──f──▶ y ──g──▶ z
Their composite g . f sends x directly to z.
At the global level, this is ordinary function composition. First apply f, then apply g. The structural question in calculus asks what happens one level down, at the level of local linear behavior.
Near x, the function f has a derivative Df_x. This derivative sends a small input displacement near x to the first-order output displacement near y = f(x).
Near y, the function g has a derivative Dg_y. This derivative sends a small input displacement near y to the first-order output displacement near z = g(y).
The small displacement produced by f is exactly the kind of input displacement that g needs.
So the local change travels by composition:
x ──f──▶ y ──g──▶ z
│ │ │
│Df │Dg │
▼ ▼ ▼
T_x ──▶ T_y ──▶ T_z
Read T_x, T_y, and T_z as the local linear worlds of infinitesimal directions attached to the points x, y, and z. The function f moves the point x to y. Its derivative moves a small direction near x to a small direction near y. Then the derivative of g moves that direction onward to a small direction near z.
That is the chain rule in structural form.
D(g . f)_x = Dg_{f(x)} . Df_x
The derivative of the composite is the composite of the derivatives, evaluated at the right points.
The derivative of g must be taken at f(x), because that is where the output of f lands. Local linearization is attached to points, so composition has to track both levels at once:
global level: x ──f──▶ f(x) ──g──▶ g(f(x))
local level: h ──Df_x──▶ Df_x(h) ──Dg_{f(x)}──▶ Dg_{f(x)}(Df_x(h))
This is why the chain rule is more than a shortcut for differentiating formulas. It says that local linearization respects composition.
Start with a small input change h near x. The first function converts it, to first order, into Df_x(h). The second function then converts that result, to first order, into Dg_{f(x)}(Df_x(h)).
The composite process therefore has local linear part
h |-> Dg_{f(x)}(Df_x(h))
which is exactly the composite linear map
Dg_{f(x)} . Df_x
This is the minimal answer calculus can accept. If derivatives are supposed to capture first-order behavior, then chaining two smooth changes should chain their first-order behaviors. Otherwise local descriptions would fail to agree with global composition.
In one-variable calculus, the same structure appears in a compressed numerical form.
If f and g are functions from real numbers to real numbers, then each derivative is represented by multiplication by a number. The derivative Df_x is the linear map
h |-> f'(x) h
and the derivative Dg_{f(x)} is the linear map
k |-> g'(f(x)) k
Composing those two linear maps sends
h |-> f'(x) h |-> g'(f(x)) f'(x) h
So the derivative value of the composite is
(g . f)'(x) = g'(f(x)) f'(x)
The familiar formula is the coordinate shadow of the structural statement.
First derivatives compose as linear maps. In one dimension, linear maps are represented by scalar multiplication, so composition becomes multiplication of derivative values.
The same discipline becomes clearer in higher dimensions. Suppose
f : R^n -> R^m
and
g : R^m -> R^k
At a point x, the derivative of f is a linear map from input directions in R^n to output directions in R^m. At the point f(x), the derivative of g is a linear map from directions in R^m to directions in R^k.
Once bases are chosen, these linear maps are represented by matrices:
J_f(x) and J_g(f(x))
The derivative of the composite is represented by their matrix product:
J_{g . f}(x) = J_g(f(x)) J_f(x)
Again, the matrix formula is representational. The structural fact is composition of local linear maps. The order of the matrix product records the order of the underlying transformations: first Df_x, then Dg_{f(x)}.
This connects calculus directly back to linear algebra.
Linear maps compose. Their composition is associative. Identity maps behave as neutral elements. Matrices represent those maps after a choice of basis. The chain rule says that differentiation sends smooth composition into linear composition without breaking that structure.
That is why the chain rule has the shape it does.
It is tempting to treat the rule as a clever computational pattern, especially because it is so useful for symbolic differentiation. But the rule is forced by the meaning of derivative itself. If the derivative is the local linear part of a smooth function, then the local linear part of two processes performed in succession must be obtained by performing their local linear parts in succession.
The error terms also compose in the right way.
For f, local linearization says that near x,
f(x + h) = f(x) + Df_x(h) + higher-order error
For g, local linearization says that near y = f(x),
g(y + k) = g(y) + Dg_y(k) + higher-order error
When k is the small output displacement produced by f, the first-order part passes through Dg_y. The remaining pieces are higher-order relative to the original small input change. Under differentiability, those pieces become negligible at first order.
What survives is the linear chain:
h -> Df_x(h) -> Dg_{f(x)}(Df_x(h))
This is the precise sense in which the chain rule preserves locality. The derivative of the composite depends only on the derivative of f at x and the derivative of g at the point where f sends x. It does not require the whole global behavior of both functions to be recomputed from scratch.
The rule also preserves the distinction between points and directions.
The function f acts on points. Its derivative acts on directions attached to points. The function g acts on the resulting points. Its derivative acts on the resulting directions. The chain rule keeps those two levels synchronized.
points: x ──f──▶ y ──g──▶ z
directions: h ──Df_x──▶ k ──Dg_y──▶ l
Once change is local, each point carries its own linear space of possible first-order directions. A smooth map sends points to points and, through its derivative, sends directions over each point to directions over the image point.
In this language, the derivative behaves like a structure-preserving translation from smooth geometry to linear algebra. It takes a smooth map and assigns to it a compatible family of linear maps. Identity maps become identity linear maps. Composites become composites.
That compatibility is often called functoriality.
The word points forward to category theory, but the idea is already present here in concrete form. A construction is functorial when it respects the way transformations compose. The chain rule says that taking derivatives is functorial with respect to composition of smooth maps.
identity smooth map ──derivative──▶ identity linear map
composite smooth map ──derivative──▶ composite linear map
This makes the chain rule one of the first places where the categorical lens becomes visible inside ordinary calculus. The theorem is not merely that a certain formula is true. The theorem says that the operation "take the local linearization" preserves the compositional structure of smooth change.
Several familiar consequences fall out of this view.
If a process is built from many smaller processes, its derivative can be built from the derivatives of those pieces. This is why complicated formulas can be differentiated systematically.
If one part of a process has zero derivative in a relevant direction, the whole chain may lose first-order sensitivity in that direction. Local change has been blocked at one stage.
If every derivative in a chain is invertible, then local directions pass through the whole chain without collapsing. That observation leads toward inverse function ideas and the study of when a smooth map is locally reversible.
The invariant preserved by the chain rule is composable first-order behavior. The exact nonlinear shape of each function may be complex, and its global behavior may be hard to analyze. Locally, the first-order effect of the whole process is determined by the first-order effects of the parts and their order of composition.
This is the calculus analogue of a pattern we have already seen before. Arithmetic became coherent when repeated steps composed. Algebra became structural when operations interacted consistently. Geometry became organized when rigid transformations composed. Linear algebra became powerful because linear maps composed while preserving linear structure.
Calculus now adds its version:
Smooth changes compose, and their local linearizations compose with them.
The objects are points in spaces where smooth variation can be inspected locally, together with smooth functions between those spaces. The local morphisms are derivatives, understood as linear maps between spaces of infinitesimal directions. What composes are both the global smooth functions and their local linear approximations. The invariant is first-order behavior under composition: rates, directions, tangent data, and local sensitivity propagate through a chain according to linear composition. The defining relation is
D(g . f)_x = Dg_{f(x)} . Df_x
Equality here means equality of local linear behavior at the relevant point; coordinate formulas and Jacobian matrices are representations after choices have been made. Global accumulation, total area, net change, and integration remain outside the frame for one more step.
The chain rule says that local linearization respects composition.
Differentiation gives calculus a way to pass from continuous change to local linear behavior. The chain rule explains how that local behavior moves through composed processes.
But local behavior is still local.
How can the small contributions along an interval or region be assembled into a global effect?
And what kind of structure lets infinitesimal change accumulate into total change?