深度学习理论的基本原则_第1章_预备知识
第 1 章 预备知识
本书的目标是 制定能够从理论上理解深度学习的原理 。其中最重要一个原理就是:深度并且宽的神经网络受近高斯分布的控制。因此,要想读完本书,你需要掌握高斯积分和摄动理论。我们在本章中包含了对这些工具的快速介绍,还有一些必要的统计学概念介绍。读者唯一需要的先决条件是熟练掌握线性代数、多元微积分和初级概率论。
考虑到这一点,我们从 第 1.1 节
开始对 高斯积分 进行讨论,重点放在单项式对高斯分布的均值计算上,最终推导出 Wick 定理。
接下来,在 第 1.2 节
中,我们首先对期望值和可观测值进行一般性讨论。将可观测量视为通过重复实验学习概分布的一种方式,我们将介绍 矩 和 累积量 的统计概念,以及相应的 全 M 点相关器
和 连接 M 点相关器
的物理概念。其中特别强调了 连接相关器
,因为它能够直接表征某个分布与高斯性之间的偏差。
在 第 1.3 节
中,我们介绍了概率分布的 负对数概率
或 动作表示
,并解释了 动作
如何通过对高斯分布施加系统性地变形,形成非高斯分布的紧凑表示。特别是,我们专注于 近高斯分布
(其与高斯分布的偏差通过动作中的微小耦合即可实现),并展示了如何应用 摄动理论
将 非高斯耦合
连接到可观测量(例如通过 连接相关器
)。通过摄动地处理此类耦合,我们可以将任意近高斯分布的相关器转换为 高斯积分
之和;然后可以通过在 第 1.1 节
中介绍的工具来评估每个积分。这是本文最重要的技巧之一,因为我们将研究的神经网络几乎都受近高斯分布控制,并且随着网络变宽, 非高斯耦合
的扰动也会变小。
由于所有这些操作都需要随时使用,因此在本章中,文字、公式和示例可能比较冗长,但目的是使这些材料尽可能透明和易于理解。
1.1 高斯积分
本节的目的是介绍 高斯积分 和 高斯概率分布,并最终推导出 Wick 定理( 式 1.45
)。该定理提供了一个计算多元高斯分布任意 矩(moment) 的基本公式。
1.1.1 一元高斯积分
首先从最简单的单变量高斯函数开始,
$$
e^{-\frac{z^{2}}{2}} \tag{1.1}
$$
该函数的图形描绘了著名的钟形曲线,在 $z=0$ 处的峰值周围对称,并且对于较大的 $|z| \gg 1$ 迅速变细。 式 1.1
本身不能作为概率分布,因为它没有被归一化处理。为了找出正确的归一化,需要执行高斯积分:
$$
I_{1} \equiv \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2}} \tag{1.2}
$$
有一个巧妙的技巧来计算该积分。首先,考虑它的正方形
$$
I_{1}^{2}=\left(\int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2}}\right)^{2}=\int_{-\infty}^{\infty} d x e^{-\frac{x^{2}}{2}} \int_{-\infty}^{\infty} d y e^{-\frac{y^{2}}{2}}=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} d x d y e^{-\frac{1}{2}\left(x^{2}+y^{2}\right)} \tag{1.3}
$$
其中只是更改了虚拟积分变量的名称。接下来,我们将变量更改为极坐标形式 $(x, y)=(r \cos \phi, r \sin \phi)$,它将积分度量转换为 $d x d y=r d r d \phi$ ,并给出了两个基本积分计算:
$$
\begin{align*}
I_{1}^{2}=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} d x d y e^{-\frac{1}{2}\left(x^{2}+y^{2}\right)} &=\int_{0}^{\infty} r d r \int_{0}^{2 \pi} d \phi e^{-\frac{r^{2}}{2}} \
&=2 \pi \int_{0}^{\infty} d r r e^{-\frac{r^{2}}{2}}=2 \pi\left|-e^{-\frac{r^{2}}{2}}\right|_{r=0}^{r=\infty}=2 \pi \tag{1.4}
\end{align*}
$$
通过取平方根,我们可以将 式 1.2
的高斯积分计算为
$$
I_{1}=\int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2}}=\sqrt{2 \pi} \tag{1.5}
$$
用这个归一化因子除以高斯函数,可以定义单位方差的高斯概率分布为
$$
p(z) \equiv \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^{2}}{2}} \tag{1.6}
$$
现在已正确归一化,即 $\int_{-\infty}^{\infty} d z p(z)=1$。这种具有零均值和单位方差的分布也被称为标准正态分布。
将此结果扩展到方差 $K>0$ 的高斯分布非常容易。相应的归一化因子由下式给出
$$
I_{K} \equiv \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}}=\sqrt{K} \int_{-\infty}^{\infty} d u e^{-\frac{u^{2}}{2}}=\sqrt{2 \pi K} \tag{1.7}
$$
其中,我们将积分变量重新调整为 $u=z / \sqrt{K}$ ,则可以将方差为 $K$ 的高斯分布定义为
$$
p(z) \equiv \frac{1}{\sqrt{2 \pi K}} e^{-\frac{z^{2}}{2 K}} \tag{1.8}
$$
该分布图再次描绘了一条围绕 $z=0$ 对称的钟形曲线,但它现在配置了表征其宽度的尺度 $K$,随着 $|z| \gg \sqrt{K}$。逐渐变细。更一般地,可以将钟形曲线的中心移动为:
$$
p(z) \equiv \frac{1}{\sqrt{2 \pi K}} e^{-\frac{(z-s)^{2}}{2 K}} \tag{1.9}
$$
它现在关于 $z=s$ 对称。该中心值 $s$ 称为分布的均值,因为它是:
$$
\begin{align*}
\mathbb{E}[z] \equiv \int_{-\infty}^{\infty} d z p(z) z &=\frac{1}{\sqrt{2 \pi K}} \int_{-\infty}^{\infty} d z e^{-\frac{(z-s)^{2}}{2 K}} z \
&=\frac{1}{I_{K}} \int_{-\infty}^{\infty} d w e^{-\frac{w^{2}}{2 K}}(s+w) \
&=\frac{s I_{K}}{I_{K}}+\frac{1}{I_{K}} \int_{-\infty}^{\infty} d w\left(e^{-\frac{w^{2}}{2 K}} w\right) \
&=s \tag{1.10}
\end{align*}
$$
其中,我们将变量转换为 $w=z-s$ ,注意在最后一步时的第二项中,被积函数相对于积分变量 $w \leftrightarrow-w$ 是奇对偶的,因此积分结果为零。
关注零均值的高斯分布,让我们考虑一般函数 $\mathcal{O}(z)$ 的其他期望值,即
$$
\mathbb{E}[\mathcal{O}(z)] \equiv \int_{-\infty}^{\infty} d z p(z) \mathcal{O}(z)=\frac{1}{\sqrt{2 \pi K}} \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} \mathcal{O}(z) \tag{1.11}
$$
我们经常将此类函数 $\mathcal{O}(z)$ 称为 可观测量,它们对应于实验的测量结果。有一类特殊函数的期望值被称为 矩,对应于将 $z^{M}$ ( $M$ 为任意整数) 作为 $\mathcal{O}(z)$ 插入到被积函数中:
$$
\mathbb{E}\left[z^{M}\right]=\frac{1}{\sqrt{2 \pi K}} \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} z^{M} \tag{1.12}
$$
注意,对于任何奇数 $M$,积分都会为 $0$,因为此时被积函数相对于 $z \leftrightarrow-z$ 是奇函数。对于偶数 $M=2 m$ ,需要计算如下形式的积分
$$
I_{K, m} \equiv \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} z^{2 m} \tag{1.13}
$$
作为几乎与 式 1.2
一样古老的数学对象,同样存在计算它们的技巧:
$$
\begin{align*}
I_{K, m} &=\int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} z^{2 m}=\left(2 K^{2} \frac{d}{d K}\right)^{m} \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}}=\left(2 K^{2} \frac{d}{d K}\right)^{m} I_{K} \
&=\left(2 K^{2} \frac{d}{d K}\right)^{m} \sqrt{2 \pi} K^{\frac{1}{2}}=\sqrt{2 \pi} K^{\frac{2 m+1}{2}}(2 m-1)(2 m-3) \cdots 1 \tag{1.14}
\end{align*}
$$
在进入第二行时,我们在 式 1.7
中替换了 $I_{K}$。
这个具有 $2 m=2$ 的方程清楚地说明了为什么我们将 $K$ 称为方差,因为对于具有方差 $K$ 的零均值高斯分布,我们有 $\operatorname{var}(z) \equiv \mathbb{E}\left[(z-\mathbb{E}[z])^{2}\right]=\mathbb{E}\left[z^{2}\right]-\mathbb{ E}[z]^{2}=\mathbb{E}\left[z^{2}\right]=K$。
因此,可以看到偶数阶矩可以如下简单公式给出:
$$
\mathbb{E}\left[z^{2 m}\right]=\frac{I_{K, m}}{\sqrt{2 \pi K}}=K^{m}(2 m-1) ! ! \tag{1.15}
$$
其中,引入了双阶乘法:
$$
(2 m-1) ! ! \equiv(2 m-1)(2 m-3) \cdots 1=\frac{(2 m) !}{2^{m} m !} \tag{1.16}
$$式 1.15
是单变量高斯分布的 Wick 定理。
实际上还有另一种导出 式 1.15
的好方法,它可以更自然地扩展到多变量高斯分布。该推导从考虑具有 源项(Source term) $J$ 的高斯积分开始,我们将其定义为:
$$
Z_{K, J} \equiv \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}+J z} \tag{1.17}
$$
请注意,当将源设置为零时,恢复了高斯积分的归一化 $Z_{K, J=0}=I_{K}$。在物理文献中 $Z_{K, J}$ 有时被称为具有源的配分函数(Partition function),这个积分将被用作矩的 生成函数(generating function)。我们可以通过完成指数中的平方来计算 $Z_{K, J}$ :
$$
-\frac{z^{2}}{2 K}+J z=-\frac{(z-J K)^{2}}{2 K}+\frac{K J^{2}}{2},
$$
这可以让我们重写 式 1.17
的积分为:
$$
Z_{K, J}=e^{\frac{K J^{2}}{2}} \int_{-\infty}^{\infty} d z e^{-\frac{(z-J K)^{2}}{2 K}}=e^{\frac{K J^{2}}{2}} I_{K}=e^{\frac{K J^{2}}{2}} \sqrt{2 \pi K},
$$
在中间等式中,可以注意到被积函数只是具有方差 $K$ 的移位高斯函数。
现在可以将具有源 $Z_{K, J}$ 的高斯积分与具有 $I_{K, m}$ 插入的高斯积分联系起来。通过将 $Z_{K, J}$ 与源 $J$ 微分,然后将源设置为零,可得:
$$
I_{K, m}=\int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} z^{2 m}=\left.\left[\left(\frac{d}{d J}\right)^{2 m} \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}+J z}\right]\right|{J=0}=\left.\left[\left(\frac{d}{d J}\right)^{2 m} Z{K, J}\right]\right|{J=0}
$$
换言之,积分 $I{K, m}$ 与在 $J=0$ 附近的配分函数 $Z_{K, J}$ 的偶数泰勒系数相关。例如,对于 $2 m=2$,有
$$
\mathbb{E}\left[z^{2}\right]=\frac{I_{K, 1}}{\sqrt{2 \pi K}}=\left.\left[\left(\frac{d}{d J}\right)^{2} e^{\frac{K J^{2}}{2}}\right]\right|{J=0}=\left.\left[e^{\frac{K J^{2}}{2}}\left(K+K^{2} J^{2}\right)\right]\right|{J=0}=K,
$$
对于 $2 m=4$ ,有
$$
\mathbb{E}\left[z^{4}\right]=\frac{I_{K, 2}}{\sqrt{2 \pi K}}=\left.\left[\left(\frac{d}{d J}\right)^{4} e^{\frac{K J^{2}}{2}}\right]\right|{J=0}=\left.\left[e^{\frac{K J^{2}}{2}}\left(3 K^{2}+6 K^{3} J^{2}+K^{4} J^{4}\right)\right]\right|{J=0}=3 K^{2} .
$$
请注意,在设置 $J=0$ 时,任何具有悬空源 $J$ 的项都会消失。这个现象提供了一种用来评估 $m$ 相关器的简单方法:泰勒展开指数 $Z_{K, J} / I_{K}=\exp \left(\frac{K J^{2}}{2} \right)$ ,并保留含有适当数量源的项,以使表达式不会消失。由此得到
$$
\begin{aligned}
\mathbb{E}\left[z^{2 m}\right] &=\frac{I_{K, m}}{\sqrt{2 \pi K}}=\left.\left[\left(\frac{d}{d J}\right)^{2 m} e^{\frac{K J^{2}}{2}}\right]\right|{J=0}=\left{\left(\frac{d}{d J}\right)^{2 m}\left[\sum{k=0}^{\infty} \frac{1}{k !}\left(\frac{K}{2}\right)^{k} J^{2 k}\right]\right}_{J=0}(1.23) \
&=\left(\frac{d}{d J}\right)^{2 m}\left[\frac{1}{m !}\left(\frac{K}{2}\right)^{m} J^{2 m}\right]=K^{m} \frac{(2 m) !}{2^{m} m !}=K^{m}(2 m-1) ! !
\end{aligned}
$$
这完成了对单变量高斯分布的 Wick 定理 (式 1.15
) 的第二次推导。此推导比第一个简单推导要长得多,但可以很自然地扩展到多变量高斯分布。
1.1.2 多元高斯积分
Picking up speed, we are now ready to handle multivariable Gaussian integrals for an $N$-dimensional variable $z_{\mu}$ with $\mu=1, \ldots, N .{ }^{2}$
Throughout this book, we will explicitly write out the component indices of vectors, matrices, and tensors as much as possible, except on some occasions when it is clear enough from context.
The multivariable Gaussian function is defined as
$$
\exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right){\mu \nu} z{\nu}\right]
$$
where the variance or covariance matrix $K_{\mu \nu}$ is an $N$-by- $N$ symmetric positive definite matrix, and its inverse $\left(K^{-1}\right){\mu \nu}$ is defined so that their matrix product gives the $N$-by- $N$ identity matrix
$$
\sum{\rho=1}^{N}\left(K^{-1}\right){\mu \rho} K{\rho \nu}=\delta_{\mu \nu}
$$
Here we have also introduced the Kronecker delta $\delta_{\mu \nu}$, which satisfies
$$
\delta_{\mu \nu} \equiv \begin{cases}1, & \mu=\nu \ 0, & \mu \neq \nu\end{cases}
$$
The Kronecker delta is just a convenient representation of the identity matrix.
Now, to construct a probability distribution from the Gaussian function (1.24), we again need to evaluate the normalization factor
$$
\begin{aligned}
I_{K} & \equiv \int d^{N} z \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right){\mu \nu} z{\nu}\right] \
&=\int_{-\infty}^{\infty} d z_{1} \int_{-\infty}^{\infty} d z_{2} \cdots \int_{-\infty}^{\infty} d z_{N} \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right){\mu \nu} z{\nu}\right] .
\end{aligned}
$$
To compute this integral, first recall from linear algebra that, given an $N$-by- $N$ symmetric matrix $K_{\mu \nu}$, there is always an orthogonal matrix ${ }^{3} O_{\mu \nu}$ that diagonalizes $K_{\mu \nu}$ as $\left(O K O^{T}\right){\mu \nu}=\lambda{\mu} \delta_{\mu \nu}$ with eigenvalues $\lambda_{\mu=1, \ldots, N}$ and diagonalizes its inverse as $\left(O K^{-1} O^{T}\right){\mu \nu}=$ $\left(1 / \lambda{\mu}\right) \delta_{\mu \nu}$. With this in mind, after twice inserting the identity matrix as $\delta_{\mu \nu}=$ $\left(O^{T} O\right)_{\mu \nu}$, the sum in the exponent of the integral can be expressed in terms of the eigenvalues as
An orthogonal matrix $O_{\mu \nu}$ is a matrix whose transpose $\left(O^{T}\right){\mu \nu}$ equals its inverse, i.e., $\left(O^{T} O\right){\mu \nu}=$ $\delta_{\mu \nu}$
$$
\begin{aligned}
\sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right){\mu \nu} z{\nu} &=\sum_{\mu, \rho, \sigma, \nu=1}^{N} z_{\mu}\left(O^{T} O\right){\mu \rho}\left(K^{-1}\right){\rho \sigma}\left(O^{T} O\right){\sigma \nu} z{\nu} \
&=\sum_{\mu, \nu=1}^{N}(O z){\mu}\left(O K^{-1} O^{T}\right){\mu \nu}(O z){\nu} \
&=\sum{\mu=1}^{N} \frac{1}{\lambda_{\mu}}(O z){\mu}^{2}
\end{aligned}
$$
where to reach the final line we used the diagonalization property of the inverse covariance matrix. Remembering that for a positive definite matrix $K{\mu \nu}$ the eigenvalues are all positive $\lambda_{\mu}>0$, we see that the $\lambda_{\mu}$ sets the scale of the falloff of the Gaussian function in each of the eigendirections. Next, recall from multivariable calculus that a change of variables $u_{\mu} \equiv(O z){\mu}$ with an orthogonal matrix $O$ leaves the integration measure invariant, i.e., $d^{N} z=d^{N} u$. All together, this lets us factorize the multivariable Gaussian integral $(1.27)$ into a product of single-variable Gaussian integrals (1.7), yielding
$$
\begin{aligned}
I{K} &=\int_{-\infty}^{\infty} d u_{1} \int_{-\infty}^{\infty} d u_{2} \cdots \int_{-\infty}^{\infty} d u_{N} \exp \left(-\frac{u_{1}^{2}}{2 \lambda_{1}}-\frac{u_{2}^{2}}{2 \lambda_{2}}-\ldots-\frac{u_{N}^{2}}{2 \lambda_{N}}\right) \
&=\prod_{\mu=1}^{N}\left[\int_{-\infty}^{\infty} d u_{\mu} \exp \left(-\frac{u_{\mu}^{2}}{2 \lambda_{\mu}}\right)\right]=\prod_{\mu=1}^{N} \sqrt{2 \pi \lambda_{\mu}}=\sqrt{\prod_{\mu=1}^{N}\left(2 \pi \lambda_{\mu}\right) .}
\end{aligned}
$$
Finally, recall one last fact from linear algebra that the product of the eigenvalues of a matrix is equal to the matrix determinant. Thus, compactly, we can express the value of the multivariable Gaussian integral as
$$
I_{K}=\int d^{N} z \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right){\mu \nu} z{\nu}\right]=\sqrt{|2 \pi K|}
$$
where $|A|$ denotes the determinant of a square matrix $A$.
Having figured out the normalization factor, we can define the zero-mean multivariable Gaussian probability distribution with variance $K_{\mu \nu}$ as
$$
p(z)=\frac{1}{\sqrt{|2 \pi K|}} \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right){\mu \nu} z{\nu}\right] .
$$
While we’re at it, let us also introduce the conventions of suppressing the superscript “ $-1$ “ for the inverse covariance $\left(K^{-1}\right){\mu \nu}$, instead placing the component indices upstairs as
$$
K^{\mu \nu} \equiv\left(K^{-1}\right){\mu \nu} .
$$
This way, we distinguish the covariance $K_{\mu \nu}$ and the inverse covariance $K^{\mu \nu}$ by whether or not component indices are lowered or raised. With this notation, inherited from general relativity, the defining equation for the inverse covariance $(1.25)$ is written instead as
$$
\sum_{\rho=1}^{N} K^{\mu \rho} K_{\rho \nu}=\delta_{\nu}^{\mu}
$$
and the multivariable Gaussian distribution 1.31) is written as
$$
p(z)=\frac{1}{\sqrt{|2 \pi K|}} \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}\right)
$$
Although it might take some getting used to, this notation saves us some space and saves you some handwriting pain.
If you like, in your notes you can also go full general-relativistic mode and adopt Einstein summation convention, suppressing the summation symbol any time indices are repeated in upstair-downstair pairs. For instance, if we adopted this convention we would write the defining equation for inverse simply as $K^{\mu \rho} K_{\rho \nu}=\delta^{\mu}{ }{\nu}$ and the Gaussian function as $\exp \left(-\frac{1}{2} z{\mu} K^{\mu \nu} z_{\nu}\right)$.
Specifically for neural networks, you might find the Einstein summation convention helpful for sample indices, but sometimes confusing for neural indices. For extra clarity, we won’t adopt this convention in the text of the book, but we mention it now since we do often use such a convention to simplify our own calculations in private.
Regardless of how it’s written, the zero-mean multivariable Gaussian probability distribution (1.34) peaks at $z=0$, and its falloff is directiondependent, determined by the covariance matrix $K_{\mu \nu}$. More generally, we can shift the peak of the Gaussian distribution to $s_{\mu}$
$$
p(z)=\frac{1}{\sqrt{|2 \pi K|}} \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N}(z-s){\mu} K^{\mu \nu}(z-s){\nu}\right]
$$
which defines a general multivariable Gaussian distribution with mean $\mathbb{E}\left[z_{\mu}\right]=s_{\mu}$ and covariance $K_{\mu \nu}$. This is the most general version of the Gaussian distribution.
Next, let’s consider the moments of the mean-zero multivariable Gaussian distribution
$$
\begin{aligned}
\mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{M}}\right] & \equiv \int d^{N} z p(z) z_{\mu_{1}} \cdots z_{\mu_{M}} \
&=\frac{1}{\sqrt{|2 \pi K|}} \int d^{N} z \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}\right) z_{\mu_{1}} \cdots z_{\mu_{M}}=\frac{I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)}}{I_{K}},
\end{aligned}
$$
where we introduced multivariable Gaussian integrals with insertions
$$
I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)} \equiv \int d^{N} z \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}\right) z_{\mu_{1}} \cdots z_{\mu_{M}} .
$$
Following our approach in the single-variable case, let’s construct the generating function for the integrals $I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)}$ by including a source term $J^{\mu}$ as
$$
Z_{K, J} \equiv \int d^{N} z \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}+\sum_{\mu=1}^{N} J^{\mu} z_{\mu}\right) \text {. }
$$
As the name suggests, differentiating the generating function $Z_{K, J}$ with respect to the source $J^{\mu}$ brings down a power of $z_{\mu}$ such that after $M$ such differentiations we have
$$
\begin{aligned}
& {\left.\left[\frac{d}{d J^{\mu_{1}}} \frac{d}{d J^{\mu_{2}}} \cdots \frac{d}{d J^{\mu_{M}}} Z_{K, J}\right]\right|{J=0} } \
=& \int d^{N z} \exp \left(-\frac{1}{2} \sum{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}\right) z_{\mu_{1}} \cdots z_{\mu_{M}}=I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)} .
\end{aligned}
$$
So, as in the single-variable case, the Taylor coefficients of the partition function $Z_{K, J}$ expanded around $J^{\mu}=0$ are simply related to the integrals with insertions $I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)}$. Therefore, if we knew a closed-form expression for $Z_{K, J}$, we could easily compute the values of the integrals $I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)}$.
To evaluate the generating function $Z_{K, J}$ in a closed form, again we follow the lead of the single-variable case and complete the square in the exponent of the integrand in (1.38) as
$$
\begin{aligned}
&-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}+\sum_{\mu=1}^{N} J^{\mu} z_{\mu} \
=&-\frac{1}{2} \sum_{\mu, \nu=1}^{N}\left(z_{\mu}-\sum_{\rho=1}^{N} K_{\mu \rho} J^{\rho}\right) K^{\mu \nu}\left(z_{\nu}-\sum_{\lambda=1}^{N} K_{\nu \lambda} J^{\lambda}\right)+\frac{1}{2} \sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu} \
=&-\frac{1}{2} \sum_{\mu, \nu=1}^{N} w_{\mu} K^{\mu \nu} w_{\nu}+\frac{1}{2} \sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}
\end{aligned}
$$
where we have introduced the shifted variable $w_{\mu} \equiv z_{\mu}-\sum_{\rho=1}^{N} K_{\mu \rho} J^{\rho}$. Using this substitution, the generating function can be evaluated explicitly
$$
\begin{aligned}
Z_{K, J} &=\exp \left(\frac{1}{2} \sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right) \int d^{N} w \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} w_{\mu} K^{\mu \nu} w_{\nu}\right] \
&=\sqrt{|2 \pi K|} \exp \left(\frac{1}{2} \sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right)
\end{aligned}
$$
where at the end we used our formula for the multivariable integral $I_{K}$, (1.30). With our closed-form expression $1.41$ ) for the generating function $Z_{K, J}$, we can compute the Gaussian integrals with insertions $I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)}$ by differentiating it, using (1.39). For an even number $M=2 m$ of insertions, we find a really nice formula
$$
\begin{aligned}
\mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{2 m}}\right] &=\frac{I_{K,\left(\mu_{1}, \ldots, \mu_{2 m}\right)}}{I_{K}}=\left.\frac{1}{I_{K}}\left[\frac{d}{d J^{\mu_{1}}} \cdots \frac{d}{d J^{\mu_{2 m}}} Z_{K, J}\right]\right|{J=0} \
&=\frac{1}{2^{m} m !} \frac{d}{d J^{\mu{1}}} \frac{d}{d J^{\mu_{2}}} \cdots \frac{d}{d J^{\mu_{2 m}}}\left(\sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right)^{m}
\end{aligned}
$$
For an odd number $M=2 m+1$ of insertions, there is dangling source upon setting $J=0$, and so those integrals vanish. You can also see this by looking at the integrand for any odd moment and noticing that it is odd with respect to the sign flip of the integration variables $z_{\mu} \leftrightarrow-z_{\mu}$.
Now, let’s take a few moments to evaluate a few moments using this formula. For $2 m=2$, we have
$$
\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]=\frac{1}{2} \frac{d}{d J^{\mu_{1}}} \frac{d}{d J^{\mu_{2}}}\left(\sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right)=K_{\mu_{1} \mu_{2}}
$$
Here, there are $2 !=2$ ways to apply the product rule for derivatives and differentiate the two $J$ ‘s, both of which evaluate to the same expression due to the symmetry of the covariance, $K_{\mu_{1} \mu_{2}}=K_{\mu_{2} \mu_{1}}$. This expression $11.43$ validates in the multivariable setting why we have been calling $K_{\mu \nu}$ the covariance, because we see explicitly that it is the covariance.
Next, for $2 m=4$ we get a more complicated expression
$$
\begin{aligned}
\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right] &=\frac{1}{2^{2} 2 !} \frac{d}{d J^{\mu_{1}}} \frac{d}{d J^{\mu_{2}}} \frac{d}{d J^{\mu_{3}}} \frac{d}{d J^{\mu_{4}}}\left(\sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right)\left(\sum_{\rho, \lambda=1}^{N} J^{\rho} K_{\rho \lambda} J^{\lambda}\right) \
&=K_{\mu_{1} \mu_{2}} K_{\mu_{3} \mu_{4}}+K_{\mu_{1} \mu_{3}} K_{\mu_{2} \mu_{4}}+K_{\mu_{1} \mu_{4}} K_{\mu_{2} \mu_{3}}
\end{aligned}
$$
Here we note that there are now $4 !=24$ ways to differentiate the four $J$ ‘s, though only three distinct ways to pair the four auxiliary indices $1,2,3,4$ that sit under $\mu$. This gives $24 / 3=8=2^{2} 2$ ! equivalent terms for each of the three pairings, which cancels against the overall factor $1 /\left(2^{2} 2 !\right)$.
For general $2 m$, there are $(2 m)$ ! ways to differentiate the sources, of which $2^{m} m$ ! of those ways are equivalent. This gives $(2 m) ! /\left(2^{m} m !\right)=(2 m-1) ! !$ distinct terms, corresponding to the $(2 m-1) ! !$ distinct pairings of $2 m$ auxiliary indices $1, \ldots, 2 m$ that sit under $\mu$. The factor of $1 /\left(2^{m} m\right.$ !) in the denominator of $(1.42)$ ensures that the coefficient of each of these terms is normalized to unity. Thus, most generally, we can express the moments of the multivariable Gaussian with the following formula
$$
\mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{2 m}}\right]=\sum_{\text {all pairing }} K_{\mu_{k_{1}} \mu_{k_{2}}} \cdots K_{\mu_{k_{2 m-1}} \mu_{k_{2 m}}},
$$
where, to reiterate, the sum is over all the possible distinct pairings of the $2 m$ auxiliary indices under $\mu$ such that the result has the $(2 m-1)$ !! terms that we described above. Each factor of the covariance $K_{\mu \nu}$ in a term in sum is called a Wick contraction, corresponding to a particular pairing of auxiliary indices. Each term then is composed of $m$ different Wick contractions, representing a distinct way of pairing up all the auxiliary indices. To make sure you understand how this pairing works, look back at the $2 m=2$ case (1.43) - with a single Wick contraction - and the $2 m=4$ case $(1.44)$ - with three distinct ways of making two Wick contractions - and try to work out the $2 m=6$ case, which yields $(6-1) ! !=15$ distinct ways of making three Wick contractions:
$$
\begin{aligned}
\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right] &=K_{\mu_{1} \mu_{2}} K_{\mu_{3} \mu_{4}} K_{\mu_{5} \mu_{6}}+K_{\mu_{1} \mu_{3}} K_{\mu_{2} \mu_{4}} K_{\mu_{5} \mu_{6}}+K_{\mu_{1} \mu_{4}} K_{\mu_{2} \mu_{3}} K_{\mu_{5} \mu_{6}} \
&+K_{\mu_{1} \mu_{2}} K_{\mu_{3} \mu_{5}} K_{\mu_{4} \mu_{6}}+K_{\mu_{1} \mu_{3}} K_{\mu_{2} \mu_{5}} K_{\mu_{4} \mu_{6}}+K_{\mu_{1} \mu_{5}} K_{\mu_{2} \mu_{3}} K_{\mu_{4} \mu_{6}} \
&+K_{\mu_{1} \mu_{2}} K_{\mu_{5} \mu_{4}} K_{\mu_{3} \mu_{6}}+K_{\mu_{1} \mu_{5}} K_{\mu_{2} \mu_{4}} K_{\mu_{3} \mu_{6}}+K_{\mu_{1} \mu_{4}} K_{\mu_{2} \mu_{5}} K_{\mu_{3} \mu_{6}} \
&+K_{\mu_{1} \mu_{5}} K_{\mu_{3} \mu_{4}} K_{\mu_{2} \mu_{6}}+K_{\mu_{1} \mu_{3}} K_{\mu_{5} \mu_{4}} K_{\mu_{2} \mu_{6}}+K_{\mu_{1} \mu_{4}} K_{\mu_{5} \mu_{3}} K_{\mu_{2} \mu_{6}} \
&+K_{\mu_{5} \mu_{2}} K_{\mu_{3} \mu_{4}} K_{\mu_{1} \mu_{6}}+K_{\mu_{5} \mu_{3}} K_{\mu_{2} \mu_{4}} K_{\mu_{1} \mu_{6}}+K_{\mu_{5} \mu_{4}} K_{\mu_{2} \mu_{3}} K_{\mu_{1} \mu_{6}}
\end{aligned}
$$
The formula (1.45) is Wick’s theorem. Put a box around it. Take a few moments for reflection.
$\cdots$
$\cdots$
$\cdots$
Good. You are now a Gaussian sensei. Exhale, and then say as Neo would say, “I know Gaussian integrals.”
Now that the moments have passed, it is an appropriate time to transition to the next section where you will learn about more general probability distributions.
1.2 Probability, Correlation and Statistics, and All That
In introducing the Gaussian distribution in the last section we briefly touched upon the concepts of expectation and moments. These are defined for non-Gaussian probability distributions too, so now let us reintroduce these concepts and expand on their definitions, with an eye towards understanding the nearly-Gaussian distributions that describe wide neural networks.
Given a probability distribution $p(z)$ of an $N$-dimensional random variable $z_{\mu}$, we can learn about its statistics by measuring functions of $z_{\mu}$. We’ll refer to such measurable functions in a generic sense as observables and denote them as $\mathcal{O}(z)$. The expectation value of an observable
$$
\mathbb{E}[\mathcal{O}(z)] \equiv \int d^{N} z p(z) \mathcal{O}(z)
$$
characterizes the mean value of the random function $\mathcal{O}(z)$. Note that the observable $\mathcal{O}(z)$ needs not be a scalar-valued function, e.g. the second moment of a distribution is a matrix-valued observable given by $\mathcal{O}(z)=z_{\mu} z_{\nu}$.
Operationally, an observable is a quantity that we measure by conducting experiments in order to connect to a theoretical model for the underlying probability distribution describing $z_{\mu}$. In particular, we repeatedly measure the observables that are naturally accessible to us as experimenters, collect their statistics, and then compare them with predictions for the expectation values of those observables computed from some theoretical model of $p(z)$.
With that in mind, it’s very natural to ask: what kind of information can we learn about an underlying distribution $p(z)$ by measuring an observable $\mathcal{O}(z)$ ? For an a priori unknown distribution, is there a set of observables that can serve as a sufficient probe of $p(z)$ such that we could use that information to predict the result of all future experiments involving $z_{\mu}$ ?
Consider a class of observables that we’ve already encountered, the moments or $M$-point correlators of $z_{\mu}$, given by the expectation
In the rest of this book, we’ll often use the physics term $M$-point correlator rather than the statistics term moment, though they mean the same thing and can be used interchangeably.
$$
\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right]=\int d^{N} z p(z) z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}
$$
In principle, knowing the $M$-point correlators of a distribution lets us compute the expectation value of any analytic observable $\mathcal{O}(z)$ via Taylor expansion
$$
\begin{aligned}
\mathbb{E}[\mathcal{O}(z)] &=\mathbb{E}\left[\left.\sum_{M=0}^{\infty} \frac{1}{M !} \sum_{\mu_{1}, \ldots, \mu_{M}=1}^{N} \frac{\partial^{M} \mathcal{O}}{\partial z_{\mu_{1}} \cdots \partial z_{\mu_{M}}}\right|{z=0} z{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right] \
&=\left.\sum_{M=0}^{\infty} \frac{1}{M !} \sum_{\mu_{1}, \ldots, \mu_{M}=1}^{N} \frac{\partial^{M} \mathcal{O}}{\partial z_{\mu_{1}} \cdots \partial z_{\mu_{M}}}\right|{z=0} \mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right]
\end{aligned}
$$
where on the last line we took the Taylor coefficients out of the expectation by using the linearity property of the expectation, inherited from the linearity property of the integral in (1.46). As such, it’s clear that the collection of all the $M$-point correlators completely characterizes a probability distribution for all intents and purposes.
In fact, the moments offer a dual description of the probability distribution through either the Laplace transform or the Fourier transform. For instance, the Laplace transform of the probability distribution $p(z)$ is given by
$$
Z_{J} \equiv \mathbb{E}\left[\exp \left(\sum_{\mu} J^{\mu} z_{\mu}\right)\right]=\int\left[\prod_{\mu} d z_{\mu}\right] p(z) \exp \left(\sum_{\mu} J^{\mu} z_{\mu}\right)
$$
As in the Gaussian case, this integral gives a generating function for the $M$-point correlators of $p(z)$, which means that $Z_{J}$ can be reconstructed from these correlators. The probability distribution can then be obtained through the inverse Laplace transform.
However, this description in terms of all the correlators is somewhat cumbersome and operationally infeasible. To get a reliable estimate of the $M$-point correlator, we must simultaneously measure $M$ components of a random variable for each draw and repeat such measurements many times. As $M$ grows, this task quickly becomes impractical. In fact, if we could easily perform such measurements for all $M$, then our theoretical model of $p(z)$ would no longer be a useful abstraction; from (1.48) we would already know the outcome of all possible experiments that we could perform, leaving nothing for us to predict.
To that point, essentially all useful distributions can be effectively described in terms of a finite number of quantities, giving them a parsimonious representation. For instance, consider the zero-mean $n$-dimensional Gaussian distribution with the variance $K_{\mu \nu}$. The nonzero $2 \mathrm{~m}$-point correlators are given by Wick’s theorem $1.45$ ) as
$$
\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{2 m}}\right]=\sum_{\text {all pairing }} K_{\mu_{k_{1}} \mu_{k_{2}}} \cdots K_{\mu_{k_{2 m-1}} \mu_{k_{2 m}}}
$$
and are determined entirely by the $N(N+1) / 2$ independent components of the variance $K_{\mu \nu}$. The variance itself can be estimated by measuring the two-point correlator
$$
\mathbb{E}\left[z_{\mu} z_{\nu}\right]=K_{\mu \nu}
$$
This is consistent with our description of the distribution itself as “the zero-mean $N$ dimensional Gaussian distribution with the variance $K_{\mu \nu} “$ in which we only had to specify these same set of numbers, $K_{\mu \nu}$, to pick out the particular distribution we had in mind. For zero-mean Gaussian distributions, there’s no reason to measure or keep track of any of the higher-point correlators as they are completely constrained by the variance through $1.50$ ).
More generally, it would be nice if there were a systematic way for learning about non-Gaussian probability distributions without performing an infinite number of experiments. For nearly-Gaussian distributions, a useful set of observables is given by what statisticians call cumulants and physicists call connected correlators.
Outside of this chapter, just as we’ll often use the term $M$-point correlator rather than the term moment, we’ll use the term $M$-point connected correlator rather than the term cumulant. When we want to refer to the moment and not the cumulant, we might sometimes say full correlator to contrast with connected correlator.
As the formal definition of these quantities is somewhat cumbersome and unintuitive, let’s start with a few simple examples.
The first cumulant or the connected one-point correlator is the same as the full one-point correlator
$$
\left.\mathbb{E}\left[z_{\mu}\right]\right|{\text {connected }} \equiv \mathbb{E}\left[z{\mu}\right] .
$$
This is just the mean of the distribution. The second cumulant or the connected twopoint correlator is given by
$$
\begin{aligned}
\left.\mathbb{E}\left[z_{\mu} z_{\nu}\right]\right|{\text {connected }} & \equiv \mathbb{E}\left[z{\mu} z_{\nu}\right]-\mathbb{E}\left[z_{\mu}\right] \mathbb{E}\left[z_{\nu}\right] \
&=\mathbb{E}\left[\left(z_{\mu}-\mathbb{E}\left[z_{\mu}\right]\right)\left(z_{\nu}-\mathbb{E}\left[z_{\nu}\right]\right)\right] \equiv \operatorname{Cov}\left[z_{\mu}, z_{\nu}\right]
\end{aligned}
$$
which is also known as the covariance of the distribution. Note how the mean is subtracted from the random variable $z_{\mu}$ before taking the square in the connected version. The quantity $\widehat{\Delta z}{\mu} \equiv z{\mu}-\mathbb{E}\left[z_{\mu}\right]$ represents a fluctuation of the random variable around its mean. Intuitively, such fluctuations are equally likely to contribute positively as they are likely to contribute negatively, $\mathbb{E}\left[\widehat{\Delta z}{\mu}\right]=\mathbb{E}\left[z{\mu}\right]-\mathbb{E}\left[z_{\mu}\right]=0$, so it’s necessary to take the square in order to get an estimate of the magnitude of such fluctuations.
At this point, let us restrict our focus to distributions that are invariant under a signflip symmetry $z_{\mu} \rightarrow-z_{\mu}$, which holds for the zero-mean Gaussian distribution (1.34). Importantly, this parity symmetry will also hold for the nearly-Gaussian distributions that we will study in order to describe neural networks. For all such even distributions with this symmetry, all odd moments and all odd-point connected correlators vanish.
With this restriction, the next simplest observable is the fourth cumulant or the connected four-point correlator, given by the formula
$$
\begin{aligned}
&\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|{\text {connected }} \
=& \mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right] \
&-\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right]-\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{3}}\right] \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{4}}\right]-\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{4}}\right] \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{3}}\right]
\end{aligned}
$$
For the Gaussian distribution, recalling the Wick theorem (1.50), the last three terms precisely subtract off the three pairs of Wick contractions used to evaluate the first term, meaning
$$
\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }}=0 .
$$
Essentially by design, the connected four-point correlator vanishes for the Gaussian distribution, and a nonzero value signifies a deviation from Gaussian statistics.
In statistics, the connected four-point correlator for a single random variable $z$ is called the excess kurtosis when normalized by the square of the variance. It is a natural measure of the tails of the distribution, as compared to a Gaussian distribution, and also serves as a measure of the potential for outliers. In particular, a positive value indicates fatter tails while a negative value indicates thinner tails.
In fact, the connected four-point correlator is perhaps the simplest measure of non-Gaussianity. Now that we have a little intuition, we are as ready as we’ll ever be to discuss the definition for the $M$-th cumulant or the $M$-point connected correlator. For completeness, we’ll give the general definition, before restricting again to distributions that are symmetric under parity $z_{\mu} \rightarrow-z_{\mu}$. The definition is inductive and somewhat counterintuitive, expressing the $M$-th moment in terms of connected correlators from degree 1 to $M$ :
$$
\begin{aligned}
& \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right] \
\equiv &\left.\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right]\right|{\text {connected }}\right] \
&+\left.\left.\sum{\text {all subdivisions }}\left[z_{\mu_{1}^{[1]}} \cdots z_{\mu_{\nu_{1}}[1]}\right]\right|{\text {connected }} \cdots \mathbb{E}\left[z{\mu_{1}[s]} \cdots z_{\mu_{\nu_{\nu}}[s]}\right]\right|{\text {connected }},
\end{aligned}
$$
where the sum is over all the possible subdivisions of $M$ variables into $s>1$ clusters of sizes $\left(\nu{1}, \ldots, \nu_{s}\right)$ as $\left(k_{1}^{[1]}, \ldots, k_{\nu_{1}}^{[1]}\right), \ldots,\left(k_{1}^{[s]}, \ldots, k_{\nu_{s}}^{[s]}\right)$. By decomposing the $M$-th moment into a sum of products of connected correlators of degree $M$ and lower, we see that the connected $M$-point correlator corresponds to a new type of correlation that cannot be expressed by the connected correlators of a lower degree. We saw an example of this above when discussing the connected four-point correlator as a simple measure of nonGaussianity.
To see how this abstract definition actually works, let’s revisit the examples. First, we trivially recover the relation between the mean and the one-point connected correlator
$$
\left.\mathbb{E}\left[z_{\mu}\right]\right|{\text {connected }}=\mathbb{E}\left[z{\mu}\right],
$$
as there is no subdivision of a $M=1$ variable into any smaller pieces. For $M=2$, the definition (1.56) gives
$$
\begin{aligned}
\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] &=\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]\right|{\text {connected }}+\left.\left.\mathbb{E}\left[z{\mu_{1}}\right]\right|{\text {connected }} \mathbb{E}\left[z{\mu_{2}}\right]\right|{\text {connected }} \
&=\left.\mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}}\right]\right|{\text {connected }}+\mathbb{E}\left[z{\mu_{1}}\right] \mathbb{E}\left[z_{\mu_{2}}\right]
\end{aligned}
$$
Rearranging to solve for the connected two-point function in terms of the moments, we see that this is equivalent to our previous definition for the covariance $1.53$ ).
At this point, let us again restrict to parity-symmetric distributions invariant under $z_{\mu} \rightarrow-z_{\mu}$, remembering that this means that all the odd-point connected correlators will vanish. For such distributions, evaluating the definition (1.56) for $M=4$ gives
$$
\begin{aligned}
\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]=&\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|{\text {connected }} \
&+\left.\left.\mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}}\right]\right|{\text {connected }} \mathbb{E}\left[z{\mu_{3}} z_{\mu_{4}}\right]\right|{\text {connected }} \
&+\left.\left.\mathbb{E}\left[z{\mu_{1}} z_{\mu_{3}}\right]\right|{\text {connected }} \mathbb{E}\left[z{\mu_{2}} z_{\mu_{4}}\right]\right|{\text {connected }} \
&+\left.\left.\mathbb{E}\left[z{\mu_{1}} z_{\mu_{4}}\right]\right|{\text {connected }} \mathbb{E}\left[z{\mu_{2}} z_{\mu_{3}}\right]\right|{\text {connected }}
\end{aligned}
$$
Since $\mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}}\right]=\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]\right|{\text {connected }}$ when the mean vanishes, this is also just a rearrangement of our previous expression $(1.54$ ) for the connected four-point correlator for such zero-mean distributions.
In order to see something new, let us carry on for $M=6$ :
$$
\begin{aligned}
\mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right]=&\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right]\right|{\text {connected }} \
&+\left.\left.\left.\mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}}\right]\right|{\text {connected }}\left[z{\mu_{3}} z_{\mu_{4}}\right]\right|{\text {connected }} \mathbb{E}\left[z{\mu_{5}} z_{\mu_{6}}\right]\right|{\text {connected }} \
&+[14 \text { other }(2,2,2) \text { subdivisions }] \
&+\left.\left.\mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|{\text {connected }}\left[z{\mu_{5}} z_{\mu_{6}}\right]\right|{\text {connected }} \
&+[14 \text { other }(4,2) \text { subdivisions }]
\end{aligned}
$$
in which we have expressed the full six-point correlator in terms of a sum of products of connected two-point, four-point, and six-point correlators. Rearranging the above expression and expressing the two-point and four-point connected correlators in terms of their definitions, $(1.53$ and $(1.54$, we obtain an expression for the connected six-point correlator:
$$
\begin{aligned}
&\left.\mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right]\right|{\text {connected }} \
=& \mathbb{E}\left[z{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right] \
&-\left{\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right] \mathbb{E}\left[z_{\mu_{5}} z_{\mu_{6}}\right]+[14 \text { other }(4,2) \text { subdivisions }]\right} \
&+2\left{\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right] \mathbb{E}\left[z_{\mu_{5}} z_{\mu_{6}}\right]+[14 \text { other }(2,2,2) \text { subdivisions }]\right}
\end{aligned}
$$
The rearrangement is useful for computational purposes, in that it’s simple to first compute the moments of a distribution and then organize the resulting expressions in order to evaluate the connected correlators.
Focusing back on $1.60$, it’s easy to see that the connected six-point correlator vanishes for Gaussian distributions. Remembering that the connected four-point correlator also vanishes for Gaussian distributions, we see that the fifteen $(2,2,2)$ subdivision terms are exactly equal to the fifteen terms generated by the Wick contractions resulting from evaluating the full correlator on the left-hand side of the equation. In fact, applying the general definition of connected correlators $(1.56$ ) to the zero-mean Gaussian distribution, we see inductively that all $M$-point connected correlators for $M>2$ will vanish.
To see this, note that if all the higher-point connected correlators vanish, then the definition $1.56$ is equivalent to Wick’s theorem $1.50$, with nonzero terms in $1.56$ - the subdivisions into clusters of sizes $(2, \ldots, 2)$ - corresponding exactly to the different pairings in $11.50$ ).
Thus, the connected correlators are a very natural measure of how a distribution deviates from Gaussianity.
With this in mind, we can finally define a nearly-Gaussian distribution as a distribution for which all the connected correlators for $M>2$ are small. ${ }^{10}$
As we discussed in $1.1$, the variance sets the scale of the Gaussian distribution. For nearly-Gaussian distributions, we require that all $2 m$-point connected correlators be parametrically small when compared to an appropriate power of the variance, i.e., $\left.\left|\mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{2 m}}\right]\right|{\text {connected }}|\ll| K{\mu \nu}\right|^{m}$, schematically.
In fact, the non-Gaussian distributions that describe neural networks generally have the property that, as the network becomes wide, the connected four-point correlator becomes small and the higher-point connected correlators become even smaller. For these nearlyGaussian distributions, a few leading connected correlators give a concise and accurate description of the distribution, just as a few leading Taylor coefficients can give a good description of a function near the point of expansion.
1.3 Nearly-Gaussian Distributions
Now that we have defined nearly-Gaussian distributions in terms of measurable deviations from Gaussian statistics, i.e. via small but nonzero connected correlators, it’s natural to ask how we can link these observables to the actual functional form of the distribution, $p(z)$. We can make this connection through the action.
The action $S(z)$ is a function that defines a probability distribution $p(z)$ through the relation
$$
p(z) \propto e^{-S(z)} .
$$
In the statistics literature, the action $S(z)$ is sometimes called the negative log probability, but we will again follow the physics literature and call it the action. In order for 1.62) to make sense as a probability distribution, $p(z)$ needs be normalizable so that we can satisfy
$$
\int d^{N} z p(z)=1
$$
That’s where the normalization factoror partition function
$$
Z \equiv \int d^{N} z e^{-S(z)}
$$
comes in. After computing the partition function, we can define a probability distribution for a particular action $S(z)$ as
$$
p(z) \equiv \frac{e^{-S(z)}}{Z} .
$$
Conversely, given a probability distribution we can associate an action, $S(z)=-\log [p(z)]$, up to an additive ambiguity: the ambiguity arises because a constant shift in the action can be offset by the multiplicative factor in the partition function.
One convention is to pick the constant such that the action vanishes when evaluated at its global minimum.
The action is a very convenient way to approximate certain types of statistical processes, particularly those with nearly-Gaussian statistics. To demonstrate this, we’ll first start with the simplest action, which describes the Gaussian distribution, and then we’ll show how to systematically perturb it in order to include various non-Gaussianities.
1.3.1 Quadratic action and the Gaussian distribution
Since we already know the functional form of the Gaussian distribution, it’s simple to identify the action by reading it off from the exponent in (1.34)
$$
S(z)=\frac{1}{2} \sum_{\mu, \nu=1}^{N} K^{\mu \nu} z_{\mu} z_{\nu},
$$
where, as a reminder, the matrix $K^{\mu \nu}$ is the inverse of the variance matrix $K_{\mu \nu}$. The partition function is given by the normalization integral $(1.30)$ that we computed in $\S 1.1$
$$
Z=\int d^{N} z e^{-S(z)}=I_{K}=\sqrt{|2 \pi K|} .
$$
This quadratic action is the simplest normalizable action and serves as a starting point for defining other distributions.
As we will show next, integrals against the Gaussian distribution are a primitive for evaluating expectations against nearly-Gaussian distributions. Therefore, in order to differentiate between a general expectation and an integral against the Gaussian distribution, let us introduce a special bra-ket, or $\langle\cdot\rangle$ notation for computing Gaussian expectation values. For an observable $\mathcal{O}(z)$, define a Gaussian expectation as
$$
\langle\mathcal{O}(z)\rangle_{K} \equiv \frac{1}{\sqrt{|2 \pi K|}} \int\left[\prod_{\mu=1}^{N} d z_{\mu}\right] \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} K^{\mu \nu} z_{\mu} z_{\nu}\right) \mathcal{O}(z)
$$
In particular, with this notation we can write Wick’s theorem as
$$
\left\langle z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{2 m}}\right\rangle_{K}=\sum_{\text {all pairing }} K_{\mu_{k_{1}} \mu_{k_{2}}} \cdots K_{\mu_{k_{2 m-1}} \mu_{k_{2 m}}} .
$$
If we’re talking about a Gaussian distribution with variance $K_{\mu \nu}$, then we can use the notation $\mathbb{E}[\cdot]$ and $\langle\cdot\rangle_{K}$ interchangeably. If instead we’re talking about a nearly-Gaussian distribution $p(z)$, then $\mathbb{E}[\cdot]$ indicates expectation with respect to $p(z)$, $1.46)$. However, in the evaluation of such an expectation, we’ll often encounter Gaussian integrals, for which we’ll use this bra-ket notation $\langle\cdot\rangle_{K}$ to simplify expressions.
Quartic action and perturbation theory
Now, let’s find an action that represents a nearly-Gaussian distribution with a connected four-point correlator that is small but non-vanishing
$$
\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }}=O(\epsilon) .
$$
Here we have introduced a small parameter $\epsilon \ll 1$ and indicated that the correlator should be of order $\epsilon$. For neural networks, we will later find that the role of the small parameter $\epsilon$ is played by 1 /width.
We should be able to generate a small connected four-point correlator by deforming the Gaussian distribution through the addition of a small quartic term to the quadratic action (1.66), giving us a quartic action
$$
S(z)=\frac{1}{2} \sum_{\mu, \nu=1}^{N} K^{\mu \nu} z_{\mu} z_{\nu}+\frac{\epsilon}{4 !} \sum_{\mu, \nu, \rho, \lambda=1}^{N} V^{\mu \nu \rho \lambda} z_{\mu} z_{\nu} z_{\rho} z_{\lambda}
$$
where the quartic coupling $\epsilon V^{\mu \nu \rho \lambda}$ is an $(N \times N \times N \times N)$-dimensional tensor that is completely symmetric in all of its four indices. The factor of $1 / 4 !$ is conventional in order to compensate for the overcounting in the sum due to the symmetry of the indices. While it’s not a proof of the connection, note that the coupling $\epsilon V^{\mu \nu \rho \lambda}$ has the right number of components to faithfully reproduce the four-point connected correlator (1.70), which is also an $(N \times N \times N \times N)$-dimensional symmetric tensor. At least from this perspective we’re off to a good start.
Let us now establish this correspondence between the quartic coupling and connected four-point correlator. Note that in general it is impossible to compute any expectation value in closed form with a non-Gaussian action - this includes even the partition function. Instead, in order to compute the connected four-point correlator we’ll need to employ perturbation theory to expand everything to first order in the small parameter $\epsilon$, each term of which can then be evaluated in a closed form. As this is easier done than said, let’s get to the computations.
To start, let’s evaluate the partition function:
$$
\begin{aligned}
Z &=\int\left[\prod_{\mu} d z_{\mu}\right] e^{-S(z)} \
&=\int\left[\prod_{\mu} d z_{\mu}\right] \exp \left(-\frac{1}{2} \sum_{\mu, \nu} K^{\mu \nu} z_{\mu} z_{\nu}-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right) \
&=\sqrt{|2 \pi K|}\left\langle\exp \left(-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right)\right\rangle_{K}
\end{aligned}
$$
In the second line we inserted our expression for the quartic action 1.71), and in the last line we used our bra-ket notation (1.68) for a Gaussian expectation with variance $K_{\mu \nu}$. As advertised, the Gaussian expectation in the final line cannot be evaluated in closed form. However, since our parameter $\epsilon$ is small, we can Taylor-expand the exponential to express the partition function as a sum of simple Gaussian expectations that can be evaluated using Wick’s theorem (1.69):
$$
\begin{aligned}
Z &=\sqrt{|2 \pi K|}\left\langle 1-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}+O\left(\epsilon^{2}\right)\right\rangle_{K} \
&=\sqrt{|2 \pi K|}\left[1-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left\langle z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right\rangle_{K}+O\left(\epsilon^{2}\right)\right] \
&=\sqrt{|2 \pi K|}\left[1-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left(K_{\rho_{1} \rho_{2}} K_{\rho_{3} \rho_{4}}+K_{\rho_{1} \rho_{3}} K_{\rho_{2} \rho_{4}}+K_{\rho_{1} \rho_{4}} K_{\rho_{2} \rho_{3}}\right)+O\left(\epsilon^{2}\right)\right] \
&=\sqrt{|2 \pi K|}\left[1-\frac{1}{8} \epsilon \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\rho_{1} \rho_{2}} K_{\rho_{3} \rho_{4}}+O\left(\epsilon^{2}\right)\right]
\end{aligned}
$$
In the final line, we were able to combine the three $K^{2}$ terms together by using the total symmetry of the quartic coupling and then relabeling some of the summed-over dummy indices.
Similarly, let’s evaluate the two-point correlator:
$$
\begin{aligned}
& \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]=\frac{1}{Z} \int\left[\prod_{\mu} d z_{\mu}\right] e^{-S(z)} z_{\mu_{1}} z_{\mu_{2}} \
=& \frac{\sqrt{|2 \pi K|}}{Z}\left\langle z_{\mu_{1}} z_{\mu_{2}} \exp \left(-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right)\right\rangle_{K} \
=& \frac{\sqrt{|2 \pi K|}}{Z}\left[\left\langle z_{\mu_{1}} z_{\mu_{2}}\right\rangle_{K}-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left\langle z_{\mu_{1}} z_{\mu_{2}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right\rangle_{K}+O\left(\epsilon^{2}\right)\right] \
=& {\left[1+\frac{1}{8} \epsilon \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\rho_{1} \rho_{2}} K_{\rho_{3} \rho_{4}}\right] K_{\mu_{1} \mu_{2}} } \
&-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left(3 K_{\mu_{1} \mu_{2}} K_{\rho_{1} \rho_{2}} K_{\rho_{3} \rho_{4}}+12 K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\rho_{3} \rho_{4}}\right)+O\left(\epsilon^{2}\right) \
=& K_{\mu_{1} \mu_{2}}-\frac{\epsilon}{2} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\rho_{3} \rho_{4}}+O\left(\epsilon^{2}\right) .
\end{aligned}
$$
Here, to go from the first line to the second line we inserted our expression for the quartic action (1.71) and rewrote the integral as a Gaussian expectation. Then, after expanding in $\epsilon$ to first order, in the next step we substituted $1.73$ in for the partition function $Z$ in the denominator and expanded $1 / Z$ to the first order in $\epsilon$ using the expansion $1 /(1-x)=1+x+O\left(x^{2}\right)$. In that same step, we also noted that, of the fifteen terms coming from the Gaussian expectation $\left\langle z_{\mu_{1}} z_{\mu_{2}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right\rangle_{K}$, there are three ways in
which $z_{\mu_{1}}$ and $z_{\mu_{2}}$ contract with each other but twelve ways in which they don’t. Given again the symmetry of $V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}$, this is the only distinction that matters.
At last, let’s compute the full four-point correlator:
To go from the first line to the second line we inserted our expression for the quartic action (1.71), expanded to first order in $\epsilon$, and rewrote in the bra-ket notation (1.68). On the third line, we again substituted in the expression $(1.73$ for the partition function $Z$, expanded $1 / Z$ to first order in $\epsilon$, and then used Wick’s theorem $1.69$ ) to evaluate the fourth and eighth Gaussian moments. (Yes, we know that the evaluation of $\left\langle z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right\rangle_{K}$ is not fun. The breakdown of the terms depends again on whether or not the $\mu$-type indices are contracted with the $\rho$-type indices or not.) We can simplify this expression by noticing that some terms cancel due to $\frac{1}{8}-\frac{3}{24}=0$ and some other terms can be nicely regrouped once we notice through the expression for the two-point correlator (1.74) that
$$
\begin{aligned}
&K_{\mu_{1} \mu_{2}} K_{\mu_{3} \mu_{4}}-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left(12 K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\mu_{3} \mu_{4}} K_{\rho_{3} \rho_{4}}+12 K_{\mu_{3} \rho_{1}} K_{\mu_{4} \rho_{2}} K_{\mu_{1} \mu_{2}} K_{\rho_{3} \rho_{4}}\right) \
&=\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right]+O\left(\epsilon^{2}\right)
\end{aligned}
$$
yielding in the end
$$
\begin{aligned}
& \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right] \
=& \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right]+\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{3}}\right] \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{4}}\right]+\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{4}}\right] \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{3}}\right] \
&-\epsilon \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\mu_{3} \rho_{3}} K_{\mu_{4} \rho_{4}}+O\left(\epsilon^{2}\right)
\end{aligned}
$$
Given the full four-point correlator $1.75)$ and the two-point correlator $1.74$ ), we can finally evaluate the connected four-point correlator $(1.54)$ as
$$
\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|{\text {connected }}=-\epsilon \sum{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\mu_{3} \rho_{3}} K_{\mu_{4} \rho_{4}}+O\left(\epsilon^{2}\right) \text {. }
$$
This makes explicit the relationship between the connected four-point correlator and the quartic coupling in the action, when both are small. We see that for the nearly-Gaussian distribution realized by the quartic action (1.71), the distribution is - as promised nearly Gaussian: the strength of the coupling $\epsilon V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}$ directly controls the distribution’s deviation from Gaussian statistics, as measured by the connected four-point correlator. This also shows that the four-index tensor $V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}$ creates nontrivial correlations between the components $z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}$ that cannot otherwise be built up by the correlation $K_{\mu \nu}$ in any pair of random variables $z_{\mu} z_{\nu}$.
Finally, note that the connected two-point correlator (1.74) - i.e. the covariance of this nearly-Gaussian distribution - is also shifted from its Gaussian value of $K_{\mu_{1} \mu_{2}}$ by the quartic coupling $\epsilon V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}$. Thus, the nearly-Gaussian deformation not only creates complicated patterns of four-point correlation as measured by the connected four-point correlator 1.78), it also can modify the details of the Gaussian two-point correlation.
Now that we see how to compute the statistics of a nearly-Gaussian distribution, let’s take a step back and think about what made this possible. We can perform these perturbative calculations any time there exists in the problem a dimensionless parameter $\epsilon$ that is small $\epsilon \ll 1$, but nonzero $\epsilon>0$. This makes perturbation theory an extremely powerful tool for theoretical analysis any time a problem has any extreme scales, small or large.
Importantly, this is directly relevant to theoretically understanding neural networks in practice. As we will explain in the following chapters, real networks have a parameter $n$ - the number of neurons in a layer - that is typically large $n \gg 1$, but certainly not infinite $n<\infty$. This means that we can expand the distributions that describe such networks in the inverse of the large parameter as $\epsilon=1 / n$. Indeed, when the parameter $n$ is large - as is typical in practice - the distributions that describe neural networks become nearly-Gaussian and thus theoretically tractable. This type of expansion is known as the $1 / n$ expansion or large- $n$ expansion and will be one of our main tools for learning the principles of deep learning theory.
Aside: statistical independence and interactions
The quartic action (1.71) is one of the simplest models of an interacting theory. We showed this explicitly by connecting the quartic coupling to the non-Gaussian statistics of the non-vanishing connected four-point correlator. Here, let us try to offer an intuitive meaning of interaction by appealing to the notion of statistical independence.
Recall from the probability theory that two random variables $x$ and $y$ are statistically independent if their joint distribution factorizes as
$$
p(x, y)=p(x) p(y) .
$$
For the Gaussian distribution, if the variance matrix $K_{\mu \nu}$ is diagonal, there is no correlation at all between different components of $z_{\mu}$; they are manifestly statistically independent from each other.
Even if $K_{\mu \nu}$ is not diagonal, we can still unwind the correlation of a Gaussian distribution by rotating to the right basis. As discussed in $\S 1.1$, there always exists an orthogonal matrix $O$ that diagonalizes the covariance as $\left(O K O^{T}\right){\mu \nu}=\lambda{\mu} \delta_{\mu \nu}$. In terms of the variables $u_{\mu} \equiv(O z){\mu}$, the distribution looks like
$$
p(z)=\frac{1}{\sqrt{|2 \pi K|}} \exp \left(-\sum{\mu=1}^{N} \frac{u_{\mu}^{2}}{2 \lambda_{\mu}}\right)=\prod_{\mu=1}^{N}\left(\frac{e^{-\frac{u_{\mu}^{2}}{2 \lambda_{\mu}}}}{\sqrt{2 \pi \lambda_{\mu}}}\right)=p\left(u_{1}\right) \cdots p\left(u_{N}\right)
$$
Thus, we see that in the $u$-coordinate basis the original multivariable Gaussian distribution factorizes into $N$ single-variable Gaussians that are statistically independent.
We also see that in terms of the action, statistical independence is characterized by the action breaking into a sum over separate terms. This unwinding of interaction between variables is generically impossible when there are nonzero non-Gaussian couplings. For instance, there are $\sim N^{2}$ components of an orthogonal matrix $O_{\mu \nu}$ to change basis, while there are $\sim N^{4}$ components of the quartic coupling $\epsilon V^{\mu \nu \rho \lambda}$ that correlate random variables, so it is generically impossible to re-express the quartic action as a sum of functions of $N$ different variables. Since the action cannot be put into a sum over $N$ separate terms, the joint distribution cannot factorize, and the components will not be independent from each other. Thus, it is impossible to factor the nearly-Gaussian distribution into the product of $N$ statistically independent distributions. In this sense, what is meant by interaction is the breakdown of statistical independence. ${ }^{12}$
Nearly-Gaussian actions
Having given a concrete example in which we illustrated how to deform the quadratic action to realize the simplest nearly-Gaussian distribution, we now give a more general perspective on nearly-Gaussian distributions. In what follows, we will continue to require that our distributions are invariant under the parity symmetry that takes $z_{\mu} \rightarrow-z_{\mu}$. In the action representation, this corresponds to including only terms of even degree. ${ }^{13}$
${ }^{12}$ An astute reader might wonder if there is any interaction when we consider a single-variable distribution with $N=1$, since there’s no other variables to interact with. For nearly-Gaussian distributions, even if $N=1$, we saw in (1.74) that the variance of the distribution is shifted from its Gaussian value, $K$, and depends on the quartic coupling $\epsilon V$. In physics, we say that this shift is due to the self-interaction induced by the quartic coupling $\epsilon V$, since it modifies the value of observables from the free Gaussian theory that we are comparing to, even though there’s no notion of statistical independence to appeal to here.
Said another way, even though the action just involves one term, such a non-Gaussian distribution does not have a closed-form solution for its partition function or correlators; i.e. there’s no trick that lets us compute integrals of the form $e^{-S(z)}$ exactly, when $S(z)=\frac{z^{2}}{2 K}+\frac{1}{4 !} \epsilon V z^{4}$. This means that we still have to make use of perturbation theory to analyze the self-interaction in such distributions.
${ }^{13}$ The imposition of such a parity symmetry, and thus the absence of odd-degree terms in the action, means that all of the odd moments and hence all of the odd-point connected correlators will vanish.
With that caveat in mind, though otherwise very generally, we can express a nonGaussian distribution by deforming the Gaussian action as
$$
S(z)=\frac{1}{2} \sum_{\mu, \nu=1}^{N} K^{\mu \nu} z_{\mu} z_{\nu}+\sum_{m=2}^{k} \frac{1}{(2 m) !} \sum_{\mu_{1}, \ldots, \mu_{2 m}=1}^{N} s^{\mu_{1} \cdots \mu_{2 m}} z_{\mu_{1}} \cdots z_{\mu_{2 m}},
$$
where the factor of $1 /(2 m)$ ! is conventional in order to compensate for the overcounting in the sum due to the implied symmetry of the indices $\mu_{1}, \ldots, \mu_{2 m}$ in the coefficients $s^{\mu_{1} \cdots \mu_{2 m}}$, given the permutation symmetry of the product of variables $z_{\mu_{1}} \cdots z_{\mu_{2 m}}$. The number of terms in the non-Gaussian part of the action is controlled by the integer $k$. If $k$ were unbounded, then $S(z)$ would be an arbitrary even function, and $p(z)$ could be any parity-symmetric distribution. The action is most useful when the expanded polynomial $S(z)$ truncated to reasonably small degree $k$ - like $k=2$ for the quartic action - yields a good representation for the statistical process of interest.
The coefficients $s^{\mu_{1} \cdots \mu_{2 m}}$ are generally known as non-Gaussian couplings, and they control the interactions of the $z_{\mu}$ .
In the similar vein, the coefficient $K^{\mu \nu}$ in the action is sometimes called a quadratic coupling since the coupling of the component $z_{\mu}$ with the component $z_{\nu}$ in the quadratic action leads to a nontrivial correlation, i.e. $\operatorname{Cov}\left[z_{\mu}, z_{\nu}\right]=K_{\mu \nu}$.
In particular, there is a direct correspondence between the product of the specific components $z_{\mu}$ that appear together in the action and the presence of connected correlation between those variables, with the degree of the term in (1.81) directly contributing to connected correlators of that degree. We saw an example of this in $1.78$ ), which connected the quartic term to the connected four-point correlator. In this way, the couplings give a very direct way of controlling the degree and pattern of non-Gaussian correlation, and the overall degree of the action offers a way of systematically including more and more complicated patterns of such correlations.
If you recall from $$ 1.2$, we defined nearly-Gaussian distributions as ones for which all these connected correlators are small. Equivalently, from the action perspective, a nearly-Gaussian distribution is a non-Gaussian distribution with an action of the form (1.81) for which all the couplings $s^{\mu_{1} \cdots \mu_{2 m}}$ are parametrically small for all $1 \leq m \leq k$ :
$$
\left|s^{\mu_{1} \cdots \mu_{2 m}}\right| \ll\left|K^{\mu \nu}\right|^{m},
$$
where this equation is somewhat schematic given the mismatch of the indices.
This schematic equation is, nonetheless, dimensionally consistent. To support that remark, let us give a brief introduction to dimensional analysis: let the random variable $z_{\mu}$ have dimension $\zeta$, which we denote as $\left[z_{\mu}\right]=\zeta^{1}$. By dimension, you should have in mind something like a unit of length, so e.g. we read the expression $\left[z_{\mu}\right]=\zeta^{1}$ as “a component of $z$ is measured in units of $\zeta$.” The particular units are arbitrary: e.g. for length, we can choose between meters or inches or parsecs as long as we use a unit of length but not, say, meters ${ }^{2}$, which instead would be a unit of area. Importantly, we cannot add or equate quantities that have different units: it doesn’t make any logical sense to add a length to an area. This is similar to the concept of type safety in computer science, e.g. we should not add a type str variable to a type int variable.
Now, since the action $S(z)$ is the argument of an exponential $p(z) \propto e^{-S(z)}$, it must be dimensionless; otherwise, the exponential $e^{-S}=1-S+\frac{S^{2}}{2}+\ldots$ would violate the addition rule that we just described. From this dimensionless requirement for the action, we surmise that the inverse of the covariance matrix has dimension $\left[K^{\mu \nu}\right]=\zeta^{-2}$, and that the covariance itself has dimension $\left[K_{\mu \nu}\right]=\zeta^{2}$. Similarly, all the non-Gaussian couplings in $1.81$ have dimensions $\left[s^{\mu_{1} \cdots \mu_{2 m}}\right]=\zeta^{-2 m}$. Thus, both sides of $1.82$ have the same dimension, making this equation dimensionally consistent.
Even more concretely, consider the quartic action (1.71). If we let the tensorial part of the quartic coupling have dimensions $\left[V^{\mu \nu \rho \lambda}\right]=\zeta^{-4}$, then the parameter $\epsilon$ is dimensionless, as claimed. This means that we can consistently compare $\epsilon$ to unity, and its parametric smallness $\epsilon \ll 1$ means that the full quartic coupling $\epsilon V^{\mu \nu \rho \lambda}$ is much smaller than the square of the quadratic coupling, and that the connected four-point correlator $1.78$ is much smaller than the square of the connected two-point correlator (1.74).
Importantly the comparison is with an appropriate power of the inverse variance or quadratic coupling $K^{\mu \nu}$ since, as we already explained, the variance sets the scale of the Gaussian distribution to which we are comparing these nearly-Gaussian distributions.
As we will see in $4$, wide neural networks are described by nearly-Gaussian distributions. In particular, we will find that such networks are described by a special type of nearly-Gaussian distribution where the connected correlators are hierarchically small, scaling as
$$
\left.\mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{2 m}}\right]\right|_{\text {connected }}=O\left(\epsilon^{m-1}\right),
$$
with the same parameter $\epsilon$ controlling the different scalings for each of the $2 m$-point connected correlators. Importantly, the non-Gaussianities coming from higher-point connected correlators become parametrically less important as $\epsilon$ becomes smaller.
This means that for a nearly-Gaussian distribution with hierarchical scalings (1.83), we can consistently approximate the distribution by truncating the action at some fixed order in $\epsilon$. To be concrete, we can use an action of the form (1.81) to faithfully represent all the correlations up to order $O\left(\epsilon^{k-1}\right)$, neglecting connected correlations of order $O\left(\epsilon^{k}\right)$ and higher. The resulting action offers a useful and effective description for the statistical process of interest, as long as $\epsilon$ is small enough and $k$ is high enough that $O\left(\epsilon^{k}\right)$ is negligible.
In practice, a quartic action (1.71) truncated to $k=2$ will let us model realistic finite-width neural networks. This quartic action captures the important qualitative difference between nearly-Gaussian distributions and the Gaussian distribution, incorporating nontrivial interactions between the different components of the random variable. In addition, the difference between the statistics $1.83$ of a nearly-Gaussian distribution truncated to $O(\epsilon)$ versus one truncated to $O\left(\epsilon^{2}\right)$ is mostly quantitative: in both cases there are nontrivial non-Gaussian correlations, but the pattern of higher-order correlation differs only in a small way, with the difference suppressed as $O\left(\epsilon^{2}\right)$. In this way, the distribution represented by the quartic action is complex enough to capture the most salient non-Gaussian effects in neural networks while still being simple enough to be analytically tractable.