第 1 章 预备知识

本书的目标是 制定能够从理论上理解深度学习的原理 。其中最重要一个原理就是:深度并且宽的神经网络受近高斯分布的控制。因此,要想读完本书,你需要掌握高斯积分和摄动理论。我们在本章中包含了对这些工具的快速介绍,还有一些必要的统计学概念介绍。读者唯一需要的先决条件是熟练掌握线性代数、多元微积分和初级概率论。

考虑到这一点,我们从 第 1.1 节 开始对 高斯积分 进行讨论,重点放在单项式对高斯分布的均值计算上,最终推导出 Wick 定理

接下来,在 第 1.2 节 中,我们首先对期望值和可观测值进行一般性讨论。将可观测量视为通过重复实验学习概分布的一种方式,我们将介绍 累积量 的统计概念,以及相应的 全 M 点相关器连接 M 点相关器 的物理概念。其中特别强调了 连接相关器,因为它能够直接表征某个分布与高斯性之间的偏差。

第 1.3 节 中,我们介绍了概率分布的 负对数概率动作表示 ,并解释了 动作 如何通过对高斯分布施加系统性地变形,形成非高斯分布的紧凑表示。特别是,我们专注于 近高斯分布 (其与高斯分布的偏差通过动作中的微小耦合即可实现),并展示了如何应用 摄动理论非高斯耦合 连接到可观测量(例如通过 连接相关器)。通过摄动地处理此类耦合,我们可以将任意近高斯分布的相关器转换为 高斯积分 之和;然后可以通过在 第 1.1 节 中介绍的工具来评估每个积分。这是本文最重要的技巧之一,因为我们将研究的神经网络几乎都受近高斯分布控制,并且随着网络变宽, 非高斯耦合 的扰动也会变小。

由于所有这些操作都需要随时使用,因此在本章中,文字、公式和示例可能比较冗长,但目的是使这些材料尽可能透明和易于理解。

1.1 高斯积分

本节的目的是介绍 高斯积分高斯概率分布,并最终推导出 Wick 定理式 1.45)。该定理提供了一个计算多元高斯分布任意 矩(moment) 的基本公式。

1.1.1 一元高斯积分

首先从最简单的单变量高斯函数开始,

ez22(1.1)e^{-\frac{z^{2}}{2}} \tag{1.1}

该函数的图形描绘了著名的钟形曲线,在 z=0z=0 处的峰值周围对称,并且对于较大的 z1|z| \gg 1 迅速变细。 式 1.1 本身不能作为概率分布,因为它没有被归一化处理。为了找出正确的归一化,需要执行高斯积分:

I1dzez22(1.2)I_{1} \equiv \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2}} \tag{1.2}

有一个巧妙的技巧来计算该积分。首先,考虑它的正方形

I12=(dzez22)2=dxex22dyey22=dxdye12(x2+y2)(1.3)I_{1}^{2}=\left(\int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2}}\right)^{2}=\int_{-\infty}^{\infty} d x e^{-\frac{x^{2}}{2}} \int_{-\infty}^{\infty} d y e^{-\frac{y^{2}}{2}}=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} d x d y e^{-\frac{1}{2}\left(x^{2}+y^{2}\right)} \tag{1.3}

其中只是更改了虚拟积分变量的名称。接下来,我们将变量更改为极坐标形式 (x,y)=(rcosϕ,rsinϕ)(x, y)=(r \cos \phi, r \sin \phi),它将积分度量转换为 dxdy=rdrdϕd x d y=r d r d \phi ,并给出了两个基本积分计算:

I12=dxdye12(x2+y2)=0rdr02πdϕer22=2π0drrer22=2πer22r=0r==2π\begin{align*} I_{1}^{2}=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} d x d y e^{-\frac{1}{2}\left(x^{2}+y^{2}\right)} &=\int_{0}^{\infty} r d r \int_{0}^{2 \pi} d \phi e^{-\frac{r^{2}}{2}} \\ &=2 \pi \int_{0}^{\infty} d r r e^{-\frac{r^{2}}{2}}=2 \pi\left|-e^{-\frac{r^{2}}{2}}\right|_{r=0}^{r=\infty}=2 \pi \tag{1.4} \end{align*}

通过取平方根,我们可以将 式 1.2 的高斯积分计算为

I1=dzez22=2π(1.5)I_{1}=\int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2}}=\sqrt{2 \pi} \tag{1.5}

用这个归一化因子除以高斯函数,可以定义单位方差的高斯概率分布为

p(z)12πez22(1.6)p(z) \equiv \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^{2}}{2}} \tag{1.6}

现在已正确归一化,即 dzp(z)=1\int_{-\infty}^{\infty} d z p(z)=1。这种具有零均值和单位方差的分布也被称为标准正态分布。

将此结果扩展到方差 K>0K>0 的高斯分布非常容易。相应的归一化因子由下式给出

IKdzez22K=Kdueu22=2πK(1.7)I_{K} \equiv \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}}=\sqrt{K} \int_{-\infty}^{\infty} d u e^{-\frac{u^{2}}{2}}=\sqrt{2 \pi K} \tag{1.7}

其中,我们将积分变量重新调整为 u=z/Ku=z / \sqrt{K} ,则可以将方差为 KK 的高斯分布定义为

p(z)12πKez22K(1.8)p(z) \equiv \frac{1}{\sqrt{2 \pi K}} e^{-\frac{z^{2}}{2 K}} \tag{1.8}

该分布图再次描绘了一条围绕 z=0z=0 对称的钟形曲线,但它现在配置了表征其宽度的尺度 KK,随着 zK|z| \gg \sqrt{K}。逐渐变细。更一般地,可以将钟形曲线的中心移动为:

p(z)12πKe(zs)22K(1.9)p(z) \equiv \frac{1}{\sqrt{2 \pi K}} e^{-\frac{(z-s)^{2}}{2 K}} \tag{1.9}

它现在关于 z=sz=s 对称。该中心值 ss 称为分布的均值,因为它是:

E[z]dzp(z)z=12πKdze(zs)22Kz=1IKdwew22K(s+w)=sIKIK+1IKdw(ew22Kw)=s\begin{align*} \mathbb{E}[z] \equiv \int_{-\infty}^{\infty} d z p(z) z &=\frac{1}{\sqrt{2 \pi K}} \int_{-\infty}^{\infty} d z e^{-\frac{(z-s)^{2}}{2 K}} z \\ &=\frac{1}{I_{K}} \int_{-\infty}^{\infty} d w e^{-\frac{w^{2}}{2 K}}(s+w) \\ &=\frac{s I_{K}}{I_{K}}+\frac{1}{I_{K}} \int_{-\infty}^{\infty} d w\left(e^{-\frac{w^{2}}{2 K}} w\right) \\ &=s \tag{1.10} \end{align*}

其中,我们将变量转换为 w=zsw=z-s ,注意在最后一步时的第二项中,被积函数相对于积分变量 www \leftrightarrow-w 是奇对偶的,因此积分结果为零。

关注零均值的高斯分布,让我们考虑一般函数 O(z)\mathcal{O}(z) 的其他期望值,即

E[O(z)]dzp(z)O(z)=12πKdzez22KO(z)(1.11)\mathbb{E}[\mathcal{O}(z)] \equiv \int_{-\infty}^{\infty} d z p(z) \mathcal{O}(z)=\frac{1}{\sqrt{2 \pi K}} \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} \mathcal{O}(z) \tag{1.11}

我们经常将此类函数 O(z)\mathcal{O}(z) 称为 可观测量,它们对应于实验的测量结果。有一类特殊函数的期望值被称为 ,对应于将 zMz^{M}MM 为任意整数) 作为 O(z)\mathcal{O}(z) 插入到被积函数中:

E[zM]=12πKdzez22KzM(1.12)\mathbb{E}\left[z^{M}\right]=\frac{1}{\sqrt{2 \pi K}} \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} z^{M} \tag{1.12}

注意,对于任何奇数 MM,积分都会为 00,因为此时被积函数相对于 zzz \leftrightarrow-z 是奇函数。对于偶数 M=2mM=2 m ,需要计算如下形式的积分

IK,mdzez22Kz2m(1.13)I_{K, m} \equiv \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} z^{2 m} \tag{1.13}

作为几乎与 式 1.2 一样古老的数学对象,同样存在计算它们的技巧:

IK,m=dzez22Kz2m=(2K2ddK)mdzez22K=(2K2ddK)mIK=(2K2ddK)m2πK12=2πK2m+12(2m1)(2m3)1\begin{align*} I_{K, m} &=\int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} z^{2 m}=\left(2 K^{2} \frac{d}{d K}\right)^{m} \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}}=\left(2 K^{2} \frac{d}{d K}\right)^{m} I_{K} \\ &=\left(2 K^{2} \frac{d}{d K}\right)^{m} \sqrt{2 \pi} K^{\frac{1}{2}}=\sqrt{2 \pi} K^{\frac{2 m+1}{2}}(2 m-1)(2 m-3) \cdots 1 \tag{1.14} \end{align*}

在进入第二行时,我们在 式 1.7 中替换了 IKI_{K}

这个具有 2m=22 m=2 的方程清楚地说明了为什么我们将 KK 称为方差,因为对于具有方差 KK 的零均值高斯分布,我们有 var(z)E[(zE[z])2]=E[z2]E[z]2=E[z2]=K\operatorname{var}(z) \equiv \mathbb{E}\left[(z-\mathbb{E}[z])^{2}\right]=\mathbb{E}\left[z^{2}\right]-\mathbb{ E}[z]^{2}=\mathbb{E}\left[z^{2}\right]=K

因此,可以看到偶数阶矩可以如下简单公式给出:

E[z2m]=IK,m2πK=Km(2m1)!!(1.15)\mathbb{E}\left[z^{2 m}\right]=\frac{I_{K, m}}{\sqrt{2 \pi K}}=K^{m}(2 m-1) ! ! \tag{1.15}

其中,引入了双阶乘法:

(2m1)!!(2m1)(2m3)1=(2m)!2mm!(1.16)(2 m-1) ! ! \equiv(2 m-1)(2 m-3) \cdots 1=\frac{(2 m) !}{2^{m} m !} \tag{1.16}

式 1.15 是单变量高斯分布的 Wick 定理

实际上还有另一种导出 式 1.15 的好方法,它可以更自然地扩展到多变量高斯分布。该推导从考虑具有 源项(Source term) JJ 的高斯积分开始,我们将其定义为:

ZK,Jdzez22K+Jz(1.17)Z_{K, J} \equiv \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}+J z} \tag{1.17}

请注意,当将源设置为零时,恢复了高斯积分的归一化 ZK,J=0=IKZ_{K, J=0}=I_{K}。在物理文献中 ZK,JZ_{K, J} 有时被称为具有源的配分函数(Partition function),这个积分将被用作矩的 生成函数(generating function)。我们可以通过完成指数中的平方来计算 ZK,JZ_{K, J}

z22K+Jz=(zJK)22K+KJ22,-\frac{z^{2}}{2 K}+J z=-\frac{(z-J K)^{2}}{2 K}+\frac{K J^{2}}{2},

这可以让我们重写 式 1.17 的积分为:

ZK,J=eKJ22dze(zJK)22K=eKJ22IK=eKJ222πK,Z_{K, J}=e^{\frac{K J^{2}}{2}} \int_{-\infty}^{\infty} d z e^{-\frac{(z-J K)^{2}}{2 K}}=e^{\frac{K J^{2}}{2}} I_{K}=e^{\frac{K J^{2}}{2}} \sqrt{2 \pi K},

在中间等式中,可以注意到被积函数只是具有方差 KK 的移位高斯函数。

现在可以将具有源 ZK,JZ_{K, J} 的高斯积分与具有 IK,mI_{K, m} 插入的高斯积分联系起来。通过将 ZK,JZ_{K, J} 与源 JJ 微分,然后将源设置为零,可得:

IK,m=dzez22Kz2m=[(ddJ)2mdzez22K+Jz]J=0=[(ddJ)2mZK,J]J=0I_{K, m}=\int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}} z^{2 m}=\left.\left[\left(\frac{d}{d J}\right)^{2 m} \int_{-\infty}^{\infty} d z e^{-\frac{z^{2}}{2 K}+J z}\right]\right|_{J=0}=\left.\left[\left(\frac{d}{d J}\right)^{2 m} Z_{K, J}\right]\right|_{J=0}

换言之,积分 IK,mI_{K, m} 与在 J=0J=0 附近的配分函数 ZK,JZ_{K, J} 的偶数泰勒系数相关。例如,对于 2m=22 m=2,有

E[z2]=IK,12πK=[(ddJ)2eKJ22]J=0=[eKJ22(K+K2J2)]J=0=K,\mathbb{E}\left[z^{2}\right]=\frac{I_{K, 1}}{\sqrt{2 \pi K}}=\left.\left[\left(\frac{d}{d J}\right)^{2} e^{\frac{K J^{2}}{2}}\right]\right|_{J=0}=\left.\left[e^{\frac{K J^{2}}{2}}\left(K+K^{2} J^{2}\right)\right]\right|_{J=0}=K,

对于 2m=42 m=4 ,有

E[z4]=IK,22πK=[(ddJ)4eKJ22]J=0=[eKJ22(3K2+6K3J2+K4J4)]J=0=3K2.\mathbb{E}\left[z^{4}\right]=\frac{I_{K, 2}}{\sqrt{2 \pi K}}=\left.\left[\left(\frac{d}{d J}\right)^{4} e^{\frac{K J^{2}}{2}}\right]\right|_{J=0}=\left.\left[e^{\frac{K J^{2}}{2}}\left(3 K^{2}+6 K^{3} J^{2}+K^{4} J^{4}\right)\right]\right|_{J=0}=3 K^{2} .

请注意,在设置 J=0J=0 时,任何具有悬空源 JJ 的项都会消失。这个现象提供了一种用来评估 mm 相关器的简单方法:泰勒展开指数 ZK,J/IK=exp(KJ22)Z_{K, J} / I_{K}=\exp \left(\frac{K J^{2}}{2} \right) ,并保留含有适当数量源的项,以使表达式不会消失。由此得到

E[z2m]=IK,m2πK=[(ddJ)2meKJ22]J=0={(ddJ)2m[k=01k!(K2)kJ2k]}J=0(1.23)=(ddJ)2m[1m!(K2)mJ2m]=Km(2m)!2mm!=Km(2m1)!!\begin{aligned} \mathbb{E}\left[z^{2 m}\right] &=\frac{I_{K, m}}{\sqrt{2 \pi K}}=\left.\left[\left(\frac{d}{d J}\right)^{2 m} e^{\frac{K J^{2}}{2}}\right]\right|_{J=0}=\left\{\left(\frac{d}{d J}\right)^{2 m}\left[\sum_{k=0}^{\infty} \frac{1}{k !}\left(\frac{K}{2}\right)^{k} J^{2 k}\right]\right\}_{J=0}(1.23) \\ &=\left(\frac{d}{d J}\right)^{2 m}\left[\frac{1}{m !}\left(\frac{K}{2}\right)^{m} J^{2 m}\right]=K^{m} \frac{(2 m) !}{2^{m} m !}=K^{m}(2 m-1) ! ! \end{aligned}

这完成了对单变量高斯分布的 Wick 定理式 1.15) 的第二次推导。此推导比第一个简单推导要长得多,但可以很自然地扩展到多变量高斯分布。

1.1.2 多元高斯积分

Picking up speed, we are now ready to handle multivariable Gaussian integrals for an NN-dimensional variable zμz_{\mu} with μ=1,,N.2\mu=1, \ldots, N .{ }^{2}

Throughout this book, we will explicitly write out the component indices of vectors, matrices, and tensors as much as possible, except on some occasions when it is clear enough from context.

The multivariable Gaussian function is defined as

exp[12μ,ν=1Nzμ(K1)μνzν]\exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right)_{\mu \nu} z_{\nu}\right]

where the variance or covariance matrix KμνK_{\mu \nu} is an NN-by- NN symmetric positive definite matrix, and its inverse (K1)μν\left(K^{-1}\right)_{\mu \nu} is defined so that their matrix product gives the NN-by- NN identity matrix

ρ=1N(K1)μρKρν=δμν\sum_{\rho=1}^{N}\left(K^{-1}\right)_{\mu \rho} K_{\rho \nu}=\delta_{\mu \nu}

Here we have also introduced the Kronecker delta δμν\delta_{\mu \nu}, which satisfies

δμν{1,μ=ν0,μν\delta_{\mu \nu} \equiv \begin{cases}1, & \mu=\nu \\ 0, & \mu \neq \nu\end{cases}

The Kronecker delta is just a convenient representation of the identity matrix.
Now, to construct a probability distribution from the Gaussian function (1.24), we again need to evaluate the normalization factor

IKdNzexp[12μ,ν=1Nzμ(K1)μνzν]=dz1dz2dzNexp[12μ,ν=1Nzμ(K1)μνzν].\begin{aligned} I_{K} & \equiv \int d^{N} z \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right)_{\mu \nu} z_{\nu}\right] \\ &=\int_{-\infty}^{\infty} d z_{1} \int_{-\infty}^{\infty} d z_{2} \cdots \int_{-\infty}^{\infty} d z_{N} \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right)_{\mu \nu} z_{\nu}\right] . \end{aligned}

To compute this integral, first recall from linear algebra that, given an NN-by- NN symmetric matrix KμνK_{\mu \nu}, there is always an orthogonal matrix 3Oμν{ }^{3} O_{\mu \nu} that diagonalizes KμνK_{\mu \nu} as (OKOT)μν=λμδμν\left(O K O^{T}\right)_{\mu \nu}=\lambda_{\mu} \delta_{\mu \nu} with eigenvalues λμ=1,,N\lambda_{\mu=1, \ldots, N} and diagonalizes its inverse as (OK1OT)μν=\left(O K^{-1} O^{T}\right)_{\mu \nu}= (1/λμ)δμν\left(1 / \lambda_{\mu}\right) \delta_{\mu \nu}. With this in mind, after twice inserting the identity matrix as δμν=\delta_{\mu \nu}= (OTO)μν\left(O^{T} O\right)_{\mu \nu}, the sum in the exponent of the integral can be expressed in terms of the eigenvalues as

An orthogonal matrix OμνO_{\mu \nu} is a matrix whose transpose (OT)μν\left(O^{T}\right)_{\mu \nu} equals its inverse, i.e., (OTO)μν=\left(O^{T} O\right)_{\mu \nu}= δμν\delta_{\mu \nu}

μ,ν=1Nzμ(K1)μνzν=μ,ρ,σ,ν=1Nzμ(OTO)μρ(K1)ρσ(OTO)σνzν=μ,ν=1N(Oz)μ(OK1OT)μν(Oz)ν=μ=1N1λμ(Oz)μ2\begin{aligned} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right)_{\mu \nu} z_{\nu} &=\sum_{\mu, \rho, \sigma, \nu=1}^{N} z_{\mu}\left(O^{T} O\right)_{\mu \rho}\left(K^{-1}\right)_{\rho \sigma}\left(O^{T} O\right)_{\sigma \nu} z_{\nu} \\ &=\sum_{\mu, \nu=1}^{N}(O z)_{\mu}\left(O K^{-1} O^{T}\right)_{\mu \nu}(O z)_{\nu} \\ &=\sum_{\mu=1}^{N} \frac{1}{\lambda_{\mu}}(O z)_{\mu}^{2} \end{aligned}

where to reach the final line we used the diagonalization property of the inverse covariance matrix. Remembering that for a positive definite matrix KμνK_{\mu \nu} the eigenvalues are all positive λμ>0\lambda_{\mu}>0, we see that the λμ\lambda_{\mu} sets the scale of the falloff of the Gaussian function in each of the eigendirections. Next, recall from multivariable calculus that a change of variables uμ(Oz)μu_{\mu} \equiv(O z)_{\mu} with an orthogonal matrix OO leaves the integration measure invariant, i.e., dNz=dNud^{N} z=d^{N} u. All together, this lets us factorize the multivariable Gaussian integral (1.27)(1.27) into a product of single-variable Gaussian integrals (1.7), yielding

IK=du1du2duNexp(u122λ1u222λ2uN22λN)=μ=1N[duμexp(uμ22λμ)]=μ=1N2πλμ=μ=1N(2πλμ).\begin{aligned} I_{K} &=\int_{-\infty}^{\infty} d u_{1} \int_{-\infty}^{\infty} d u_{2} \cdots \int_{-\infty}^{\infty} d u_{N} \exp \left(-\frac{u_{1}^{2}}{2 \lambda_{1}}-\frac{u_{2}^{2}}{2 \lambda_{2}}-\ldots-\frac{u_{N}^{2}}{2 \lambda_{N}}\right) \\ &=\prod_{\mu=1}^{N}\left[\int_{-\infty}^{\infty} d u_{\mu} \exp \left(-\frac{u_{\mu}^{2}}{2 \lambda_{\mu}}\right)\right]=\prod_{\mu=1}^{N} \sqrt{2 \pi \lambda_{\mu}}=\sqrt{\prod_{\mu=1}^{N}\left(2 \pi \lambda_{\mu}\right) .} \end{aligned}

Finally, recall one last fact from linear algebra that the product of the eigenvalues of a matrix is equal to the matrix determinant. Thus, compactly, we can express the value of the multivariable Gaussian integral as

IK=dNzexp[12μ,ν=1Nzμ(K1)μνzν]=2πKI_{K}=\int d^{N} z \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right)_{\mu \nu} z_{\nu}\right]=\sqrt{|2 \pi K|}

where A|A| denotes the determinant of a square matrix AA.
Having figured out the normalization factor, we can define the zero-mean multivariable Gaussian probability distribution with variance KμνK_{\mu \nu} as

p(z)=12πKexp[12μ,ν=1Nzμ(K1)μνzν].p(z)=\frac{1}{\sqrt{|2 \pi K|}} \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu}\left(K^{-1}\right)_{\mu \nu} z_{\nu}\right] .

While we’re at it, let us also introduce the conventions of suppressing the superscript " 1-1 " for the inverse covariance (K1)μν\left(K^{-1}\right)_{\mu \nu}, instead placing the component indices upstairs as

Kμν(K1)μν.K^{\mu \nu} \equiv\left(K^{-1}\right)_{\mu \nu} .

This way, we distinguish the covariance KμνK_{\mu \nu} and the inverse covariance KμνK^{\mu \nu} by whether or not component indices are lowered or raised. With this notation, inherited from general relativity, the defining equation for the inverse covariance (1.25)(1.25) is written instead as

ρ=1NKμρKρν=δνμ\sum_{\rho=1}^{N} K^{\mu \rho} K_{\rho \nu}=\delta_{\nu}^{\mu}

and the multivariable Gaussian distribution 1.31) is written as

p(z)=12πKexp(12μ,ν=1NzμKμνzν)p(z)=\frac{1}{\sqrt{|2 \pi K|}} \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}\right)

Although it might take some getting used to, this notation saves us some space and saves you some handwriting pain.

If you like, in your notes you can also go full general-relativistic mode and adopt Einstein summation convention, suppressing the summation symbol any time indices are repeated in upstair-downstair pairs. For instance, if we adopted this convention we would write the defining equation for inverse simply as KμρKρν=δμνK^{\mu \rho} K_{\rho \nu}=\delta^{\mu}{ }_{\nu} and the Gaussian function as exp(12zμKμνzν)\exp \left(-\frac{1}{2} z_{\mu} K^{\mu \nu} z_{\nu}\right).
Specifically for neural networks, you might find the Einstein summation convention helpful for sample indices, but sometimes confusing for neural indices. For extra clarity, we won’t adopt this convention in the text of the book, but we mention it now since we do often use such a convention to simplify our own calculations in private.

Regardless of how it’s written, the zero-mean multivariable Gaussian probability distribution (1.34) peaks at z=0z=0, and its falloff is directiondependent, determined by the covariance matrix KμνK_{\mu \nu}. More generally, we can shift the peak of the Gaussian distribution to sμs_{\mu}

p(z)=12πKexp[12μ,ν=1N(zs)μKμν(zs)ν]p(z)=\frac{1}{\sqrt{|2 \pi K|}} \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N}(z-s)_{\mu} K^{\mu \nu}(z-s)_{\nu}\right]

which defines a general multivariable Gaussian distribution with mean E[zμ]=sμ\mathbb{E}\left[z_{\mu}\right]=s_{\mu} and covariance KμνK_{\mu \nu}. This is the most general version of the Gaussian distribution.

Next, let’s consider the moments of the mean-zero multivariable Gaussian distribution

E[zμ1zμM]dNzp(z)zμ1zμM=12πKdNzexp(12μ,ν=1NzμKμνzν)zμ1zμM=IK,(μ1,,μM)IK,\begin{aligned} \mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{M}}\right] & \equiv \int d^{N} z p(z) z_{\mu_{1}} \cdots z_{\mu_{M}} \\ &=\frac{1}{\sqrt{|2 \pi K|}} \int d^{N} z \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}\right) z_{\mu_{1}} \cdots z_{\mu_{M}}=\frac{I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)}}{I_{K}}, \end{aligned}

where we introduced multivariable Gaussian integrals with insertions

IK,(μ1,,μM)dNzexp(12μ,ν=1NzμKμνzν)zμ1zμM.I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)} \equiv \int d^{N} z \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}\right) z_{\mu_{1}} \cdots z_{\mu_{M}} .

Following our approach in the single-variable case, let’s construct the generating function for the integrals IK,(μ1,,μM)I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)} by including a source term JμJ^{\mu} as

ZK,JdNzexp(12μ,ν=1NzμKμνzν+μ=1NJμzμ)Z_{K, J} \equiv \int d^{N} z \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}+\sum_{\mu=1}^{N} J^{\mu} z_{\mu}\right) \text {. }

As the name suggests, differentiating the generating function ZK,JZ_{K, J} with respect to the source JμJ^{\mu} brings down a power of zμz_{\mu} such that after MM such differentiations we have

[ddJμ1ddJμ2ddJμMZK,J]J=0=dNzexp(12μ,ν=1NzμKμνzν)zμ1zμM=IK,(μ1,,μM).\begin{aligned} & {\left.\left[\frac{d}{d J^{\mu_{1}}} \frac{d}{d J^{\mu_{2}}} \cdots \frac{d}{d J^{\mu_{M}}} Z_{K, J}\right]\right|_{J=0} } \\ =& \int d^{N z} \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}\right) z_{\mu_{1}} \cdots z_{\mu_{M}}=I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)} . \end{aligned}

So, as in the single-variable case, the Taylor coefficients of the partition function ZK,JZ_{K, J} expanded around Jμ=0J^{\mu}=0 are simply related to the integrals with insertions IK,(μ1,,μM)I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)}. Therefore, if we knew a closed-form expression for ZK,JZ_{K, J}, we could easily compute the values of the integrals IK,(μ1,,μM)I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)}.

To evaluate the generating function ZK,JZ_{K, J} in a closed form, again we follow the lead of the single-variable case and complete the square in the exponent of the integrand in (1.38) as

12μ,ν=1NzμKμνzν+μ=1NJμzμ=12μ,ν=1N(zμρ=1NKμρJρ)Kμν(zνλ=1NKνλJλ)+12μ,ν=1NJμKμνJν=12μ,ν=1NwμKμνwν+12μ,ν=1NJμKμνJν\begin{aligned} &-\frac{1}{2} \sum_{\mu, \nu=1}^{N} z_{\mu} K^{\mu \nu} z_{\nu}+\sum_{\mu=1}^{N} J^{\mu} z_{\mu} \\ =&-\frac{1}{2} \sum_{\mu, \nu=1}^{N}\left(z_{\mu}-\sum_{\rho=1}^{N} K_{\mu \rho} J^{\rho}\right) K^{\mu \nu}\left(z_{\nu}-\sum_{\lambda=1}^{N} K_{\nu \lambda} J^{\lambda}\right)+\frac{1}{2} \sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu} \\ =&-\frac{1}{2} \sum_{\mu, \nu=1}^{N} w_{\mu} K^{\mu \nu} w_{\nu}+\frac{1}{2} \sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu} \end{aligned}

where we have introduced the shifted variable wμzμρ=1NKμρJρw_{\mu} \equiv z_{\mu}-\sum_{\rho=1}^{N} K_{\mu \rho} J^{\rho}. Using this substitution, the generating function can be evaluated explicitly

ZK,J=exp(12μ,ν=1NJμKμνJν)dNwexp[12μ,ν=1NwμKμνwν]=2πKexp(12μ,ν=1NJμKμνJν)\begin{aligned} Z_{K, J} &=\exp \left(\frac{1}{2} \sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right) \int d^{N} w \exp \left[-\frac{1}{2} \sum_{\mu, \nu=1}^{N} w_{\mu} K^{\mu \nu} w_{\nu}\right] \\ &=\sqrt{|2 \pi K|} \exp \left(\frac{1}{2} \sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right) \end{aligned}

where at the end we used our formula for the multivariable integral IKI_{K}, (1.30). With our closed-form expression 1.411.41 ) for the generating function ZK,JZ_{K, J}, we can compute the Gaussian integrals with insertions IK,(μ1,,μM)I_{K,\left(\mu_{1}, \ldots, \mu_{M}\right)} by differentiating it, using (1.39). For an even number M=2mM=2 m of insertions, we find a really nice formula

E[zμ1zμ2m]=IK,(μ1,,μ2m)IK=1IK[ddJμ1ddJμ2mZK,J]J=0=12mm!ddJμ1ddJμ2ddJμ2m(μ,ν=1NJμKμνJν)m\begin{aligned} \mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{2 m}}\right] &=\frac{I_{K,\left(\mu_{1}, \ldots, \mu_{2 m}\right)}}{I_{K}}=\left.\frac{1}{I_{K}}\left[\frac{d}{d J^{\mu_{1}}} \cdots \frac{d}{d J^{\mu_{2 m}}} Z_{K, J}\right]\right|_{J=0} \\ &=\frac{1}{2^{m} m !} \frac{d}{d J^{\mu_{1}}} \frac{d}{d J^{\mu_{2}}} \cdots \frac{d}{d J^{\mu_{2 m}}}\left(\sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right)^{m} \end{aligned}

For an odd number M=2m+1M=2 m+1 of insertions, there is dangling source upon setting J=0J=0, and so those integrals vanish. You can also see this by looking at the integrand for any odd moment and noticing that it is odd with respect to the sign flip of the integration variables zμzμz_{\mu} \leftrightarrow-z_{\mu}.

Now, let’s take a few moments to evaluate a few moments using this formula. For 2m=22 m=2, we have

E[zμ1zμ2]=12ddJμ1ddJμ2(μ,ν=1NJμKμνJν)=Kμ1μ2\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]=\frac{1}{2} \frac{d}{d J^{\mu_{1}}} \frac{d}{d J^{\mu_{2}}}\left(\sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right)=K_{\mu_{1} \mu_{2}}

Here, there are 2!=22 !=2 ways to apply the product rule for derivatives and differentiate the two JJ 's, both of which evaluate to the same expression due to the symmetry of the covariance, Kμ1μ2=Kμ2μ1K_{\mu_{1} \mu_{2}}=K_{\mu_{2} \mu_{1}}. This expression 11.4311.43 validates in the multivariable setting why we have been calling KμνK_{\mu \nu} the covariance, because we see explicitly that it is the covariance.
Next, for 2m=42 m=4 we get a more complicated expression

E[zμ1zμ2zμ3zμ4]=1222!ddJμ1ddJμ2ddJμ3ddJμ4(μ,ν=1NJμKμνJν)(ρ,λ=1NJρKρλJλ)=Kμ1μ2Kμ3μ4+Kμ1μ3Kμ2μ4+Kμ1μ4Kμ2μ3\begin{aligned} \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right] &=\frac{1}{2^{2} 2 !} \frac{d}{d J^{\mu_{1}}} \frac{d}{d J^{\mu_{2}}} \frac{d}{d J^{\mu_{3}}} \frac{d}{d J^{\mu_{4}}}\left(\sum_{\mu, \nu=1}^{N} J^{\mu} K_{\mu \nu} J^{\nu}\right)\left(\sum_{\rho, \lambda=1}^{N} J^{\rho} K_{\rho \lambda} J^{\lambda}\right) \\ &=K_{\mu_{1} \mu_{2}} K_{\mu_{3} \mu_{4}}+K_{\mu_{1} \mu_{3}} K_{\mu_{2} \mu_{4}}+K_{\mu_{1} \mu_{4}} K_{\mu_{2} \mu_{3}} \end{aligned}

Here we note that there are now 4!=244 !=24 ways to differentiate the four JJ 's, though only three distinct ways to pair the four auxiliary indices 1,2,3,41,2,3,4 that sit under μ\mu. This gives 24/3=8=22224 / 3=8=2^{2} 2 ! equivalent terms for each of the three pairings, which cancels against the overall factor 1/(222!)1 /\left(2^{2} 2 !\right).

For general 2m2 m, there are (2m)(2 m) ! ways to differentiate the sources, of which 2mm2^{m} m ! of those ways are equivalent. This gives (2m)!/(2mm!)=(2m1)!!(2 m) ! /\left(2^{m} m !\right)=(2 m-1) ! ! distinct terms, corresponding to the (2m1)!!(2 m-1) ! ! distinct pairings of 2m2 m auxiliary indices 1,,2m1, \ldots, 2 m that sit under μ\mu. The factor of 1/(2mm1 /\left(2^{m} m\right. !) in the denominator of (1.42)(1.42) ensures that the coefficient of each of these terms is normalized to unity. Thus, most generally, we can express the moments of the multivariable Gaussian with the following formula

E[zμ1zμ2m]=all pairing Kμk1μk2Kμk2m1μk2m,\mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{2 m}}\right]=\sum_{\text {all pairing }} K_{\mu_{k_{1}} \mu_{k_{2}}} \cdots K_{\mu_{k_{2 m-1}} \mu_{k_{2 m}}},

where, to reiterate, the sum is over all the possible distinct pairings of the 2m2 m auxiliary indices under μ\mu such that the result has the (2m1)(2 m-1) !! terms that we described above. Each factor of the covariance KμνK_{\mu \nu} in a term in sum is called a Wick contraction, corresponding to a particular pairing of auxiliary indices. Each term then is composed of mm different Wick contractions, representing a distinct way of pairing up all the auxiliary indices. To make sure you understand how this pairing works, look back at the 2m=22 m=2 case (1.43) - with a single Wick contraction - and the 2m=42 m=4 case (1.44)(1.44) - with three distinct ways of making two Wick contractions - and try to work out the 2m=62 m=6 case, which yields (61)!!=15(6-1) ! !=15 distinct ways of making three Wick contractions:

E[zμ1zμ2zμ3zμ4zμ5zμ6]=Kμ1μ2Kμ3μ4Kμ5μ6+Kμ1μ3Kμ2μ4Kμ5μ6+Kμ1μ4Kμ2μ3Kμ5μ6+Kμ1μ2Kμ3μ5Kμ4μ6+Kμ1μ3Kμ2μ5Kμ4μ6+Kμ1μ5Kμ2μ3Kμ4μ6+Kμ1μ2Kμ5μ4Kμ3μ6+Kμ1μ5Kμ2μ4Kμ3μ6+Kμ1μ4Kμ2μ5Kμ3μ6+Kμ1μ5Kμ3μ4Kμ2μ6+Kμ1μ3Kμ5μ4Kμ2μ6+Kμ1μ4Kμ5μ3Kμ2μ6+Kμ5μ2Kμ3μ4Kμ1μ6+Kμ5μ3Kμ2μ4Kμ1μ6+Kμ5μ4Kμ2μ3Kμ1μ6\begin{aligned} \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right] &=K_{\mu_{1} \mu_{2}} K_{\mu_{3} \mu_{4}} K_{\mu_{5} \mu_{6}}+K_{\mu_{1} \mu_{3}} K_{\mu_{2} \mu_{4}} K_{\mu_{5} \mu_{6}}+K_{\mu_{1} \mu_{4}} K_{\mu_{2} \mu_{3}} K_{\mu_{5} \mu_{6}} \\ &+K_{\mu_{1} \mu_{2}} K_{\mu_{3} \mu_{5}} K_{\mu_{4} \mu_{6}}+K_{\mu_{1} \mu_{3}} K_{\mu_{2} \mu_{5}} K_{\mu_{4} \mu_{6}}+K_{\mu_{1} \mu_{5}} K_{\mu_{2} \mu_{3}} K_{\mu_{4} \mu_{6}} \\ &+K_{\mu_{1} \mu_{2}} K_{\mu_{5} \mu_{4}} K_{\mu_{3} \mu_{6}}+K_{\mu_{1} \mu_{5}} K_{\mu_{2} \mu_{4}} K_{\mu_{3} \mu_{6}}+K_{\mu_{1} \mu_{4}} K_{\mu_{2} \mu_{5}} K_{\mu_{3} \mu_{6}} \\ &+K_{\mu_{1} \mu_{5}} K_{\mu_{3} \mu_{4}} K_{\mu_{2} \mu_{6}}+K_{\mu_{1} \mu_{3}} K_{\mu_{5} \mu_{4}} K_{\mu_{2} \mu_{6}}+K_{\mu_{1} \mu_{4}} K_{\mu_{5} \mu_{3}} K_{\mu_{2} \mu_{6}} \\ &+K_{\mu_{5} \mu_{2}} K_{\mu_{3} \mu_{4}} K_{\mu_{1} \mu_{6}}+K_{\mu_{5} \mu_{3}} K_{\mu_{2} \mu_{4}} K_{\mu_{1} \mu_{6}}+K_{\mu_{5} \mu_{4}} K_{\mu_{2} \mu_{3}} K_{\mu_{1} \mu_{6}} \end{aligned}

The formula (1.45) is Wick’s theorem. Put a box around it. Take a few moments for reflection.
\cdots
\cdots
\cdots
Good. You are now a Gaussian sensei. Exhale, and then say as Neo would say, “I know Gaussian integrals.”

Now that the moments have passed, it is an appropriate time to transition to the next section where you will learn about more general probability distributions.

1.2 Probability, Correlation and Statistics, and All That

In introducing the Gaussian distribution in the last section we briefly touched upon the concepts of expectation and moments. These are defined for non-Gaussian probability distributions too, so now let us reintroduce these concepts and expand on their definitions, with an eye towards understanding the nearly-Gaussian distributions that describe wide neural networks.

Given a probability distribution p(z)p(z) of an NN-dimensional random variable zμz_{\mu}, we can learn about its statistics by measuring functions of zμz_{\mu}. We’ll refer to such measurable functions in a generic sense as observables and denote them as O(z)\mathcal{O}(z). The expectation value of an observable

E[O(z)]dNzp(z)O(z)\mathbb{E}[\mathcal{O}(z)] \equiv \int d^{N} z p(z) \mathcal{O}(z)

characterizes the mean value of the random function O(z)\mathcal{O}(z). Note that the observable O(z)\mathcal{O}(z) needs not be a scalar-valued function, e.g. the second moment of a distribution is a matrix-valued observable given by O(z)=zμzν\mathcal{O}(z)=z_{\mu} z_{\nu}.

Operationally, an observable is a quantity that we measure by conducting experiments in order to connect to a theoretical model for the underlying probability distribution describing zμz_{\mu}. In particular, we repeatedly measure the observables that are naturally accessible to us as experimenters, collect their statistics, and then compare them with predictions for the expectation values of those observables computed from some theoretical model of p(z)p(z).

With that in mind, it’s very natural to ask: what kind of information can we learn about an underlying distribution p(z)p(z) by measuring an observable O(z)\mathcal{O}(z) ? For an a priori unknown distribution, is there a set of observables that can serve as a sufficient probe of p(z)p(z) such that we could use that information to predict the result of all future experiments involving zμz_{\mu} ?

Consider a class of observables that we’ve already encountered, the moments or MM-point correlators of zμz_{\mu}, given by the expectation

In the rest of this book, we’ll often use the physics term MM-point correlator rather than the statistics term moment, though they mean the same thing and can be used interchangeably.

E[zμ1zμ2zμM]=dNzp(z)zμ1zμ2zμM\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right]=\int d^{N} z p(z) z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}

In principle, knowing the MM-point correlators of a distribution lets us compute the expectation value of any analytic observable O(z)\mathcal{O}(z) via Taylor expansion

E[O(z)]=E[M=01M!μ1,,μM=1NMOzμ1zμMz=0zμ1zμ2zμM]=M=01M!μ1,,μM=1NMOzμ1zμMz=0E[zμ1zμ2zμM]\begin{aligned} \mathbb{E}[\mathcal{O}(z)] &=\mathbb{E}\left[\left.\sum_{M=0}^{\infty} \frac{1}{M !} \sum_{\mu_{1}, \ldots, \mu_{M}=1}^{N} \frac{\partial^{M} \mathcal{O}}{\partial z_{\mu_{1}} \cdots \partial z_{\mu_{M}}}\right|_{z=0} z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right] \\ &=\left.\sum_{M=0}^{\infty} \frac{1}{M !} \sum_{\mu_{1}, \ldots, \mu_{M}=1}^{N} \frac{\partial^{M} \mathcal{O}}{\partial z_{\mu_{1}} \cdots \partial z_{\mu_{M}}}\right|_{z=0} \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right] \end{aligned}

where on the last line we took the Taylor coefficients out of the expectation by using the linearity property of the expectation, inherited from the linearity property of the integral in (1.46). As such, it’s clear that the collection of all the MM-point correlators completely characterizes a probability distribution for all intents and purposes.

In fact, the moments offer a dual description of the probability distribution through either the Laplace transform or the Fourier transform. For instance, the Laplace transform of the probability distribution p(z)p(z) is given by

ZJE[exp(μJμzμ)]=[μdzμ]p(z)exp(μJμzμ)Z_{J} \equiv \mathbb{E}\left[\exp \left(\sum_{\mu} J^{\mu} z_{\mu}\right)\right]=\int\left[\prod_{\mu} d z_{\mu}\right] p(z) \exp \left(\sum_{\mu} J^{\mu} z_{\mu}\right)

As in the Gaussian case, this integral gives a generating function for the MM-point correlators of p(z)p(z), which means that ZJZ_{J} can be reconstructed from these correlators. The probability distribution can then be obtained through the inverse Laplace transform.

However, this description in terms of all the correlators is somewhat cumbersome and operationally infeasible. To get a reliable estimate of the MM-point correlator, we must simultaneously measure MM components of a random variable for each draw and repeat such measurements many times. As MM grows, this task quickly becomes impractical. In fact, if we could easily perform such measurements for all MM, then our theoretical model of p(z)p(z) would no longer be a useful abstraction; from (1.48) we would already know the outcome of all possible experiments that we could perform, leaving nothing for us to predict.

To that point, essentially all useful distributions can be effectively described in terms of a finite number of quantities, giving them a parsimonious representation. For instance, consider the zero-mean nn-dimensional Gaussian distribution with the variance KμνK_{\mu \nu}. The nonzero 2 m2 \mathrm{~m}-point correlators are given by Wick’s theorem 1.451.45 ) as

E[zμ1zμ2zμ2m]=all pairing Kμk1μk2Kμk2m1μk2m\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{2 m}}\right]=\sum_{\text {all pairing }} K_{\mu_{k_{1}} \mu_{k_{2}}} \cdots K_{\mu_{k_{2 m-1}} \mu_{k_{2 m}}}

and are determined entirely by the N(N+1)/2N(N+1) / 2 independent components of the variance KμνK_{\mu \nu}. The variance itself can be estimated by measuring the two-point correlator

E[zμzν]=Kμν\mathbb{E}\left[z_{\mu} z_{\nu}\right]=K_{\mu \nu}

This is consistent with our description of the distribution itself as "the zero-mean NN dimensional Gaussian distribution with the variance Kμν"K_{\mu \nu} " in which we only had to specify these same set of numbers, KμνK_{\mu \nu}, to pick out the particular distribution we had in mind. For zero-mean Gaussian distributions, there’s no reason to measure or keep track of any of the higher-point correlators as they are completely constrained by the variance through 1.501.50 ).

More generally, it would be nice if there were a systematic way for learning about non-Gaussian probability distributions without performing an infinite number of experiments. For nearly-Gaussian distributions, a useful set of observables is given by what statisticians call cumulants and physicists call connected correlators.

Outside of this chapter, just as we’ll often use the term MM-point correlator rather than the term moment, we’ll use the term MM-point connected correlator rather than the term cumulant. When we want to refer to the moment and not the cumulant, we might sometimes say full correlator to contrast with connected correlator.

As the formal definition of these quantities is somewhat cumbersome and unintuitive, let’s start with a few simple examples.

The first cumulant or the connected one-point correlator is the same as the full one-point correlator

E[zμ]connected E[zμ].\left.\mathbb{E}\left[z_{\mu}\right]\right|_{\text {connected }} \equiv \mathbb{E}\left[z_{\mu}\right] .

This is just the mean of the distribution. The second cumulant or the connected twopoint correlator is given by

E[zμzν]connected E[zμzν]E[zμ]E[zν]=E[(zμE[zμ])(zνE[zν])]Cov[zμ,zν]\begin{aligned} \left.\mathbb{E}\left[z_{\mu} z_{\nu}\right]\right|_{\text {connected }} & \equiv \mathbb{E}\left[z_{\mu} z_{\nu}\right]-\mathbb{E}\left[z_{\mu}\right] \mathbb{E}\left[z_{\nu}\right] \\ &=\mathbb{E}\left[\left(z_{\mu}-\mathbb{E}\left[z_{\mu}\right]\right)\left(z_{\nu}-\mathbb{E}\left[z_{\nu}\right]\right)\right] \equiv \operatorname{Cov}\left[z_{\mu}, z_{\nu}\right] \end{aligned}

which is also known as the covariance of the distribution. Note how the mean is subtracted from the random variable zμz_{\mu} before taking the square in the connected version. The quantity Δz^μzμE[zμ]\widehat{\Delta z}_{\mu} \equiv z_{\mu}-\mathbb{E}\left[z_{\mu}\right] represents a fluctuation of the random variable around its mean. Intuitively, such fluctuations are equally likely to contribute positively as they are likely to contribute negatively, E[Δz^μ]=E[zμ]E[zμ]=0\mathbb{E}\left[\widehat{\Delta z}_{\mu}\right]=\mathbb{E}\left[z_{\mu}\right]-\mathbb{E}\left[z_{\mu}\right]=0, so it’s necessary to take the square in order to get an estimate of the magnitude of such fluctuations.

At this point, let us restrict our focus to distributions that are invariant under a signflip symmetry zμzμz_{\mu} \rightarrow-z_{\mu}, which holds for the zero-mean Gaussian distribution (1.34). Importantly, this parity symmetry will also hold for the nearly-Gaussian distributions that we will study in order to describe neural networks. For all such even distributions with this symmetry, all odd moments and all odd-point connected correlators vanish.
With this restriction, the next simplest observable is the fourth cumulant or the connected four-point correlator, given by the formula

E[zμ1zμ2zμ3zμ4]connected =E[zμ1zμ2zμ3zμ4]E[zμ1zμ2]E[zμ3zμ4]E[zμ1zμ3]E[zμ2zμ4]E[zμ1zμ4]E[zμ2zμ3]\begin{aligned} &\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }} \\ =& \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right] \\ &-\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right]-\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{3}}\right] \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{4}}\right]-\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{4}}\right] \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{3}}\right] \end{aligned}

For the Gaussian distribution, recalling the Wick theorem (1.50), the last three terms precisely subtract off the three pairs of Wick contractions used to evaluate the first term, meaning

E[zμ1zμ2zμ3zμ4]connected =0.\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }}=0 .

Essentially by design, the connected four-point correlator vanishes for the Gaussian distribution, and a nonzero value signifies a deviation from Gaussian statistics.

In statistics, the connected four-point correlator for a single random variable zz is called the excess kurtosis when normalized by the square of the variance. It is a natural measure of the tails of the distribution, as compared to a Gaussian distribution, and also serves as a measure of the potential for outliers. In particular, a positive value indicates fatter tails while a negative value indicates thinner tails.

In fact, the connected four-point correlator is perhaps the simplest measure of non-Gaussianity. Now that we have a little intuition, we are as ready as we’ll ever be to discuss the definition for the MM-th cumulant or the MM-point connected correlator. For completeness, we’ll give the general definition, before restricting again to distributions that are symmetric under parity zμzμz_{\mu} \rightarrow-z_{\mu}. The definition is inductive and somewhat counterintuitive, expressing the MM-th moment in terms of connected correlators from degree 1 to MM :

E[zμ1zμ2zμM]E[zμ1zμ2zμM]connected ]+all subdivisions [zμ1[1]zμν1[1]]connected E[zμ1[s]zμνν[s]]connected ,\begin{aligned} & \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right] \\ \equiv &\left.\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{M}}\right]\right|_{\text {connected }}\right] \\ &+\left.\left.\sum_{\text {all subdivisions }}\left[z_{\mu_{1}^{[1]}} \cdots z_{\mu_{\nu_{1}}[1]}\right]\right|_{\text {connected }} \cdots \mathbb{E}\left[z_{\mu_{1}[s]} \cdots z_{\mu_{\nu_{\nu}}[s]}\right]\right|_{\text {connected }}, \end{aligned}

where the sum is over all the possible subdivisions of MM variables into s>1s>1 clusters of sizes (ν1,,νs)\left(\nu_{1}, \ldots, \nu_{s}\right) as (k1[1],,kν1[1]),,(k1[s],,kνs[s])\left(k_{1}^{[1]}, \ldots, k_{\nu_{1}}^{[1]}\right), \ldots,\left(k_{1}^{[s]}, \ldots, k_{\nu_{s}}^{[s]}\right). By decomposing the MM-th moment into a sum of products of connected correlators of degree MM and lower, we see that the connected MM-point correlator corresponds to a new type of correlation that cannot be expressed by the connected correlators of a lower degree. We saw an example of this above when discussing the connected four-point correlator as a simple measure of nonGaussianity.

To see how this abstract definition actually works, let’s revisit the examples. First, we trivially recover the relation between the mean and the one-point connected correlator

E[zμ]connected =E[zμ],\left.\mathbb{E}\left[z_{\mu}\right]\right|_{\text {connected }}=\mathbb{E}\left[z_{\mu}\right],

as there is no subdivision of a M=1M=1 variable into any smaller pieces. For M=2M=2, the definition (1.56) gives

E[zμ1zμ2]=E[zμ1zμ2]connected +E[zμ1]connected E[zμ2]connected =E[zμ1zμ2]connected +E[zμ1]E[zμ2]\begin{aligned} \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] &=\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]\right|_{\text {connected }}+\left.\left.\mathbb{E}\left[z_{\mu_{1}}\right]\right|_{\text {connected }} \mathbb{E}\left[z_{\mu_{2}}\right]\right|_{\text {connected }} \\ &=\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]\right|_{\text {connected }}+\mathbb{E}\left[z_{\mu_{1}}\right] \mathbb{E}\left[z_{\mu_{2}}\right] \end{aligned}

Rearranging to solve for the connected two-point function in terms of the moments, we see that this is equivalent to our previous definition for the covariance 1.531.53 ).

At this point, let us again restrict to parity-symmetric distributions invariant under zμzμz_{\mu} \rightarrow-z_{\mu}, remembering that this means that all the odd-point connected correlators will vanish. For such distributions, evaluating the definition (1.56) for M=4M=4 gives

E[zμ1zμ2zμ3zμ4]=E[zμ1zμ2zμ3zμ4]connected +E[zμ1zμ2]connected E[zμ3zμ4]connected +E[zμ1zμ3]connected E[zμ2zμ4]connected +E[zμ1zμ4]connected E[zμ2zμ3]connected \begin{aligned} \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]=&\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }} \\ &+\left.\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]\right|_{\text {connected }} \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }} \\ &+\left.\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{3}}\right]\right|_{\text {connected }} \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{4}}\right]\right|_{\text {connected }} \\ &+\left.\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{4}}\right]\right|_{\text {connected }} \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{3}}\right]\right|_{\text {connected }} \end{aligned}

Since E[zμ1zμ2]=E[zμ1zμ2]connected \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]=\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]\right|_{\text {connected }} when the mean vanishes, this is also just a rearrangement of our previous expression (1.54(1.54 ) for the connected four-point correlator for such zero-mean distributions.
In order to see something new, let us carry on for M=6M=6 :

E[zμ1zμ2zμ3zμ4zμ5zμ6]=E[zμ1zμ2zμ3zμ4zμ5zμ6]connected +E[zμ1zμ2]connected [zμ3zμ4]connected E[zμ5zμ6]connected +[14 other (2,2,2) subdivisions ]+E[zμ1zμ2zμ3zμ4]connected [zμ5zμ6]connected +[14 other (4,2) subdivisions ]\begin{aligned} \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right]=&\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right]\right|_{\text {connected }} \\ &+\left.\left.\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]\right|_{\text {connected }}\left[z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }} \mathbb{E}\left[z_{\mu_{5}} z_{\mu_{6}}\right]\right|_{\text {connected }} \\ &+[14 \text { other }(2,2,2) \text { subdivisions }] \\ &+\left.\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }}\left[z_{\mu_{5}} z_{\mu_{6}}\right]\right|_{\text {connected }} \\ &+[14 \text { other }(4,2) \text { subdivisions }] \end{aligned}

in which we have expressed the full six-point correlator in terms of a sum of products of connected two-point, four-point, and six-point correlators. Rearranging the above expression and expressing the two-point and four-point connected correlators in terms of their definitions, (1.53(1.53 and (1.54(1.54, we obtain an expression for the connected six-point correlator:

E[zμ1zμ2zμ3zμ4zμ5zμ6]connected =E[zμ1zμ2zμ3zμ4zμ5zμ6]{E[zμ1zμ2zμ3zμ4]E[zμ5zμ6]+[14 other (4,2) subdivisions ]}+2{E[zμ1zμ2]E[zμ3zμ4]E[zμ5zμ6]+[14 other (2,2,2) subdivisions ]}\begin{aligned} &\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right]\right|_{\text {connected }} \\ =& \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\mu_{5}} z_{\mu_{6}}\right] \\ &-\left\{\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right] \mathbb{E}\left[z_{\mu_{5}} z_{\mu_{6}}\right]+[14 \text { other }(4,2) \text { subdivisions }]\right\} \\ &+2\left\{\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right] \mathbb{E}\left[z_{\mu_{5}} z_{\mu_{6}}\right]+[14 \text { other }(2,2,2) \text { subdivisions }]\right\} \end{aligned}

The rearrangement is useful for computational purposes, in that it’s simple to first compute the moments of a distribution and then organize the resulting expressions in order to evaluate the connected correlators.

Focusing back on 1.601.60, it’s easy to see that the connected six-point correlator vanishes for Gaussian distributions. Remembering that the connected four-point correlator also vanishes for Gaussian distributions, we see that the fifteen (2,2,2)(2,2,2) subdivision terms are exactly equal to the fifteen terms generated by the Wick contractions resulting from evaluating the full correlator on the left-hand side of the equation. In fact, applying the general definition of connected correlators (1.56(1.56 ) to the zero-mean Gaussian distribution, we see inductively that all MM-point connected correlators for M>2M>2 will vanish.

To see this, note that if all the higher-point connected correlators vanish, then the definition 1.561.56 is equivalent to Wick’s theorem 1.501.50, with nonzero terms in 1.561.56 - the subdivisions into clusters of sizes (2,,2)(2, \ldots, 2) - corresponding exactly to the different pairings in 11.5011.50 ).

Thus, the connected correlators are a very natural measure of how a distribution deviates from Gaussianity.

With this in mind, we can finally define a nearly-Gaussian distribution as a distribution for which all the connected correlators for M>2M>2 are small. 10{ }^{10}

As we discussed in 1.11.1, the variance sets the scale of the Gaussian distribution. For nearly-Gaussian distributions, we require that all 2m2 m-point connected correlators be parametrically small when compared to an appropriate power of the variance, i.e., E[zμ1zμ2m]connected Kμνm\left.\left|\mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{2 m}}\right]\right|_{\text {connected }}|\ll| K_{\mu \nu}\right|^{m}, schematically.

In fact, the non-Gaussian distributions that describe neural networks generally have the property that, as the network becomes wide, the connected four-point correlator becomes small and the higher-point connected correlators become even smaller. For these nearlyGaussian distributions, a few leading connected correlators give a concise and accurate description of the distribution, just as a few leading Taylor coefficients can give a good description of a function near the point of expansion.
1.3 Nearly-Gaussian Distributions
Now that we have defined nearly-Gaussian distributions in terms of measurable deviations from Gaussian statistics, i.e. via small but nonzero connected correlators, it’s natural to ask how we can link these observables to the actual functional form of the distribution, p(z)p(z). We can make this connection through the action.

The action S(z)S(z) is a function that defines a probability distribution p(z)p(z) through the relation

p(z)eS(z).p(z) \propto e^{-S(z)} .

In the statistics literature, the action S(z)S(z) is sometimes called the negative log probability, but we will again follow the physics literature and call it the action. In order for 1.62) to make sense as a probability distribution, p(z)p(z) needs be normalizable so that we can satisfy

dNzp(z)=1\int d^{N} z p(z)=1

That’s where the normalization factoror partition function

ZdNzeS(z)Z \equiv \int d^{N} z e^{-S(z)}

comes in. After computing the partition function, we can define a probability distribution for a particular action S(z)S(z) as

p(z)eS(z)Z.p(z) \equiv \frac{e^{-S(z)}}{Z} .

Conversely, given a probability distribution we can associate an action, S(z)=log[p(z)]S(z)=-\log [p(z)], up to an additive ambiguity: the ambiguity arises because a constant shift in the action can be offset by the multiplicative factor in the partition function.

One convention is to pick the constant such that the action vanishes when evaluated at its global minimum.

The action is a very convenient way to approximate certain types of statistical processes, particularly those with nearly-Gaussian statistics. To demonstrate this, we’ll first start with the simplest action, which describes the Gaussian distribution, and then we’ll show how to systematically perturb it in order to include various non-Gaussianities.

1.3.1 Quadratic action and the Gaussian distribution

Since we already know the functional form of the Gaussian distribution, it’s simple to identify the action by reading it off from the exponent in (1.34)

S(z)=12μ,ν=1NKμνzμzν,S(z)=\frac{1}{2} \sum_{\mu, \nu=1}^{N} K^{\mu \nu} z_{\mu} z_{\nu},

where, as a reminder, the matrix KμνK^{\mu \nu} is the inverse of the variance matrix KμνK_{\mu \nu}. The partition function is given by the normalization integral (1.30)(1.30) that we computed in §1.1\S 1.1

Z=dNzeS(z)=IK=2πK.Z=\int d^{N} z e^{-S(z)}=I_{K}=\sqrt{|2 \pi K|} .

This quadratic action is the simplest normalizable action and serves as a starting point for defining other distributions.

As we will show next, integrals against the Gaussian distribution are a primitive for evaluating expectations against nearly-Gaussian distributions. Therefore, in order to differentiate between a general expectation and an integral against the Gaussian distribution, let us introduce a special bra-ket, or \langle\cdot\rangle notation for computing Gaussian expectation values. For an observable O(z)\mathcal{O}(z), define a Gaussian expectation as

O(z)K12πK[μ=1Ndzμ]exp(12μ,ν=1NKμνzμzν)O(z)\langle\mathcal{O}(z)\rangle_{K} \equiv \frac{1}{\sqrt{|2 \pi K|}} \int\left[\prod_{\mu=1}^{N} d z_{\mu}\right] \exp \left(-\frac{1}{2} \sum_{\mu, \nu=1}^{N} K^{\mu \nu} z_{\mu} z_{\nu}\right) \mathcal{O}(z)

In particular, with this notation we can write Wick’s theorem as

zμ1zμ2zμ2mK=all pairing Kμk1μk2Kμk2m1μk2m.\left\langle z_{\mu_{1}} z_{\mu_{2}} \cdots z_{\mu_{2 m}}\right\rangle_{K}=\sum_{\text {all pairing }} K_{\mu_{k_{1}} \mu_{k_{2}}} \cdots K_{\mu_{k_{2 m-1}} \mu_{k_{2 m}}} .

If we’re talking about a Gaussian distribution with variance KμνK_{\mu \nu}, then we can use the notation E[]\mathbb{E}[\cdot] and K\langle\cdot\rangle_{K} interchangeably. If instead we’re talking about a nearly-Gaussian distribution p(z)p(z), then E[]\mathbb{E}[\cdot] indicates expectation with respect to p(z)p(z), 1.46)1.46). However, in the evaluation of such an expectation, we’ll often encounter Gaussian integrals, for which we’ll use this bra-ket notation K\langle\cdot\rangle_{K} to simplify expressions.
Quartic action and perturbation theory
Now, let’s find an action that represents a nearly-Gaussian distribution with a connected four-point correlator that is small but non-vanishing

E[zμ1zμ2zμ3zμ4]connected =O(ϵ).\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }}=O(\epsilon) .

Here we have introduced a small parameter ϵ1\epsilon \ll 1 and indicated that the correlator should be of order ϵ\epsilon. For neural networks, we will later find that the role of the small parameter ϵ\epsilon is played by 1 /width.

We should be able to generate a small connected four-point correlator by deforming the Gaussian distribution through the addition of a small quartic term to the quadratic action (1.66), giving us a quartic action

S(z)=12μ,ν=1NKμνzμzν+ϵ4!μ,ν,ρ,λ=1NVμνρλzμzνzρzλS(z)=\frac{1}{2} \sum_{\mu, \nu=1}^{N} K^{\mu \nu} z_{\mu} z_{\nu}+\frac{\epsilon}{4 !} \sum_{\mu, \nu, \rho, \lambda=1}^{N} V^{\mu \nu \rho \lambda} z_{\mu} z_{\nu} z_{\rho} z_{\lambda}

where the quartic coupling ϵVμνρλ\epsilon V^{\mu \nu \rho \lambda} is an (N×N×N×N)(N \times N \times N \times N)-dimensional tensor that is completely symmetric in all of its four indices. The factor of 1/4!1 / 4 ! is conventional in order to compensate for the overcounting in the sum due to the symmetry of the indices. While it’s not a proof of the connection, note that the coupling ϵVμνρλ\epsilon V^{\mu \nu \rho \lambda} has the right number of components to faithfully reproduce the four-point connected correlator (1.70), which is also an (N×N×N×N)(N \times N \times N \times N)-dimensional symmetric tensor. At least from this perspective we’re off to a good start.

Let us now establish this correspondence between the quartic coupling and connected four-point correlator. Note that in general it is impossible to compute any expectation value in closed form with a non-Gaussian action - this includes even the partition function. Instead, in order to compute the connected four-point correlator we’ll need to employ perturbation theory to expand everything to first order in the small parameter ϵ\epsilon, each term of which can then be evaluated in a closed form. As this is easier done than said, let’s get to the computations.
To start, let’s evaluate the partition function:

Z=[μdzμ]eS(z)=[μdzμ]exp(12μ,νKμνzμzνϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4zρ1zρ2zρ3zρ4)=2πKexp(ϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4zρ1zρ2zρ3zρ4)K\begin{aligned} Z &=\int\left[\prod_{\mu} d z_{\mu}\right] e^{-S(z)} \\ &=\int\left[\prod_{\mu} d z_{\mu}\right] \exp \left(-\frac{1}{2} \sum_{\mu, \nu} K^{\mu \nu} z_{\mu} z_{\nu}-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right) \\ &=\sqrt{|2 \pi K|}\left\langle\exp \left(-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right)\right\rangle_{K} \end{aligned}

In the second line we inserted our expression for the quartic action 1.71), and in the last line we used our bra-ket notation (1.68) for a Gaussian expectation with variance KμνK_{\mu \nu}. As advertised, the Gaussian expectation in the final line cannot be evaluated in closed form. However, since our parameter ϵ\epsilon is small, we can Taylor-expand the exponential to express the partition function as a sum of simple Gaussian expectations that can be evaluated using Wick’s theorem (1.69):

Z=2πK1ϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4zρ1zρ2zρ3zρ4+O(ϵ2)K=2πK[1ϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4zρ1zρ2zρ3zρ4K+O(ϵ2)]=2πK[1ϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4(Kρ1ρ2Kρ3ρ4+Kρ1ρ3Kρ2ρ4+Kρ1ρ4Kρ2ρ3)+O(ϵ2)]=2πK[118ϵρ1,,ρ4Vρ1ρ2ρ3ρ4Kρ1ρ2Kρ3ρ4+O(ϵ2)]\begin{aligned} Z &=\sqrt{|2 \pi K|}\left\langle 1-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}+O\left(\epsilon^{2}\right)\right\rangle_{K} \\ &=\sqrt{|2 \pi K|}\left[1-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left\langle z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right\rangle_{K}+O\left(\epsilon^{2}\right)\right] \\ &=\sqrt{|2 \pi K|}\left[1-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left(K_{\rho_{1} \rho_{2}} K_{\rho_{3} \rho_{4}}+K_{\rho_{1} \rho_{3}} K_{\rho_{2} \rho_{4}}+K_{\rho_{1} \rho_{4}} K_{\rho_{2} \rho_{3}}\right)+O\left(\epsilon^{2}\right)\right] \\ &=\sqrt{|2 \pi K|}\left[1-\frac{1}{8} \epsilon \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\rho_{1} \rho_{2}} K_{\rho_{3} \rho_{4}}+O\left(\epsilon^{2}\right)\right] \end{aligned}

In the final line, we were able to combine the three K2K^{2} terms together by using the total symmetry of the quartic coupling and then relabeling some of the summed-over dummy indices.
Similarly, let’s evaluate the two-point correlator:

E[zμ1zμ2]=1Z[μdzμ]eS(z)zμ1zμ2=2πKZzμ1zμ2exp(ϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4zρ1zρ2zρ3zρ4)K=2πKZ[zμ1zμ2Kϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4zμ1zμ2zρ1zρ2zρ3zρ4K+O(ϵ2)]=[1+18ϵρ1,,ρ4Vρ1ρ2ρ3ρ4Kρ1ρ2Kρ3ρ4]Kμ1μ2ϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4(3Kμ1μ2Kρ1ρ2Kρ3ρ4+12Kμ1ρ1Kμ2ρ2Kρ3ρ4)+O(ϵ2)=Kμ1μ2ϵ2ρ1,,ρ4Vρ1ρ2ρ3ρ4Kμ1ρ1Kμ2ρ2Kρ3ρ4+O(ϵ2).\begin{aligned} & \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right]=\frac{1}{Z} \int\left[\prod_{\mu} d z_{\mu}\right] e^{-S(z)} z_{\mu_{1}} z_{\mu_{2}} \\ =& \frac{\sqrt{|2 \pi K|}}{Z}\left\langle z_{\mu_{1}} z_{\mu_{2}} \exp \left(-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right)\right\rangle_{K} \\ =& \frac{\sqrt{|2 \pi K|}}{Z}\left[\left\langle z_{\mu_{1}} z_{\mu_{2}}\right\rangle_{K}-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left\langle z_{\mu_{1}} z_{\mu_{2}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right\rangle_{K}+O\left(\epsilon^{2}\right)\right] \\ =& {\left[1+\frac{1}{8} \epsilon \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\rho_{1} \rho_{2}} K_{\rho_{3} \rho_{4}}\right] K_{\mu_{1} \mu_{2}} } \\ &-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left(3 K_{\mu_{1} \mu_{2}} K_{\rho_{1} \rho_{2}} K_{\rho_{3} \rho_{4}}+12 K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\rho_{3} \rho_{4}}\right)+O\left(\epsilon^{2}\right) \\ =& K_{\mu_{1} \mu_{2}}-\frac{\epsilon}{2} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\rho_{3} \rho_{4}}+O\left(\epsilon^{2}\right) . \end{aligned}

Here, to go from the first line to the second line we inserted our expression for the quartic action (1.71) and rewrote the integral as a Gaussian expectation. Then, after expanding in ϵ\epsilon to first order, in the next step we substituted 1.731.73 in for the partition function ZZ in the denominator and expanded 1/Z1 / Z to the first order in ϵ\epsilon using the expansion 1/(1x)=1+x+O(x2)1 /(1-x)=1+x+O\left(x^{2}\right). In that same step, we also noted that, of the fifteen terms coming from the Gaussian expectation zμ1zμ2zρ1zρ2zρ3zρ4K\left\langle z_{\mu_{1}} z_{\mu_{2}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right\rangle_{K}, there are three ways in

which zμ1z_{\mu_{1}} and zμ2z_{\mu_{2}} contract with each other but twelve ways in which they don’t. Given again the symmetry of Vρ1ρ2ρ3ρ4V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}, this is the only distinction that matters.
At last, let’s compute the full four-point correlator:
To go from the first line to the second line we inserted our expression for the quartic action (1.71), expanded to first order in ϵ\epsilon, and rewrote in the bra-ket notation (1.68). On the third line, we again substituted in the expression (1.73(1.73 for the partition function ZZ, expanded 1/Z1 / Z to first order in ϵ\epsilon, and then used Wick’s theorem 1.691.69 ) to evaluate the fourth and eighth Gaussian moments. (Yes, we know that the evaluation of zμ1zμ2zμ3zμ4zρ1zρ2zρ3zρ4K\left\langle z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}} z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}}\right\rangle_{K} is not fun. The breakdown of the terms depends again on whether or not the μ\mu-type indices are contracted with the ρ\rho-type indices or not.) We can simplify this expression by noticing that some terms cancel due to 18324=0\frac{1}{8}-\frac{3}{24}=0 and some other terms can be nicely regrouped once we notice through the expression for the two-point correlator (1.74) that

Kμ1μ2Kμ3μ4ϵ24ρ1,,ρ4Vρ1ρ2ρ3ρ4(12Kμ1ρ1Kμ2ρ2Kμ3μ4Kρ3ρ4+12Kμ3ρ1Kμ4ρ2Kμ1μ2Kρ3ρ4)=E[zμ1zμ2]E[zμ3zμ4]+O(ϵ2)\begin{aligned} &K_{\mu_{1} \mu_{2}} K_{\mu_{3} \mu_{4}}-\frac{\epsilon}{24} \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}\left(12 K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\mu_{3} \mu_{4}} K_{\rho_{3} \rho_{4}}+12 K_{\mu_{3} \rho_{1}} K_{\mu_{4} \rho_{2}} K_{\mu_{1} \mu_{2}} K_{\rho_{3} \rho_{4}}\right) \\ &=\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right]+O\left(\epsilon^{2}\right) \end{aligned}

yielding in the end

E[zμ1zμ2zμ3zμ4]=E[zμ1zμ2]E[zμ3zμ4]+E[zμ1zμ3]E[zμ2zμ4]+E[zμ1zμ4]E[zμ2zμ3]ϵρ1,,ρ4Vρ1ρ2ρ3ρ4Kμ1ρ1Kμ2ρ2Kμ3ρ3Kμ4ρ4+O(ϵ2)\begin{aligned} & \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right] \\ =& \mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}}\right] \mathbb{E}\left[z_{\mu_{3}} z_{\mu_{4}}\right]+\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{3}}\right] \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{4}}\right]+\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{4}}\right] \mathbb{E}\left[z_{\mu_{2}} z_{\mu_{3}}\right] \\ &-\epsilon \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\mu_{3} \rho_{3}} K_{\mu_{4} \rho_{4}}+O\left(\epsilon^{2}\right) \end{aligned}

Given the full four-point correlator 1.75)1.75) and the two-point correlator 1.741.74 ), we can finally evaluate the connected four-point correlator (1.54)(1.54) as

E[zμ1zμ2zμ3zμ4]connected =ϵρ1,,ρ4Vρ1ρ2ρ3ρ4Kμ1ρ1Kμ2ρ2Kμ3ρ3Kμ4ρ4+O(ϵ2)\left.\mathbb{E}\left[z_{\mu_{1}} z_{\mu_{2}} z_{\mu_{3}} z_{\mu_{4}}\right]\right|_{\text {connected }}=-\epsilon \sum_{\rho_{1}, \ldots, \rho_{4}} V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} K_{\mu_{1} \rho_{1}} K_{\mu_{2} \rho_{2}} K_{\mu_{3} \rho_{3}} K_{\mu_{4} \rho_{4}}+O\left(\epsilon^{2}\right) \text {. }

This makes explicit the relationship between the connected four-point correlator and the quartic coupling in the action, when both are small. We see that for the nearly-Gaussian distribution realized by the quartic action (1.71), the distribution is - as promised nearly Gaussian: the strength of the coupling ϵVρ1ρ2ρ3ρ4\epsilon V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} directly controls the distribution’s deviation from Gaussian statistics, as measured by the connected four-point correlator. This also shows that the four-index tensor Vρ1ρ2ρ3ρ4V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}} creates nontrivial correlations between the components zρ1zρ2zρ3zρ4z_{\rho_{1}} z_{\rho_{2}} z_{\rho_{3}} z_{\rho_{4}} that cannot otherwise be built up by the correlation KμνK_{\mu \nu} in any pair of random variables zμzνz_{\mu} z_{\nu}.

Finally, note that the connected two-point correlator (1.74) - i.e. the covariance of this nearly-Gaussian distribution - is also shifted from its Gaussian value of Kμ1μ2K_{\mu_{1} \mu_{2}} by the quartic coupling ϵVρ1ρ2ρ3ρ4\epsilon V^{\rho_{1} \rho_{2} \rho_{3} \rho_{4}}. Thus, the nearly-Gaussian deformation not only creates complicated patterns of four-point correlation as measured by the connected four-point correlator 1.78), it also can modify the details of the Gaussian two-point correlation.
Now that we see how to compute the statistics of a nearly-Gaussian distribution, let’s take a step back and think about what made this possible. We can perform these perturbative calculations any time there exists in the problem a dimensionless parameter ϵ\epsilon that is small ϵ1\epsilon \ll 1, but nonzero ϵ>0\epsilon>0. This makes perturbation theory an extremely powerful tool for theoretical analysis any time a problem has any extreme scales, small or large.

Importantly, this is directly relevant to theoretically understanding neural networks in practice. As we will explain in the following chapters, real networks have a parameter nn - the number of neurons in a layer - that is typically large n1n \gg 1, but certainly not infinite n<n<\infty. This means that we can expand the distributions that describe such networks in the inverse of the large parameter as ϵ=1/n\epsilon=1 / n. Indeed, when the parameter nn is large - as is typical in practice - the distributions that describe neural networks become nearly-Gaussian and thus theoretically tractable. This type of expansion is known as the 1/n1 / n expansion or large- nn expansion and will be one of our main tools for learning the principles of deep learning theory.
Aside: statistical independence and interactions
The quartic action (1.71) is one of the simplest models of an interacting theory. We showed this explicitly by connecting the quartic coupling to the non-Gaussian statistics of the non-vanishing connected four-point correlator. Here, let us try to offer an intuitive meaning of interaction by appealing to the notion of statistical independence.

Recall from the probability theory that two random variables xx and yy are statistically independent if their joint distribution factorizes as

p(x,y)=p(x)p(y).p(x, y)=p(x) p(y) .

For the Gaussian distribution, if the variance matrix KμνK_{\mu \nu} is diagonal, there is no correlation at all between different components of zμz_{\mu}; they are manifestly statistically independent from each other.

Even if KμνK_{\mu \nu} is not diagonal, we can still unwind the correlation of a Gaussian distribution by rotating to the right basis. As discussed in §1.1\S 1.1, there always exists an orthogonal matrix OO that diagonalizes the covariance as (OKOT)μν=λμδμν\left(O K O^{T}\right)_{\mu \nu}=\lambda_{\mu} \delta_{\mu \nu}. In terms of the variables uμ(Oz)μu_{\mu} \equiv(O z)_{\mu}, the distribution looks like

p(z)=12πKexp(μ=1Nuμ22λμ)=μ=1N(euμ22λμ2πλμ)=p(u1)p(uN)p(z)=\frac{1}{\sqrt{|2 \pi K|}} \exp \left(-\sum_{\mu=1}^{N} \frac{u_{\mu}^{2}}{2 \lambda_{\mu}}\right)=\prod_{\mu=1}^{N}\left(\frac{e^{-\frac{u_{\mu}^{2}}{2 \lambda_{\mu}}}}{\sqrt{2 \pi \lambda_{\mu}}}\right)=p\left(u_{1}\right) \cdots p\left(u_{N}\right)

Thus, we see that in the uu-coordinate basis the original multivariable Gaussian distribution factorizes into NN single-variable Gaussians that are statistically independent.
We also see that in terms of the action, statistical independence is characterized by the action breaking into a sum over separate terms. This unwinding of interaction between variables is generically impossible when there are nonzero non-Gaussian couplings. For instance, there are N2\sim N^{2} components of an orthogonal matrix OμνO_{\mu \nu} to change basis, while there are N4\sim N^{4} components of the quartic coupling ϵVμνρλ\epsilon V^{\mu \nu \rho \lambda} that correlate random variables, so it is generically impossible to re-express the quartic action as a sum of functions of NN different variables. Since the action cannot be put into a sum over NN separate terms, the joint distribution cannot factorize, and the components will not be independent from each other. Thus, it is impossible to factor the nearly-Gaussian distribution into the product of NN statistically independent distributions. In this sense, what is meant by interaction is the breakdown of statistical independence. 12{ }^{12}
Nearly-Gaussian actions
Having given a concrete example in which we illustrated how to deform the quadratic action to realize the simplest nearly-Gaussian distribution, we now give a more general perspective on nearly-Gaussian distributions. In what follows, we will continue to require that our distributions are invariant under the parity symmetry that takes zμzμz_{\mu} \rightarrow-z_{\mu}. In the action representation, this corresponds to including only terms of even degree. 13{ }^{13}
12{ }^{12} An astute reader might wonder if there is any interaction when we consider a single-variable distribution with N=1N=1, since there’s no other variables to interact with. For nearly-Gaussian distributions, even if N=1N=1, we saw in (1.74) that the variance of the distribution is shifted from its Gaussian value, KK, and depends on the quartic coupling ϵV\epsilon V. In physics, we say that this shift is due to the self-interaction induced by the quartic coupling ϵV\epsilon V, since it modifies the value of observables from the free Gaussian theory that we are comparing to, even though there’s no notion of statistical independence to appeal to here.

Said another way, even though the action just involves one term, such a non-Gaussian distribution does not have a closed-form solution for its partition function or correlators; i.e. there’s no trick that lets us compute integrals of the form eS(z)e^{-S(z)} exactly, when S(z)=z22K+14!ϵVz4S(z)=\frac{z^{2}}{2 K}+\frac{1}{4 !} \epsilon V z^{4}. This means that we still have to make use of perturbation theory to analyze the self-interaction in such distributions.
13{ }^{13} The imposition of such a parity symmetry, and thus the absence of odd-degree terms in the action, means that all of the odd moments and hence all of the odd-point connected correlators will vanish.

With that caveat in mind, though otherwise very generally, we can express a nonGaussian distribution by deforming the Gaussian action as

<u>S(z)=12μ,ν=1NKμνzμzν+m=2k1(2m)!μ1,,μ2m=1Nsμ1μ2mzμ1zμ2m,</u><u>S(z)=\frac{1}{2} \sum_{\mu, \nu=1}^{N} K^{\mu \nu} z_{\mu} z_{\nu}+\sum_{m=2}^{k} \frac{1}{(2 m) !} \sum_{\mu_{1}, \ldots, \mu_{2 m}=1}^{N} s^{\mu_{1} \cdots \mu_{2 m}} z_{\mu_{1}} \cdots z_{\mu_{2 m}},</u>

where the factor of 1/(2m)1 /(2 m) ! is conventional in order to compensate for the overcounting in the sum due to the implied symmetry of the indices μ1,,μ2m\mu_{1}, \ldots, \mu_{2 m} in the coefficients sμ1μ2ms^{\mu_{1} \cdots \mu_{2 m}}, given the permutation symmetry of the product of variables zμ1zμ2mz_{\mu_{1}} \cdots z_{\mu_{2 m}}. The number of terms in the non-Gaussian part of the action is controlled by the integer kk. If kk were unbounded, then S(z)S(z) would be an arbitrary even function, and p(z)p(z) could be any parity-symmetric distribution. The action is most useful when the expanded polynomial S(z)S(z) truncated to reasonably small degree kk - like k=2k=2 for the quartic action - yields a good representation for the statistical process of interest.

The coefficients sμ1μ2ms^{\mu_{1} \cdots \mu_{2 m}} are generally known as non-Gaussian couplings, and they control the interactions of the zμz_{\mu} .

In the similar vein, the coefficient KμνK^{\mu \nu} in the action is sometimes called a quadratic coupling since the coupling of the component zμz_{\mu} with the component zνz_{\nu} in the quadratic action leads to a nontrivial correlation, i.e. Cov[zμ,zν]=Kμν\operatorname{Cov}\left[z_{\mu}, z_{\nu}\right]=K_{\mu \nu}.

In particular, there is a direct correspondence between the product of the specific components zμz_{\mu} that appear together in the action and the presence of connected correlation between those variables, with the degree of the term in (1.81) directly contributing to connected correlators of that degree. We saw an example of this in 1.781.78 ), which connected the quartic term to the connected four-point correlator. In this way, the couplings give a very direct way of controlling the degree and pattern of non-Gaussian correlation, and the overall degree of the action offers a way of systematically including more and more complicated patterns of such correlations.

If you recall from $1.2\$ 1.2, we defined nearly-Gaussian distributions as ones for which all these connected correlators are small. Equivalently, from the action perspective, a nearly-Gaussian distribution is a non-Gaussian distribution with an action of the form (1.81) for which all the couplings sμ1μ2ms^{\mu_{1} \cdots \mu_{2 m}} are parametrically small for all 1mk1 \leq m \leq k :

sμ1μ2mKμνm,\left|s^{\mu_{1} \cdots \mu_{2 m}}\right| \ll\left|K^{\mu \nu}\right|^{m},

where this equation is somewhat schematic given the mismatch of the indices.

This schematic equation is, nonetheless, dimensionally consistent. To support that remark, let us give a brief introduction to dimensional analysis: let the random variable zμz_{\mu} have dimension ζ\zeta, which we denote as [zμ]=ζ1\left[z_{\mu}\right]=\zeta^{1}. By dimension, you should have in mind something like a unit of length, so e.g. we read the expression [zμ]=ζ1\left[z_{\mu}\right]=\zeta^{1} as “a component of zz is measured in units of ζ\zeta.” The particular units are arbitrary: e.g. for length, we can choose between meters or inches or parsecs as long as we use a unit of length but not, say, meters 2{ }^{2}, which instead would be a unit of area. Importantly, we cannot add or equate quantities that have different units: it doesn’t make any logical sense to add a length to an area. This is similar to the concept of type safety in computer science, e.g. we should not add a type str variable to a type int variable.

Now, since the action S(z)S(z) is the argument of an exponential p(z)eS(z)p(z) \propto e^{-S(z)}, it must be dimensionless; otherwise, the exponential eS=1S+S22+e^{-S}=1-S+\frac{S^{2}}{2}+\ldots would violate the addition rule that we just described. From this dimensionless requirement for the action, we surmise that the inverse of the covariance matrix has dimension [Kμν]=ζ2\left[K^{\mu \nu}\right]=\zeta^{-2}, and that the covariance itself has dimension [Kμν]=ζ2\left[K_{\mu \nu}\right]=\zeta^{2}. Similarly, all the non-Gaussian couplings in 1.811.81 have dimensions [sμ1μ2m]=ζ2m\left[s^{\mu_{1} \cdots \mu_{2 m}}\right]=\zeta^{-2 m}. Thus, both sides of 1.821.82 have the same dimension, making this equation dimensionally consistent.

Even more concretely, consider the quartic action (1.71). If we let the tensorial part of the quartic coupling have dimensions [Vμνρλ]=ζ4\left[V^{\mu \nu \rho \lambda}\right]=\zeta^{-4}, then the parameter ϵ\epsilon is dimensionless, as claimed. This means that we can consistently compare ϵ\epsilon to unity, and its parametric smallness ϵ1\epsilon \ll 1 means that the full quartic coupling ϵVμνρλ\epsilon V^{\mu \nu \rho \lambda} is much smaller than the square of the quadratic coupling, and that the connected four-point correlator 1.781.78 is much smaller than the square of the connected two-point correlator (1.74).

Importantly the comparison is with an appropriate power of the inverse variance or quadratic coupling KμνK^{\mu \nu} since, as we already explained, the variance sets the scale of the Gaussian distribution to which we are comparing these nearly-Gaussian distributions.

As we will see in 44, wide neural networks are described by nearly-Gaussian distributions. In particular, we will find that such networks are described by a special type of nearly-Gaussian distribution where the connected correlators are hierarchically small, scaling as

E[zμ1zμ2m]connected =O(ϵm1),\left.\mathbb{E}\left[z_{\mu_{1}} \cdots z_{\mu_{2 m}}\right]\right|_{\text {connected }}=O\left(\epsilon^{m-1}\right),

with the same parameter ϵ\epsilon controlling the different scalings for each of the 2m2 m-point connected correlators. Importantly, the non-Gaussianities coming from higher-point connected correlators become parametrically less important as ϵ\epsilon becomes smaller.

This means that for a nearly-Gaussian distribution with hierarchical scalings (1.83), we can consistently approximate the distribution by truncating the action at some fixed order in ϵ\epsilon. To be concrete, we can use an action of the form (1.81) to faithfully represent all the correlations up to order O(ϵk1)O\left(\epsilon^{k-1}\right), neglecting connected correlations of order O(ϵk)O\left(\epsilon^{k}\right) and higher. The resulting action offers a useful and effective description for the statistical process of interest, as long as ϵ\epsilon is small enough and kk is high enough that O(ϵk)O\left(\epsilon^{k}\right) is negligible.

In practice, a quartic action (1.71) truncated to k=2k=2 will let us model realistic finite-width neural networks. This quartic action captures the important qualitative difference between nearly-Gaussian distributions and the Gaussian distribution, incorporating nontrivial interactions between the different components of the random variable. In addition, the difference between the statistics 1.831.83 of a nearly-Gaussian distribution truncated to O(ϵ)O(\epsilon) versus one truncated to O(ϵ2)O\left(\epsilon^{2}\right) is mostly quantitative: in both cases there are nontrivial non-Gaussian correlations, but the pattern of higher-order correlation differs only in a small way, with the difference suppressed as O(ϵ2)O\left(\epsilon^{2}\right). In this way, the distribution represented by the quartic action is complex enough to capture the most salient non-Gaussian effects in neural networks while still being simple enough to be analytically tractable.