Picking up speed, we are now ready to handle multivariable Gaussian integrals for an N-dimensional variable zμ with μ=1,…,N.2
Throughout this book, we will explicitly write out the component indices of vectors, matrices, and tensors as much as possible, except on some occasions when it is clear enough from context.
The multivariable Gaussian function is defined as
exp[−21μ,ν=1∑Nzμ(K−1)μνzν]
where the variance or covariance matrix Kμν is an N-by- N symmetric positive definite matrix, and its inverse (K−1)μν is defined so that their matrix product gives the N-by- N identity matrix
ρ=1∑N(K−1)μρKρν=δμν
Here we have also introduced the Kronecker delta δμν, which satisfies
δμν≡{1,0,μ=νμ=ν
The Kronecker delta is just a convenient representation of the identity matrix.
Now, to construct a probability distribution from the Gaussian function (1.24), we again need to evaluate the normalization factor
To compute this integral, first recall from linear algebra that, given an N-by- N symmetric matrix Kμν, there is always an orthogonal matrix 3Oμν that diagonalizes Kμν as (OKOT)μν=λμδμν with eigenvalues λμ=1,…,N and diagonalizes its inverse as (OK−1OT)μν=(1/λμ)δμν. With this in mind, after twice inserting the identity matrix as δμν=(OTO)μν, the sum in the exponent of the integral can be expressed in terms of the eigenvalues as
An orthogonal matrix Oμν is a matrix whose transpose (OT)μν equals its inverse, i.e., (OTO)μν=δμν
where to reach the final line we used the diagonalization property of the inverse covariance matrix. Remembering that for a positive definite matrix Kμν the eigenvalues are all positive λμ>0, we see that the λμ sets the scale of the falloff of the Gaussian function in each of the eigendirections. Next, recall from multivariable calculus that a change of variables uμ≡(Oz)μ with an orthogonal matrix O leaves the integration measure invariant, i.e., dNz=dNu. All together, this lets us factorize the multivariable Gaussian integral (1.27) into a product of single-variable Gaussian integrals (1.7), yielding
Finally, recall one last fact from linear algebra that the product of the eigenvalues of a matrix is equal to the matrix determinant. Thus, compactly, we can express the value of the multivariable Gaussian integral as
IK=∫dNzexp[−21μ,ν=1∑Nzμ(K−1)μνzν]=∣2πK∣
where ∣A∣ denotes the determinant of a square matrix A.
Having figured out the normalization factor, we can define the zero-mean multivariable Gaussian probability distribution with variance Kμν as
p(z)=∣2πK∣1exp[−21μ,ν=1∑Nzμ(K−1)μνzν].
While we’re at it, let us also introduce the conventions of suppressing the superscript " −1 " for the inverse covariance (K−1)μν, instead placing the component indices upstairs as
Kμν≡(K−1)μν.
This way, we distinguish the covariance Kμν and the inverse covariance Kμν by whether or not component indices are lowered or raised. With this notation, inherited from general relativity, the defining equation for the inverse covariance (1.25) is written instead as
ρ=1∑NKμρKρν=δνμ
and the multivariable Gaussian distribution 1.31) is written as
p(z)=∣2πK∣1exp(−21μ,ν=1∑NzμKμνzν)
Although it might take some getting used to, this notation saves us some space and saves you some handwriting pain.
If you like, in your notes you can also go full general-relativistic mode and adopt Einstein summation convention, suppressing the summation symbol any time indices are repeated in upstair-downstair pairs. For instance, if we adopted this convention we would write the defining equation for inverse simply as KμρKρν=δμν and the Gaussian function as exp(−21zμKμνzν).
Specifically for neural networks, you might find the Einstein summation convention helpful for sample indices, but sometimes confusing for neural indices. For extra clarity, we won’t adopt this convention in the text of the book, but we mention it now since we do often use such a convention to simplify our own calculations in private.
Regardless of how it’s written, the zero-mean multivariable Gaussian probability distribution (1.34) peaks at z=0, and its falloff is directiondependent, determined by the covariance matrix Kμν. More generally, we can shift the peak of the Gaussian distribution to sμ
p(z)=∣2πK∣1exp[−21μ,ν=1∑N(z−s)μKμν(z−s)ν]
which defines a general multivariable Gaussian distribution with mean E[zμ]=sμ and covariance Kμν. This is the most general version of the Gaussian distribution.
Next, let’s consider the moments of the mean-zero multivariable Gaussian distribution
Following our approach in the single-variable case, let’s construct the generating function for the integrals IK,(μ1,…,μM) by including a source term Jμ as
ZK,J≡∫dNzexp(−21μ,ν=1∑NzμKμνzν+μ=1∑NJμzμ).
As the name suggests, differentiating the generating function ZK,J with respect to the source Jμ brings down a power of zμ such that after M such differentiations we have
So, as in the single-variable case, the Taylor coefficients of the partition function ZK,J expanded around Jμ=0 are simply related to the integrals with insertions IK,(μ1,…,μM). Therefore, if we knew a closed-form expression for ZK,J, we could easily compute the values of the integrals IK,(μ1,…,μM).
To evaluate the generating function ZK,J in a closed form, again we follow the lead of the single-variable case and complete the square in the exponent of the integrand in (1.38) as
where at the end we used our formula for the multivariable integral IK, (1.30). With our closed-form expression 1.41 ) for the generating function ZK,J, we can compute the Gaussian integrals with insertions IK,(μ1,…,μM) by differentiating it, using (1.39). For an even number M=2m of insertions, we find a really nice formula
For an odd number M=2m+1 of insertions, there is dangling source upon setting J=0, and so those integrals vanish. You can also see this by looking at the integrand for any odd moment and noticing that it is odd with respect to the sign flip of the integration variables zμ↔−zμ.
Now, let’s take a few moments to evaluate a few moments using this formula. For 2m=2, we have
Here, there are 2!=2 ways to apply the product rule for derivatives and differentiate the two J 's, both of which evaluate to the same expression due to the symmetry of the covariance, Kμ1μ2=Kμ2μ1. This expression 11.43 validates in the multivariable setting why we have been calling Kμν the covariance, because we see explicitly that it is the covariance.
Next, for 2m=4 we get a more complicated expression
Here we note that there are now 4!=24 ways to differentiate the four J 's, though only three distinct ways to pair the four auxiliary indices 1,2,3,4 that sit under μ. This gives 24/3=8=222 ! equivalent terms for each of the three pairings, which cancels against the overall factor 1/(222!).
For general 2m, there are (2m) ! ways to differentiate the sources, of which 2mm ! of those ways are equivalent. This gives (2m)!/(2mm!)=(2m−1)!! distinct terms, corresponding to the (2m−1)!! distinct pairings of 2m auxiliary indices 1,…,2m that sit under μ. The factor of 1/(2mm !) in the denominator of (1.42) ensures that the coefficient of each of these terms is normalized to unity. Thus, most generally, we can express the moments of the multivariable Gaussian with the following formula
where, to reiterate, the sum is over all the possible distinct pairings of the 2m auxiliary indices under μ such that the result has the (2m−1) !! terms that we described above. Each factor of the covariance Kμν in a term in sum is called a Wick contraction, corresponding to a particular pairing of auxiliary indices. Each term then is composed of m different Wick contractions, representing a distinct way of pairing up all the auxiliary indices. To make sure you understand how this pairing works, look back at the 2m=2 case (1.43) - with a single Wick contraction - and the 2m=4 case (1.44) - with three distinct ways of making two Wick contractions - and try to work out the 2m=6 case, which yields (6−1)!!=15 distinct ways of making three Wick contractions:
The formula (1.45) is Wick’s theorem. Put a box around it. Take a few moments for reflection. ⋯ ⋯ ⋯
Good. You are now a Gaussian sensei. Exhale, and then say as Neo would say, “I know Gaussian integrals.”
Now that the moments have passed, it is an appropriate time to transition to the next section where you will learn about more general probability distributions.
1.2 Probability, Correlation and Statistics, and All That
In introducing the Gaussian distribution in the last section we briefly touched upon the concepts of expectation and moments. These are defined for non-Gaussian probability distributions too, so now let us reintroduce these concepts and expand on their definitions, with an eye towards understanding the nearly-Gaussian distributions that describe wide neural networks.
Given a probability distribution p(z) of an N-dimensional random variable zμ, we can learn about its statistics by measuring functions of zμ. We’ll refer to such measurable functions in a generic sense as observables and denote them as O(z). The expectation value of an observable
E[O(z)]≡∫dNzp(z)O(z)
characterizes the mean value of the random function O(z). Note that the observable O(z) needs not be a scalar-valued function, e.g. the second moment of a distribution is a matrix-valued observable given by O(z)=zμzν.
Operationally, an observable is a quantity that we measure by conducting experiments in order to connect to a theoretical model for the underlying probability distribution describing zμ. In particular, we repeatedly measure the observables that are naturally accessible to us as experimenters, collect their statistics, and then compare them with predictions for the expectation values of those observables computed from some theoretical model of p(z).
With that in mind, it’s very natural to ask: what kind of information can we learn about an underlying distribution p(z) by measuring an observable O(z) ? For an a priori unknown distribution, is there a set of observables that can serve as a sufficient probe of p(z) such that we could use that information to predict the result of all future experiments involving zμ ?
Consider a class of observables that we’ve already encountered, the moments or M-point correlators of zμ, given by the expectation
In the rest of this book, we’ll often use the physics term M-point correlator rather than the statistics term moment, though they mean the same thing and can be used interchangeably.
E[zμ1zμ2⋯zμM]=∫dNzp(z)zμ1zμ2⋯zμM
In principle, knowing the M-point correlators of a distribution lets us compute the expectation value of any analytic observable O(z) via Taylor expansion
where on the last line we took the Taylor coefficients out of the expectation by using the linearity property of the expectation, inherited from the linearity property of the integral in (1.46). As such, it’s clear that the collection of all the M-point correlators completely characterizes a probability distribution for all intents and purposes.
In fact, the moments offer a dual description of the probability distribution through either the Laplace transform or the Fourier transform. For instance, the Laplace transform of the probability distribution p(z) is given by
ZJ≡E[exp(μ∑Jμzμ)]=∫[μ∏dzμ]p(z)exp(μ∑Jμzμ)
As in the Gaussian case, this integral gives a generating function for the M-point correlators of p(z), which means that ZJ can be reconstructed from these correlators. The probability distribution can then be obtained through the inverse Laplace transform.
However, this description in terms of all the correlators is somewhat cumbersome and operationally infeasible. To get a reliable estimate of the M-point correlator, we must simultaneously measure M components of a random variable for each draw and repeat such measurements many times. As M grows, this task quickly becomes impractical. In fact, if we could easily perform such measurements for all M, then our theoretical model of p(z) would no longer be a useful abstraction; from (1.48) we would already know the outcome of all possible experiments that we could perform, leaving nothing for us to predict.
To that point, essentially all useful distributions can be effectively described in terms of a finite number of quantities, giving them a parsimonious representation. For instance, consider the zero-mean n-dimensional Gaussian distribution with the variance Kμν. The nonzero 2m-point correlators are given by Wick’s theorem 1.45 ) as
and are determined entirely by the N(N+1)/2 independent components of the variance Kμν. The variance itself can be estimated by measuring the two-point correlator
E[zμzν]=Kμν
This is consistent with our description of the distribution itself as "the zero-mean N dimensional Gaussian distribution with the variance Kμν" in which we only had to specify these same set of numbers, Kμν, to pick out the particular distribution we had in mind. For zero-mean Gaussian distributions, there’s no reason to measure or keep track of any of the higher-point correlators as they are completely constrained by the variance through 1.50 ).
More generally, it would be nice if there were a systematic way for learning about non-Gaussian probability distributions without performing an infinite number of experiments. For nearly-Gaussian distributions, a useful set of observables is given by what statisticians call cumulants and physicists call connected correlators.
Outside of this chapter, just as we’ll often use the term M-point correlator rather than the term moment, we’ll use the term M-point connected correlator rather than the term cumulant. When we want to refer to the moment and not the cumulant, we might sometimes say full correlator to contrast with connected correlator.
As the formal definition of these quantities is somewhat cumbersome and unintuitive, let’s start with a few simple examples.
The first cumulant or the connected one-point correlator is the same as the full one-point correlator
E[zμ]∣connected ≡E[zμ].
This is just the mean of the distribution. The second cumulant or the connected twopoint correlator is given by
which is also known as the covariance of the distribution. Note how the mean is subtracted from the random variable zμ before taking the square in the connected version. The quantity Δzμ≡zμ−E[zμ] represents a fluctuation of the random variable around its mean. Intuitively, such fluctuations are equally likely to contribute positively as they are likely to contribute negatively, E[Δzμ]=E[zμ]−E[zμ]=0, so it’s necessary to take the square in order to get an estimate of the magnitude of such fluctuations.
At this point, let us restrict our focus to distributions that are invariant under a signflip symmetry zμ→−zμ, which holds for the zero-mean Gaussian distribution (1.34). Importantly, this parity symmetry will also hold for the nearly-Gaussian distributions that we will study in order to describe neural networks. For all such even distributions with this symmetry, all odd moments and all odd-point connected correlators vanish.
With this restriction, the next simplest observable is the fourth cumulant or the connected four-point correlator, given by the formula
For the Gaussian distribution, recalling the Wick theorem (1.50), the last three terms precisely subtract off the three pairs of Wick contractions used to evaluate the first term, meaning
E[zμ1zμ2zμ3zμ4]∣connected =0.
Essentially by design, the connected four-point correlator vanishes for the Gaussian distribution, and a nonzero value signifies a deviation from Gaussian statistics.
In statistics, the connected four-point correlator for a single random variable z is called the excess kurtosis when normalized by the square of the variance. It is a natural measure of the tails of the distribution, as compared to a Gaussian distribution, and also serves as a measure of the potential for outliers. In particular, a positive value indicates fatter tails while a negative value indicates thinner tails.
In fact, the connected four-point correlator is perhaps the simplest measure of non-Gaussianity. Now that we have a little intuition, we are as ready as we’ll ever be to discuss the definition for the M-th cumulant or the M-point connected correlator. For completeness, we’ll give the general definition, before restricting again to distributions that are symmetric under parity zμ→−zμ. The definition is inductive and somewhat counterintuitive, expressing the M-th moment in terms of connected correlators from degree 1 to M :
where the sum is over all the possible subdivisions of M variables into s>1 clusters of sizes (ν1,…,νs) as (k1[1],…,kν1[1]),…,(k1[s],…,kνs[s]). By decomposing the M-th moment into a sum of products of connected correlators of degree M and lower, we see that the connected M-point correlator corresponds to a new type of correlation that cannot be expressed by the connected correlators of a lower degree. We saw an example of this above when discussing the connected four-point correlator as a simple measure of nonGaussianity.
To see how this abstract definition actually works, let’s revisit the examples. First, we trivially recover the relation between the mean and the one-point connected correlator
E[zμ]∣connected =E[zμ],
as there is no subdivision of a M=1 variable into any smaller pieces. For M=2, the definition (1.56) gives
Rearranging to solve for the connected two-point function in terms of the moments, we see that this is equivalent to our previous definition for the covariance 1.53 ).
At this point, let us again restrict to parity-symmetric distributions invariant under zμ→−zμ, remembering that this means that all the odd-point connected correlators will vanish. For such distributions, evaluating the definition (1.56) for M=4 gives
Since E[zμ1zμ2]=E[zμ1zμ2]∣connected when the mean vanishes, this is also just a rearrangement of our previous expression (1.54 ) for the connected four-point correlator for such zero-mean distributions.
In order to see something new, let us carry on for M=6 :
E[zμ1zμ2zμ3zμ4zμ5zμ6]=E[zμ1zμ2zμ3zμ4zμ5zμ6]∣connected +E[zμ1zμ2]∣connected [zμ3zμ4]connected E[zμ5zμ6]connected +[14 other (2,2,2) subdivisions ]+E[zμ1zμ2zμ3zμ4]∣connected [zμ5zμ6]connected +[14 other (4,2) subdivisions ]
in which we have expressed the full six-point correlator in terms of a sum of products of connected two-point, four-point, and six-point correlators. Rearranging the above expression and expressing the two-point and four-point connected correlators in terms of their definitions, (1.53 and (1.54, we obtain an expression for the connected six-point correlator:
=E[zμ1zμ2zμ3zμ4zμ5zμ6]∣connected E[zμ1zμ2zμ3zμ4zμ5zμ6]−{E[zμ1zμ2zμ3zμ4]E[zμ5zμ6]+[14 other (4,2) subdivisions ]}+2{E[zμ1zμ2]E[zμ3zμ4]E[zμ5zμ6]+[14 other (2,2,2) subdivisions ]}
The rearrangement is useful for computational purposes, in that it’s simple to first compute the moments of a distribution and then organize the resulting expressions in order to evaluate the connected correlators.
Focusing back on 1.60, it’s easy to see that the connected six-point correlator vanishes for Gaussian distributions. Remembering that the connected four-point correlator also vanishes for Gaussian distributions, we see that the fifteen (2,2,2) subdivision terms are exactly equal to the fifteen terms generated by the Wick contractions resulting from evaluating the full correlator on the left-hand side of the equation. In fact, applying the general definition of connected correlators (1.56 ) to the zero-mean Gaussian distribution, we see inductively that all M-point connected correlators for M>2 will vanish.
To see this, note that if all the higher-point connected correlators vanish, then the definition 1.56 is equivalent to Wick’s theorem 1.50, with nonzero terms in 1.56 - the subdivisions into clusters of sizes (2,…,2) - corresponding exactly to the different pairings in 11.50 ).
Thus, the connected correlators are a very natural measure of how a distribution deviates from Gaussianity.
With this in mind, we can finally define a nearly-Gaussian distribution as a distribution for which all the connected correlators for M>2 are small. 10
As we discussed in 1.1, the variance sets the scale of the Gaussian distribution. For nearly-Gaussian distributions, we require that all 2m-point connected correlators be parametrically small when compared to an appropriate power of the variance, i.e., ∣E[zμ1⋯zμ2m]∣connected ∣≪∣Kμνm, schematically.
In fact, the non-Gaussian distributions that describe neural networks generally have the property that, as the network becomes wide, the connected four-point correlator becomes small and the higher-point connected correlators become even smaller. For these nearlyGaussian distributions, a few leading connected correlators give a concise and accurate description of the distribution, just as a few leading Taylor coefficients can give a good description of a function near the point of expansion.
1.3 Nearly-Gaussian Distributions
Now that we have defined nearly-Gaussian distributions in terms of measurable deviations from Gaussian statistics, i.e. via small but nonzero connected correlators, it’s natural to ask how we can link these observables to the actual functional form of the distribution, p(z). We can make this connection through the action.
The action S(z) is a function that defines a probability distribution p(z) through the relation
p(z)∝e−S(z).
In the statistics literature, the action S(z) is sometimes called the negative log probability, but we will again follow the physics literature and call it the action. In order for 1.62) to make sense as a probability distribution, p(z) needs be normalizable so that we can satisfy
∫dNzp(z)=1
That’s where the normalization factoror partition function
Z≡∫dNze−S(z)
comes in. After computing the partition function, we can define a probability distribution for a particular action S(z) as
p(z)≡Ze−S(z).
Conversely, given a probability distribution we can associate an action, S(z)=−log[p(z)], up to an additive ambiguity: the ambiguity arises because a constant shift in the action can be offset by the multiplicative factor in the partition function.
One convention is to pick the constant such that the action vanishes when evaluated at its global minimum.
The action is a very convenient way to approximate certain types of statistical processes, particularly those with nearly-Gaussian statistics. To demonstrate this, we’ll first start with the simplest action, which describes the Gaussian distribution, and then we’ll show how to systematically perturb it in order to include various non-Gaussianities.
1.3.1 Quadratic action and the Gaussian distribution
Since we already know the functional form of the Gaussian distribution, it’s simple to identify the action by reading it off from the exponent in (1.34)
S(z)=21μ,ν=1∑NKμνzμzν,
where, as a reminder, the matrix Kμν is the inverse of the variance matrix Kμν. The partition function is given by the normalization integral (1.30) that we computed in §1.1
Z=∫dNze−S(z)=IK=∣2πK∣.
This quadratic action is the simplest normalizable action and serves as a starting point for defining other distributions.
As we will show next, integrals against the Gaussian distribution are a primitive for evaluating expectations against nearly-Gaussian distributions. Therefore, in order to differentiate between a general expectation and an integral against the Gaussian distribution, let us introduce a special bra-ket, or ⟨⋅⟩ notation for computing Gaussian expectation values. For an observable O(z), define a Gaussian expectation as
If we’re talking about a Gaussian distribution with variance Kμν, then we can use the notation E[⋅] and ⟨⋅⟩K interchangeably. If instead we’re talking about a nearly-Gaussian distribution p(z), then E[⋅] indicates expectation with respect to p(z), 1.46). However, in the evaluation of such an expectation, we’ll often encounter Gaussian integrals, for which we’ll use this bra-ket notation ⟨⋅⟩K to simplify expressions.
Quartic action and perturbation theory
Now, let’s find an action that represents a nearly-Gaussian distribution with a connected four-point correlator that is small but non-vanishing
E[zμ1zμ2zμ3zμ4]∣connected =O(ϵ).
Here we have introduced a small parameter ϵ≪1 and indicated that the correlator should be of order ϵ. For neural networks, we will later find that the role of the small parameter ϵ is played by 1 /width.
We should be able to generate a small connected four-point correlator by deforming the Gaussian distribution through the addition of a small quartic term to the quadratic action (1.66), giving us a quartic action
where the quartic coupling ϵVμνρλ is an (N×N×N×N)-dimensional tensor that is completely symmetric in all of its four indices. The factor of 1/4! is conventional in order to compensate for the overcounting in the sum due to the symmetry of the indices. While it’s not a proof of the connection, note that the coupling ϵVμνρλ has the right number of components to faithfully reproduce the four-point connected correlator (1.70), which is also an (N×N×N×N)-dimensional symmetric tensor. At least from this perspective we’re off to a good start.
Let us now establish this correspondence between the quartic coupling and connected four-point correlator. Note that in general it is impossible to compute any expectation value in closed form with a non-Gaussian action - this includes even the partition function. Instead, in order to compute the connected four-point correlator we’ll need to employ perturbation theory to expand everything to first order in the small parameter ϵ, each term of which can then be evaluated in a closed form. As this is easier done than said, let’s get to the computations.
To start, let’s evaluate the partition function:
In the second line we inserted our expression for the quartic action 1.71), and in the last line we used our bra-ket notation (1.68) for a Gaussian expectation with variance Kμν. As advertised, the Gaussian expectation in the final line cannot be evaluated in closed form. However, since our parameter ϵ is small, we can Taylor-expand the exponential to express the partition function as a sum of simple Gaussian expectations that can be evaluated using Wick’s theorem (1.69):
In the final line, we were able to combine the three K2 terms together by using the total symmetry of the quartic coupling and then relabeling some of the summed-over dummy indices.
Similarly, let’s evaluate the two-point correlator:
Here, to go from the first line to the second line we inserted our expression for the quartic action (1.71) and rewrote the integral as a Gaussian expectation. Then, after expanding in ϵ to first order, in the next step we substituted 1.73 in for the partition function Z in the denominator and expanded 1/Z to the first order in ϵ using the expansion 1/(1−x)=1+x+O(x2). In that same step, we also noted that, of the fifteen terms coming from the Gaussian expectation ⟨zμ1zμ2zρ1zρ2zρ3zρ4⟩K, there are three ways in
which zμ1 and zμ2 contract with each other but twelve ways in which they don’t. Given again the symmetry of Vρ1ρ2ρ3ρ4, this is the only distinction that matters.
At last, let’s compute the full four-point correlator:
To go from the first line to the second line we inserted our expression for the quartic action (1.71), expanded to first order in ϵ, and rewrote in the bra-ket notation (1.68). On the third line, we again substituted in the expression (1.73 for the partition function Z, expanded 1/Z to first order in ϵ, and then used Wick’s theorem 1.69 ) to evaluate the fourth and eighth Gaussian moments. (Yes, we know that the evaluation of ⟨zμ1zμ2zμ3zμ4zρ1zρ2zρ3zρ4⟩K is not fun. The breakdown of the terms depends again on whether or not the μ-type indices are contracted with the ρ-type indices or not.) We can simplify this expression by noticing that some terms cancel due to 81−243=0 and some other terms can be nicely regrouped once we notice through the expression for the two-point correlator (1.74) that
This makes explicit the relationship between the connected four-point correlator and the quartic coupling in the action, when both are small. We see that for the nearly-Gaussian distribution realized by the quartic action (1.71), the distribution is - as promised nearly Gaussian: the strength of the coupling ϵVρ1ρ2ρ3ρ4 directly controls the distribution’s deviation from Gaussian statistics, as measured by the connected four-point correlator. This also shows that the four-index tensor Vρ1ρ2ρ3ρ4 creates nontrivial correlations between the components zρ1zρ2zρ3zρ4 that cannot otherwise be built up by the correlation Kμν in any pair of random variables zμzν.
Finally, note that the connected two-point correlator (1.74) - i.e. the covariance of this nearly-Gaussian distribution - is also shifted from its Gaussian value of Kμ1μ2 by the quartic coupling ϵVρ1ρ2ρ3ρ4. Thus, the nearly-Gaussian deformation not only creates complicated patterns of four-point correlation as measured by the connected four-point correlator 1.78), it also can modify the details of the Gaussian two-point correlation.
Now that we see how to compute the statistics of a nearly-Gaussian distribution, let’s take a step back and think about what made this possible. We can perform these perturbative calculations any time there exists in the problem a dimensionless parameter ϵ that is small ϵ≪1, but nonzero ϵ>0. This makes perturbation theory an extremely powerful tool for theoretical analysis any time a problem has any extreme scales, small or large.
Importantly, this is directly relevant to theoretically understanding neural networks in practice. As we will explain in the following chapters, real networks have a parameter n - the number of neurons in a layer - that is typically large n≫1, but certainly not infinite n<∞. This means that we can expand the distributions that describe such networks in the inverse of the large parameter as ϵ=1/n. Indeed, when the parameter n is large - as is typical in practice - the distributions that describe neural networks become nearly-Gaussian and thus theoretically tractable. This type of expansion is known as the 1/n expansion or large- n expansion and will be one of our main tools for learning the principles of deep learning theory.
Aside: statistical independence and interactions
The quartic action (1.71) is one of the simplest models of an interacting theory. We showed this explicitly by connecting the quartic coupling to the non-Gaussian statistics of the non-vanishing connected four-point correlator. Here, let us try to offer an intuitive meaning of interaction by appealing to the notion of statistical independence.
Recall from the probability theory that two random variables x and y are statistically independent if their joint distribution factorizes as
p(x,y)=p(x)p(y).
For the Gaussian distribution, if the variance matrix Kμν is diagonal, there is no correlation at all between different components of zμ; they are manifestly statistically independent from each other.
Even if Kμν is not diagonal, we can still unwind the correlation of a Gaussian distribution by rotating to the right basis. As discussed in §1.1, there always exists an orthogonal matrix O that diagonalizes the covariance as (OKOT)μν=λμδμν. In terms of the variables uμ≡(Oz)μ, the distribution looks like
Thus, we see that in the u-coordinate basis the original multivariable Gaussian distribution factorizes into N single-variable Gaussians that are statistically independent.
We also see that in terms of the action, statistical independence is characterized by the action breaking into a sum over separate terms. This unwinding of interaction between variables is generically impossible when there are nonzero non-Gaussian couplings. For instance, there are ∼N2 components of an orthogonal matrix Oμν to change basis, while there are ∼N4 components of the quartic coupling ϵVμνρλ that correlate random variables, so it is generically impossible to re-express the quartic action as a sum of functions of N different variables. Since the action cannot be put into a sum over N separate terms, the joint distribution cannot factorize, and the components will not be independent from each other. Thus, it is impossible to factor the nearly-Gaussian distribution into the product of N statistically independent distributions. In this sense, what is meant by interaction is the breakdown of statistical independence. 12
Nearly-Gaussian actions
Having given a concrete example in which we illustrated how to deform the quadratic action to realize the simplest nearly-Gaussian distribution, we now give a more general perspective on nearly-Gaussian distributions. In what follows, we will continue to require that our distributions are invariant under the parity symmetry that takes zμ→−zμ. In the action representation, this corresponds to including only terms of even degree. 13 12 An astute reader might wonder if there is any interaction when we consider a single-variable distribution with N=1, since there’s no other variables to interact with. For nearly-Gaussian distributions, even if N=1, we saw in (1.74) that the variance of the distribution is shifted from its Gaussian value, K, and depends on the quartic coupling ϵV. In physics, we say that this shift is due to the self-interaction induced by the quartic coupling ϵV, since it modifies the value of observables from the free Gaussian theory that we are comparing to, even though there’s no notion of statistical independence to appeal to here.
Said another way, even though the action just involves one term, such a non-Gaussian distribution does not have a closed-form solution for its partition function or correlators; i.e. there’s no trick that lets us compute integrals of the form e−S(z) exactly, when S(z)=2Kz2+4!1ϵVz4. This means that we still have to make use of perturbation theory to analyze the self-interaction in such distributions. 13 The imposition of such a parity symmetry, and thus the absence of odd-degree terms in the action, means that all of the odd moments and hence all of the odd-point connected correlators will vanish.
With that caveat in mind, though otherwise very generally, we can express a nonGaussian distribution by deforming the Gaussian action as
where the factor of 1/(2m) ! is conventional in order to compensate for the overcounting in the sum due to the implied symmetry of the indices μ1,…,μ2m in the coefficients sμ1⋯μ2m, given the permutation symmetry of the product of variables zμ1⋯zμ2m. The number of terms in the non-Gaussian part of the action is controlled by the integer k. If k were unbounded, then S(z) would be an arbitrary even function, and p(z) could be any parity-symmetric distribution. The action is most useful when the expanded polynomial S(z) truncated to reasonably small degree k - like k=2 for the quartic action - yields a good representation for the statistical process of interest.
The coefficients sμ1⋯μ2m are generally known as non-Gaussian couplings, and they control the interactions of the zμ .
In the similar vein, the coefficient Kμν in the action is sometimes called a quadratic coupling since the coupling of the component zμ with the component zν in the quadratic action leads to a nontrivial correlation, i.e. Cov[zμ,zν]=Kμν.
In particular, there is a direct correspondence between the product of the specific components zμ that appear together in the action and the presence of connected correlation between those variables, with the degree of the term in (1.81) directly contributing to connected correlators of that degree. We saw an example of this in 1.78 ), which connected the quartic term to the connected four-point correlator. In this way, the couplings give a very direct way of controlling the degree and pattern of non-Gaussian correlation, and the overall degree of the action offers a way of systematically including more and more complicated patterns of such correlations.
If you recall from $1.2, we defined nearly-Gaussian distributions as ones for which all these connected correlators are small. Equivalently, from the action perspective, a nearly-Gaussian distribution is a non-Gaussian distribution with an action of the form (1.81) for which all the couplings sμ1⋯μ2m are parametrically small for all 1≤m≤k :
∣sμ1⋯μ2m∣≪∣Kμν∣m,
where this equation is somewhat schematic given the mismatch of the indices.
This schematic equation is, nonetheless, dimensionally consistent. To support that remark, let us give a brief introduction to dimensional analysis: let the random variable zμ have dimension ζ, which we denote as [zμ]=ζ1. By dimension, you should have in mind something like a unit of length, so e.g. we read the expression [zμ]=ζ1 as “a component of z is measured in units of ζ.” The particular units are arbitrary: e.g. for length, we can choose between meters or inches or parsecs as long as we use a unit of length but not, say, meters 2, which instead would be a unit of area. Importantly, we cannot add or equate quantities that have different units: it doesn’t make any logical sense to add a length to an area. This is similar to the concept of type safety in computer science, e.g. we should not add a type str variable to a type int variable.
Now, since the action S(z) is the argument of an exponential p(z)∝e−S(z), it must be dimensionless; otherwise, the exponential e−S=1−S+2S2+… would violate the addition rule that we just described. From this dimensionless requirement for the action, we surmise that the inverse of the covariance matrix has dimension [Kμν]=ζ−2, and that the covariance itself has dimension [Kμν]=ζ2. Similarly, all the non-Gaussian couplings in 1.81 have dimensions [sμ1⋯μ2m]=ζ−2m. Thus, both sides of 1.82 have the same dimension, making this equation dimensionally consistent.
Even more concretely, consider the quartic action (1.71). If we let the tensorial part of the quartic coupling have dimensions [Vμνρλ]=ζ−4, then the parameter ϵ is dimensionless, as claimed. This means that we can consistently compare ϵ to unity, and its parametric smallness ϵ≪1 means that the full quartic coupling ϵVμνρλ is much smaller than the square of the quadratic coupling, and that the connected four-point correlator 1.78 is much smaller than the square of the connected two-point correlator (1.74).
Importantly the comparison is with an appropriate power of the inverse variance or quadratic coupling Kμν since, as we already explained, the variance sets the scale of the Gaussian distribution to which we are comparing these nearly-Gaussian distributions.
As we will see in 4, wide neural networks are described by nearly-Gaussian distributions. In particular, we will find that such networks are described by a special type of nearly-Gaussian distribution where the connected correlators are hierarchically small, scaling as
E[zμ1⋯zμ2m]∣connected =O(ϵm−1),
with the same parameter ϵ controlling the different scalings for each of the 2m-point connected correlators. Importantly, the non-Gaussianities coming from higher-point connected correlators become parametrically less important as ϵ becomes smaller.
This means that for a nearly-Gaussian distribution with hierarchical scalings (1.83), we can consistently approximate the distribution by truncating the action at some fixed order in ϵ. To be concrete, we can use an action of the form (1.81) to faithfully represent all the correlations up to order O(ϵk−1), neglecting connected correlations of order O(ϵk) and higher. The resulting action offers a useful and effective description for the statistical process of interest, as long as ϵ is small enough and k is high enough that O(ϵk) is negligible.
In practice, a quartic action (1.71) truncated to k=2 will let us model realistic finite-width neural networks. This quartic action captures the important qualitative difference between nearly-Gaussian distributions and the Gaussian distribution, incorporating nontrivial interactions between the different components of the random variable. In addition, the difference between the statistics 1.83 of a nearly-Gaussian distribution truncated to O(ϵ) versus one truncated to O(ϵ2) is mostly quantitative: in both cases there are nontrivial non-Gaussian correlations, but the pattern of higher-order correlation differs only in a small way, with the difference suppressed as O(ϵ2). In this way, the distribution represented by the quartic action is complex enough to capture the most salient non-Gaussian effects in neural networks while still being simple enough to be analytically tractable.