In this article, we give an intuitive introduction to the article (Li, 2022) that gives a first-principle derivation of Deep Neural Networks.

The basic idea is that Deep Neural Network (DNN) is a phenomenological model of biotic intelligence, and DNNs could be studied similarly to physics through symmetries, though the symmetries here are not conservative symmetries in physics, but a symmetry perhaps that might form a different paradigm, and is referred as adaptive symmetry. The learning process could be formalized as a symmetries-breaking process where these symmetries are broken to build a model of the world that could predict the future, such that the uncertainty of survival and reproduction is decreased; that is, to adapt to the world.

If we consider the conservative symmetry in physics as the mathematical principle of physics (i.e., natural world), then from the same philosophy of science we could argue that the adaptive symmetry is the mathematical principle of organic world.

More technically, there exists concentration of measures phenomena for large DNNs that enable learning of the network to stay within a phase where arbitrary learning errors could be reduced. That is, biotic or cognitive complexity arises from adaptive-symmetries breaking.

The “large” DNN here refers to a relative relationship between dataset and model capacity, and thus this phenomenon lets us speculate scaling the model and the dataset, and logically leads to the current marvelous large language models. And it further suggests that the next conceptual breakthrough might be in the field of embodied intelligence, which include Agent on Internet.

In the following, we shall break down the ideas step by step.

The intelligibility of intelligence

Albert Einstein once said “the most incomprehensible thing about the world is that it is comprehensible”. Human being has come a long way to understand our world: to paraphrase Carl Sagan, we have come to know the machinery that generates the sunlight that makes life possible, the gravity that glues us to Earth that would otherwise send us spinning off into space, or the atoms of which we are made and on whose stability we fundamentally depend.

For anyone have thought about it deeply, this is a miraculous achievement of mankind. Nowadays, people are not amazed about the intelligibility of physical world. But before the Enlightenment, the physical world was considered as the creation of gods, and not comprehensible to mortals.

Each era has its own epistemic struggles, and currently, humanity are facing another philosophical problem in a similar nature; that is, the intelligibility of intelligence (Salamon & Rayhawk, 2010). This post aims to explain a plausible epistemic foundation where the intelligibility of intelligence is possible. And to begin with, we need to revisit the intelligibility of physics.

The epistemic foundation of physics

The intelligibility of natural world comes from a concept known as symmetries. P. Anderson, the Nobel laureate, described in the seminal essay More is Different that, “it is only slightly overstating the case to say that physics is the study of symmetry”. Symmetries are repeated structure that are so regular that it enables synchronized microscopic phenomena to be observable across scales at the macroscopic scale. This similarity across scales is referred as self-similarity. To give an example, a snowflake would repeat its six-fold symmetries indefinitely, and its microscopic structure (at the nanometer scale) would manifest as the macroscopic crystal (at the centimeter scale) as we see it.


Six-fold symmetry and snowflakes.

Symmetry is the structure that bridges scales; it induces stable macroscopic structure from the microscopic, and let the microscopic worlds be measurable and thus knowable to us. The whole edifice of physics is built upon identifying symmetries existing and formalizing them into laws.

Conservation symmetry breaking: the mathematical principle of changes in the natural world

These symmetries are not a static concept, but are equilibria of dynamic motion of objects (e.g., molecules motions) that conserve a quantity called free energy. And when those symmetries are broken, we would observe phenomena that we call change.

To give an example, when temperature rises, the heat perturbations of the molecule oscillation could not be contained, and would induce a synchronized/cascaded change across all molecules. As a result, the six-fold symmetries of water molecules would break into a symmetry referred as the rotation symmetries, and the symmetry-breaking process is known as phase change in physics. And macroscopically, water molecule collectively is observed as water sphere (under zero gravity).


Rotation symmetry and water.

And our universe is arguably an almost infinitely long symmetry breaking process of high energy particles created at Big Bang, the creation of the universe. This is why P. Anderson said that “it is only slightly overstating the case to say that physics is the study of symmetry”. And physically, our worlds might be just symmetries and their breaking.

Neural network training solves a neural differential equation

If we make the changes through symmetries breaking almost indefinitely complex and temporally extended, we reach something that might be quite close to the phenomena that we call life. And further complexification of this diachronic symmetry breaking process might create something called neural plasticity—more details could be found in this article (Li, 2022) introduced in the science of intelligence section on the research page. In the following sections, we introduce the symmetries breaking in DNNs intuitively.

DNN is a phenomenological of cognition, and the training of a DNN is a process of change where relevant information (related to survival or reproduction) is encoded in the network. To explain the motivation to scale DNNs, Ilya Suskever explains in an interview that training a DNN is like solving a neural equation. More concretely, gradient descent actually solves a highly sophisticated nonlinear differential equation,

\[\nabla \mathcal{L}(W\sigma \ldots W\sigma X, y) = 0,\]

where $W, \sigma$ denotes the weights and activation function respectively. The larger the network, the more degree of freedom exists such that a better solution could be reached by gradient descent.

This is also a formalism of the Bitter Lesson by Rich Sutton: learning and search are the two most important classes of techniques in AI research. By local search in the weight space, a network is found that minimizes the error $\mathcal{L}$ (i.e., the gradient is zero when a local minimum is found).

Neural circuits: the atom of intelligence

Further extrapolating the mathematical principle of change in the natural world, we could reason that the intelligibility of intelligence might require us to find stable characteristics in those changes that are measurable across scales. And in (Li, 2022), we find something referred as circuit symmetries.

Though the characterization is given by highly sophisticated mathematical formalism, the basic idea behind circuit symmetries is simple. First, we explain what a circuit is—in DNNs, it serves as the role of atom in physics.

As illustrated above, at initialization of a DNN, each weight (the lines connecting dots) is randomly initialized, and each neuron (the dots) would be activated also randomly. As a result, if we connect the input neuron (from the left) to the output neuron (to the right), illustrated as the green line, we would get an ordered set and we refer this set as a circuit.

More formally (though still simplified), let $\boldsymbol{I}_{l}$ denotes $\{1, \ldots, n_l \}, l \in \{1, \ldots, L\}$, where $n_l$ denotes the number of neurons of at layer $l$ and $L$ is the number of layers, (c.f. the above illustration), and $l \in \{1, \ldots, L\}$ is the layer index up to $L$. Let $\boldsymbol{I}$ denote $\boldsymbol{I}_1 \otimes \ldots \otimes \boldsymbol{I}_L$, where $\otimes$ denotes Cartesian product. Then $\boldsymbol{I}$ denotes the indices of all circuits, and we could denote a circuit as

\[\Psi_{\boldsymbol{i}} = X_{i_1}W_{i_1 i_2}H_{i_2}\ldots W_{i_{L-1} i_{L}}H_{L},\]

where $\boldsymbol{i} \in \boldsymbol{I}$, $X_{i_1}$ is the input neuron $i_1$, $W_{i_1 i_2}$ is the weight connect neuron $i_1, i_2$, $H_{i_2}$ is the hidden neuron $i_2$, and so on.

To give simplistic understand of the circuit’s behaviors, we might understand each neuron as a gate that decides whether the circuit that it belongs is activated (and thus a circuit is controlled by multiple gates), and the weights are scalars that multiplied together when this circuit activates.

By looking at the network as an ensemble of neural circuits, we could see symmetries arise. Circuits play the role of atoms in physics, and later neuron ensemble is just the summation of those circuits, just like the magnetic force is in a sense the summation of individual atom magnetic field.

Circuit symmetry: symmetry of organism

Recall that symmetries are repeated stable structure at microscopic scale that manifests macroscopic behaviors. The green line previously is one circuit; however, each possible black line previously is also a circuit. As a result, the behaviors of the output (blue) neuron—which is the summation of each circuit’s value (the multiplication of all weights of the circuit when it activates)—is the collective behaviors of all the circuits. We refer this collective (which mathematically is a summation) as a neural assembly. And there is a microscopic symmetry of circuits that manifests as a macroscopic symmetry of the neural assembly, and this composite symmetry at circuit level (instead of at atom level) might be the key to understand biotic systems (i.e., organism). We introduce the symmetry as follows.

The snowflake is the macroscopic ensemble behavior of the six-fold symmetry of ice molecules’ the quantum statistical wave (distribution) function. Marvelously, the probability distribution of possible values of the neural assembly is self-similar to the probability distribution of the circuits’ value.

More specifically, at the beginning of network training, due to the fact that weights are initialized according to symmetrical probability distributions (e.g., Gaussian distribution), the circuits would be of a symmetrically distributed distribution in the PAC (probably approximately correct) sense (mathematically proved and experimentally verified in the paper). We refer this as circuit symmetries: that is to say, at the beginning of the training, the circuits would contribute symmetrically to the output neuron.


Circuit symmetry: the value of the circuits distributed symmetrically.

The circuit symmetry (including the broken circuit symmetry introduced later) is given as a theorem in the article (Li, 2022), which is proved by proving convergence by characteristic function between the circuit distribution and a symmetric distribution (theorem 1 in the paper), and looks something like this

\[\forall g_t \in G_t, \left| \mathbb{E}\left[ e^{i(\frac{\sqrt{n}}{c})^{e-s}\Psi^{\overline{l}}_{g_t.\boldsymbol{i}}t}\right] - \mathbb{E}\left[e^{i(\frac{\sqrt{n}}{c})^{e-s}\hat{\Psi}^{\overline{l}}_{g_t.\boldsymbol{i}}t} \right]\right| \\ \leq \left| O\left( (\frac{1}{2\sqrt{n}})^{e-s-b_{\boldsymbol{i}}}\sin(t) \right) \right|,\]

where $\Psi$ is a formalism for circuits in the article (Li, 2022). The bound here is not intended to be readable, and interesting readers may refer to the paper.

And we could mathematically prove and experimentally verify that the coarse-graining (or summation) of those symmetrically distributed circuits (in which, for example, the output neuron is literally a neural assembly) is also symmetrically distributed probabilistically in the PAC sense—though we would only done this for some types of the neuron assemblies that are relevant in the paper. We show the distribution of neural assemblies in the following plots—the plots shown here are not distribution of neuron output, but gradient and Hessian entries, which are also neuron assemblies and are more relevant in the paper.


Distribution of neural assemblies

More formally, a neuron assembly is just the summation of all circuits in the assembly,

\[\sum_{\boldsymbol{i} \in \boldsymbol{I}}\Psi_{\boldsymbol{i}} =\sum_{(i_1,\ldots,i_L) \in \boldsymbol{I}} X_{i_1}W_{i_1 i_2}H_{i_2}\ldots W_{i_{L-1} i_{L}}H_{i_L},\]

and the summation/coarse-graining over $n^{L}$ (we assume all layer has $n$ neurons for simplicity) maintain the symmetry distribution in a PAC sense—here we use the output neuron as an example, and a neuron assembly could be other assemblies other than the output neurons.

Therefore, the symmetry of circuit is self-similar to, and manifests at macroscopic as the symmetry of neural assemblies.

Lastly, we complement with the distribution of weights throughout training, whose symmetry induces the circuit symmetry to corroborate.


The weight distribution throughout training

Adaptive symmetries breaking: the mathematical principle of the organic world

As the symmetry-breaking in the natural/inorganic world induces macroscopic changes, the circuit symmetry breaking of neural networks (an organic system) induces macroscopic changes as well; however, such changes are not physical, but informational.

Imagine that we perturb the weights (i.e., the green line), this perturbation would direct influence the output neurons (i.e., the blue neuron on the right), assuming the circuit activates. And we could freely increase or decrease the weight; as a result, we could freely increase the contribution of this circuit to the output neuron, or decrease it.

Further imagine that the weights are perturbed systematically by back propagation. Then, this system of degree of freedom (i.e., weights) would collectively be optimized to satisfy the constraint given by the labels through the objective function. As a result, the random/symmetric contributions of those circuits are modified collectively to minimize the objective function (which characterizes training errors).

Therefore, a DNN is a system poses at the state of adaptive symmetry to minimize the errors to fulfill its goal characterized by the objective function: at the beginning of the training, the contributions of circuits are symmetrically distributed to increase or decrease the errors; through the feedback from the objective, this symmetries are gradually broken to reduce the errors.

To give an example, the following training errors (prediction accuracy) are a network trained to classify images into object categories. As training progress, the network could classify all object correctly in the training set. And the weight’s distribution, shown in the earlier figure, slightly and gradually skews during training.


(Left) Training errors and prediction accuracy throughout training. (Right) CIFAR10 image classification task.

Therefore, the emergence of informational structure (which is often complex, and in the previous example, recognizes objects) in organic/neural system is a process of adaptive symmetries breaking.

Adaptive symmetries and large (language) models

The symmetries in physics informs us the control and order parameters of the system; for example, the temperature controls the phase of the water molecule system, and when it crosses a threshold, the order of the system changes from liquid phase to the snow/solid phase. The symmetries of biotic/neural system in this case also informs us about the phases of the system: it informs us about the adaptability, or plasticity of the system. And all those analysis in (Li, 2022) gives a formal, theoretical and experimental (which combined is referred as scientific from the philosophy of science) characterization that helps us understand the scaling of the neural network model.

Previously, we describe that circuit symmetries exists in a DNN, and the breaking of those symmetries reduces the training errors. Notice that given a $L$ DNN with width $n$ for each layer, there are exponentially number $n^{L}$ of circuits w.r.t. to weights. Thus, we might speculate that there are a very large possibilities where the circuits could be perturbed to reduce the errors, subject to the interaction or constrains among the neurons (because the change of one weight could also influence exponential number of circuits as well).

In (Li, 2022), we formally show that when there are certain diversity among neurons, despite the circuit symmetries would be broken during training, their macroscopic adaptive symmetry would not be broken. As a result, there always exist directions where the output of the final neuron could be modified to minimize the errors. More technically, the adaptive symmetries manifests in the Hessian of the objective function as the symmetry of eigenvalue distributions, as shown in the following Hessian figure. Recall that when there are negative eigenvalue exists for a function, all stationary points are saddle points, and gradient descent direction exists to further minimize the function.


Eigenvalue distribution throughout training.

The symmetry of eigenvalue distribution is given as theorem 2 in the article (Li, 2022), which is proved by convergence of probability distribution on matrices, and looks something like this,

\[P\left( \left| \text{tr }\left( \boldsymbol{B} \boldsymbol{G}(z) - \boldsymbol{M}(z) \right) \right| \leq ||\boldsymbol{B}|| \frac{N^{\epsilon}}{N\Im z} \right) \geq 1 - CN^{-D},\]

where $G$ is the resolvent of the Hessian matrix, and $M$ is the resolvent of a matrix with symmetric eigenvalue distribution. The bound here is not intended to be readable, and interesting readers may refer to the paper.

In this case, the order parameter of the DNN system is the distribution of gradients, which is also symmetric throughout training (c.f. the gradient distribution figure). Recall that nonzero gradient implies directions to reduce errors; this characterizes the order of DNN, i.e., adaptability or plasticity. As acute readers might notice, compared with the single scalar that characterizes the order of the simple physical system like water, the order parameter of the DNN system is high dimensional. The philosophy of science that defines order parameter is discussed in (Li, 2022), and we would not delve into them here.

Similarly, the control parameter is also not a single scalar in the case of the simple physical system like water molecule system, but is high dimensional as well, and is formalized with rather complicated math—interested readers could read the paper. Here, we demonstrate the control parameter visually. The control parameter characterizes the interaction among neuron assemblies, and as training progresses and circuits breaks their symmetries, the interaction among assemblies increases, as shown in the following, where each pixel represents the correlation among neural assemblies (more technically Hessian entries).


Correlations among neuron assemblies at the beginning (left) and the end of the training (right).

More technically, the control parameters are coupling density among neural assemblies. To give an over-simplified demonstration, let $\mathbb{I}$ denote an index set indexing neuron assemblies; denote $\mathcal{N}(\alpha)$ the set of assemblies that couple with the assembly $\alpha \in \mathbb{I}$. For theorem 2 to hold, each $\alpha \in \mathbb{I}$ needs to satisfy $|\mathcal{N}(\alpha)| < \sqrt{N}$, where $N$ is the number of parameters of the network; that is, $\alpha$ is at most statistically dependent with $\sqrt{N}$ neuron assemblies. The condition on control parameters are given as assumptions in the paper, and this constraint on coupling density is a simplified version of assumption 2 in the article (Li, 2022). The neuron assemblies here are Hessian entries, and interested readers may refer to the article for details.

We give a rather toy outline of proof here, and interested readers may refer to the paper. The basic logic is that the circuit symmetry leads to some stable characteristics among the neuron assemblies—recall that at the very beginning we have discussed that stable characteristics induced by symmetries bridge the microscopic and the macroscopic. By characterizing the stable characteristics, we make some restriction on the interaction among the neuron assemblies; this leads to the control parameters and is formulated as assumptions; Then, those restriction would enable us to prove the probability bound on the Hessian’ eigenvalue distribution, which is symmetric in the PAC sense.

Recall that neuron circuits start with input neurons, which are just data. Therefore, the interaction could be roughly understood to encode the information in the dataset with hierarchical patterns. Therefore, the quantitative characterization (of control parameter) involves both the dataset and the model size: more specifically, it is an interplay between the complexity of the dataset, and the potential complexity of the information the network is able to encode, or in other words, network capacity. In this case, the interaction are still rather sparse given that the dataset is well beyond the network capacity. As a result, there are sufficient circuit symmetries that enable the network to reach zero errors.

To summarize, we have given a formalism of DNN training, and validate it experimentally. The formalism and experiments show that

  1. the training of a DNN solves a neural differential equation;
  2. the training process finds a solution that satisfies the constraints dictated by the dataset (and the objective function);
  3. the training process is a circuit-symmetry breaking process;
  4. however, for a sufficiently large network, whose network capacity is well below the complexity of the dataset, even if some of the circuits break their symmetry, there are sufficient reservoir of circuits, the macroscopic behavior still manifests as adaptive symmetry that enables the network to reach zero training errors.

This straightforwardly lets us extrapolate the situation, what if we could increase the complexity of the dataset and the capacity of the network indefinitely? We have known that DNN has remarkable generalization capacity (though this is also an interesting problem to study scientifically), and thus a lower training error accompanies lower test error. This leads to us to speculate scaling the model and the dataset, and logically leads to the current marvelous large language models. And it further suggests the next conceptual breakthrough might be in the field of embodied intelligence.

Nonetheless, this could be argued as merely posterior insight. The point is that the formalism here could guide us to find the next conceptual breakthrough. Though this might be a topic of another day, we shall briefly discuss it as an epilogue.

Adaptive symmetries and embodied/language intelligence

The interplay between the complexity of dataset and the model capacity previously is not a static concept but rather a feedback control loop between the environment (i.e., the dataset) and the agent (i.e., the model). Thus, the extrapolation moves beyond scaling language model (which happened in the past a few years).

What is happening in the past almost 100 years of intelligence research might be something like the following. During the evolution history, our intelligence gets encoded into our genes, and that becomes our foundation model. Our lifetime is an adaptive/alignment/epigenetic developmental process with the embodied environment. And in the past 100 years, we have just accumulated enough digital environmental information to encode a foundation model (e.g., ImageNet and Common Crawls) into an artificial neural network.

Yet, true intelligence is not created by training entirely over those offline data; it requires online interaction with the environment. Therefore, the result here further suggests studying embodied intelligence, which does not narrowly restricted in settings such as robotics or autonomous vehicles, but also agents embodied on the Internet.

Therefore, though the fundamental principle probably would hold well into the future, the formalism in the article (Li, 2022) is probably a preliminary form, and the next step is to study large models interacting with sophisticated environment. And further research is lightly discussed in the research page.

  1. Li, S. W. M. (2022). Complexity from Adaptive-Symmetries Breaking: Global Minima in the Statistical Mechanics of Deep Neural Networks (Letter Version) (Version v1). Zenodo. https://doi.org/10.5281/zenodo.5814935
  2. Salamon, A., & Rayhawk, S. (2010). How Intelligible is Intelligence ? ECAP10: VIII European Conference on Computing and Philosophy.