Replies: 1 comment
-
Synaptic growth and pruning could be more flexibleIf you start out with most synapses ineffective and then slowly grow and prune them, it is possible that you could get around some of the above constraints, by starting with a relatively low dimensional initial representational bias, that then adds dimensions over time. However, from what we generally know about the developmental process, there is an initial overproduction of synapses, followed by pruning, not a more selective "start small" kind of growth process, so this is probably not what is happening. It is not clear that the "start small" process would work given the low probability of finding useful connections in a very large and noisy brain. |
Beta Was this translation helpful? Give feedback.
-
The differences in network structure and dynamics between
deep_move
vs.deep_music
(re)surfaced some key points about the importance of topographic connectivity, and temporal integration dynamics, for solving certain kinds of problems. Here is the consolidated wisdom:TL;DR: use topographic narrower connectivity whenever multiple independent features need to be processed in parallel (e.g., lower sensory systems), and shorter time integration constants when shorter time scales need to be integrated. These are the "no duh" biases obviously appropriate for these cases, but they are critical for actually getting the axon-based biological neurons to work (probably moreso than ReLU units).
A neuron will be sensitive to, and attempt to learn about, everything that shows up in its receptive field (input synapses). Thus, constraining this connectivity can have major implications for learning and processing.
In biological neurons with positive-only weights and inhibitory competition, neurons will only get active when they get enough excitation across their inputs to overcome the inhibitory threshold. This results in a conjunctive bias toward encoding an AND-like conjunction across all inputs. This is different from ReLU neurons without any kind of normalization / inhibition, where they can represent linear weightings (rotations) of inputs, subject to a much weaker constraint of being above the 0 threshold point. In other words, biological neurons are significantly more nonlinear than ReLU units, and exhibit a strong sparse detector bias, vs. a linear transform bias in ReLU.
The sparse detector bias is critical for constraining the behavior of the network under bidirectional connectivity. Without this constraint, e.g., a BPTT ReLU network or similar with a more linear behavior, there is a combinatorial explosion of network dynamics over time that exponentially increases the search space over which learning has to operate, resulting in generally poor generalization and poor sample efficiency. These problems escalate exponentially with increasing scale. Likewise, the unconstrained Boltzmann machine scales very poorly.
Thus, there is a fundamental tradeoff between having an effective sparse detector model with bidirectional connectivity that can integrate information flexibly across many modalities / brain areas etc, and something that can exhibit more unconstrained linear mappings in a feedforward-only mode. The cortex clearly comes down on the side of the bidirectional sparse detector side of this tradeoff.
If the neural receptive field includes multiple simultaneously active, yet independent features, which need to be processed separately (combinatorially, not conjunctively) , the conjunctive bias will strongly interfere with the ability to process effectively.
There are two ways to resolve this issue: 1. Only process one thing at a time. 2. Restrict neural connectivity to enable each neuron to process the independent features independently.
At a broad level, the cortex is clearly strongly biased toward processing one thing at a time. Despite our billions of neurons, human capacity constraints across many domains converge on the magic number 4 as the upper limit of parallel processing capacity, with a strong preference towards only 1. This constraint applies most strongly to the higher level association cortex levels, which support our conscious awareness, and are reflected in the unitary nature of consciousness. The pervasive bidirectional connectivity and sparse detector behavior of cortex conspire to enforce this constraint.
However, in lower-level perceptual areas, it is essential to be able to process a large number of simultaneously active, independent features! Thus, connectivity for such areas must be topographic and restricted (i.e., "convolutional" in the DCNN nomenclature) to get around the conjunctivity bias. This is reflected in the classical, successful hierarchical structure of object recognition processing in the ventral stream, where lower-level topographically-organized processing incrementally converges toward a high-level unitary representation in IT -- massively parallel processing funnels into a singular high-level representation.
The same constraints operate over time: lower levels have more "markovian" short-term temporal integration, while higher levels integrate over longer timescales. This is well documented empirically (e.g., papers by Chaudhuri, Kennedy, Baldassano). This means that the CT temporal integration and prediction coding layers have faster dynamics in lower areas, and slower dynamics in higher areas.
The practical implications of these principles are well-illustrated in the differences between the
deep_music
anddeep_move
test cases inexamples
. The music model works on a singular sequential flow of notes over time, with temporal structure across multiple time scales, and benefits from long temporal integration and works fine with a fully connected hidden layer. The move model is attempting to predict the visual effects of physical motion through space, which is entirely dependent on the last motor action, and having longer temporal integration that worked so well in music causes it to fail entirely! Furthermore, the depth map is comprised of multiple depth points that must be updated systematically and combinatorially --- treating the entire depth array as a conjunction does not work at all. Instead, a topographic encoding of different depth levels is required.Beta Was this translation helpful? Give feedback.
All reactions