From weather forecasting to next-token prediction, modern machine learning models have shown striking success at predicting a complex, dynamical world. Across this variety, I argue that two structurally different mechanisms do most of the work: generalized synchronization and hopping on an evolving energy landscape. In the recurrent family — RNNs, LSTMs, reservoir computers, and state-space models — the network can be viewed as a dynamical system driven by its input. Under generalized synchronization, the hidden state becomes a high-dimensional nonlinear delay embedding of the input's history, and prediction reduces to a readout on this embedding. This perspective explains, within a single framework, how such systems learn one or many chaotic attractors, switch between them under forcing or autonomously, infer unmeasured variables of the input system, and perform dynamical source separation. It also predicts an intrinsic fragility: any perturbation to the input history displaces the state from the synchronization manifold and corrupts subsequent prediction. For attention-based models, I will present a complementary geometric perspective. Building on the connection between attention and associative memory, I interpret next-token prediction as a discrete jump on an energy landscape that is reconstructed at every step as a weighted Gaussian mixture over the current key vectors, with mixture weights set by the exponential of each token's transformed norm. From this view, layer normalization is structurally necessary to prevent a single token's basin from dominating the mixture; the choice of activation in the key/query projections (tanh in particular) controls how densely the landscape is populated with degenerate basins; and the empirical robustness of transformers to perturbations in the distant past follows directly from per-step landscape reconstruction, since a perturbed past token reshapes a basin the current jump need not visit. I will close by contrasting synchronization and landscape hopping as inductive biases for prediction in a dynamical world.
Speaker
Zhixin LuScientist at the Allen Institute