# predictability, complexity and learning[精选推荐pdf]

arXiv:physics/0007070v3 [physics.data-an] 23 Jan 2001Predictability, complexity and learningWilliam Bialek,1Ilya Nemenman,1,2and Naftali Tishby1,31NEC Research Institute, 4 Independence Way, Princeton, New Jersey 085402Department of Physics, Princeton University, Princeton, New Jersey 085443School of Computer Science and Engineering, and Center for Neural Computation,Hebrew University, Jerusalem 91904, IsraelFebruary 2, 2008We define predictive information Ipred(T) as the mutual information betweenthe past and the future of a time series. Three qualitatively different behav- iors are found in the limit of large observation times T: Ipred(T) can remainfinite, grow logarithmically, or grow as a fractional power law. If the time seriesallows us to learn a model with a finite number of parameters, then Ipred(T)grows logarithmically with a coefficient that counts the dimensionality of the model space. In contrast, power–law growth is associated, for example, withthe learning of infinite parameter (or nonparametric) models such as continu- ous functions with smoothness constraints. There are connections between thepredictive information and measures of complexity that have been defined both in learning theory and in the analysis of physical systems through statistical mechanics and dynamical systems theory. Further, in the same way that en- tropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of Ipred(T) provides the unique measure for the complexity of dynamics underlying a timeseries. Finally, we discuss how these ideas may be useful in different problems in physics, statistics, and biology.1Contents1Introduction32A curious observation53Fundamentals74Learning and predictability13 4.1A test case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 4.2Learning a parameterized distribution. . . . . . . . . . . . . . .18 4.3Learning a parameterized process . . . . . . . . . . . . . . . . . .23 4.4Taming the fluctuations . . . . . . . . . . . . . . . . . . . . . . .24 4.5Beyond finite parameterization: general considerations. . . . . . . . . . . . . . . . . . . . . . . .29 4.6Beyond finite parameterization: example . . . . . . . . . . . . . .315Ipredas a measure of complexity35 5.1Complexity of statistical models. . . . . . . . . . . . . . . . . .35 5.2Complexity of dynamical systems . . . . . . . . . . . . . . . . . .37 5.3A unique measure of complexity? . . . . . . . . . . . . . . . . . .396Discussion437References4821IntroductionThere is an obvious interest in having practical algorithms for predicting the future, and there is a correspondingly large literature on the problem of time series extrapolation.1But prediction is both more and less than extrapolation: we might be able to predict, for example, the chance of rain in the comingweek even if we cannot extrapolate the trajectory of temperature fluctuations. In the spirit of its thermodynamic origins, information theory (Shannon 1948) characterizes the potentialities and limitations of all possible prediction algo- rithms, as well as unifying the analysis of extrapolation with the more generalnotion of predictability. Specifically, we can define a quantity—the predictive information—that measures how much our observations of the past can tell us about the future. The predictive information characterizes the world we are observing, and we shall see that this characterization is close to our intuition about the complexity of the underlying dynamics. Prediction is one of the fundamental problems in neural computation. Much of what we admire in expert human performance is predictive in character—the point guard who passes the basketball to a place where his teammate will arrivein a split second, the chess master who knows how moves made now will influence the end game two hours hence, the investor who buys a stock in anticipation that it will grow in the year to come. More generally, we gather sensory information not for its own sake but in the hope that this information will guide our actions (including our verbal actions). But acting takes time, and sense data can guide us only to the extent that those data inform us about the state of the world at the time of our actions, so the only components of the incoming data that have a chance of being useful are those that are predictive. Put bluntly, nonpredictive information is useless to the organism, and it therefore makes sense to isolate the predictive information. It will turn out that most of the information we collect over a long period of time is nonpredictive, so that isolating the predictive information must go a long way toward separating out those features of the sensory world that are relevant for behavior. One of the most important examples of prediction is the phenomenon of gen-eralization in learning. Learning is formalized as finding a model that explains or describes a set of observations, but again this is useful only because we expect this model will continue to be valid: in the language of learning theory [see, for example, Vapnik (1998)] an animal can gain selective advantage not from its performance on the training data but only from its performance at generaliza-tion. Generalizing—and not “overfitting” the training data—is precisely the problem of isolating those features of the data that have predictive value (see also Bialek and Tishby, in preparation). Further, we know that the success of1The classic papers are by Kolmogoroff(1939, 1941) and Wiener (1949), who essentially solved all the extrapolation problems that could be solved by linear methods. Our understand- ing of predictability was changed by developments in dynamical systems, which showed that apparently random (chaotic) time series could arise from simple deterministic rules, and this led to vigorous exploration of nonlinear extrapolation algorithms (Abarbanel et al. 1993). Fora review comparing different approaches, see the conference proceedings edited by Weigend and Gershenfeld (1994).3generalization hinges on controlling the complexity of the models that we are willing to consider as possibilities. Finally, learning a model to describe a data set can be seen as an encoding of those data, as emphasized by Rissanen (1989), and the quality of this encoding can be measured using the ideas of information theory. Thus the exploration of learning problems should provide us with explicit links among the concepts of entropy, predictability, and complexity. The notion of complexity arises not only in learning theory, but also in several other contexts. Some physical systems exhibit more complex dynamics thanothers (turbulent vs. laminar flows in fluids), and some systems evolve toward more complex states than others (spin glasses vs. ferromagnets). The problem of characterizing complexity in physical systems has a substantial literature of its own; for an overview see Bennett (1990). In this context several authors have considered complexity measures based on entropy or mutual information, although as far as we know no clear connections have been drawn among the measures of complexity that arise in learning theory and those that arise in dynamical systems and statistical mechanics.An essential difficulty in quantifying complexity is to distinguish complexity from randomness. A true random string cannot be compressed and hence re-quires a long description; it thus is complex in the sense defined by Kolmogorov (1965, Li and Vit´ anyi 1993, Vit´ anyi and Li 2000), yet the physical process that generates this string may have a very simple description. Both in statistical me- chanics and in learning theory our intuitive notions of complexity correspond to the statements about complexity of the underlying process, and not directly to the description length or Kolmogorov complexity. Our central result is that the predictive information provides a general mea- sure of complexity which includes as special cases the relevant concepts from learning theory and from dynamical systems.While work on complexity inlearning theory rests specifically on the idea that one is trying to infer a model from data, the predictive information is a property of the data (or, more pre-cisely, of an ensemble of data) itself without reference to a specific class of underlying models. If the data are generated by a process in a known class but with unknown parameters, then we can calculate the predictive information explicitly and show that this information diverges logarithmically with the sizeof the data set we have observed; the coefficient of this divergence counts thenumber of parameters in the model, or more precisely the effective dimension of the model class, and this provides a link to known results of Rissanen and others. We also can quantify the complexity of processes that fall outside theconventional finite dimensional models, and we show that these more complex processes are characterized by a power–law rather than a logarithmic divergence of the predictive information. By analogy with the analysis of critical phenomena in statistical physics, the separation of logarithmic from power–law divergences, together with themeasurement of coefficients and exponents for these divergences, allows us todefine “universality classes” for the complexity of data streams. The power–law or nonparametric class of processes may be crucial in real world learning tasks,where the effective number of parameters becomes so large that asymptotic4results for finitely parameterizable models are inaccessible in practice. There is empirical evidence that simple physical systems can generate dynamics in this complexity class, and there are hints that language also may fall in this class. Finally, we argue that the divergent components of the predictive information provide a unique measure of complexity that is consistent with certain simple requirements. This argument is in the spirit of Shannon’s original derivation of entropy as the unique measure of available information. We believe that this uniqueness argument provides a conclusive answer to the question of how one should quantify the complexity of a process generating a time series. With the evident cost of lengthening our discussion, we have tried to give a self–contained presentation that develops our point of view, uses simple ex- amples to connect with known results, and then generalizes and goes beyond these results.2Even in cases where at least the qualitative form of our results is known from previous work, we believe that our point of view elucidates some issues that may have been less the focus of earlier studies. Last but not least, we explore the possibilities for connecting our theoretical discussion with the experimental characterization of learning and complexity in neural systems.2A curious observationBefore starting the systematic analysis of the problem, we want to motivate our discussion further by presenting results of some simple numerical experi- ments. Since most of the paper draws examples from learning, here we consider examples from equilibrium statistical mechanics. Suppose that we have a one dimensional chain of Ising spins with the Hamiltonian given byH = −Xi,jJijσiσj,(1)where the matrix of interactions Jijis not restricted to nearest neighbors—long range interactions are also allowed. One may identify spins pointing upwards with 1 and downwards with 0, and then a spin chain is equivalent to some sequence of binary digits. This sequence consists of (overlapping) words of N digits each, Wk, k = 0,1···2N− 1. There are 2Nsuch words total, and theyappear with very different frequencies n(Wk) in the spin chain [see Fig. (1) for details]. If the number of spins is large, then counting these frequencies provides a good empirical estimate of PN(Wk), the probability distribution of different words of length N. Then one can calculate the entropy S(N) of this probability distribution by the usual formulaS(N) = −2N−1Xk=0PN(Wk)log2PN(Wk)(bits).(2)2Some of the basic ideas presented here, together with some connections to earlier work,can be found in brief preliminary reports (Bialek 1995; Bialek and Tishby 1999). The central results of the present work, however, were at best conjectures in these preliminary accounts.57W1111 1110 0 0 0 10 0 0 0 0 0 W1W0W1W9W0. . . . .W = 0 0 0 0 W = 0 0 0 1W = 0 0 1 0W = 1 1 1 1. . .01215Figure 1: Calculating entropy of words of length 4 in a chain of 17 spins. For this chain, n(W0) = n(W1) = n(W3) = n(W7) = n(W12) = n(W14) = 2, n(W8) = n(W9) = 1, and all other frequencies are zero. Thus, S(4) ≈ 2.95 bits.Note that this is not the entropy of a finite chain with length N; instead it is the entropy of words or strings with length N drawn from a much longer chain. Nonetheless, since entropy is an extensive property, S(N) is proportional asymptotically to N for any spin chain, that is S(N) ≈ S0· N. The usual goal in statistical mechanics is to understand this “thermodynamic limit” N → ∞, and hence to calculate the entropy density S0. Different sets of interactions Jijresult in different values of S0, but the qualitative result S(N) ∝ N is true for all reasonable {Jij}.We investigated three different spin chains of one billion spins each.Asusual in statistical mechanics the probability of any configuration of spins {σi} is given by the Boltzmann distribution,P[{σi}] ∝ exp(−H/kBT),(3)where to normalize the scale of the Jijwe set kBT = 1. For the first chain, only Ji,i+1= 1 was nonzero, and its value was the same for all i. The second chain was also generated using the nearest neighbor interactions, but the value of the coupling was reset every 400,000 spins by taking a random number from a Gaus- sian distribution with zero mean and unit variance. In the third case, we again reset the interactions at the same frequency, but now interactions were long– ranged; the variances of coupling constants decreased with the distance between the spins as hJ2iji = 1/(i−j)2. We plot S(N) for all these cases in Fig. (2), and,of course, the asymptotically linear behavior is evident—the extensive entropy shows no qualitative distinction among the three cases we consider. However, the situation changes drastically if we remove the asymptotic linearcontribution and focus on the corrections to extensive behavior. Specifically, we write S(N) = S0· N +S1(N), and plot only the sublinear component S1(N) of the entropy. As we see in Fig. (3), the three chains then exhibit qualitativelydiff erent features: for the first one, S1is constant; for the second one, it is logarithmic; and, for the third one, it clearly shows a power–law behavior.What is the significance of these observations? Of course, the differences in the behavior of S1(N) must be related to the ways we chose J’s for thesimulations. In the first case, J is fixed, and if we see N spins and try to predict