Neural network training is usually formulated as a problem in function minimization. More precisely, if W are the weights defining a network’s architecture And e(W) is the weight depending error function, its gradient ∇e(W) is usually employed to arrive at the optimal weight set W*. There may be several ways of exploting this information and the simplest is just plain gradient descent, Which assumes an “Euclidean” structure in the underlying space of the W weights. Although very natural, this may result sometimes in quite slow network Learning in some problems, both in batch and, especially, on line error Minimization, where the global error function e(W) is replaced by an individual, Z pattern depending error function e(Z,W). Several procedures such as Adaptive learning rates or the addition of momentum terms have been proposed [6]. A different approach is suggested by the fact that in some instances, there May be metrics other than the euclidean one better suited to describe weight Space. This has been shown to be the case for a related problem, likelihood Estimates for parametric probability models [1], [4], for which a Riemannian structure Can be defined in weight space. The same reasoning can be applied for a Concrete network model, Multilayer Perceptrons (MLPs). When used in regression Problems, that is when the MLP tries to establish a relationship between An input X and output y for each pattern Z = (X,y), a probability model p(Z;W) = p(X,y; W) can be defined in pattern space so that the on line MLP Error function e(Z,W) = e(X,y,W) = (y - F(X,W)2/2 is seen as the log-likelihood Of p(Z;W); here F(X,W) denotes the network’s transfer function. This allows one to recast network learning as the likelihood estimation of a certain semi—parametric probability density p(X,y,W). In this setting, there is [2] a natural Riemannian metric on the space {p(X, y; W): W} of these densities, determined by a metric tensor given by the matrix $$ G(W) = E\left[ {(\nabla _W \log p)(\nabla _W \log p)^t } \right] = \int {\int {\frac{{\partial \log p}} {{\partial W}}} \left( {\frac{{\partial \log p}} {{\partial W}}} \right)} ^t p(X,y;W)dXdy. $$ G(W) is also known as the Fisher Information matrix, as it gives the variance of Cramer—Rao bound for the optimal parameter estimator. This suggests to use the “natural” gradient in the Riemannian setting, that is G(W)−1∇w e(X, y; W), Instead of the ordinary Euclidean gradient ∇w e(X, y; W).