process, the training of an RBF can be linked to an "interpolation" process. Indeed, the
RBF centers cover the span of the input space, but are located in the highest data density
zones in this input space. The process of optimizing the location of the RBF centers is the
first in training the network. To that goal, several approaches can be taken:
1. The RBF centers can be positioned at equal distance from each other in a
gridlike fashion. With a 10 delays input line, if we choose to divide the spread
of the data into 3 points, there will be 310 =59049 processing elements in the
hidden layer, which would render the training computationally explosive if
not impossible.
2. Another idea to fix the RBF centers is to use the some input vectors randomly
taken in the data set, since that would distribute the centers according to the
probability density function of the training data. Such an approach still does
not produce very good prediction performance, due to the non-stationarity of
the data. Therefore the centers of the RBF need to be clustered adaptively.
3. The first approach is to use the k-means clustering algorithm [25]. For each
input vector, it finds the closest RBF center according to the Euclidian norm,
and updates the location of that winning center proportionally to the distance
from the input point to that center. The learning rate is annealed during the
training. This is performed online (update center after each input), but can be
performed offline (run one epoch and do the updates in batch). Also the
spread of each processing element is chosen to be half the mean of the
distance to its ten closest neighbors (in order to have a good coverage of the
space between neighbors without too much overlap). A limitation of the k-
means algorithm is that it can get stuck in local optima, dependent on the
initial center locations.
4. The second clustering algorithm is the orthogonal least squares (OLS)
algorithm [26]. For every presentation of the whole set of training vectors, it
defines a new RBF center by optimizing the variance of the explained data. It
actually uses the training vector space and defines an orthogonal basis using a
method analog to the Gram-Schmidt orthogonalization algorithm. Contrary to
the k-means algorithm, OLS provides an optimum clustering of the training
data, but it does so by favoring the directions of maximum variance, so if the
chosen hidden layer dimension is not high enough, and or if the two regimes
are too much alike in those directions (if the differences are in directions of
smaller variance), this method might lack specialization.
Once the RBF layer is clustered, the linear layer has to be trained: we use the RLS
algorithm [23] for initial training, and the LMS algorithm for online adaptation. Again