Biogeography-Based Optimization for Weight Optimization in Elman Neural Network Compared with Meta- Heuristics Methods

1 College of Applied Computer Sciences (CACS), King Saud University, Riyadh, Saudi Arabia; Faculty of Sciences and Techniques of Sidi Bouzid, University of Kairouan, Tunisia. Abstract: In this paper, we present a learning algorithm for the Elman Recurrent Neural Network (ERNN) based on Biogeography-Based Optimization (BBO). The proposed algorithm computes the weights, initials inputs of the context units and self-feedback coefficient of the Elman network. The method applied for four benchmark problems: Mackey Glass and Lorentz equations, which produce chaotic time series, and to real life classification; iris and Breast Cancer datasets. Numerical experimental results show improvement of the performance of the proposed algorithm in terms of accuracy and MSE eror over many heuristic algorithms.


Introduction
Among the varied types of Neural Nets, Recurrent Neural Network (RNN) is able to forecast the most accurate results (Senjyu et al., 2006). In case of the RNN, the fixed back-connections save a copy of the previous values of the hidden units in the context units. Actually, the RNN was applied in various applications, such as, pattern recognition (Hori et al., 2016), robotic control (Sharma et al., 2016), and genetic data prediction (Baldi & Pollastri, 2003). Therefore, RNN has been widely used as a tool in data classification (Nawi et al., 2015) and time series prediction (Chandra, 2015;Koskela et al., 1996).
In fact, there are two types of RNN; a fully RNN used by Kechriotis Zervas and Manolakos (1994), and a Partially RNN used by Robinson and Fallside (1991). Concerning the fully RNN, each unit of the NN is connected to every other unit. BAM (Bidirectional Associative Memory) (Kosko, 1988) and Hopfield (1982) are examples of FRNN. The recurrent networks are still complicated in dealing with complex problems. While the partially training is faster compared to globally recurrent NNs. Recent researches exhibit that PRNN can be a highly effective forecasting method in fields like Electricity Consumption and Wind Speed (Cao, Ewing, & Thompson, 2012;Marvuglia & Messineo, 2012). PRNN offer both features. This topology is considered for non-linear applications and also to modulate time series data ( Müller-Navarra, Lessmann & Voß 2015). Elman Neural Network (ENN) (Elman, 1990) is the most widely used PRNN architecture. Its structure chosen over the Jordan network (Jordan, 1997) thanks to its hidden layer being wider than the output layer. This wider layer allows more values to be feedback to the input, consequently authorizing more information to be available to the network (Venayagamoorthy, Welch & Ruffing, 2009). Optimisation can be performed by metaheuristic methods (Yao & Kim, 2014). This class of network could be trained with heuristics algorithms because of the inconveniences gradient-based algorithms such as suffering from the local minima. Generally, it was three tasks for RNN optimization; weight and bias optimization, architecture optimisation and parameter gradient optimization. This work concerns the first optimization task in order to find the minimum training error.
A metaheuristic is officially known as an iterative generation process, which guides a subordinate heuristic by combining rationally different concepts for exploring and exploiting the search space, and learning the strategies which are used for organizing information to find efficiently near-optimal solutions (Osman & Laporte, 1996). In general, the Nature Inspired Algorithms is mainly classified in three major's groups: Evolutionary Algorithms, Ecology-Based Algorithms and Bio-Inspired Algorithm.
The Evolutionary computation algorithms are based on biological Darwinian evolution to design a solution and it include the Genetic Algorithms (Pham & Karaboga, 1999), the Differential Evolution (Storn & Price, 1997) and Evolutionary Strategies (Kawada, Yamamoto, & Mada, 2004).
The Ecology-Based Algorithms is based on the ecosystems to solve the problem. This group include Biogeography-Based Optimization (BBO) and Invasive Weed colony Optimization (IWO) algorithms (Mehrabian & Lucas, 2006).
Despite the generation of an evolving solution is common for the most of approaches, they have their distinctive way to exploit and explore the search space of the problem.
BBO algorithm considered one of the powerful algorithms due to its exploration and exploitation strategies depend on the two BBO operators; migration and mutation.
The main objective of the mutation operator is to enhance the diversity of the population. Based on this operator, the solution with low HSI can improved as well as the solutions with high HSI. Consequently, the probabilistic operator can be applied for any candidate solution. Unlike the evolutionary algorithms, at each generation the solutions are the combination of the parents 'solutions and their offspring.
The rate of emigrations is evolved from one generation to another. The habitat with a high emigration rate can share the information with the one with low emigration rate.
Several techniques have been used to optimize Elman RNN performance such as Genetic Algorithm (GA) (Pham & Karaboga, 1999), Particle Swarm Optimization PSO (Xiao, Venayagamoorthy, & Corzine, 2007), Ant Colony Optimization (ACO) (Zhipeng et al., 2012), Evolutionary June, 2020 Artificial Intelligence and Neuroscience Volume 11, Issue 2 Strategies (ES) (Kawada, Yamamoto, & Mada, 2004) and Population Based Incremental Learning (PBIL) (Palafox & Iba, 2012). Both BBO and GA are evolutionary algorithm, but each of them has a specific characteristic. In (Simon et al., 2011), the authors claim that BBO and GA have the same chances of finding the optimal solution, but BBO able to conserve this optimum once it found. Thanks to immigration rate, which help to retain good solutions in the population and reduces with fitness. In addition, the use of mutation for each individual in a population enhances the exploitation capability of BBO compared to GA, which applies a single mutation for the entire population. In fact, (Simon, Ergezer & Du, 2009) prove that the advantage of BBO compared to GA is more marked with larger and higher dimensional problems.
PSO is based on the behaviour of birds seeking their feeds while BBO uses the principle of migration to the islands, despite this difference, these two algorithms have similar characteristics as the sharing of information between populations but the strength of BBO is that it retains solutions from one iteration to another and ameliorate the solutions by the migration mechanism. BBO uses the mutation mechanism that represents a strong point compared to the Swarm Intelligence techniques (PSO, ACO).
In (Hordri, Yuhaniz, & Nasien, 2013), the author compares the performance of BBO, PSO and GA, treating fourteen benchmark functions, and finds that BBO makes a success in the convergence time and a great performance in avoiding local minima.
In this work, the used BBO algorithm is for optimizing the weight of the ENN. We also examine the advantages of this algorithm on the training ENN for the classification and prediction of benchmark problems. The performance of our algorithm will be compared also with other well-known heuristics algorithms.
The results indicate that BBO algorithm proves its effectiveness on training Elman Neural Network.
The used methodology in our analysis is as follows: Section II presents a broad description of Elman Neural Network (ENN); Section III explains the basic concept of the BBO algorithm and its use for the design ENN. The experimental results will be given in the fourth Section. Finally, the last Section gives the conclusions.

Elman Neural Network
Elman Neural Network (ENN) proposed in (Elman, 1990) designed with the input layer, the hidden layer, the recurrent link known as context layer and the output layer. It is based on the context layer that contains a copy of the hidden layer, which are subsequently used as input. The main advantage of this layer is to store the information in the hidden layer and to preserve the memory, which gives more information entered as input. As is well known, this simple Recurrent Network has many advantages, such as faster convergence, more accurate mapping and nonlinear prediction capability (Chandra, 2015).
Let assume xi( i = 1 .. m ) the input vector, yk the output of ENN and zj( j = 1 .. n ) the output of hidden layer. bj and bk are the biases in the hidden layer and the output layer respectively. uj denotes the context layer neurons. wij is the weight that connects between the input nodes (i) and the hidden nodes (j). cj denotes the weight that connects between the hidden nodes and the context nodes. vjk is the weight that connects the node j in the hidden layer to the output nodes.
(1) uj is the context node value, calculated by (2)  1) (2) The activation function selected in hidden layer is the sigmoid function defined as follows: ( The output of ENN is given as follows: The architecture of ENN is presented in Figure 1.

BBO Trained Elman RNN
Biogeography-Based Optimization (BBO) proposed in (Simon, 2008) is an Evolutionary Algorithm (EA), which is based on migration and immigration to the islands. Recently BBO algorithm has proved its efficiency and success to supply global optimal solution in different problems such as (Ma et al., 2015;Mirjalili, Mirjalili, & Lewis, 2014;Rodan, Faris & Alqatawna, 2016;Zhang et al., 2019).
The general idea of this algorithm is to get the relation between the species by emigration, immigration and mutation. Similarly, to GA, BBO employs the habitats as chromosome. Each habitat is assigned by a vector of habitants (genes in a GA), which are used as changeable variables to optimize the process problem. To achieve this objective, BBO offers Habitat Suitability Index (HSI) as performance index. High HSI represents a good solution, so a large number of habitants, which are more likely to immigrate to other islands with low HIS. Those poor solutions have a low HSI and a higher immigration rate. The BBO algorithm is characterised by emigration, immigration and mutation rates.
The time complexity analysis of BBO depends on numbers of used resources. Based on the O-notation the time complexity expressed as function describing the asymptotic upper bound. The big O notation is defined as follows: The computational complexity of the BBO algorithm depends on the number of species (habitats), the number of generations, the migration (selection of the solutions) and mutation operator and finding the best solution. Therefore, at each iteration, the computational complexity of BBO is as follows The time complexity of Initialization is of O(nmd) where d is the dimension of habitats, m is the number of habitants and n is the number of habitats but in our implementation d is one dimension. In the migration operation, the roulette wheel selection is used to select of a candidate solution from which to immigrate so the computation complexity of migration is of O(mn 2 ). For each habitant, the mutation operation has been applied, thus the computational complexity of the mutation is of O(nm). The selection of the best solution is based on the fitness value of each habitat. Consequently, the computational complexity of best habitat is of O(n 2 ). Therefore, the final computational complexity of the proposed method is as follows: O(BBO)= O(g(mn 2 +mn+ n 2 ) where g is the number of generations. In the expression of time complexity, all variables with zero space dimension are ignored because to their constant complexity time.
Fitness evaluation: Calculate HSI as fitness function defined by the error of ENN to evaluate habitats performance.

Encoding scheme of ENN trained by BBO
The optimization algorithm evolves the parameter of ENNs. Thus, in BBO, a structure habitat is encoded based on vector scheme which is defined as follows ENN = [ W12 W32 W24 b1 b2 ] Fig. 2. ENN with the structure of 1-1-1.
In Figure 2 example, each layer (input, hidden and output) is composed by only one node. W12 denotes the weight between input node and hidden node. W32 denotes the weight between context node and hidden node. W24 denotes the weight between hidden node and output node. b1 and b2 are the biases value of hidden node and output node respectively. So the encoding vector scheme contains the list of weights between input and hidden layer, the list of weights between the context and hidden layer, the list of weights between hidden and output layer and biases values. The fact remains that training RNNs is a challenging optimization problem. So each training set should be evaluated by a fitness measure. Thus, for each individual, the HSI function should be assigned depending to the desired optimization. In this work, the Mean Square Error (MSE) is used to compute the output error as HSI function:

Fitness function
Where S is the number of training samples, m denotes the number of output, is the obtained output of the ith input unit and d denotes the desired output. In this study, the proposed algorithm aims to minimize network performance. The computational complexity can have written as follows: O(BBO-ENN) = O ( i( x(z+y) + hH2 + Hh + H2 )) (6) Where i is the number of iterations, x is the number of input training sets, z and y are the number of nodes in the hidden layer and the output layer respectively, h is the number of habitants (weights and biases) and H is the number of habitats (ENNs). Denotes that H2 represent the elitism complexity, Hh is the mutation complexity, hH2 is the migration complexity and x(z+y) is our ENN complexity. The proposed model BBO-ENN given in Figure 3. In fact, the first step of the proposed model is to generate a random set of ENNs as habitats and initialize randomly weights and biases values as habitants. The second step is to calculate MSE for each ENN by Eq. (1) to distinguish between the best and the worst habitat. The third step is to update and modify the emigration, immigration and mutation rates. After having an idea about the good and poor solutions, we must make the combination between different islands then select some habitats to mutate various habitants. The last step is to select the good solution as elitism for future generations. These steps repeated until satisfaction of the stop condition, which can be the number of iterations or the error rate. Figure 4 presents a conceptual picture of the BBO-ENN. As seen in this figure, there are three habitats (ENNs). Habitat 1 provides a lower HIS, highest emigration and lower immigration. It presents the good solution, so it is more likely to share weights and biases with Habitat 2 and Habitat 3. Whereas, Habitat 2 provides a highest HIS, llower emigration and a highest immigration, it presents the poor solution, then, it is more likely to accept shared features (weights and biases) from Habitat 1 and Habitat 3.
Theoretically, the proposed BBO-ENN model can improve the training phase according to the various advantages of emigration and immigration rates, which are evolutionary mechanisms for each habitat, which encourages exploration. Thus, BBO forced not to fall in local optima. June, 2020 Artificial Intelligence and Neuroscience Volume 11, Issue 2 In addition, thanks to the migration of the better weights/biases towards the worse ENNs, the error rate MSE (HSI) of ENN (habitat) can be improved during the generation. Not to forget that the mutation mechanism helps each habitat to show the various exploitations mechanisms. Finally, elitism phase helps the proposed method to keep some of the best solutions, which are never lost.
After having an idea about the theoretical functionality of the proposed method, in the following section, we will see the results of the practical handling, followed by a comparative study between different algorithms.

Experiments
To verify the performance of BBO algorithm for training Elman NN, it's necessary to compare it with PSO, GA, ACO, ES and PBIL over four benchmark problems: Breast Cancer (Wolberg & Mangasarian, 1990), Iris (Fisher, 1936) for classification and Mackey and Glass (1977), and Lorenz Attractor (1963) for time series prediction.
The classification datasets based on two performance criteria: (a) MSE value and (b) classification accuracy.
In fact, the increase in population size and the number of iterations, could improve the performance of the algorithms, but in this work, we are interested on comparing the six algorithms during a fixed number of iterations. Thus, we are not forced to find the best parameters. Just use the same network parameters such as number of nodes, the value of weight initialization and size of population. In this architecture, the log-sigmoid is used as activation function.
For all algorithms, we initialise the habitat randomly in the range [-10, 10]. The population size is 200 for each dataset. For all the experiments, the performance was computed after 30 executed runs with 300 generations for all the used methods.
According to (Shamsuddin, 2004) there is no standard rule for determining the suitable number of hidden nodes. We fixed it based on this theorem "One hidden layer and 2N + 1 hidden neuron sufficient for N inputs". Table 1 show the different number of input, hidden and output node of each datasets. The initial parameters of meta-heuristics algorithms fixed in table 2; it shows various initialization settings for the optimization methods. All the parameters are chosen based on literature used value.

A. Breast Cancer
This dataset was obtained from the UCI Machine Learning Repository. This dataset contains 699 instances and 9 attributes with 458 benign and 241 malignant instances. The first 599 patterns are used for training phase, and remaining for testing. The outputs convergences of different algorithms are presented in Figure 5. Table 3 presents the experimental results of different algorithms. From the table 3, it can be seen that the MSE value for our BBO-ENN is less than PSO, GA, ACO, ES and PBIL algorithms, which demonstrates the efficacy of BBO-ENN for data classification. The proposed algorithm achieves the small MSE (0.0024175) and the highest accuracy with 99.99. Meanwhile, the other methods (PSO, ACO, ES, and PBIL) converge with large MSE and lower accuracy. Whereas, the MSE value of GA is closer to MSE BBO. As shown in Figure 5, the BBO technique has a faster and lower convergence curves among all the methods for Breast Cancer. From the simulation results, the BBO algorithm proves its superiority in terms of MSE and accuracy.

B. Iris dataset
The Iris Plants data set contains 150 samples and four attributes (sepal length, sepal width, petal length, petal width). It has actually, three major classes: Setosa, Versicolour and Virginica. In this experiment, we used four inputs, nine hidden nodes and three outputs . The first 150 patterns are selected for training phase, and remaining 150 for testing.  Figure 6 shows the convergence of each algorithm, and illustrates the success of BBO compared to the other methods. From these results, the BBO algorithm achieves with higher performance.

C. Mackey-Glass time series prediction
The Mackey-Glass time series prediction is defined using the following equation: In our work, the input of ENN with four data points: x(t), x(t-6), x(t-12) and x(t-18). The output is defined in equation 8: The first 500 samples are selected for training phase, and remaining 500 for testing. After 300 generations of the training process, the outputs convergences of different algorithms are presented in Figure 7. Table 5 shows the comparison of MSE error of the BBO-ENN to the other used meta-heuristics algorithms. In this experiment, the BBO and GA algorithm achieves with the smallest MSE error of 0.009702. In some implementation, the MSE-GA equal to MSE-BBO. However, BBO-ENN is still promising in cases where the convergence to the best solution is faster than the other methods.

D. Lorenz attractor
The Lorenz system was given by the following differential equations: Where and are positive real parameters. In these three equations, the component x denotes the used time series. In this work, the input of ENN is defined by x(t), x(t-1), x(t-2). The output is presented in equation 10: The first 500 samples from 1000 simulation data points are chosen for training phase, and the remaining 500 for testing. The convergence curve of each algorithm summarized in Figure 8.  Figure 8 represents the MSE convergence for Lorenz problem. This Figure demonstrates that the proposed BBO-ENN have better result than the other algorithms. The BBO-ENN shows again its efficiency for the prediction of Lorenz time series. During the previous experiments, the BBO proves its good performances compared to the other applied algorithms. The obtained results can be explained based on the philosophy of the BBO technique over the other evolutionary algorithms. During the generation, the BBO solutions are maintained depending on their emigration rate. At each iteration, the BBO improves the habits by changing some features. The poor solutions can be improved from the good solutions by sharing theirs SIVs (attributes). However, in GA, ACO, PBIL techniques, the worse solutions are discarded from the populations and only the best candidate's solutions are maintained. Thus, the population evolves using the elite solutions. BBO also clearly similar to PSO and DE approach in maintaining solutions. The solution learns from theirs neighbours and evolves based on the movements of the around particles.

Conclusion
In this work, a Biogeography-Based Optimization (BBO) algorithm proposed to train Elman Neural Network (ENN) for four benchmark problems. The experiment results show that the BBO-ENN model can effectively classify the data such as Breast Cancer and Iris data sets. The method applied to Mackey Glass and Lorentz equations, which produce chaotic time series. Statistical results show that the proposed algorithm June, 2020 Artificial Intelligence and Neuroscience Volume 11, Issue 2 outperforms the GA, PSO, EA, ACO and PBIL algorithms. Performance and success of BBO-ENN is mainly due to the use of the Biogeography-Based Optimization (BBO) algorithm, which can successfully optimize the weight parameter of Elman Neural Network. BBO-ENN makes a success in the convergence time and a great performance in avoiding local minima. Although, BBO has shown a good performance when being applied to classification and time series prediction, BBO inherently lacks exploration ability to increase the diversity of habitats, which lead to slow down the convergence of the algorithm. The expansion of applying BBO algorithms in many types of problems open several research areas. One suggested research for the future work is to automating parameter tuning. Additional study is to apply the BBO algorithm for complicated problems.
June, 2020 Artificial Intelligence and Neuroscience Volume 11, Issue 2