Google Summer of Code Proposal
Published:
Google Summer Of Code
Organizations: MLpack
Proposal ideas:
Implementing enhancement modules to improve the ability of original CMA-ES in terms of scalability, performance and finding global optimum.
Objectives:
- Implement 4 extension algorithms (restart-CMA-ES, active-CMA-ES, Sep-CMA-ES and Cholesky-CMA-ES) and get merged into the codebase ensmallen.
- Refactorize the folder ensmallen/include/ensmallen_bits/cmaes with proper APIs for different variations of CMA-ES.
- Write intensive test cases with different kinds of functions and sanity checks for each variation algorithm.
- Detail documentation on the code and create a real-world application of CMA-ES tutorial.
Table of contents:
- About Project
- Reason
- Summary of CMA-ES
- CMA-ES enhancement
- API Design
- Testing and Benchmarking
- Documentation
- Timeline
2 About Project
2.1 Reason
The reasons why I chose this project as GSoC project. I have been practicing C++ meta programming and multi-thread lately, mainly by reading “The C++ Programming Language” by the “father” of C++ Bjarne Stroustrup. I tried to implement the C++ matrix library from scratch recently, but everything seems overwhelming to me, so I decided to look for an open-sourced C++ project that aligns with my academic interests which are Statistics and Machine Learning. My mathematics is accumulated over the years, but yet I still have not tried to apply any related algorithm using my favorite programming language - C++
And surprisedly, mlpack GSoC idea on MCA-ES Enhancement is an intersection of my interests.
2.2 Summary of MCA-ES algorithm and changes to original algorithm recent years
I did some research into this field and found some new promising improvements which can easily be integrated into ensmallen library
So the MCA-ES falls into the category of Evolutionary Algorithm(EA) - Evolution strategy. EA algorithms find the candidate solutions in any type of optimization problem by constantly changing the original (randomly generated) population - an inspired biological evolution process.
A typical ES consists of 4 steps and repeats the steps until the desired population is generated - qualified by objective function(s)
- Evaluate the fitness of each individual in the population
- Select the fittest individuals for reproduction
- Breed new individuals through crossover and mutation operations to give birth new population (offspring)
- Replace the least-fit individuals of the population with new individuals.
Our interest problem here is to use an evolutionary algorithm for optimization of real-valued function $f : \mathbb{R}^{n} \to \mathbb{R}$.
This particular problem is appealing to the machine learning field since almost machine learning objective functions are $f : \mathbb{R}^{n} \to \mathbb{R}$. The usual algorithm for this problem is gradient descent which required the fitness landscape to be convex while most of the desired loss functions are non-convex. Stochastic and derivative-free methods are theoretically suitable for these types of functions. An ES is designed to tackle this situation.
CMA-ES using normal distribution to draw new population and select $\mu$ best candidates as parents from $\lambda$ for future generation based on fitness values while inferior is discarded (step 1)
This means that:
\[x_{k}^{(g+1)} \sim m^{(g)} + \sigma^{(g)}\mathcal{N}(\textbf{0}, \textbf{C}^{(g)})\]For all $k=1,…,\lambda$ While $\textbf{C}^{(g)}$ is covariance matrix and $\sigma^{(g)}$ is the step-size(recorded should the update be significant or not, depends on cumulative path length control of $g^{\text{th}}$ generation.
The recombination and mutation of CMA-ES are stimulated by moving the mean and an adaptive covariance matrix after every generation.
Considering two update quantities separately.
2.2.1 The mean of selection parents
In the original CMA-ES paper, the algorithm suggested that, the new mean $m^{(g+1)}$ of draw distribution(search distribution) is a weighted average of $\mu$ selected points from the sample $x_{1}^{(g+1)},…, x_{\lambda}^{(g+1)}$:
\[⁍\]with $x_{i:\lambda}^{(g+1)}$, $i=1,…,\mu$ are the $\mu$ best elements in terms of objective function (best $f(x_{i:\lambda}^{(g+1)})$ among $\lambda$ samples $x_i$. Here we usually set $c_m=1$ as the original paper suggests.
Setting parameters also is important, since there is a rigorous mathematics explanation in the original paper on how parameters affect performance in each function.
Original paper suggests:
$\lambda = 4 + \lfloor 3*\ln{n} \rfloor$ $\mu = \dfrac{\lambda}{2}$
$w^{‘}_{i} = \dfrac{\lambda + 1}{2} - \ln{i}$$= \mu + 0.5 - \ln{i} \quad i = 1, …, \mu$
$w_{i} = \dfrac{1}{ \sum w^{‘}_{i} }$ $i=1,..,\mu$
$\lambda$ can be increased if we are deal with more noisy function while the trade-off is the speed of convergence. The speed decreases at most linearly with $\lambda$
if (lambda == 0){
lambda = (4 + std::round(3 * std::log(iterate.n_elem))) * 10;
}
const size_t mu = std::round(lambda / 2);
BaseMatType w = std::log(mu + 0.5) - arma::log(
arma::linspace<BaseMatType>(0, mu - 1, mu) + 1.0);
w /= arma::accu(w);
In the implementation of ensmallen:
First Initialization Declare the first average as a random vector lies in $[a,b]^n$ with uniform distribution $\sigma = 0.3(b-a)$ with $a,b$ are the lower bound and upper bound respectively (problem-dependent) as the author suggested (in order to have a faster convergence solution CMA-ES inconsistencies · Issue #70 ·))
// Declare the first average as a random vector lies in [a,b]^n with uniform distribution (sigma) while a, b are the lowerbound and upperbound a
// ...
std::vector<BaseMatType> mPosition(2, BaseMatType(iterate.n_rows, iterate.n_cols))
// vector of 2 matrices recorded the weighted mean m^(g), one before, one after
BaseMatType sigma(2,1) // vector 2 elements, one before, one after
// ...
mPosition[0] = lowerBound + arma::randu<BaseMatType>(iterate.n_rows,
,iterate.n_cols) * (upperBound - lowerBound);
// ...
// Calculate the (x_i - m^g) by sampled from distribution C^-1/2 . N(0, I)
step = w(0) * pStep[idx(0)];
for (size_t j = 1; j < mu; ++j){
step += w(j) * pStep[idx(j)];
}
// The update new mean
mPosition[idx1] = mPosition[idx0] + sigma(idx0) * step;
But the update batch only considers $\mu$ candidates from $\lambda$ sampled vectors. This means that the weighted recombination refers to the idea of replacing the computation of the average Eq. (1) with a weighted sum, where the weight of a mutation vector depends on the respective offspring candidate solution’s rank in the set of all $\lambda$ offspring. So the new mean is updated preferably to the direction of successful offspring and discarded the appearance of the remaining $\lambda-\mu$unsuccessful offspring. This is mentioned in [1]. So in order to have a better performance, we should include the unsuccessful samples in the update of the mean by giving them negative weight. Also, Rudolph [2] has shown that by giving negative weights to unfavourable candidate solutions can drive a sooner convergence. In [3] Hansen also mentioned this so he decided to create new setting for the $w(i)$
$w^{‘}_{i} = \dfrac{\lambda + 1}{2} - \ln{i}$ $= \mu + 0.5 - \ln{i}$ for $i = 1, …, \lambda$
$w_{i} = \dfrac{1}{\sum w^{‘}_i ^{+}} w^{‘}_i$ if $w^{‘}_i \geq 0$ positive weights sum to one
$w_i = \dfrac{min(\alpha^{-}\mu, \alpha^{-}{\mu_{eff}}, \alpha^{-}_{pos def})}{\sum w^{‘}_i ^{-}}$ if $w^{‘}i < 0$ negative weights usually sum to $-\alpha^{-}{\mu}$
Where
$\alpha^{-}_{\mu} = 1 + \dfrac{c_1}{c\mu}$,
$\alpha^{-}{\mu{eff}} = 1 + \dfrac{2\mu^{-}{eff}}{\mu^{-}{eff}+2},$
$\alpha^{-}{pos def} = \dfrac{1-c_1-c\mu}{nc\mu}$.
In old implementation, we calculate the $w^{‘}{i} = \dfrac{\lambda + 1}{2} - \ln{i}$ *= $\mu + 0.5 - \ln{i}$ for $i = 1, …, \mu$ and $w{i} = \dfrac{1}{ \sum w^{‘}_{i} }$* for $i=1,..,\mu$ (1) Now with new suggested implementation:
BaseMatType weight_before = std::log((lambda+1)/2)-
arm::log(arm::linspace<BaseMathType>(0, lambda-1, lambda)) // The original only get mu effective weight
BaseMatType w(lambda, 1);
double mu_neg_eff = 0;
size_t idx_neg = 0;
for(int j = 0; j < lambda; ++j){
if(weight_before(j) < 0){
idx_neg = j;
break;
}
}
w.col(0, idx_neg) = weight_before.col(0, idx_neg)/arm::accu(weight_before.col(0, idx))
const double alpha_mu_negative =1+c1/cmu;
const double mueff_negative = arma:accu(weight_before(0,
idx_neg))/arma:accu(arm:pow(weight_before(0, idx_neg), 2));
const double alpha_mueff_negative = 1 + 2*mueff_negative/(mueff_negative+2);
const double alpha_posdef_negative = (1-c1-cmu)/(iterate.n_elem*cmu);
w.col(idx_neg+1, lambda) = (weight_before.col(idx_neg+1, lambda) * std::min(std::min(alpha_mu_negative, alpha_mueff_negative), alpha_posdef_negative))/(arma:accu(w.col(idx_neg+1, lambda)))
This is my suggested improvement on the CMA-ES algorithm ensmallen on updating the mean. This idea of using negative weights on covariance matrix was benchmarked by author of CMA-ES in 2010 [4] and it is shown significant results.
Since the weight is changed in order to have a better performance so same goes to $\text{step} = (\textbf{y})^{T}w$
But in the code, there is no need to change:
idx = arma::sort_index(pObjective);
step = w(0) * pStep[idx(0)];
for (size_t j = 1; j < mu; ++j)
step += w(j) * pStep[idx(j)];
2.2.2 The adaptive covariance matrix
Recall from the main equation of this method:
\[⁍\]With $\sigma^{(g)}$is the step-size control coefficient of covariance matrix in generation $(g)$.
The next quantity we would like to update after every generation is $\sigma^{(g)}\mathcal{N}(\textbf{0}, \textbf{C}^{(g)}) = \mathcal{N}(\textbf{0}, (\sigma^{(g)})^{2}\textbf{C}^{(g)})$
We will update both $\sigma$ and $C$ depends on situations.
New population $x_1, x_2,.., x_\lambda$ will be generated from the distribution $\mathcal{N}(m^{(g)}, (\sigma^{(g)})^{2}\textbf{C}^{(g)})$, so by transforming $x_i = m^{(g)} + \sigma^{(g)}y_i$ we will get $y_i \sim \mathcal{N}(\textbf{0}, \textbf{C}^{(g)})$
And $y_i = \dfrac{x_i-m^{(g)}}{\sigma^{(g)}}$. So by omitting the step-size control into $y_i$, instead of dealing with the giant matrix $C^{(g)}$ we pass the $\sigma^{(g)}$ effective into $y_i$, but these are generating with $y_i \sim \mathcal{N}(\textbf{0}, \textbf{C}^{(g)})$. So we can control the covariance matrix passively by just update $\sigma^{(g)}$ after every generation, the effect will sparks when we realized the sample $y_i$. By doing this way we can just update $\sigma^{(g)}$ separately and not to care much about the algebraic operations.
Generating $y_i$ first would be more efficient, since we can “reconstruct” $x_i$ with $m^{(g)}$ and $\sigma^{(g)}$ which are already defined in previous generation. This is captured in the ensmallen code:
for (size_t j = 0; j < lambda; ++j){
if (iterate.n_rows > iterate.n_cols){
pStep[idx(j)] = covLower *
arma::randn<BaseMatType>(iterate.n_rows, iterate.n_cols);
}
else {
pStep[idx(j)] = arma::randn<BaseMatType>(iterate.n_rows, iterate.n_cols)
* covLower;
}
pPosition[idx(j)] = mPosition[idx0] + sigma(idx0) * pStep[idx(j)];
//...
}
Update process: After realized the value of $y_i$
We want update the covariance matrix in the direction of the most successful vector i.e $y_{1:\lambda}$ or in the direction of $\mu$ most successful vectors i.e : $y_{1:\lambda}, y_{2:\lambda},…, y_{\mu:\lambda}$
The first choice is called Rank-One-Update:
\[C^{(g+1)} = (1-c_1)C^{(g)} + c_1y_{1:\lambda}y_{1:\lambda}^{T}\]With $c_1$ is smoothing coefficient, see more in Smoothing Technique. Denoted $y^{(g)} = y_{1:\lambda}$ for short. But the outer product $y^{(g)}(y^{(g)})^{T}$ has positive entries, so what if the coordinates of $y^{(g)}$ are negative? We will favor the opposite direction. To fix this, author utilized the previous “memory” to give a new update a sense which direction algorithm is updating and averaging out the positive or negative coefficients.
Evolution Path:
So instead of using only one successful $y^{(g)}$, we will use the overall tendency of the most $\mu$ successful steps and constantly update it after each generation.
\[⁍\]The term $\sqrt{c_c(2-c_c)\mu_{eff}}$ is to “averaging” out the $-1$ or $1$ coefficients of the changes, this were addressed further more in [6]Tutorial.
Now we already have an update procedure to new “effective” covariance matrix, but we still missing one quantity which not yet update - step-size control $\sigma^{(g)}$.
Update procedure for $\sigma^{(g)}$:
Based on following reasons, we should update $\sigma^{(g)}$ according to the changes of evolution path and from that, propose a separated quantity to capture these changes.
- Whenever the evolution path is short, single steps cancel each other out (this is due to the fact that: every vector is produced from a stochastic process) So if steps extinguish each other, the step-size should be decreased
- Whenever the evolution path is long, the single steps are pointing to similar direction. Because the steps are similar, the same distance can be covered fewer but longer steps into the same direction. So in this case, to control the steps growing in the same direction, the step-size should be increased
So what are the “short” and “long” mean ? The steps is short or long with respective to which quantity.
\[y_i \sim \mathcal{N}(0, C^{(g)}) \rightarrow (C^{(g)})^{\frac{-1}{2}}y_i \sim \mathcal{N}(0, \textbf{I})\]By just multiplying the steps with $(C^{(g)})^{\frac{-1}{2}}$we get an isotropic vector lies in $\mathcal{N}(0, \textbf{I})$. So the expected length of steps is $\mathbb{E}|\mathcal{N}(0,\textbf{I})|$. So the conjugate path $p_{\sigma}^{(g+1)}$ of $p_{c}^{g+1}$with step now lies in $\mathcal{N}(0, \textbf{I})$ would be:
\[p_{\sigma}^{(g+1)} = (1-c_\sigma)p_{\sigma}^{(g)} + \sqrt{c_\sigma(2-c_\sigma)\mu_{eff}}(C^{(g)})^{\frac{-1}{2}} \dfrac{m^{(g+1)} - m^{(g)}}{\sigma^{(g)}}\]Update on $\sigma^{(g+1)}$
\[\sigma^{(g+1)} = \sigma^{(g)}\exp \left( \dfrac{c_\sigma}{d_\sigma} \left(\dfrac{\|p_{\sigma}^{(g+1)}\|}{\mathbb{E}\|\mathcal{N}(0,\textbf{I})\|} - 1 \right) \right)\]By giving to $\sigma^{(g)}$ an update based on the length of previous sampled step, we can meet a case when $\sigma^{(g)}$ is too small, and suddenly new generated steps are big then the update to the new $p_c^{(g+1)}$ can be unreasonably big. To overcome this situation, author proposed another control factor $h_{\sigma} = \mathbb{I}{ {\dfrac{ | p_\sigma | }{1-(1-c_\sigma)^{2*(g+1)}} < (1.4+\dfrac{2}{n+1})\mathbb{E}|\mathcal{N}(0,\textbf{I})|} }$. And the new update on cumulative path should be:
\[p_c \leftarrow (1-c_c)p_c + h_\sigma\sqrt{c_c(2-c_c)\mu_{eff}}(\textbf{y}.w)\]This is captured in the ensmallen code:
const double enn = std::sqrt(iterate.n_elem) * (1.0 - 1.0 /
(4.0 * iterate.n_elem) + 1.0 / (21 * std::pow(iterate.n_elem, 2)));
const double h = (1.4+2.0/(iterate.n_elem+1.0))*enn
if ((psNorm / sqrt(1 - std::pow(1 - cs, 2 * i))) < h){
pc[idx1] = (1 - cc) * pc[idx0] + std::sqrt(cc * (2 - cc) *
muEffective) * step;
} else {
pc[idx1] = (1 - cc) * pc[idx0];
}
Since I have changed the weight and it does not affect on $\textbf{y}$ in the old code. So this part I do not need to change anything. But the next part, there is going to have big changes.
The second choice to the decision is called Rank-$\mu$-Update
In the original paper, author suggested to add $\mu$ direction into the covariance matrix and smoothing the old memory
\[C^{(g+1)} = (1-c_\mu)C^{(g)} + c_\mu \sum_{i=1}^{\mu}w_iy_{i:\lambda}^{(g+1)}(y_{i:\lambda}^{(g+1)})^{T}\]But only include $\mu$ is not the best strategy, as we discussed on above. So by updating new weight, the $\text{rank}-\mu-\text{Update}$ needs to be changed.
\[C^{(g+1)} = (1-c_{\mu})C^{(g)} + c_\mu\sum_{i=1}^{\lambda}w_{i}^{\circ}y_{i:\lambda}^{(g+1)}(y_{i:\lambda}^{(g+1)})^{T}\]$w_{i}^{\circ} = w_{i}$ x (1 if $w_i \geq 0$ else $n/|C^{\frac{-1}{2}}y_{i:\lambda}|$
Combine this with Rank-1-Update, we have following update to covariance matrix:
\[C \leftarrow (1+c_1\delta(h_\sigma)-c_1-c_\mu \sum w_j)C + c_1p_cp_c^{\textbf{T}} + c_\mu\sum_{i=1}^{\lambda}w_{i}^{\circ}y_{i:\lambda}^{(g+1)}(y_{i:\lambda}^{(g+1)})^{\textbf{T}}\]$\delta{(h_\sigma)} = (1-h_{\sigma})c_c(2-c_c) \leq 1$
In compare with the implemented formula in ensmallen:
\[C \leftarrow (1-c_1-c_{\mu}\sum{w_j})C + c_1(p_cp_c^{\textbf{T}}+(1-h_\sigma)c_c(2-c_c)C) + c_\mu\sum_{i=1}^{\mu}w_{i}y_{i:\lambda}^{(g+1)}(y_{i:\lambda}^{(g+1)})^{\textbf{T}}\]$w_{i}^{\circ}$ is a new term, with negative weight, we have to multiply $w_i$ with $(C^{\frac{-1}{2}})^{(g)}y_{i:\lambda}$, but this quantity is a vector which were generated by $\mathcal{N}(0,\textbf{I})$ in the code
pStep[idx(j)] = covLower * arma::randn<BaseMatType>(iterate.n_rows, iterate.n_cols);
So I just need to recorded these vectors($\lambda$ vectors) into a vector $\textbf{z}$.
std::vector<BaseMatType> z(lambda, BaseMatType>(iterate.n_rows, iterate.n_cols));
for (size_t j = 0; j < lambda; ++j){
if (iterate.n_rows > iterate.n_cols){
z(j) = arma::randn<BaseMatType>(iterate.n_rows, iterate.n_cols);
pStep[idx(j)] = covLower*z(j)
} else {
z(j) = arma::randn<BaseMatType>(iterate.n_rows, iterate.n_cols);
pStep[idx(j)] = z(j)*covLower;
}
}
I suggest new implementation of covariance update (on the left) versus the old implementation(on the right)
const double hs = (psNorm / sqrt(1 -
std::pow(1 - cs, 2 * i))) < h) ? 1 : 0;
const double deltahs = (1-hs)*cc*(2-cc)
//Update cumulative path
pc[idx1] = (1 - cc) * pc[idx0] +
hs * std::sqrt(cc * (2 - cc) * muEffective) * step;
// Intiate new C by old C term
C[idx1] = (1+c1*deltahs-c1-cmu*arma::accu{w})*C[idx0]
if (iterate.n_rows > iterate.n_cols){
// Add rank-one update to covariance
C[idx1] = (1 - c1 - cmu) * C[idx0] + c1 *
(pc[idx1] * pc[idx1].t());
}
else{
C[idx1] = (1 - c1 - cmu) * C[idx0] + c1 *
(pc[idx1].t() * pc[idx1]);
}
// Add rank-mu term with new weight idea
if (iterate.n_rows > iterate.n_cols)
{
for (size_t j = 0; j < lambda; ++j)
{
if(w(j) < 0) w(j) *= n/arma::norm(z(j))
C[idx1] = C[idx1] + cmu * w(j) *
pStep[idx(j)] * pStep[idx(j)].t();
}
}
else
{
for (size_t j = 0; j < lambda; ++j)
{
if(w(j) < 0) w(j) *= n/arma::norm(z(j))
C[idx1] = C[idx1] + cmu * w(j) *
pStep[idx(j)].t() * pStep[idx(j)];
}
}
if ((psNorm / sqrt(1 - std::pow(1 - cs, 2 * i))) < h)
{
pc[idx1] = (1 - cc) * pc[idx0] + std::sqrt(cc * (2 - cc) *
muEffective) * step;
if (iterate.n_rows > iterate.n_cols)
{
C[idx1] = (1 - c1 - cmu) * C[idx0] + c1 *
(pc[idx1] * pc[idx1].t());
}
else
{
C[idx1] = (1 - c1 - cmu) * C[idx0] + c1 *
(pc[idx1].t() * pc[idx1]);
}
}
else
{
pc[idx1] = (1 - cc) * pc[idx0];
if (iterate.n_rows > iterate.n_cols)
{
C[idx1] = (1 - c1 - cmu) * C[idx0] + c1 * (pc[idx1] *
pc[idx1].t() + (cc * (2 - cc)) * C[idx0]);
}
else
{
C[idx1] = (1 - c1 - cmu) * C[idx0] + c1 *
(pc[idx1].t() * pc[idx1] + (cc * (2 - cc)) * C[idx0]);
}
}
if (iterate.n_rows > iterate.n_cols)
{
for (size_t j = 0; j < mu; ++j)
{
C[idx1] = C[idx1] + cmu * w(j) *
pStep[idx(j)] * pStep[idx(j)].t();
}
}
else
{
for (size_t j = 0; j < mu; ++j)
{
C[idx1] = C[idx1] + cmu * w(j) *
pStep[idx(j)].t() * pStep[idx(j)];
}
}
The new implementation discarded the one if else
statement by introducing the deltahs
variable. Also add new changes to the weight w
by recording z(j)
2.3 CMA-ES enhancements
2.3.1 Active-CMA-ES
There’re minor modifications on the covariance update formula from [3] : Using negative weights on bad mutation offspring in formula 16th from:
\[\begin{aligned}\boldsymbol{C}^{(g+1)} &=\left(1-c_{\mu} \sum w_{i}\right) \boldsymbol{C}^{(g)}+c_{\mu} \sum_{i=1}^{\lambda} w_{i} \boldsymbol{y}_{i: \lambda}^{(g+1)} \boldsymbol{y}_{i: \lambda}^{(g+1)^{\top}} \\&=\boldsymbol{C}^{(g)^{1 / 2}}\left(\mathbf{I}+c_{\mu} \sum_{i=1}^{\lambda} w_{i}\left(\boldsymbol{z}_{i: \lambda}^{(g+1)} \boldsymbol{z}_{i: \lambda}^{(g+1)^{\top}}-\mathbf{I}\right)\right) \boldsymbol{C}^{(g)^{1 / 2}} \end{aligned}\]To this formula [1] where $Z$ is $\text{rank}-\mu$ covariance matrix:
\[⁍\]Furthermore, in order to maintain stationary of the expectation of the covariance matrix under random selection, Active-CMA-ES change the covariance matrix update formula 30 from:
\[\begin{aligned}&\boldsymbol{C}^{(g+1)}=(\underbrace{1-c_{1}-c_{\mu} \sum w_{j}}_{\text {can be close or equal to } 0}) \boldsymbol{C}^{(g)}\\&+c_{1} \underbrace{\boldsymbol{p}_{\mathrm{c}}^{(g+1)} \boldsymbol{p}_{\mathrm{c}}^{(g+1)^{\top}}}_{\text {rank-one update }}+\underbrace{\sum_{i=1}^{\lambda} w_{i} \boldsymbol{y}_{i: \lambda}^{(g+1)}\left(\boldsymbol{y}_{i: \lambda}^{(g+1)}\right)^{\top}}_{\text {rank- } \mu \text { update }}\end{aligned}\]To this formula where $\beta = \frac{4\mu-2}{(n+12)^{12}+4\mu}$, $c_{cov} = \frac{2}{(n+\sqrt{2})^{2}}$ (found by non-linear fitting to lots of test functions):
\[\mathbf{C} \leftarrow\left(1-c_{\text {cov }}\right) \mathbf{C}+c_{\operatorname{cov}} \mathbf{p}_{\mathbf{C}} \mathbf{p}_{\mathbf{C}}^{\mathrm{T}}+\beta \mathbf{Z}\]Implementation details from ensmallen to below codes:
const size_t mu = std::round(lambda / 2);
BaseMatType w = std::log(mu + 0.5) - arma::log(
arma::linspace<BaseMatType>(0, mu - 1, mu) + 1.0);
w /= arma::accu(w);
if (iterate.n_rows > iterate.n_cols){
for (size_t j = 0; j < mu; ++j){
C[idx1] = C[idx1] + cmu * w(j) * pStep[idx(j)] * pStep[idx(j)].t();
}
}
else{
for (size_t j = 0; j < mu; ++j){
C[idx1] = C[idx1] + cmu * w(j) * pStep[idx(j)].t() * pStep[idx(j)];
}
}
const size_t mu = std::round(lambda / 2);
const float beta = (4*mu-2)/((iterate.n_elements+12)**2+4*mu)
BaseMatType w = arma::vec(mu).fill(1/mu);
if (iterate.n_rows > iterate.n_cols){
for (size_t j = 0; j < lambda; ++j){
C[idx1] = C[idx1] +beta * w(j) * pStep[idx(j)] * pStep[idx(j)].t();
C[idx1] = C[idx1] - beta * w(j) * pStep[idx(lambda-j+1)] * pStep[idx(lambda-j+1)].t();
}
}
else{
for (size_t j = 0; j < lambda; ++j){
C[idx1] = C[idx1] + beta * w(j) * pStep[idx(j)].t() * pStep[idx(j)];
C[idx1] = C[idx1] - beta * w(j) * pStep[idx(lambda-j+1)].t() * pStep[idx(lambda-j+1)];
}
}
2.3.2 IPOP-CMA-ES
This can be called as a restart-CMA evolution strategy, where the population size $\lambda$ is increased by a factor of $\alpha$ (normally $\alpha$ is set to be between 2 and 3) and all other parameters will be reset for each restart (IPOP). We will have a loop with maximum number of restarts ($n_{\text{restart}}$), for each iteration, the original CMA-ES will be called and the returned solution will be stored if it’s the best one in the loop. The loop only breaks when the number of function evaluations exceeds $n\times 10^{4}$ or the objective function error value is below $10^{-8}$. Furthermore, in each iteration, the optimization process of CMA-ES can be ended “early” for the loop to skip to the following restart if the optimization process meet the stopping criteria:
- Stop if the range of the best objective function values of the last $10 + \lceil \frac{30}{n}\rceil$ generations is zero ($\text{equalfunvalhist}$), or the range of these function values and all function values of the recent generation is below $\text{Tolfun} = 10^{-12}$.
- Stop if the standard deviation of the normal distribution is smaller than $\text{TolX}$ in all coordinates and $p_c$ is smaller than $\text{TolX}$ in all components. We set $\text{TolX}= 10^{-12}$. where $p_c \leftarrow (1-c_c)p_c + h_\sigma\sqrt{c_c(2-c_c)\mu_{eff}}(\textbf{y}.w)$ for each update iteration of optimization process (not the restart iteration).
- Stop if adding a 0.1-standard deviation vector in a principal axis direction of $C^{(g)}$ does not change $m^{(g)}$ ($\text{noteffectaxis}$).
- Stop if adding a 0.2-standard deviation vector in each coordinate does not change $m^{(g)}$ ($\text{noteffectcoord}$).
- Stop if the $|C^{(g)}|\times |(C^{(g)})^{-1}|$ (condition number of the covariance matrix) exceeds $10^{14}$ ($\text{conditionconv}$).
template<typename SelectionPolicyType>
template<typename SeparableFunctionType,
typename MatType,
typename... CallbackTypes>
typename MatType::elem_type IPOPCMAES<SelectionPolicyType>::Optimize(
SeparableFunctionType& function,
MatType& iterateIn,
CallbackTypes&&... callbacks){
//... Initialization of all neccesary parameters <only the types except the lambda>
const int n_restart = 9;
BaseMatType& best_solution;
ElemType ipop_best_objective;
const float factor = 2;
//Iteratively called the same block of code of optimize method of CMAES
for (int i=0; i<n_restart; ++i){
//.. Initialize parameters with values to reset parameters purposefully
for (size_t j = 1; j < maxIterations){
//... Cholensky factorization, choosing mu parents, evaluate the objectives
if (currentObjective < overallObjective)
{
overallObjective = currentObjective;
iterate = mPosition[idx1];
}
//... (the update of parameters)
if (IPOPCMAES<SelectionPolicyType>::TolFun()){break;}
if (IPOPCMAES<SelectionPolicyType>::TolX(pc)){break;}
if (IPOPCMAES<SelectionPolicyType>::noteffectaxis(C, mPosition)){break;}
if (IPOPCMAES<SelectionPolicyType>::noteffectcoord(mPosition)){break;}
if (IPOPCMAES<SelectionPolicyType>::conditionconv(C)){break;}
}
//Get the best value among the restarts
if (overallObjective < ipop_best_objective){
ipop_best_objective = overallObjective;
best_solution = iterate;
}
lambda = lambda*factor;
if (std::isnan(ipop_best_objective) || std::isinf(ipop_best_objective))
{
Warn << "IPOP-CMA-ES: converged to " << ipop_best_objective << "; "
<< "terminating with failure. Try a smaller step size?" << std::endl;
Callback::EndOptimization(*this, function, best_solution, callbacks...);
return ipop_best_objective;
}
if (std::abs(lastObjective - ipop_best_objective) < tolerance)
{
Info << "IPOP-CMA-ES: minimized within tolerance " << tolerance << "; "
<< "terminating optimization." << std::endl;
Callback::EndOptimization(*this, function, best_solution, callbacks...);
return ipop_best_ojective;
}
}
}
2.3.3 Sep-CMA-ES
To enable the algorithm for large scale optimization, a linear time and space version called sep-CMA-ES was proposed by Ros and Hasen [6]. The algorithm does not learn dependencies but the scaling of variables by restraining the covariance matrix update to the diagonal elements. This means we will constraint $C = \sigma_{rescale}^{(g)}I$. Therefore the update rule for covariance matrix only cares about the diagonal elements which costs $O(n)$.
\[c_{j j}^{t+1}=\left(1-c_{c o v}\right) c_{j j}^{t}+\frac{1}{\mu_{c o v}}\left(\boldsymbol{p}_{c}^{t+1}\right)_{j}^{2}+c_{c c o v}\left(1-\frac{1}{\mu_{c c o v}}\right) \sum_{i=1}^{\mu} w_{i} c_{j j}^{t}\left(z_{i: \lambda}^{t+1}\right)_{j}^{2}, j=1, \ldots, n\]2.3.4 Cholesky-CMA-ES
The advantage of CMA-ES is learning the dependencies between $n$ decision variables, also forms its main practical limitations such as $O(n^{3})$ to do Cholesky factorization for rolling out mutation vectors from Multivariable Normal Distribution. This’s extremely expensive for large-scale optimization. Cholesky-CMA-ES [7] can prove that we can update the Cholesky factors iteratively in the certain way such that this equals with continuously update the covariance matrix then factorize it into Cholesky factors again. This update rule for Cholesky factors is inly true for rank-1 update.
\[\boldsymbol{C}^{(g+1)}=\alpha \boldsymbol{C}^{(g)}+\beta \boldsymbol{p}_{c}^{(g)} \boldsymbol{p}_{c}^{(g)^{\mathrm{T}}}\]Update rules for Cholesky factor and its inverse (this can be understand as alternative update rule for covariance matrix)
\[\boldsymbol{A}^{(g+1)}=\sqrt{\alpha} \boldsymbol{A}^{(g)}+\frac{\sqrt{\alpha}}{\left\|\boldsymbol{z}^{(g)}\right\|^{2}}\left(\sqrt{1+\frac{\beta}{\alpha}\left\|\boldsymbol{z}^{(g)}\right\|^{2}}-1\right)\left[\boldsymbol{A}^{(g)} \boldsymbol{z}^{(g)}\right] \boldsymbol{z}^{(g)^{T}}\] \[\boldsymbol{A}^{(g+1)^{-1}}=\frac{1}{\sqrt{\alpha}} \boldsymbol{A}^{g^{-1}}-\frac{1}{\sqrt{\alpha}\left\|\boldsymbol{z}^{(g)}\right\|^{2}}\left(1-\frac{1}{\sqrt{1+\frac{\beta}{\alpha}\left\|\boldsymbol{z}^{(g)}\right\|^{2}}}\right) \boldsymbol{z}^{(g)}\left[\boldsymbol{z}^{(g)^{T}} \boldsymbol{A}^{(g)^{-1}}\right]\]where $\boldsymbol{z}{t}$ is a vector $R^{n}$ such that $\boldsymbol{p}^{(g)}{c} = \boldsymbol{A}^{(g)}\boldsymbol{z}^{(g)}$.
2.4 API Design
Since I am not full aware of the architecture that ensmallen are using. So there will be some points I implement the API in the way that is not compatible with ensmallen. I will need some advices on this problem. If my idea of splitting the CMA-ES into small components(stop the code duplication in improved methods) become infeasible, then I simply back to the idea of implementing each of algorithm separately.
The idea is inspired when I look into moead folder where author implement MOEA/D-DE, written by Nanubala Gnana Sai. Also I modularize both improvements and original algorithm to similar structure of CMA-ES/libcmaes.
The folder uses object-oriented policy-based design which is compatible with current ensmallen and makes extensive use of templates.
All the code will be under “cmaes” in the include/ensmallen_bits folder.
.
+-- selection_policies
+-- full_selection.hpp
+-- random_selection.hpp
+-- stop_policies
+-- default_stop.hpp
+-- ipop_stop.hpp
+-- cmaparameters.hpp
+-- cmaes.hpp
+-- cmaes_impl.hpp
+-- sepcmaes.hpp
+-- sepcmaes_impl.hpp
+-- cholcmaes.hpp
+-- cholcmaes_impl.hpp
The cmaes_impl.hpp
, sepcmaes.hpp
, cholcmaes.hpp
and maybe more similar improvements will be included. Most of them are different in covariance matrix update part. So it is better to code the covariance matrix update separately in each of improvement. In future or during the program, if it is feasible then I can implement a separate function covariance_update
The significant change is introducing new class cmaparameters
. cmaparameters
object captures hyper-parametes like $\mu, h_{\sigma},…$ and inputs (iterate) to the algorithm, such as offspring per generation, initial step-size, lower bound, upper bound, … Also I will leave the $\lambda$ as a parameter to this object. By introducing this class, we can reduce the amount of code to write fixed parameters, most of them are constant in the beginning of instantiation of the class. And minor changes to this object if we use different improvements.
namespace ens{
class CMAparameters {
public:
CMAparameters() {}
template<typename MatType>
CMAparameters(MatType& x0,
const size_t &lambda = 0,
const double &sigma,
const size_t batch
const double lowerBound
const double upperBound,
const size_t maxIterations)
void set_algo(std::string){
// setup the params depends on which algo we are using
}
// get param functions
private:
// General params
}
}
So below is the cmaes.hpp
structure, every improvement should be like this, the only difference is in Optimize function.
#include "selection_policies/full_selection.hpp"
#include "selection_policies/random_selection.hpp"
#include "stop_policies/default_stop.hpp"
#include "stop_policies/ipop_policies.hpp"
#include "cmaparameters.hpp"
namespace ens{
template<typename SelectionPolicyType = FullSelection,
typename StopPolicyType = DefaultStop>
class CMAES{
public:
CMAES(){};
CMAES(
CMAparameters& params = CMAparameters();
const SelectionPolicyType& selectionPolicy = SelectionPolicyType(),
const StopPolicyType& stopPolicy = StopPolicyType());
template<typename SeparableFunctionType,
typename... CallbackTypes>
typename MatType::elem_type Optimize(SeparableFunctionType& function,
CallbackTypes&&... callbacks);
// Optimize function are different from each other (covariance update)
private:
CMAparametes params;
SelectionPolicyType selectionPolicy;
StopPolicyType stopPolicy;
}
template<typename SelectionPolicyType = RandomSelection, StopPolicyType = DefaultStop>
using ApproxCMAES = CMAES<SelectionPolicyType, StopPolicyType >;
}
In order to use optimize a function, first we have to instantiate the CMAparameters
including the first start vector.
CMAparameters params = CMAparameters(arma::uvec(1.0, 2.0, 3.0), 4+3*std::log(3), 4, -10.0, 10.0)
ApproxCMAES<> optimizer = ApproxCMAES<>(params);
optimizer.optimize(function)
2.5 Testing and Benchmark
In order to check the functionality of CMA-ES in ensmallen, 4 tests was implemented in ensmallen. However, these tests mostly are for sanity checks and only examine CMA-ES with logistic regression function under different selection policies and matrix types (float or double).
#include <ensmallen.hpp>
#include "catch.hpp"
#include "test_function_tools.hpp"
using namespace ens;
using namespace ens::test;
TEST_CASE("CMAESLogisticRegressionTest", "[CMAESTest]")
{
CMAES<> cmaes(0, -1, 1, 32, 200, 1e-3);
LogisticRegressionFunctionTest(cmaes, 0.003, 0.006, 5);
}
TEST_CASE("ApproxCMAESLogisticRegressionTest", "[CMAESTest]")
{
ApproxCMAES<> cmaes(0, -1, 1, 32, 200, 1e-3);
LogisticRegressionFunctionTest(cmaes, 0.003, 0.006, 5);
}
TEST_CASE("CMAESLogisticRegressionFMatTest", "[CMAESTest]")
{
CMAES<> cmaes(0, -1, 1, 32, 200, 1e-3);
LogisticRegressionFunctionTest<arma::fmat>(cmaes, 0.01, 0.02, 5);
}
TEST_CASE("ApproxCMAESLogisticRegressionFMatTest", "[CMAESTest]")
{
ApproxCMAES<> cmaes(0, -1, 1, 32, 200, 1e-3);
LogisticRegressionFunctionTest<arma::fmat>(cmaes, 0.01, 0.02, 5);
}
In my plan, multiple test cases including test function suits, a real data optimization test, will be added. (Notes: if the timeline is not allowed then the implementation order will be described as below)
Common function tests for CMA-ES
The most adopted test suit for CMA-ES family is series of unimodal functions which can be separable (Sphere, Ellipsoid) or low conditioning (Rosenbrock) or high conditioning (Discus, Cigar and Different Powers). In ensmallen, only sphere and Rosenbrock functions are implemented (in ensmallen/include/ensmallen_bits/problems), however, there is no test case for CMA-ES with them. Notes that in CMA-ES with Optimal Covariance Update and Storage Complexity (acm.org), the initial points get from $[0,1]$ excepts the sphere gets from $\mathcal{N}(0,1)$ therefore I will set the bound to be $[-1,1]$.
\[\begin{array}{ll}\hline \text { Name } & f(\boldsymbol{x}) \\\hline \text { Sphere } & \|\boldsymbol{x}\|^{2} \\\text { Rosenbrock } & \sum_{i=0}^{d-1}\left(100\left(x_{i+1}-x_{i}^{2}\right)^{2}+\left(1-x_{i}\right)^{2}\right) \\\text { Discus } & x_{0}^{2}+\sum_{i=1}^{d} 10^{-6} x_{i}^{2} \\\text { Cigar } & 10^{-6} x_{0}^{2}+\sum_{i=1}^{d} x_{i}^{2} \\\text { Ellipsoid } & \sum_{i=0}^{d} 10^{\frac{-6 i}{d-1}} x_{i}^{2} \\\text { Different Powers } & \sum_{i=0}^{d}\left|x_{i}\right|^{\frac{2+10 i}{d-1}} \\\hline\end{array}\]TEST_CASE("CMAESSphereFunctionTest", "[CMAESTest]") { CMAES<> cmaes(0, -1, 1, 32, 200, 1e-3); FunctionTest<SphereFunction, arma::sp_mat>(s, 0.03, 0.003); } TEST_CASE("CMAESRosenbrockTest", "[CMAESTest]") { for (size_t i = 10; i < 50; i += 5) { GeneralizedRosenbrockFunction f(i); CMAES<> cmaes(0, 0, 1, 32, 200, 1e-3); arma::mat coordinates = f.GetInitialPoint(); double objectives = s.Optimize(f, coordinates); REQUIRE(objectives == Approx(0.0).margin(1e-4)); for (size_t j = 0; j < i; ++j) REQUIRE(coordinates(j) == Approx(1.0).epsilon(1e-4)); } } TEST_CASE("CMAESSDiscusFunctionTest", "[CMAESTest]") { CMAES<> cmaes(0, 0, 1, 32, 200, 1e-3); //Original CMA-ES achieves approximately 10e-1 Discus FunctionTest<DiscusFunction, arma::sp_mat>(s, 0.01, 0.003); } TEST_CASE("CMAESSCigarFunctionTest", "[CMAESTest]") { CMAES<> cmaes(0, 0, 1, 32, 200, 1e-3); //Original CMA-ES achieves medianly approximately 10e-3 FunctionTest<DiscusFunction, arma::sp_mat>(s, 0.0001, 0.00001); } //... Ellipsoid and different powers functions test case.
Here is to-do list for common function tests:
- Implementing Discus, Cigar, Ellipsoid and Different Powers in “ensmallen/include/ensmallen_bits/problems”
- Implementing test cases for each function and tolerant threshold.
Multi-modal test suit
Real-world application functions are mostly multi-modal preventing simple optimization algorithm to work well. In a multi-modal benchmarking paper for CMA-ES by Nikolaus Hansen [8], a series of multi-modal tests have been proposed. However, due to the number and the complexity of these functions, I will only be implementing below functions. In ensmallen test functions, Ackley and Rastrigin functions are implemented.
\[\begin{array}{ll}\hline \text { Name } & f(\boldsymbol{x}) \\\hline \text { Ackley } & \begin{aligned} 20-20 & \exp \left(-0.2 \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_{i}^{2}}\right) \\ &+\mathrm{e}-\exp \left(\frac{1}{n} \sum_{i=1}^{n} \cos \left(2 \pi x_{i}\right)\right) \end{aligned} \\\text { Bohachevsky } & \sum_{i=1}^{n-1}(x_{i}^{2}+2 x_{i+1}^{2} \left.-0.3 \cos \left(3 \pi x_{i}\right)-0.4 \cos \left(4 \pi x_{i+1}\right)+0.7\right) \\\text { Griewank } & \frac{1}{4000} \sum_{i=1}^{n} x_{i}^{2}-\prod_{i=1}^{n} \cos \left(\frac{x_{i}}{\sqrt{i}}\right)+1 \\\text { Rastrigin } & 10 n+\sum_{i=1}^{n}\left(x_{i}^{2}-10 \cos \left(2 \pi x_{i}\right)\right)\\\hline\end{array}\]Here is to-do list for multi-modal tests
- Implementing Bohachevsky and Griewank functions
- Implementing test cases + initialization bound for them based on the paper.
A real-world data test
Titanic prediction using logistic regression function is quite well-known task. For this reason, I will do data formatting for the original titanic dataset to be numerical dataset. After that, I save the data into a csv file and use arma matrix load method (arma::mat load) with arma::csv_ascii features to create a logistic regression dataset, then do optimization as usual.
TEST_CASE("CMAESLogisticRegressionDataTest", "[CMAESTest]") { ApproxCMAES<> cmaes(0, -10, 10, 32, 1000, 1e-3); //Logistic regression only achieve approximately 70% on testset of titanic LogisticRegressionFunctionRealTest<arma::fmat>(cmaes, 0.2, 0.3, 5); }
void LogisticRegressionFunctionRealTest(OptimizerType& optimizer, const double trainAccuracyTolerance, const double testAccuracyTolerance, const size_t trials = 1) { MatType data, testData; arma::Row<size_t> Prediction, testPrediction; data.load("data/titanic_train.csv", arma::csv_ascii); testData.load("data/titanic_test.csv", arma::csv_ascii); for (size_t i = 0; i < trials; ++i) { ens::test::LogisticRegression<MatType> lr(data, prediction, 0.5); MatType coordinates = lr.GetInitialPoint(); optimizer.Optimize(lr, coordinates); const double acc = lr.ComputeAccuracy(data, Prediction, coordinates); const double testAcc = lr.ComputeAccuracy(testData, testPrediction, coordinates); if (i != (trials - 1)) { if (acc != Approx(100.0).epsilon(trainAccuracyTolerance)) continue; if (testAcc != Approx(100.0).epsilon(testAccuracyTolerance)) continue; } REQUIRE(acc == Approx(100.0).epsilon(trainAccuracyTolerance)); REQUIRE(testAcc == Approx(100.0).epsilon(testAccuracyTolerance)); break; } }
2.6 Documentation
Despite of the decent quality documentation of CMA-ES “https://ensmallen.org/docs.html#cmaes”, the current CMA-ES implementation (comment for important lines of code) is lack of comments on the size of parameters matrix, mathematical formula. The implementation should be equipped with theoretical source and minor modifications from the original source (if any). Furthermore, I plan to create a tutorial (the format and the structure might take reference from PyTorch tutorial) on how to use CMA-ES optimizer for Application to Feedback Control of Combustion [9].
3 Timeline
| | Start/End Time | Duration | Detailed Tasks | | — | — | — | — | | Community Bonding | May 20 - June 12 | 3 weeks | - Communicating with mentors.
- Find out similar improvement of CMA-ES, which can easily integrated with my proposed API.
- Redesign the API to fit more derived CMA-ES.
- Drafting out how algorithm can be implemented using only arma and std.
Contribute to mlpack source code. Pre-Coding-Period June 13-June 19 1 week - Checking with mentors again how API should be like. - Starting communicate with mentors more frequently.
Clarify on previous implemented CMA-ES. Coding June 20 - July 3 2 weeks - Decide how many and which improvements should be included into cmaes folder. - Design API to fit with all improvements.
Broke down the pass parameters, code the overall API and header files of each modules, algorithms. July 4 - July 17 2 weeks - Done with implementing all improvements. Checking with mentors is all implementation fine. July 18 - July 31 2 weeks - Starts to integrate whole folder into ensmallen library. - config the cmake, and build, see is there any error.
- Writing sanity tests and basic tests.
Benchmarking with other implemented CMA-ES performance on basic functions. August 1 - August 14 2 weeks - Writing documentation and try to first large merge. Hearing the feedbacks and check errors. August 15 - August 28 2 weeks - Write more sophisticated tests and visualization of performance August 29 - September 11 2 weeks - Reserved time for any delay time from above Get consults from mentors and finish everything.
References
[1] Improving evolution strategies through active covariance matrix adaptation. IEEE Xplore. (n.d.). Retrieved April 19, 2022, from https://ieeexplore.ieee.org/document/1688662
[2] Convergence of evolutionary algorithms in general search spaces. IEEE Xplore. (n.d.). Retrieved April 19, 2022, from https://ieeexplore.ieee.org/document/542332
[3] Hansen, N. (2016, April 4). The CMA evolution strategy: A tutorial. arXiv.org. Retrieved April 19, 2022, from https://arxiv.org/abs/1604.00772
[4] Benchmarking a weighted negative … - école polytechnique. (n.d.). Retrieved April 19, 2022, from http://www.cmap.polytechnique.fr/~nikolaus.hansen/ws1p33-hansen.pdf
[5] A restart CMA Evolution Strategy … - école polytechnique. (n.d.). Retrieved April 19, 2022, from http://www.cmap.polytechnique.fr/~nikolaus.hansen/cec2005ipopcmaes.pdf
[6] A simple modification in CMA-ES achieving linear … - inria. (n.d.). Retrieved April 19, 2022, from https://hal.inria.fr/inria-00287367/document
[7] CMA-ES with optimal covariance update and … - papers.nips.cc. (n.d.). Retrieved April 19, 2022, from https://papers.nips.cc/paper/2016/file/289dff07669d7a23de0ef88d2f7129e7-Paper.pdf
[8] Evaluating the CMA Evolution Strategy on Multimodal Test Functions Retrieved April 19, 2022, from https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.2888&rep=rep1&type=pdf
[9] [A Method for Handling Uncertainty in Evolutionary Optimization With an Application to Feedback Control of Combustion | IEEE Journals & Magazine | IEEE Xplore](https://ieeexplore.ieee.org/document/4634579) |