Adam_ A Method for Stochastic Optimization.pdf

一般编程问题

下载此实例

开发语言：Others
实例大小：0.56M
下载次数：5
浏览次数：129
发布时间：2020-08-17
实例类别：一般编程问题
发布人：robot666
文件格式：.pdf
所需积分：2

网友评论举报投诉收藏该页

下载此实例

实例介绍

【实例简介】
深度学习ADAM算法，分享给大家学习。 We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invarian
Published as a conference paper at IcLr 2015 otherwise. The first case only happens in the most severe case of sparsity: when a gradient has been zero at all timesteps except at the current timestep For less sparse cases, the effective stepsize will be smaller. When(1-81)=√1-B2 we have that m/V饼<1 therefore△;<a.In more common scenarios, we will have that mt/vatN=l since E/ vEg2< 1. The effective magnitude of the steps Laken in parameter space at each timestep are approximately bounded by the stepsize setting c, i.e., At s c. This can be understood as establishing a trust region around the current parameter value, beyond which the current gradient estimate does not provide sufficient information. This typically makes it relatively easy to know the right scale of a in advance. For many machine learning models, for instance, we often know in advance that good optima are with high probability within somc sct rcgion in paramctcr spacc: it is not uncommon, for cxamplc,to have a prior distribution over the parameters. Since a sets (an upper bound of) the magnitude of steps in parameter space, we can often deduce the right order of magnitude of a such that optima can bc rcachcd from Bo within somc numbcr of itcrations. With a slight abuse of terminology, we will call the ratio mt/vit the signal-to-noise ratio (SNR). With a smaller SNR the effective stepsize At will be closer to zero. This is a desirable property, since a smaller SNR means that there is greater uncertainty about whether the direction of mt corresponds to the direction of the true gradient. For example, the snr value typically becomes closer to o towards an optimum, leading Lo Smaller effective steps in parameter space: a form of aulomatic annealing. The effective stepsize At is also invariant to the scale of the gradients; rescaling the gradients g with factor c will scale m, with a factor c and vt with a factor c2, which cancel out: (c- mt/vc2. it)=mt/ve 3 INITIALIZATION BIAS CORRECTION As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive the term for the second moment estimate the derivation for the first moment estimate is completely analogous. Let g be the gradient of the stochastic objective f, and we wish to estimate its second raw moment (uncentered variance)using an exponential moving average of the squared gradient, decay rate F2. Let 91, gr be the gradients at subsequent timesteps, each a draw from an underlying gradicnt distribution gt Lct us initialize the exponential moving avcragc as vo =0(a vector of zeros ) First note that the update at timestep t of the exponential moving average Ut= 52.Ut-1+(1-B2)gi(where gf indicates the elementwise square gt O gt)can be writen as a function of the gradients at all prcvious timcstcps (1-B∑ We wish to know how EJvt, the expected value of the exponential moving average at timestep t, relates to the true second moment E[9f1, so we can correct for the discrepancy between the two Taking expectations of the left-hand and right-hand sides of eq (1) E团=B|1-2)∑-9 E]·(1-B2)∑ 2=1 Eg2]·(1-)+5 whcrc s=0 if the truc sccond momcnt E92] is stationary, otherwise S can be kept small sincc the exponential decay rate B1 can(and should) be chosen such that the exponential moving average assigns small weights to gradients too far in the past. What is left is the term(1-02)which is causcd by initializing thc running avcragc with zeros. In algorithm I wc thcrcforc divide by this term to correct the initialization bias In case of sparse gradients, for a reliable estimate of the second moment one needs to average over many gradients by chosing a small value of B2; however it is exactly this case of small B2 w here a lack of initialisation bias correction would lead to initial steps that are much larger Published as a conference paper at IcLr 2015 4 CONVERGENCE ANALYSIS We analyze the convergence of Adam using the online learning framework proposed in (Zinkevich, 2003). Given an arbitrary, unknown sequence of convex cost functions fi(0), f2(0),,fr(e).At each time t, our goal is to predict the parameter 8t and evaluate it on a previously unknown cost function ft. Since the nature of the sequence is unknown in advance, we evaluate our algorithm using the regret, that is the sum of all the previous difference between the online prediction ft(0t) and the best fixed point pararmeter fi(0) fron a feasible set a' for all the previous steps. Concretely, the regret is defined as P)=∑:(0)-() where 0*= arg minge ft(e). We show Adam has O(VT)regret bound and a proof is given in thc appendix. Our rcsult is comparable to the best known bound for this gcncral convex online learning problem. We also use some definitions simplify our notation, where gt Vft(0t)and gt as the ith element. We define l: t i E Rt as a vector that contains the ith dimension of the gradients over all iterations till l, g1: t, i =[91,i,92, i:..,9, i]. Also, we define ? Vs. Our following theorem holds when the learning rate at is decaying at a rate of t 2 and first moment running average coefficient B1.t decay exponentially with A, that is typically close to 1, e.g.1-10- Theorem 4.1. As sune that the /unction /t has bounded gradients, V/(8) G, V+(lox Goo for all B F Rd and distance between any 0, generated by Adam is bounded, 0n-0mll2<D mbn‖ls≤ Doo for any m,n ∈{1…T}, and B1,B2∈[0.,1)si2 方<1. Let a+=9 and B1, t=B1A,AC(0, 1). Adam achieves the following guarantee, for allT21 R()2a(1-61)1可;计 a(1+B1)G∞ √1-B2 (1-B1√1-B2(1-)2 91T,i2+ 2a(1-B1)(1- mation term can be much smaller than its upper bound )nd Our Th heorem 4.1 implies when the data features are sparse and bounded gradients, the sum =1|91T,|2<dG⑦and TUT. i < dGovT, in particular if the class of function and data features are in the form of section 1. 2 in(Duchi et al., 2011). Their results for the expected value e'nd 91: Tille] also apply to Adam. In particular, the adaptive method, such as Adam and Adagrad, can achieve O(log dvr) an improvement over O(v dr)for the non-adaptive method. Decaying Blt towards zero is impor- tant in our theoretical analysis and also matches previous empirical findings, e. g. (Sutskever et al 2013) suggests reducing the momentum coefficient in the end of training can improve convergence Finally, we can show the average regret of Adam converges Corollary42. Assume that the function ft has bounded gradients,Vf(θ)z≤G,‖ft(6)≤ Gx, for all 6∈Rd and distance between any 0, generated by Adam is bounded, 8n-0mn2<D, Om-bn‖l≤ Do for any m,n∈{1,…,T}. Adam achieves the following guarantee, for all R(7 O(=) This result can be obtained by using Theorem 4.1 and 2i 91: T il 2< dG xVT. Thus, limT)∞ 5 RELATED WORK Optimization methods bearing a direct relation to Adam are RMSProp(Tieleman hinton, 2012 Graves, 2013) and Ada Grad (Duchi et al., 2011); these relationships are discussed below. Other stochastic optimiz ation methods include vsI(Schaul et al., 2012), AdaDelta(/eiler, 2012)and the natural Newton method from roux Fitzgibbon(2010), all setting stepsizes by estimating curvature Published as a conference paper at IcLr 2015 from first-order information. The Sum-of-Functions Optimizer (SFO)(Sohl-Dickstein et al., 2014) is a quasi- Newton method based on minibatches, but(unlike Adam) has memory requirements linear in the number of minibatch partitions of a dataset, which is often infeasible on memory-constrained systems such as a GPU. Like natural gradient descent(NGD)(Amari, 1998), Adam employs a preconditioner that adapts to the geometry of the data, since ut is an approximation to the diagonal of the Fisher information matrix(Pascanu Bengio, 2013); however, Adam's preconditioner (like AdaGrad's)is more conservative in its adaption than vanilla ngd by preconditioning with the square root of the inverse of the diagonal Fisher information matrix approximation RMSProp: An optimization mcthod closely rclatcd to Adam is RMSProp(ticlcman Hinton 2012). A version with momentum has sometimes been used (Graves, 2013). There are a few impor- tant differences between RMSProp with momentum and Adam: RMSProp with momentum gener ates its parameter updates using a momcntum on the rescaled gradicnt, whcrcas Adam updates arc directly estimated using a running average of first and second moment of the gradient. RMSProp lIso lacks a bias-correction term; this matters most in case of a value of B2 close to l(required ir case of sparse gradients), since in that case not correcting the hias leads to very large stepsizes and often divergence, as we also empirically demonstrate in section 6. 4 Adagrad: An algorithm that works well for sparse gradients is Ada Grad (Duchi et al., 2011). Its basic version updates parameters as 6++1=0t-cegt/V2i= 9t. Note that if we choose B2 to be infinitesimally close to 1 from below, then limg2-10=t-. 2-192. Ada Grad corresponds Lo a version of Adam with p1=0, infinitesimal(1-B2) and a replacement of a by an annealed version a1=at-1, namely g-at12.m,/Vm216=-at-12.9/Vt-1·∑=19= At -cr gt/ 92. Note that this direct correspondence between Adam and Adagrad does not hold when removing the bias-correction terms; without bias correction, like in RMSProp, a B2 infinitesimally close to l would lead to infinitely large bias, and infinitely large parameter update 6 EXPERIMENTS To empirically evaluate the proposed method, we investigated different popular machine learning models, including logistic regression, multilayer fully connected neural networks and deep convolu tional neural networks. Using large models and datasets, we demonstrate Adam can efficiently solve practical deep learning problems We use the same parameter initialization w hen comparing different optimization algorithms. The hyper-parameters, such as learning rate and momentum, are searched over a dense grid and the results are reported using the best hy per-parameter setting 6.1 EXPERIMENT: LOGISTIC REGRESSION We evaluate our proposed method on L2-regularized multi-class logistic regression using the MNIST dataset. Logistic regression has a well-studied convex objective, making it suitable for comparison of different optimizers without worrying about local minimum issues. 'The stepsize or in our logistic regression cxpcrimcnts is adjusted by 1/vt decay, namely at= that matches with our thcorat ical prediction from section 4. The logisTic regression classifies the class label directly on the 784 dimension image vectors. We compare Adam to accelerated SGD with Nesterov momentum and Adagrad using minibatch size of 128. According to Figure 1, we found that the Adam yields similar convergence as sgd with momentum and both converge faster than adagrad As discussed in(Duchi et al., 2011), Adagrad can efficiently deal with sparse features and gradi cnts as onc of its main thcorctical rcsults whereas SGD is low at Icarning rarc fcaturcs. Adam with l/vt decay on its stcpsizc should thcoratically match thc performance of Adagrad. Wc cxaminc thc sparse feature problem using IMDB movie review dataset from(Maas et al., 2011). We pre-process the IMDB movie reviews into bag-of-words(Bow) feature vectors including the first 10,000 most frequent words. 'The 10,000 dimension Bow feature vector for each review is highly sparse. As sug gested in(Wang manning, 2013), 50%0 dropout noise can be applied to the Bow features durin Published as a conference paper at ICLR 2015 MNIST Logistic Regression 0.50 IMDB BoW feature Logistic RI Acagrad+dropout SGDNesterov MSProp+ dropout Adam+ dropout 025 reraticns over entire dataset iterations over entire dataset Figure 1: Logistic regression training negative log likelihood on MNIST images and IMDB movie reviews with 10,000 bag-of-words(Bow) feature vectors training to prevent over-fitting. In figure 1, Adagrad outperforms S() with Nesterov momentum by a large margin both with and without dropout noise. Adam converges as fast as Adagrad. The empirical performance of Adam is consistent with our theoretical findings in sections 2 and 4. Sim- ilar to Adagrad, Adam can take advantage of sparse features and obtain faster convergence rate than normal sgd with momentum 6.2 EXPERIMENT: MULTI-LAYER NEURAL NETWORKS Multi-layer neural network are powerful models with non-convex objective functions. Although our convergence nce analysis does not apply to non-convex problems, we empirically found that Adam often outperforms othcr mcthods in such cascs. In our cxpcrimcnts, wc madc modcl choices that arc consistent with previous publications in the area; a neural network model with two fully connected hidden layers with 1000 hidden units each and relU activation are used for this experiment with minibatch sizc of 128 First, we study different optimizers using the standard deterministic cross-entropy objective func tion with L2 weight decay on the parameters to prevent over-fitting. The sun-of-functions(SFO method (Sohl-Dickstein et al., 2014)is a recently proposed quasi-Newton method that works with minibatches of data and has shown good performance on optimization of multi-layer neural net works. We used their implementation and compared with adam to train such models. Figure 2 shows that. Adam makes faster progress in terms of both the number of iterations and wall-clock time. Due to the cost of updating curvature information, SFO is 5-10x slower per iteration com- pared lo Adam, and has a memory requirement that is linear in the number minibatches Stochastic regularization methods, such as dropout, are an effective way to prevent over-fitting and oftcn uscd in practicc duc to thcir simplicity. SFO assumes deterministic subfunctions, and indccd failed to converge on cost functions with stochastic regularization. We compare the effectiveness of noise.Figure 2 shows our results: Adam shows better convergence than other method ith dropout Adam to other stochastic first order methods on multi-layer neural networks trained v 6.3 EXPERIMENT: CONVOLUTIONAL NEURAL NETWORKS Convolutional neural networks(CNNs) with several layers of convolution, pooling and non-linear units have shown considerable success in computer vision tasks. Unlike most fully connected neural ncts, wcight sharing in CNNs rcsults in vastly diffcrent gradients in diffcrcnt layers. A small learning rate for the convolution layers is often used in practice when applying SGD. We show the effectiveness of Adam in deep CNNs. Our CNN architecture has three alternating stages of 5x.5 convolution tilters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units(ReLU'S). The input image are pre-processed by whitening, and Published as a conference paper at ICLR 2015 d' MNIST Multilayer Neural Network dropo Degrad SGDNesteroy AdaDelta iteral inns nver entire dataset Figure 2: Training of multilayer neural networks on MNIST images. (a)Neural networks using dropout stochastic regularization.(b) Neural networks with deterministic cost function. We compare with the sum-of-functions(SFO)optimizer( Sohl-Dickstcin ct aL., 2014) CIFARI0 Conv Net First 3 Epoches CIFAR10 ConvNet Ada grad Ada Grad i dropout Ada Grad t dropout GDNesteroy SGDNesterov SGDNesterov +dropout SGDNesterov+ dropo Adan tdr opout AdaI+dropout 15 Figure 3: Convolutional neural networks training cost. (left) Training cost for the first three epochs (right) Training cost over 45 epochs. CIFAR-10 with c64-c64-c128-1000 architecture dropout noise is applied to the input layer and fully connected layer. The minibatch size is also set to 128 similar to previous experiments Interestingly, although both Adam and Adagrad make rapid progress lowering the cost in the initial stage of the training, shown in Figure 3(left), Adam and SGD eventually converge considerably faster than Adagrad for CNNs shown in Figurc 3 (right). Wc notice the sccond moment estimate v't vanishes to zeros after a few epochs and is dominated by the e in algorithm 1. The second moment estimate is therefore a poor approximation to the geometry of the cost function in CNNs comparing to fully connccted nctwork from Scction 6.2. Whcrcas, reducing the minibatch variance through the first moment is more important in CNNs and contributes to the speed-up. As a result, Adagrad converges much slower than others in this particular experiment. Though Adam shows marginal improvement over SGiD with momentum, it adapts learning rate scale for different layers instead o hand picking manually as in SGD Published as a conference paper at IcLr 2015 =0.99=0.999=0.999 B9=0.99=0.999B:=0.999 e1-0.9 logio(a (a)after 10 epochs (b)afler 100 epochs Figure 4: Effect of bias-correction terms (red line) versus no bias correction terms(green line) after 10 epochs (left)and 100 epochs(right)on the loss(y-axes)when learning a Variational Auto Encodcr(VAE)(Kingma Welling, 2013), for diffcrent scttings of stepsize a(x-axcs) and hyper parameters B1 and B2 6. 4 EXPERIMENT BIAS-CORRECTION TERM We also empirically evaluate the effect of the bias correction terms explained in sections 2 and 3 Discusscd in scction 5, rcmoval of thc bias correction tcrms results in a vcrsion of RMsProp (Ticl man Hinton, 2012) with momentum. We vary the B1 and B2 when training a variational auto- encoder (VAE) with the same architecture as in(Kingma Welling, 2013)with a single hidden laycr with 500 hidden units with softplus nonlincaritics and a 50-dimcnsional spherical Gaussian latent variable. We iterated over a broad range of hyper-parameter choices, i. e. B1 e0, 0.9and B2 E[0.99, 0.999, 0.9999, and log1o(a E[-5,.,1. Values of B2 close to 1, required for robust- ness to sparse gradients, results in larger initialization bias; therefore we expect the bias correction term is important in such cases of slow decay, preventing an adverse effect on optimization In Figure 4, values B2 close to 1 indeed lead to instabilities in training when no bias correction term was present, especially at first few epochs of the training. The hest results were achieved with small values of(1 B2)and bias correction; this was more apparent towards the end of optimization when gradients tends to become sparser as hidden units specialize to specific patterns. In summary, Adam performed equal or better than RMSProp, regardless of hyper-parameter setting Z EXTENSIONS 7 ADAMAX In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled)L2 norm of their individual current and past gradients. We can generalize the L2 norm based update rule to a LP norm based update rule. Such variants become numerically unstable for large p. However, in the special case where we let p-o, a surprisingly simple and stable algorithm emerges; see algorithm 2. We'll now derive the algorithm. Let, in case of the LP norm, the stepsize at time t be inversely proportional to u1/p 82v-1+(1-B)|9t (t-i) 8 Published as a conference paper at IcLr 2015 Algorithm 2: AdaMax. a variant of adam based on the infinity norm See section 7. 1 for details Good default settings for the tested machine learning problems are a= 0.002, B1= 0.9 and B2=0.999. With Bi we denote B1 lo the power t. Here,(a/(1-B1))is the learning rale with the bias-correction term for the first moment. All operations on vectors are element-wise Require: a: Stepsize R 61, 52 E[0, 1): Exponential decay rates Require: f(e): Stochastic objective function with parameters A Require: Bo: Initial parameter vector <0(Initialize the exponentially weighted infinity norm) t< 0(Initialize timestep) while 0, not converged do t←t+1 gt<Veft(8t_1)(Gct gradients w.r. t stochastic objcctivc at timcstcp t) +(1-B1)g+ (Update biased first It estimate) ut<max(B2.ut-1, gtI)(Update the exponentially weighted infinity norm) Att0t1-(a/(1-Pi)).mt/ut(Updatc paramctcrs) end while return 0+(Resulting parameters Note that the decay term is here equivalently parameterised as B2 instead of B2. Now let p,0o and define ut=limp-oo(vt)1/P,then lin(ui)1/p=lin 吗(a (t-i) gilp (8) P→∞ lm(1-B}(∑ max(211l,3292:219tnlg Which corresponds to the remarkably simple recursive formula with initial value uo=0. Note that, conveniently enough, we don' t need to correct for initialization bias in this case. also note that the magnitude of parameter updates has a simpler bound with AdaMax than adam, namely:|△t|≤a. 7.2 TEMPORAL AVERAGING Since the last iterate is noisy due to stochastic approximation, better generalization performance is oftcn achicved by averaging. Previously in Moulins Bach(2011), Polyak-Ruppcrt averaging (Polyak Juditsky, 1992; Ruppert. 1988)has been shown to improve the convergence of standard SGD, where et k-l8k. Alternatively, an exponential moving average over the parameters can be used, giving higher weight to more recent parameter values. This can be trivially implemented by adding one line to the inner loop of algorithms l and 2: 0t+B2- 6+-1+(1-B20t, with 80=0 Initalization bias can again be corrected by the estimator 0t=0t/(1-Bl 8 CONCLUSION We have introduced a simple and computationally efficient algorithm for gradient-based optimiza tion of stochastic objective functions. Our method is aimed towards machine learning problems with Published as a conference paper at IcLr 2015 large datasets and/or high-dimensional parameter spaces. The method combines the advantages of two recently popular optimization methods: the ability of Ada grad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward to implement and requires little memory The experiments confirm the analysis on the rate of con vergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning 9 ACKNOWLEDGMENTS This paper would probably not have existed without the support of Google Deepmind. We would like to give special thanks to lvo Danihelka, and 'Tom Schaul for coining the name Adam. Thanks to Kai Fan from Duke University for spotting an error in the original AdaMax derivation Experiments in this work were partly carried out on the Dutch national e-infrastructure with the support Of SURF Foundation. Diederik Kingma is supported by the Google European Doctorate Fellow ship in Deep Learning REFERENCES Amari, Shun-Ichi Natural gradient works efficiently in learning Neural computation, 10(2): 251-276, 1998 Jui-Ting. Yao. Kaish g, Seide, frank, seltzer, Michael, 71 He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft /CASSP 2013. 2013. Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic plimization. The Journal of Machine learning Research, 12: 2121-2159, 2011 Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv: 1308.0850, 2013. Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing(ICASSP), 2013 IEEE International Conference on pp.6645-6649.IEEE,2013 Hinton. G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313 (5786):504507,2006 Hinton, Geoffrey, Deng, Li, Yu, Dong, DahL, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE,29(6):82-97,2012a Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Im proving neural networks by preventing Co-adaptation of feature detectors. arXiv preprint arXiv: 1207. 0.580, 2012b Kingma, Diederik P and Welling, Max Auto- Encoding Variational Bayes. In The 2nd international conference on learning Representation(ICIR), 2013 Krizhevsky, Alex, Sulskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convoluTional neural networks. In Advances in neural information processing systems, pp. 1097-110.5, 2012 Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 142-150. Association for Computational Linguistics, 201l Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp 451-459. 2011 Pascanu, Razvan and bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arxiv:7.301.3584,2013. Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4): 838 855, 1992 【实例截图】
【核心代码】

标签：

实例下载地址