A dedicated very low power analog VLSI architecture for smart adaptive systems

A dedicated very low power analog VLSI architecture for smart adaptive systems
of 21
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Applied Soft Computing 4 (2004) 206–226 A dedicated very low power analog VLSI architecturefor smart adaptive systems Maurizio Valle ∗ , Francesco Diotalevi 1  Department of Biophysical and Electronic Engineering (DIBE), University of Genova, Via All’Opera Pia 11/A, I 16145 Genova, Italy Received 15 January 2004; accepted 6 March 2004 Abstract This paper deals with analog VLSI architectures addressed to the implementation of smart adaptive systems on silicon. Inparticular, we addressed the implementation of artificial neural networks with on-chip learning algorithms with the goal of efficiency in terms of scalability, modularity, computational density, real time operation and power consumption. We presentthe analog circuit architecture of a feed-forward network with on-chip weight perturbation learning in CMOS technology.Novelty of the approach lies in the circuit implementation of the feed-forward neural primitives and on the overall analogcircuit architecture. The proposed circuits feature very low power consumption and robustness with respect to noise effects.Weextensivelytestedtheanalogarchitecturewithsimulationsattransistorlevelbyusingthenetlistextractedfromthephysicaldesign. The results compare favourably with those reported in the open literature. In particular, the architecture exhibits veryhigh power efficiency and computational density and remarkable modularity and scalability features. The proposed approachis aimed to the implementation of embedded intelligent systems for ubiquitous computing.© 2004 Elsevier B.V. All rights reserved. Keywords:  Smart adaptive systems; Analog VLSI neural networks; On-chip learning implementation; Perturbation based gradient descentalgorithms; CMOS technology; Translinear circuits; Weak inversion; Differential and balanced current mode signal coding 1. Introduction Intelligent/smart systems have become commonpractice in many engineering applications. This ismainly due to: •  the development along the last 30 years of powerful,robust and “adaptive” algorithms which can solveefficiently very computationally demanding tasks inmany areas like robotics, decision making systems, ∗ Corresponding author. Tel.:  + 39-010-353-2775;fax:  + 39-010-353-2096.  E-mail addresses: (M. Valle), (F. Diotalevi). 1 Present address: Accent Corporation Spa, Genova, Italy. artificial vision, pattern recognition, etc. We refer tothe“softcomputing”techniqueslikeartificialneuralnetworks, cellular networks, fuzzy logic, etc.; •  the exponential rate of advances in semiconductortechnology, both in productivity and performance.This has allowed a tremendous ever-increasingtrend in performance of microprocessors, digitalsignal processing devices, sensors solid-state inter-faces, wireless systems, etc.; •  at last, an equivalent advance in silicon micro-machining techniques and micro (nano) systemsengineering.Starting from the achieved results, the next break-through will be the design and development of “smart 1568-4946/$ – see front matter © 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.asoc.2004.03.002   M. Valle, F. Diotalevi/Applied Soft Computing 4 (2004) 206–226   207 adaptive systems on silicon”, i.e. very power and sil-icon area efficient devices which implement an entiresystem (i.e. sensing, computing and “actuating” ac-tions) on a single silicon die. These systems should beable to “adapt” autonomously to the changing environ-ment; they will not be programmed in a usual way butthey will infer (operative) knowledge directly from theenvironmentandfromrawsenseddata.Inotherwords,they will be able to implement “intelligent” behaviourand “cognitive” tasks. In this paper we will concen-trate on the VLSI implementation of cognitive systems(e.g. learning algorithms) and we will demonstratethat, by adopting weight perturbation (WP) learningalgorithms and dedicated circuit design approaches,very efficient (in terms of scalability, modularity, com-putational density, real time operation and power con-sumption) analog and eventually mixed-mode imple-mentations can be achieved.It has been demonstrated elsewhere [1], that the computational density and the energy efficiency of dedicatedanalogon-chiplearningchipsgreatlyexceedthose of their major competitors, i.e. state-of-the-artdigital signal processors (DSPs). Moreover, the com-putational power of analog on-chip learning imple-mentations is comparable to that of general-purposemicroprocessors (see [2]) even if their computationaldensity and energy efficiency is much greater. Infact the computational density and energy efficiencyperformance differ of many orders of magnitude infavour of dedicated analog neural chips even if theydo not use state-of-the-art (and expensive) technolo-gies. Even though circuit implemented with low costtechnologies can achieve the above-mentioned perfor-mances, dedicated on-chip learning chips still exhibithigh improvement capabilities. DSPs need A/D andD/A circuits to interface with sensors and actuators,thus, increasing the energy and silicon area budget.From this perspective, dedicated analog on-chip learn-ing implementations can compete for speed and sizewith state-of-the-art DSPs. Their performances can beimproved significantly if state-of-the-art technologiesare used as well.One should emphasize that the performances of thededicated analog on-chip learning chips concur withthose of analog computing arrays [3,4]. They exhibit a computing power in the order of 10 9 Operations Per S(OPS) per mW and per mm 2 and energy per operation(OP) in the order of 10 − 12 J. Analog on-chip learningimplementations can be regarded as computing arrayswith learning capabilities; they can take advantage of both of these features.In this paper, we address the analog VLSI imple-mentation of artificial neural networks with on-chiplearningalgorithmswiththegoalofefficiencyintermsof scalability, modularity, computational density andpower consumption.We address scalability and modularity both at thesystem as well as at the circuit level. In the formercase, we accomplish it by adopting weight perturba-tion gradient descent learning algorithms, which aremore prone to scalable architectures. In the latter case,we adopt current-mode circuits; the current mode ap-proach increases the dynamic range of signals andlowers significantly power consumption, if one imple-ments circuits biased in weak inversion.To increase the robustness of computation with re-spect to noise and analog circuit non-idealities, weadopt a differential and balanced current mode sig-nalling approach and propose the on-chip learning im-plementation. In this perspective, one should observethat [1] the on-chip leaning scheme can successfully cope with the non-idealities and errors of the analogcircuit implementation as the automatic on-chip tun-ing does with analog integrated filters. The feedback scheme is effective also because it exploits the speedof analog circuits.Following the above considerations, we proposenovel neural primitive circuits whose operation isbased on the translinear approach: the circuits arebiased in the weak inversion region of operation andachieve a very high power efficiency.We present also the novel analog circuit archi-tecture of a two-layer multi layer perceptron (MLP)network with an on-chip weight perturbation learningalgorithm. The design has been done using a CMOS0.8  m technology (AMS CYE) [5], and considering a (low) power supply voltage of 2.5V (i.e. negativepower supply voltage  V  SS  = − 1 . 25V and positivepower supply voltage  V  DD  =+ 1 . 25V).This paper is organised as follows. The system ar-chitecture (i.e. the learning algorithm and the overallanalog circuit architecture) is detailed in Section 2.The rationale of the circuit design approach is pre-sented in Section 3 while basic aspects of the designof the neural primitive and ancillary circuits are intro-duced in Section 4. The analog circuit architecture has  208  M. Valle, F. Diotalevi/Applied Soft Computing 4 (2004) 206–226  been extensively tested in learning simulations at cir-cuit level using the netlist extracted from the physicaldesign. The results and a comparison of the perfor-mances with the open literature are reported in Section5. The conclusion is drawn in Section 6. 2. System architecture 2.1. Gradient descent learning algorithms In the following, we will refer to the standard ar-chitecture of the feed-forward multi layer perceptronnetwork without any loss of generality. In fact, in anMLP we may account for time by incorporating mem-ory (e.g. tapped delay line) in the input layer of thenetwork and use it for solving dynamic estimationproblems. Alternatively, in either way, we may buildfeedback around an MLP by feeding the output sig-nals of hidden layers back to the input of precedinglayer(s). The application of the feedback configuredin this manner not only turns a static MLP into a dy-namic one (namely recurrent MLP) but also offers thepotential for improving system performances [6].In the standard architecture of a feed-forward  N  -layer MLP network, the  l th layer ( l  ∈  [1  →  N  ]) isorganised as a matrix of   n l neurons (where the suffix l  stands for the layer number) connected to a matrixof   n l × n l − 1 synaptic multipliers.The neurons implement the linear summation of thesynaptic outputs and the non-linear activation function Ψ  ( · ). The computation performed by the  j th neuron of the  l th layer is: a lj  = n l − 1  i = 1 W  lj,i X l − 1 i  X lj   = tanh (a lj  )l ∈ [1 → N  ] ,j   ∈ [1 → n l ] , i ∈ [1 → n l − 1 ](1)where X lj   istheoutputofthe  j thneuronofthe l thlayer, X l − 1 i  the output of the  i th neuron of the ( l − 1)th layer,the term  a lj   (usually called activation) is the sum of the outputs of the synapses of the ( l − 1)th layer whichare connected to the  j th neuron of the  l th layer,  W  lj,i is the weight of the synapse connecting the  i th neuronof the ( l − 1)th layer to the  j th neuron of the  l th layer(the neuron layer 0, called the input layer, does notperform any computation but only feeds the network input signals  X 0 i  ,  i ∈ [1 → n 0 ]). In the following, wewill indicate the whole set of synaptic weights as  w .The synaptic weight values (together with the network topology)determinethenetworkfunction,i.e.thenon-linear mapping between input and output data spaces.Supervised algorithms need training sets made of pairs of exemplary input patterns and desired outputpatterns (i.e. targets) [7,8]. The weights are updateduntil the target output is generated for the correspond-ing network input (i.e. an error index  ε ( w ) has beenminimised, see below) or a pre-defined default termi-nation condition is met (e.g. the number of iterationsexceeds a given threshold). To accomplish this task, agradient-descent algorithm is generally used.When considering supervised “gradient descent”based learning algorithms, the learning task is accom-plished by minimising, with respect to the adjustableparameters  w , the error index  ε ( w ); the update learn-ing rule for the generic weight  W  i,j   is: w i,j   =− η∂ε( w )∂w i,j  (2)where  η  is usually called “learning rate”. The mainissue from the analog VLSI point of view concernsthe computation of   ∂ε ( w )/  ∂w i,j  . By applying the chainrule, the back propagation (BP) algorithm calculatesthe value of   ∂ε ( w )/  ∂w i,j   while on the other hand theweight perturbation “estimates” it simply through itsincremental ratio.We define the error index  ε p  for the  p th patternexample as the sum (over all the output neurons) of the square difference between the target  ¯ X N k  and theactual output  X N k  (the index  k   runs over the outputlayer neurons): ε p ( w ) = 12 n N   k = 1 [ ¯ X N k  − X N k  ] 2 (3)The error index  ε ( w ) is computed as the averageof the error indices  ε p ( w ) over all the  N  p  exemplarypatterns of the training set: ε( w ) = 1 N  pN  p  p = 1 ε p ( w )  (4)The learning algorithm is operated by followingan on-line or a by-epoch exemplary pattern presenta-   M. Valle, F. Diotalevi/Applied Soft Computing 4 (2004) 206–226   209 tion approach. When using the on-line approach, thepatterns are sequentially (and usually in random or-der) given in input to the network. At each presenta-tion, a weight contribution for each synapse is com-puted, and the synaptic weight values are updated.In the by-epoch approach, the weight update con-tributions are accumulated along each epoch whilethe weights are held constant. The synaptic weightvalues are then updated only once all the patternshave been presented to the network (i.e. the weightcontributions related to all the pattern examples arecomputed).With respect to the by-epoch approach, the on-line procedure introduces a kind of randomness inthe learning process that may help in escaping fromthe local minima of the total error index  ε ( w ) [9].Moreover, this technique is usually faster and moreeffective when the training set is huge (e.g. consistingof thousands of pattern examples as in the case of handwritten character recognition, speech recogni-tion, etc.). On the other hand, the by-epoch approachusually gives better results when high “precision” isrequired in the non-linear mapping between input andoutput data spaces (e.g. function approximation) [1].Nevertheless, the by-epoch approach requires morememory storage to accumulate the weight contribu-tions computed at each pattern presentation [10]. Theadditional memory and the accumulation operationrepresent the main drawback of the implementationof the on-chip by-epoch learning in analog VLSItechnology. In the following, we will focus on aby-pattern approach. 2.2. Weight perturbation learning algorithm(s) The WP algorithms estimate rather than calculatethe gradient’s value of the output error index. Thismethod is able to estimate the gradient simply throughits incremental ratio: it measures the error index gra-dient by perturbing the network parameters (i.e. theweights) and by observing the change in the network output. If the weight perturbation  p (n)j,i  at the  n th itera-tion is small enough, we can neglect the higher orderterms and write ∂ε p ( w )∂w j,i ∼= ε p (w j,i + p (n)j,i  ) − ε p (w j,i )p (n)j,i (5)so w j,i  =− ηε p (w j,i + p (n)j,i  ) − ε p (w j,i )p (n)j,i (6)where  p (n)j,i  is the random perturbation injected in the w j,i  synaptic weight at the  n th iteration and  w j,i  isthe value used to update the weight  w j,i . The differ-ence between the error index before and after the per-turbation of the generic weight  w j,i  is used to estimatethe gradient’s value with respect to  w j,i . This algo-rithm is the simplest form of the WP algorithm, andbecause only one synapse’s weight is perturbed at atime, this technique is called sequential [11].To make simpler the circuit implementation, everyweight perturbation  p (n)j,i  can be considered equal invalue and random only in sign [12]: p (n)j,i  = pert (n)j,i  stepwhere step is the perturbation value, while pert (n)j,i  canassume the values + 1 or − 1 with equal probability.One can rewrite Eq. (6) as follows: w j,i =− ηε p (w j,i + p (n)j,i  ) − ε p (w j,i )p (n)j,i =− ηε p step pert (n)j,i  =− η step ε p pert (n)j,i  (7)The information of the term step in the  η  value canbe combined, i.e. w j,i  =− η ′ ε p pert (n)j,i  , η ′ = η step , pert (n)j,i  =  + 1 − 1withequalprobability (8)To compute the synapse’s weight update  w j,i , it isonly necessary to compute  ε p  and to know pert (n)j,i  .The box below shows the final version of the WPlearning algorithm, in the case of a by-epoch patternpresentation approach.The sequential WP algorithm may be too slow inreal applications where big networks are employed.In [13] f our parallel perturbation strategies have been highlighted. The one we will focus on is the “fullyparallel weight perturbation” owing to its simpler cir-cuit implementation with respect to the other perturba-tionstrategies.OriginallydevelopedbyCauwenberghs  210  M. Valle, F. Diotalevi/Applied Soft Computing 4 (2004) 206–226  [14] and Alspector et al. [15], it is called stochastic error descent. It consists of the fully parallel pertur-bation of all synaptic weights. One can formulate theweight perturbation strategy as follows:orinmathematicaltermsas:  w =− η(ε p ( w + p (n) ) − ε p (w)) p ′ (n) , where  p ( n  ) is the perturbation matrix of elements  p j,i  at the  n th iteration and  p ′ ( n  ) is a matrixof elements 1/   p j,i  (inverse of perturbation) at the  n thiteration. 2.3. Analog VLSI on-chip (supervised) learningimplementation issues The advantages of on-chip learning implementationare not evident when the problem solution requiressmall networks (i.e. with a small number of neuronsand synapses) and the training set consists of fewexamples. In this case, the off-chip learning imple-mentation is more convenient: a host computer canimplement the learning algorithm by using a chip-in-the-loop technique [16]. Analog on-chip learning net- works ought to be pursued when: (a) large networks(e.g. with thousands of synapses) and large trainingsets (e.g. thousands of examples) are considered;(b) we need to implement adaptive neural systems,i.e. systems that are continuously taught while beingused.As discussed in the previous section, multi-layernetworks can be trained by using a supervised al-gorithm, i.e. the weight adjustment (see (2)) can beachieved by using the back propagation or the weightperturbation rules.BP algorithms have been used extensively and suc-cessfully to solve real world tasks but they do not fullydeal with the pitfalls of analog VLSI circuits. There-fore, one must take care into account when design-ing analog on-chip BP learning circuits. The BP algo-rithmscomputethegradientoftheerrorindexfunctionaccording to the transfer characteristics of synapsesand neurons; nevertheless the behaviour of analog cir-cuits (non-linearity effects, offsets, mismatches, tech-nology process spread, etc.) affects the gradient com-putation and worsens the training performances.The main drawbacks of the analog on-chip imple-mentation of the BP algorithms are: ◦ fixed (in particular non-recursive) feed-forward net-work topology; ◦  need for circuits to compute the neuron functiontransfer derivative. ◦ sensitivity to circuit offsets; ◦  high signal count for the back-propagation of theerror terms.The WP rule looks more attractive for the analogon-chip implementation mainly because [1]: ◦  the WP algorithms can be applied also to recurrentmulti-layer networks; ◦  the WP algorithms are more easily arranged in ascalable architecture;
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks