Food & Beverages

A connectionist model of instruction following

A connectionist model of instruction following
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Connectionist Model Of Instruction Following David C. Noelle & Garrison W. Cottrell Department of Computer Science & EngineeringUniversity of California, San Diego9500 Gilman DriveLa Jolla, CA 92093-0114 f  dnoelle,gary g Abstract In this paper we describe a general connectionist model of “learning by being told”. Unlike common network models of  inductive learning which rely on the slow modification of con-nection weights, our model of  instructed  learning focuses onrapidchangesin theactivationstateofarecurrent network. Weview stabledistributed patternsof activation in sucha network asinternal representationsof providedadvice–representationswhich can modulate the behavior of other networks. We sug-gest that the stability of these configurations of activation canarise over the course of learning an instructional language andthatthesestablepatternsshouldappearas articulatedattractors in the activation space of the recurrent network. In additionto proposing this general model, we also report on the resultsof two computational experiments. In the first, networks aretaughttorespondappropriatelytodirectinstructionconcerningasimplemappingtask. Inthesecond,networksreceiveinstruc-tions describing proceduresfor binary arithmetic, and they aretrained to immediately implement the specified algorithms onpairs of binary numbers. While the networks in these prelim-inary experiments were not designed to embody the attractordynamics inherent in our general model, they provide supportforthisapproachbydemonstratingtheabilityofrecurrent back-propagation networks to learn an instructional languagein theservice of some task and thereafter exhibit prompt instructionfollowing behavior. Introduction Connectionist models of human learning have focused pri-marily on problems of  inductive generalization . They viewlearning as a process of extracting new knowledge from thestatistical properties of a long series of exemplars. Whilethiscovers many cases, humans exhibitother learning behav-iors which defy description as induction over examples. Inparticular, we are capable of acquiring new knowledge from single highly informativeevents, such as tasting sushi for thefirst time, seeing someone operate a new coffee machine, orhearing a lecture. A single sentence can have profound andlasting effects on our behavior. Furthermore, we are capableof integrating such rapidly acquired knowledge with knowl-edge gained inductively. If connectionism is to provide asoundcomputationalframeworkfortheentirerangeofhumanlearning behaviors, it must be extended beyond induction.Towards this end, we propose here a connectionist modelof “learning by being told”, and we demonstrate a some-what more modest model of connectionist instruction fol-lowing. We view “learning from instruction” as involvingthedemonstrationof“appropriate” behaviorimmediately fol-lowing the receipt of a linguistic directive. Our goal is tointegrate standard connectionist learning methods with suchrapid instructed learning to form a single cognitively plausi-blemodel. Thismultistrategymodel shouldhelpexplainboththe operationalization of instruction into appropriate behav-ior (Hayes-Roth et al., 1980; Mostow, 1983) and the interac-tion effects between instructed learning and exemplar-basedlearning which have been observed in humans.Our model is based on the observation that typical connec-tionistweightmodificationtechniquesare inherentlytooslowto account for instructed learning. Of course, large weightchanges could be made in response to instruction, but in net-works using distributed representations, such rapid weightmodificationtends to destroy previouslyacquired knowledgeindramatic and unrealisticways. We focus insteadonmodel-inga network’sresponse to instructionas changes inits inter-nal activation state . We suggest that our prompt response toinstructionis best seen as motion in activation space – as thesettlingof a network’sactivation state intoa (typicallynovel)basin of attraction correspondingto the received instruction. 1 This idea may be illustrated by the Necker cube net-work (McClelland et al., 1986), shown in Figure 1. Thissmall constraint satisfaction network models our perceptionof the line drawing of a cube, focusing on our tendency tointerpret the drawing in one of two distinct ways. The pro-cessing elements in this network represent particular depthassignments, “front” or “back”, for each vertex in the cube.Connection weights are selected so as to embody constraintson the interpretation of vertices. The result is a recurrent at-tractor network with two major basins of attraction in activa-tionspace. In other words, there are two main configurationsof unit activations which are stable over time – one config-uration for each interpretation of the Necker cube drawing.Given some starting levels of activation for the processingelements, the network will tend to settle into one of theseattractor basins, forminga coherent internal representationof thecubeinactivationspace. We may biasthenetworktowardoneinterpretationovertheotherbyprovidinginputactivationtotheappropriateunits. Bymanipulatingtheinputtotheunitson one side, we can cause the network to rapidly move fromone attractor basin to the next. Similarly, humans can be toldto see the cube in a different way, and this input can changetheir perception of it.The reception ofdirect instructionmay be viewed as a pro-cess similar to that embodied in this Necker cube example.In the case of instruction, natural language advice is to be 1 Thanksare due to Paul Churchlandfor this notion (Churchland,1990).  FLRFLLBLLFUL FURBUL BURBLRBULBLL BLRBURFUL FURFLRFLL Figure 1: Necker Cube And Attractor Network rapidly transformed into a coherent internal representation inactivationspace – a representationwhich may then be used tomodulate the behavior of other networks in the performanceof their tasks. As in the Necker cube system, we may encodethese internal representations in the attractors of a recurrentnetwork. Learning an instructional language may be seen astrainingtheweightsofsuch an attractornetworkso as toforma distinct basin of attraction for every meaningful sequenceof advice. Most of these articulated attractors 2 need not beexplicitly trained over the course of language learning, butthey may come into existence, nonetheless, via interactionsbetween trained attractors. In this way, some “spurious” at-tractor basins may actually be seen as serendipitous basins inthat,whilenotexplicitlytrained,theyhaveinterpretationsthat“make sense” inthe behavioral domainof the network. Thus,once the language of instruction is learned, such an attractornetwork may rapidly encode novel advice simply by movingto the corresponding basin of attraction. As in the Neckercube network, this motion in activation space may be drivenby appropriate input activation. Following the lead of manyconnectionist models of natural language processing (Cot-trell, 1989; Elman, 1990; St. John and McClelland, 1990),we may encode linguistic instructionsas temporal sequencesof such input activity, allowing advice to “push” the network into the appropriate basin of attraction. For this strategy towork, such an instructable network must support a distinctstable configuration of activation levels for every instructionsequence which might be presented. If such a combinatorialset of attractors is not present, the network will be limitedin the number of different instruction sequences that it canunderstand and operationalize. With articulated attractors inplace, however, novel instructions may be quickly moldedinto a coherent activation-based modulating force on sometask behavior.Note that this strategy of modeling instructed learning inactivation space leaves weight modification in the capablehandsofstandardconnectionistinductivelearningalgorithms.This allows instructionand inductionto proceed intandem tosolve complex learningproblems. Also, inadditionto weightmodificationsbasedonbehavioralfeedback,Hebbianlearningmay be used to strengthen and deepen attractors which are 2 Thesehave also been called componentialattractors (Plaut andMcClelland, 1993). PlanNetwork DomainTask Network BEHAVIORSENSESPLANADVICE  Figure 2: Generic Instructable Network Architectureregularly visited. Such weight changes would increase thelikelihood of visiting those attractors in the future, makingcommon instructional memories easy to instantiate.Ourapproachmaybeillustratedbyagenericnetworkarchi-tecture, shown in Figure 2. In that diagram, boxes representlayersofprocessingelements, and arrowsbetween boxesrep-resent complete interconnections between layers. Temporalstreams of tokens in an instructional language are presentedat the advice layer, and this input activity is used to directthe settling of the attractor network at the plan layer. Thestable configuration of activity levels at the plan is then usedto modulate the behavior of a task oriented network, muchlikethe “plan” layer of a Jordan network (Jordan, 1986). Theconnectionweightsmay be trained usinga standardinductivelearning technique, such as backpropagation, with an errorsignal providedonly on actual task performance. This allowsthe language of instruction to be learned in the service of atask (St. John, 1992). Such inductive learning may also beused to shape task performance through experience. Oncetheinstructionallanguage islearned, however, new behaviorsmay be elicited at the speed of activation propagation, with-out furtherweight modification,simply by presentinga novelstream of advice.In summary, our general strategy involves:   encoding instructionsas temporal input sequences;   training a recurrent network (the plan network  ) to formcombinatorial representations of these sequences in itsbasins of attraction – representations shaped by error feed-back from another network (the domain task network  ),which uses activity in the plan network to modulate theperformance of some specific task;   providing further inductive training as appropriate, allow-ingfortheinteractionofexemplar-based inductivelearningwith learning from instruction.This paper presents an initial examination of this approachtoinstructedlearning. Of particular interest is the plausibilityof the claim that a connectionist network may be trained topromptly transform a temporal sequence of instructions intoappropriate behavior in some domain. To test this assertion,some networks were made to follow instructions concerninga combinatoric discrete mapping task and others were trainedtoimmediately implement algorithmicinstructionsfor binaryarithmetic. These networkswereconstructedwithouttheben-  efit ofstableattractorsattheplanlayer, soactivityat thatlayerwas artificially “frozen” once advice was received (St. Johnand McClelland, 1990). The utility of articulated attractorsat the plan layer will be the primary focus of future work.The details of these two experiments are presented below,followinga brief overview of related work. AlternativeApproaches We are not the first to propose a technique for the directinstruction of connectionist models. Indeed, early connec-tionist networks using “localist” approaches were frequently“instructed” throughthedirect assignment of networkparam-eters by knowledgeable researchers. This was possible sincethe networks involved parameters with well understood se-mantics. In such a framework, “instruction” was essentially“programming”, involvingthe specification of processing el-ements, connections between elements, and specific connec-tion weights. Some systems automated a large portionof thisprocess, allowing symbolic rules to be directly compiled  intonetwork parameters (Cottrell, 1985; Davis, 1989). The maindrawback of this “weight compilation” approach is the con-straint that it places onthe representational schemes availableto the network. Specifically, since the semantics of network components are fixed, networks of this kind are not free toform arbitrary distributed representations appropriate for thedemands of the task.Some researchers have generated network models whichare instructable in this “weight compilation” manner but arestill free to develop arbitrary internal distributed represen-tations through inductive training. In general, networks of this kind may only be instructed before inductivetrainingbe-gins, because standard connectionist learning methods oftenchange the representational nature of weight values inhard topredictways, makingthedirectmanipulationofthoseweightsin response to instruction quite problematic. One solution tothis problem involves occasionally normalizing weight val-ues back to configurations which are “meaningful” to theweight compilationprocess. Thismay be done byidentifyingand extracting the “rules” embodied in the trained network and then resetting weight values to encode exactly those ex-tracted rules. Once reset in this way, new instructions maybe incorporated intothe network and the process of inductivelearning may begin again. Weight compilationapproaches of thiskindhave been successfully used to encode propositionallogic rules (Towell and Shavlik, 1994), “fuzzy” classificationrules (Tresp et al., 1993), simple mapping rules (McMillanet al., 1991), the transitions of finite state automata (Gilesand Omlin, 1993), and advice for an agent in an artificialenvironment (Maclin and Shavlik, 1994).Unfortunately,none of these models providea connection-ist explanation for how instructions are compiled into thenetwork. Also missing is a connectionist mechanism for therule extraction process which is needed to “reset” the seman-tics of weight values. Both of these processes, compilationand extraction,require thedirect manipulationof theprocess-ing elements and the global coordination of weight values.While a connectionist explanation for these dynamic globalrestructuring processes may be possible, it is not clear whatform such an explanation would take.Perhapsthemost importantcriticismofthesemodels, how-ever, is that the “language of thought” is preordained by theresearcher. In order to continue to receive instruction afterinductive learning has begun, the models must continuouslyreformulatetheirknowledgeintheterminologyofexplicitlin-guisticinstruction. Inductivelylearned nuances are discardedduring rule extraction, leaving these models with a behaviorthat is consistently analogous to symbolic rule following. Inessence, these models are trapped in the first stage of skill ac-quisitionand cannot escape (Rumelhart and Norman, 1978).By placing instructions in activation space, all of theseproblems may be avoided. Inductive weight modificationsand instruction following may proceed simultaneously, andtheymay complement orinterfere witheach otherincomplexways. Instructed Associations A number of initial experiments have been conducted, fo-cusing solely on the abilityof recurrent backpropagationnet-works (Rumelhart et al., 1986) to learn to operationalize in-struction. In particular, issues concerning the benefits of at-tractor network dynamics for generalization to novel advicehavebeenleftforlaterinquiries. Fortheseinitialexperiments,unit activations at the plan layer are artificially “frozen” inorder to provide a stable internal representation of input in-struction sequences. The goal of these early experiments isto demonstrate that a language of instruction may be learnedinductively solely from error feedback on actual task perfor-mance.Ourfirstexperimentfocusesontheabilityofthesenetworksto learn to follow instructions concerning a simple associa-tional mapping. Our domain task involves mapping inputsfrom a finite set into outputs from the same set. Correctmapping behavior is not specified by a collection of labeledexamples, however, but by the direct communication of map-ping rules. These rules may be viewed as statements suchas, “When you see rock , say paper .” Upon presentation of such rules, the network is to immediately change its behavioraccordingly. Inductivetrainingis used duringan initialphasein which the network learns the instructional language, butonce this initial training is complete, instruction followingmay proceed without weight modification. Also, this initialtrainingphase exposes themodel toonlya fractionofthe pos-sible mappings, and the network is expected to generalize itsinstructionfollowingbehaviortonovelinstructionsequences.The model, inspired by the architecture of the SentenceGestalt  network (St. John and McClelland, 1990), is showninFigure 3. The boxes represent layers of sigmoidal process-ing elements and arrows between boxes represent completeinterconnections between layers. Layer sizes are shown inparentheses. Symbolic instruction tokens, each encoded aslocalist “1-out-of-N” activation vectors, were presented se-quentially at the advice input layer, and activation was prop-agated through the recurrent “Plan Network” to produce apatternof activationat the plan layer. Each mapping rulewasencoded as a sequence of three of these instruction tokens (adelimiter followed by the input/output pair) and each com-pletemapping consistedof three such rules. For example, thenine token advice sequence: )  ROCK PAPER )  SCISSORSROCK  MappingNetwork PlanNetwork INPUT (3) ADVICE (4)(20)PLAN (20)(10)OUTPUT (3)Instructable Mapping Network   0 20 40 60 80 100 Training Set Size 010002000300040005000    M  e  a  n   E  p  o  c   h  s   (   +   /  -   1   S .   D .   ) Epochs To 100% Accuracy 0 20 40 60 80 100 Training Set Size 020406080100    M  e  a  n   P  e  r  c  e  n   t   C  o  r  r  e  c   t   (   +   /  -   1   S .   D .   ) Generalization Accuracy Figure 3: Instructable Mapping Network: Architecture, Training Time, & Generalization )  PAPER SCISSORS was used to communicate the three rules, “when you see ROCK , say PAPER ”, “when you see SCISSORS , say ROCK ”,and “when you see PAPER , say SCISSORS ”. When the pre-sentation of a sequence of instruction tokens was complete,the activationat theplan layer was “frozen” and used tomod-ulate the behavior of the “Mapping Network” as it performedthe desired mapping. Input tokens, also encoded in a localistfashion, were presented at the input layer, and the network’sresponse was read from the output layer. During the ini-tial training phase, mean squared error was then computedat the output layer, based on the most recently presented in-structions, and this error was backpropagated to allow forweight modifications throughout the network. The detailsof this training procedure were much like those of the Sen-tence Gestalt  model. In particular, error was backpropagatedthroughrecurrent connectionsfor onlya single time step, anderror was computed after the presentation of each instructiontoken. A learning rate of 0 :  05 was used, with no momentum.This initial inductivetraining period was ended when perfectperformance was achieved on a training set of instructionse-quences, or when this training set had been presented 5000times.Training sets of nine different sizes were examined, andfive different random initial weight sets were used. Almostall of these trials resulted in 100% accuracy on the trainingset withinthelimit of 5000epochs. In otherwords, these net-works consistently learned to operationalize the instructionson which they were trained. As shown in Figure 3, general-ization performance was also good, with accuracy values onnon-training set instruction sequences appearing well abovethe chance level of 33% correct. Note that trainingset size isexpressed in this figure as a percentage of the total number of possible instructionsequences. Three discrete inputsgave 27possible mappings. For each mapping, there were 6 possiblepermutations of the mapping rules, for a total of 162 possibleinstructionsequences.While very simple, this discrete mapping task poses inter-esting problems for inductive connectionist learning. Appro-priate system behavior depends entirely on the given instruc-tions. There are no other environmental regularities for thenetworktodiscover duringtraining. Once completelytrainedto “understand” the instructional language, the network is re-quiredtomodifyitsbehavior immediately uponreceiptofnewinstructions,withoutfurtherweight modification. The way inwhich this discrete mapping problem forces the network togeneralize in a systematic manner over the space of instruc-tion sequences makes this a difficult learning problem, and italso makes it an ideal domain in which to test the power of the proposed network architecture. The results demonstratethat associational mapping instructions can indeed be opera-tionalized by networks of this kind. However, these resultsalso suggest a need for a mechanism to improve generaliza-tion performance (Noelle and Cottrell, 1994) – a need whichmight be met by the incorporation of an attractor network atthe plan layer. Instructed Algorithms Oursecondexperimentextendsourtaskdomainintotherealmof sequential procedures. The goal is to demonstrate the abil-ityofthese recurrentnetworkstohandleinstructionsconcern-ing complex sequences of action. To this end, we focus onthe domain of arithmetic on arbitrarily large binary integers.In previouswork it was shownthat recurrent neural networksmay be inductivelytrainedto perform a systematic procedurefor multi-column addition (Cottrell and Tsung, 1993). Herewe wish to examine the possibility of modulating such al-gorithmic behavior through direct instruction. Instructionaltokens, each representing some atomic action, are to be usedto communicate a sequential method for binary addition orsubtraction to a network, and that network is to be trained to immediately implement the specified procedure.The general structure of the Cottrell & Tsung additionmodel was used as the basis of our “Arithmetic Network”,shown in Figure 4. Under this approach, arithmetic was seenas the iterative transformation of a written representation of two argument integers. The two numbers are assumed to bewritten so that columns align, and an attentional mechanismis assumed to focus processing on one digit column at a time.Solvinganarithmeticprobleminvolvesiterativelyperforminga sequence of actions for each column and then attending tothenext. In terms of thenetwork architecture, the digitsinputlayercontainedarepresentationofthetwodigitsinthecurrentcolumn (plus an extra input unit to signal when no columnsremained). The actions output layer specified the action to betaken on the current time step, which was one of: “write agiven digit as the result for the current column”, “announce acarry or borrow”, “move to the next column”, or “announce  PlanNetwork ArithmeticNetwork (20)(10)DIGITS (3)ACTION (5)PLAN (10)ADVICE (8)Instructable Arithmetic Network  Current Column 111010100 11100010+ Problem Format 0 500 1000 1500 Epochs 020406080100    P  e  r  c  e  n   t   C  o  r  r  e  c   t   A  c   t   i  o  n  s Accuracy During Training TrainingGeneralization Figure 4: Instructable Arithmetic Network: Architecture, Task Format, & Learning Curvecompletion of the current problem”. The “Arithmetic Net-work” included a recurrent hidden layer, which was neededboth to produce a sequential output and to “remember” thepotential presence of a carry or borrow from the processingof the previous column.Asin thediscrete mappingexperiment, the behaviorof thisdomain task network was to be specified by a stream of in-put instruction tokens. Different instruction sequences couldspecify different orderings for sets of standard actions (e.g.,announcingthecarrybeforeorafterrecordingthedigitsum)orcouldspecifycompletelydifferentarithmeticoperations(e.g.,subtraction rather than addition). Each sequence describedthree actions which were to be applied in the given order toeach column of digits. For example, the usual form of ad-ditionwas specified as, “ WRITE-SUM ANNOUNCE-CARRYNEXT-COLUMN ”. Only six such instruction sequences con-stituted meaningful procedures: ANNOUNCE-CARRY WRITE-SUM NEXT-COLUMNWRITE-SUM NEXT-COLUMN ANNOUNCE-PREV-CARRYWRITE-SUM ANNOUNCE-CARRY NEXT-COLUMNANNOUNCE-BORROW WRITE-DIFF NEXT-COLUMNWRITE-DIFF NEXT-COLUMN ANNOUNCE-PREV-BORROWWRITE-DIFF ANNOUNCE-BORROW NEXT-COLUMN Despite the extremely small size of this set of possible algo-rithms,makinggeneralizationunlikely,thelastofthesubtrac-tionsequenceswasavoidedduringtheinitialtrainingphase,tobe used, instead, as a test of generalization. Still, the primarygoal was to have the network exhibit appropriate behaviorwhen presented with any one of the training set instructionsequences.This network was operated and trained in much the samemanner as the discrete mapping network. Instruction tokenswere presented sequentially at the advice input layer, using alocalist code, and activity was propagated through the “PlanNetwork” to produce a plan layer activation vector repre-sentingthe entire instructionsequence. Activationat the planlayerwasthen“frozen”andusedasaninputtothe“ArithmeticNetwork”, which then performed the specified procedure ona collectionof binarynumber pairs. 3 Trainingwas conductedas in thediscrete mappingnetwork, witherror computedafterthepresentationofeach instructiontokenandbackpropagated 3 Typically, all number pairs of up to three digits in length wereused to provide training problems. throughrecurrent connections for only a single time step.A large number of experiments were conducted using thisarchitecture and task, varying hidden layer sizes and detailsof the training regimen. As Figure 4 demonstrates, it waspossible to achieve perfect performance on the training setof instruction sequences, but generalization was essentiallynot achieved. 4 Still, the primary goal of this experiment wasattained. The resulting model was capable of performinga number of different versions of addition and subtraction,and it could immediately modify its behavior, without weightadaptation,tomatchoneofthesealgorithmsuponpresentationof the appropriate instructions. This experiment has shownthat the proposed connectionist framework is sufficient toallow complex temporal behaviors to be modulated by inputadvice. Conclusion Asan initialsmall steptowardsacomprehensive modelofhu-man learning, we have proposed a connectionist framework for “learning by being told”. Our approach views linguisticadvice as temporal input streams to a recurrent network, andthe operationalization of that advice is seen as motion in theactivation space of that network. “Meaningful” instructionsequences are internally represented by stable articulated at-tractors in that space, and these attractors arise either in theprocess of learningthe instructionallanguage or in later Heb-bian modifications resulting from the repeated instantiationof neighboring attractors. By locating instructed learning inactivationspace, ourframeworkavoidstheproblemsinherentin “weight compilation” approaches, and provides a meansfor integrating inductionand instruction.In this paper, we have put off an examination of attrac-tor dynamics and have focused, instead, on establishing theability of connectionist networks to inductively learn an in-structional language. We have demonstrated successful in-structionfollowingbehavior in both a combinatoric mappingtaskand in a domain involvingtheperformance of systematicprocedures. In both domains, networks inductively acquired 4 The accuracy measurement shown in Figure 4 is a measure of correct actions over time. The displayed 74% generalization accu-racyshowsthat manyactionswere performedcorrectly, but it masksthe fact that systematic mistakes on the test set instruction sequencekeptthenetworkfrom attainingthecompletecorrectanswerforevena single subtraction problem when given this instruction sequence.
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks