A Vision Architecture

A Vision Architecture
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Vision Architecture Christoph von der Malsburg  Frankfurt Institute for Advanced Studies malsburg@fias.uni-frankfurt.deKeywords:Vision, r!hite!ture, "er!eption, Memorybstra!t:#e are offering a parti!ular interpretation $well within the range of e%perimentally and theoreti!ally a!!epted notions& of neural !onne!tivity and dynami!s and dis!uss it as the data-and-pro!ess ar!hite!ture of the visual system. 'n this interpretation the permanent !onne!tivity of !orte% is an overlay of well-stru!tured networks, (nets), whi!h are formed on the slow time-s!ale of learning by self-intera!tion of the network under the influen!e of sensory input, and whi!h are sele!tively a!tivated on the fast per!eptual time-s!ale.  *ets serve as an e%pli!it, hierar!hi!ally stru!tured representation of visual stru!ture in the various sub-modalities, as !onstraint networks favouring mutually !onsistent sets of latent variables and as pro+e!tion mappings to deal with invarian!e. INTRODUCTION he performan!e of human visual per!eption is far superior to that of any !omputer vision system and we evidently still have mu!h to learn from  biology. "arado%i!ally, however, the fun!tionality of neural vision models is worse, mu!h worse, than that of !omputer vision systems. #e blame this shortfall on the !ommonly a!!epted neural data stru!ture, whi!h is based on the single neuron hypothesis $arlow, /01&. #e propose here a radi!ally new interpretation of neural tissue as data stru!ture, in whi!h the !entral role is played by stru!tured neural nets, whi!h are formed in a slow  pro!ess based on synapti! plasti!ity and whi!h !an  be a!tivated on the fast psy!hologi!al time-s!ale. his data stru!ture and the attendant dynami!  pro!esses make it possible to formulate a vision ar!hite!ture into whi!h, we argue, many of the algorithmi! pro!esses developed in de!ades of !omputer vision !an be adapted. STRUCTURED NETS lthough we live in a three-dimensional world,  biologi!al vision is, to all we know, based on 21.3-dimensional2 representations, that is, two-dimensional views enri!hed with lo!al depth information. he vision modality has many sub-modalities 4 te%ture, !olour, depth, surfa!e !urvature, motion, segmentation, !ontours, illumination and more. ll of these !an naturally be represented in terms of lo!al features tied together into two-dimensional (nets) by a!tive links. *ets are naturally embedded in two-dimensional manifolds and have short-range links between neurons. *eural sheets, espe!ially also the primary visual !orte%, !an support a very large number of nets by sparse lo!al sele!tion of neurons, whi!h are then linked up in a stru!tured fashion $see 5ig. &. 6iven the !ell-number redundan!y in primary visual !orte% $e%!eeding geni!ulate numbers by estimated fa!tors of 78 or 38& there is mu!h !ombinatorial spa!e to define many nets. hese nets are formed by statisti!al learning from input and by dynami! self-intera!tion. 'n this way a distributed memory for lo!al te%ture in the various modalities !an be stored already in retinal !oordinates, that is, in primary visual !orte%.  1#e would like to stress here the !ontrast of this mode of representation to the !urrent paradigm. o !ope with the stru!ture of the visual world, a vision system has to represent a hierar!hy of sub-patterns, (features). 'n standard multi-layer per!eptrons $for an early referen!e see 5ukushima, /98& all features of the hierar!hy are represented by neurons. #hat we are proposing here amounts to repla!ing units as representatives of !omple% features by lo!al pie!es of net stru!ture tying together low-level feature neurons $or (te%ture elements), neurons representing the elementary features that are found in neurophysiologi!al e%periments in primary visual !orte%&. his has a number of de!isive advantages. 5irst, stru!tured nets represent visual patterns e%pli!itly, as a two-dimensional arrangement of lo!al te%ture elements. e!ond, as alluded to above, large numbers of nets !an be implemented on a !omparatively narrow neural basis in a !ombinatorial fashion. hird, partial identities of different patterns are taken !are of by partial identity of the representing pie!es of net stru!ture. 5ourth, a whole hierar!hy of features !an be represented in a flat stru!ture already in primary visual !orte% $a shade of neurophysiologi!al eviden!e for the lateral !onne!tions between neurons surfa!es in the form of non-!lassi!al re!eptive fields, see llman et al. /93&. 5ifth, nets that are homeomorphi! to ea!h other $i.e., !an be put into neuron-to-neuron !orresponden!e su!h that !onne!ted neurons in one net !orrespond to !onne!ted neurons in the other& !an a!tivate ea!h other dire!tly, without this intera!tion having to be taught, see below. ACTIVATION OF NETS ;n!e lo!al net stru!ture has been established by learning and self-intera!tion, the a!tivation by visual input takes the following form $see 5ig. &. he sensory input sele!ts lo!al feature types. <a!h feature type is $at least in a !ertain ideali=ation& represented redundantly by a number of neurons with identi!al re!eptive fields. ets of su!h input-identi!al redundant neurons form (units). #ithin a unit there is an inhibitory system indu!ing winner-take-all $#& dynami!s $only one or a few of the redundant neurons surviving after a short time&. he winners in this pro!ess are those neurons that form  part of a net, that is, whose a!tivity is supported by lateral, re!urrent input. his pro!ess of sele!tion of the input-a!tivated neurons that happen to be laterally !onne!ted as a net is an important type of implementation of dynami! links: although the !onne!tions are a!tually stati!, nets are dynami!ally a!tivated by sele!tion of net-bearing neurons. 5or another type see below.>o!al pie!es of net stru!ture !an be !onne!ted like a !ontinuous mosai! into a larger net. his may  be !ompared to the image-!ompression s!heme in whi!h the te%ture within lo!al blo!ks of an image is identified with a !ode-book entry $only the identifying number of the !ode-book entry being transmitted&. GENERATION OF NETS  *et stru!ture in primary visual !orte% is shaped  by two influen!es, input statisti!s and self-intera!tion. ;ne may assume that the geneti!ally generated initial stru!ture has random short-range lateral !onne!tions. 'n a first bout of organi=ation re!eptive fields of neurons are shaped by image statisti!s, presumably under the influen!e of a sparsity !onstraint $;lshausen and 5ield, //?&. 'n this period the # inhibition may not yet be a!tive, letting neurons in a unit develop the same re!eptive field. hen, the network be!omes sensitive to the statisti!s of visual input within somewhat larger pat!hes $the s!ale being set by the range of lateral !onne!tions& and pie!es of net stru!ture are formed by synapti! plasti!ity strengthening !onne!tions between neurons that are often !o-a!tivated and #-sele!ted, while net stru!ture is optimi=ed by the interplay between $spontaneous or indu!ed& signal generation and ebbian modifi!ation of synapti! strengths under the influen!e of a sparsity !onstraint . MODALITIES Aifferent sub-modalities $te%ture, !olour, depth, motion, ..& form their own systems of net stru!ture, that is, representations of lo!al patterns that are statisti!ally dominant in the sensory input. <a!h modality is invariant to the others and has its own lo!al feature spa!e stru!ture with its own dimensionality, three for !olour, two for in-plane motion, one for $stereo-&depth, perhaps B8 for grey-level te%ture and so on. Aifferent values of a given feature dimension are represented by different neurons, or rather units !ontaining a number of value-identi!al neurons. Aifferent value-units of the same feature dimension, forming a (!olumn), inhibit ea!h other, again in # fashion.  LATENT VARIABLES everal units standing for different values of a sub-modality feature may be simultaneously a!tive to varying degree. hey may be seen as representing different hypotheses as to the a!tual value of the feature dimension. hese a!tivities thus represent heuristi! un!ertainty, whi!h during the per!eptual  pro!ess needs to be redu!ed to !ertainty. 'n distin!tion to !omputer graphi!s, reali=ed as a deterministi! pro!ess pro!eeding from definite values of all involved variables determining a s!ene, vision is an inverse problem, in whi!h these values first have to be found in a heuristi! pro!ess that is inherently non-deterministi!. he initially unknown uantities are !alled latent variables. he task of the  per!eptual pro!ess is the iterative redu!tion of the heuristi! un!ertainty of latent variables $(per!eptual !ollapse)&, whi!h is possible by the appli!ation of !onsisten!y !onstraints and known memory patterns. 5igure : Combinatorially many nets !o-e%ist within a !orti!al stru!ture $s!hemati!&. Dnits $verti!al bo%es& are sets of neurons with identi!al re!eptive field. Visual input sele!ts a sparse subset of units $verti!al arrows&.  *eurons within units have # dynami!s. he winner neurons are those that are supported by lateral !onne!tions from neurons in other sele!ted units. >ateral !onne!tions form a net stru!ture.  neuron !an be part of several nets, thus, many nets !an !o-e%ist. CONSISTENCY CONSTRAINS #hereas the winner neurons within units are sele!ted by the pattern-representing lateral !onne!tions $whi!h may be !alled (hori=ontal nets)&, thus fa!toring in memory patterns, the winner unit inside feature !olumns are sele!ted by another kind of net stru!ture, (verti!al nets), whi!h  are formed by !onne!tions running between value units in different sub-modalities.  verti!al net ties together feature value units that are !onsistent with ea!h other, !onsistent in the sense of signals arriving at a unit over alternate pathways within the net as well as sensory signals agree with ea!h other. >ike the net stru!tures representing memory for lo!al feature distribution, !onsisten!y nets are established by a !ombination of learning from sensory input and self-intera!tion. INTRINSIC COORDINATE DOMAIN o far, we have spoken of stru!ture in primary visual !orte%, whi!h is dominated by retinal !oordinates, that is, image lo!ation !hanges with eye movements. ll lo!al te%ture representation must therefore be repeated for all positions. $his is  possible only for a limited number of lo!al te%ture  pat!hes, !omparable to the si=es of !odebooks in image-!ompression s!hemes&. 'n order to store and represent larger !hunks of visual stru!ture, su!h as for re!urring patterns like familiar ob+e!ts or abstra!t whole-s!ene lay-outs, there is another domain, see 5igure 1, presumably infero-temporal !orte%, in whi!h neurons, units and !olumns refer to pattern-fi%ed, intrinsi! !oordinates. $5or the stru!ture of fibre pro+e!tions between the retinal-!oordinate and the intrinsi!-!oordinate domains see below.& he intrinsi! domain !an be mu!h more parsimonious then the retinal one in not needing to repeat net stru!tures over the whole visual field, so that it !an afford to spend more redundan!y in ea!h intrinsi! lo!ation to be able to store a very large number of  pattern-spanning nets. lso the intrinsi! domain !ontains sub-stru!tures for the representation of sub-modalities, and again there are nets for the representation of mutual !onstraints between the sub-modalities. hus, the two domains are ualitatively the same but uantitatively very different. DYNAMIC MAPPINGS he two domains with retinal and intrinsi! !oordinates are !onne!ted by dynami! point-to-point and feature-to-feature fibre pro+e!tions that !an be swit!hed as ui!kly as retinal images move, so that !orresponden!e between homeomorphi! stru!tures is maintained. his swit!hing is a!hieved with the help of (!ontrol units) $nderson and Van<ssen, /90&. hese !an be reali=ed as neurons whose outgoing synapses are !o-lo!ali=ed with the synapses of the pro+e!tion fibres they !ontrol at  Bdendriti! pat!hes of the target neurons. 'f those  pat!hes have threshold properties, the pro+e!tion fibres !an transmit signals only if also the !ontrolling fibre is a!tive. he hypothesis that dendriti! pat!hes with non-linear response properties are a!t as de!ision units has been proposed long ago, see for instan!e $"olsky et al. 188B&. 5igure 1: ;verview of the r!hite!ture. <a!h plane !orresponds to one sub-modality, on the left side in retinal !oordinates $primary visual !orte%&, on the right side in  pattern-intrinsi! !oordinates $infero-temporal !orte%&.  segment in the retinal-!oordinate domain is pro+e!ted by dynami!al mappings to the intrinsi!-!oordinate domain. Constraint intera!tions that help to single out mutually latent variable values run between !orresponding points $whi!h refer to the same point on a surfa!e within the visual s!ene&.  !ontrol neuron may, like any other neuron, re!eive synapti! inputs $e.g., by re-afferent signals that !an in this way swit!h pro+e!tion fibres su!h as to !ompensate an intended eye movement&, but they also may get e%!ited through their !ontrol fibres. #e assume that these !arry signals that are  proportional to the similarity of the signal pattern in the !ontrolled pro+e!tion fibres on the one hand and the signal pattern in the target neurons on the other. $"ro!esses of !ontrol neurons would thus transmit and   re!eive signals and should !orrespondingly be !alled neurites.& Aifferent !ontrol units stand for different transformation parameters $relative position, si=e or orientation of !onne!ted sets of neurons in the two domains& and may be responsible for !onne!ting a lo!al pat!h in one domain to a lo!al pat!h in the other. he set of !ontrol units for different transformation parameters for a given pat!h in the target domain form a !olumn with # dynami! and represents transformation parameters as latent variable. 'n order to !over deformation, transformation parameters may !hange slowly from  point-to-point in the target domain, and an entire !oherent mapping is represented by a net of laterally !onne!ted !ontrol units $again, units !ontain a number of redundant neurons to give leeway for many nets to be stored side-by-side without mutual interferen!e&.  SEGMENTATION Vision is organi=ed as a seuen!e of attention flashes. Auring ea!h su!h flash, analysis of sensory input is restri!ted to a segment 4 a !oherent !hunk of stru!ture 4 e.g., to the region in retinal spa!e that is o!!upied by the image of an ob+e!t. >ike per!eption in general, segmentation is a hen-and-egg problem, segmentation needing re!ognition, re!ognition needing segmentation. Certain patterns indi!ative of a !oherent stru!ture are already available in primary !orte%, su!h as the presen!e of !oherent fields of motion, depth or !olour, or familiar !ontour shapes. ;thers, however, need referen!e to patterns stored in the intrinsi! !oordinate domain. 5or this to happen, two types of latent variables have to be made to !onverge first, the transformation parameters identifying the segmentEs lo!ation and si=e in the retinal-!oordinate domain, and the identity of stru!ture of a fitting model in memory. #e have modelled this pro!ess for the purpose of ob+e!t re!ognition, whi!h we tested su!!essfully on a  ben!hmark, observing rather fast !onvergen!e $referen!e suppressed for anonymity& . 'n general, the intrinsi! representation of the segment !annot be found in memory but needs to be assembled from  partial patterns $+ust as the e%tended te%ture in  primary !orte% is assembled from lo!al te%ture  pat!hes&. Con!eiving of ob+e!ts as !omposites of known elementary shapes is a well-established !on!ept $iederman, /90&. his pro!ess of assembly takes pla!e in a !oordinated fashion in the different sub-modality modules. he de-!omposition and re-!omposition of sensory patterns is the basis for a very parsimonious system of representing a large !ombinatorial universe of surfa!es of different shape, !olouring te%ture under a range of illumination !onditions and in different states of motion. RECOGNITION he a!tual re!ognition pro!ess of a pattern in the retinal !oordinate domain against a pattern in the intrinsi! domain may be seen as a pro!ess of finding a homeomorphi! pro+e!tion, or graph mat!hing,  performed by many !ontrol neurons simultaneously !he!king for patterns similarity while !ompeting with alternate !ontrol neurons and !ooperating with !ompatible ones $!ompatible in the sense of forming together a net stru!ture&. Fe!ognition by graph mat!hing has a long tradition, see for instan!e $Kree and Gippelius, /99& or $>ades et al., //7&.  related approa!h is $rathorn, 1881&, who has  pointed out the value of the information inherent in the shape of the mapping, whi!h is produ!ed as a by- produ!t. PREDICTION ;n!e this pro!ess has !onverged for a moving  pattern and its motion parameters have also been determined, the system !an set the fibre pro+e!tion system in motion to tra!k the ob+e!t and send short-term predi!tions of sensory input down from the model in the intrinsi! domain to the primary !orte%. u!!essful predi!tion of sensory input on the basis of a !onstru!ted dynami! model is the ultimate basis for our !onfiden!e in per!eptual interpretations of the environment, and is very important for the ad+ustment of !onstraint intera!tions. ONGOING WOR AND NE!T STEPS #e are at present working on a simple version of the ar!hite!ture, implementing the modalities grey-level $the input signal&, surfa!e refle!tan!e, illumination, depth, surfa!e orientation and shading, all reali=ed in image !oordinates. #e are manually !reating !onstraint intera!tions between them and a small number of lateral !onne!tivity nets. he goal is to model the per!eptual !ollapse on simple sample images. o suppress the tenden!y of the system to  break up spontaneously into lo!al domains $generating spurious latent-variable dis!ontinuities& we are working with !oarse-to-fine strategies. s we are embedded in a lab that is engaged in an effort to build a !omputer vision system by methods of systems engineering, we plan to adapt more and more known vision algorithms into the ar!hite!ture. CONCLUSION ll we are proposing is to re-interpret neural tissue and dynami!s su!h as to see them as the natural basis for the stru!tures and pro!esses that are reuired for vision. he essential point is the assumption that neural tissue is an overlay of well-stru!tured (nets), whi!h are !hara!teri=ed by sparsity in terms of !onne!tion per neuron and !onsisten!y of different pathways between pairs of neurons. "arti!ular supporting assumptions are
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks