Instruction manuals

A natural language instruction system for humanoid robots integrating situated speech recognition, visual recognition and on-line whole-body motion generation

Description
A natural language instruction system for humanoid robots integrating situated speech recognition, visual recognition and on-line whole-body motion generation
Published
of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A Natural Language Instruction System for Humanoid RobotsIntegrating Situated Speech Recognition, Visual Recognition andOn-line Whole-Body Motion Generation Ee Sian Neo, Takeshi Sakaguchi and Kazuhito Yokoi Intelligent Systems Research Institute, National Institute of Advanced Industrial Science and Technology,Tsukuba 305-8568, Japan.Email: { rio.neo, sakaguchi.t, Kazuhito.Yokoi } @aist.go.jp  Abstract —This paper presents an integrated on-line opera-tion system that enables a human user to operate humanoidrobots by using natural language instructions. This paperhas two major contributions. First, we present an integratedbehavior system that is able to trigger behaviors accordingto speech commands, by recognizing objects, triggering actionsand generating whole body motions on-line. Second, we presenta situated natural language instruction system that is able notonly to act according to speech commands, but also response tothe direction of the sound source. A system that is able to un-derstand natural language instructions and act accordingly willneed the integration of knowledge representation, perception,decision making and on-line motion generation technologies.This paper tackles this integration problem by addressing theissues of representing knowledge of objects and actions whichfacilitates natural language instructions for tasks in indoorhuman environments. We propose a taxonomy of objects inindoor human environments and a lexicon of actions in thispreliminary attempt to construct a reliable and flexible naturallanguage instruction system. We report on the implementationof the proposed system on humanoid robot HRP-2, which isable to locate auditory sources and receive natural languageinstructions from a user within 2 meters using a 8-channelmicrophone array connected to a speech recognition embeddedsystem on-board the robot. I. INTRODUCTIONMethodologies enabling a human user to operate robots toperform desired tasks in an intuitive and flexible manner isone of the key technologies in realizing really useful roboticsystems. For robots working in human environment, it isdesirable to have robots that can communicate with humanusing natural language, especially for people who are non-robotics specialists.Since the mobile robot SHAKEY in the 1960s[1] to therecent humanoid robots ASIMO[2] and QRIO[3], naturallanguage interface for robots have been a topic long studiedin the robotics community. Matsui et al. reported on anoffice mobile robot Jijo-2 which can perform services suchas guiding visitors, delivering messages, managing officemembers’ schedules and arranging meetings[5]. Laengle etal. constructed a dual-arm mobile robot KAMRO which was This work is partially supported by the Development Project for theCommon Basis of Next-Generation Robots of the New Energy and IndustrialTechnology Development Organization(NEDO), and a Grant-in-Aid forYoung Scientists(B), KAKENHI(19700199), of The Ministry of Education,Culture, Sports, Science and Technology(MEXT), Japan. Opening A Fridge to Take a DrinkGiving the Drink Fig. 1. Scenes from One of Our Demonstrations Using the ProposedSystem Corresponding to the Speech Command of “Fetch Me a Drink” augmented with a natural language interface for specifyingassembly tasks. Spacial relations between components to beassembled on the robot’s workbench are identified by anoverhead camera enabling a complete world model to bemaintained[6]. McGuire et al. reported on a hybrid architec-ture that combines a Bayesian network, neural networks, andfinite state machines to interconnect visual attention to verbalinformation about objects for instructing grasping tasks byman-machine interaction[4]. Bischoff et al. reported on theservice robot HERMES, which is built on a situation-orientedbehavior-based control architecture integrating vision, touchand natural language instruction[8]. For interactive program-ming of robots, Dillman et al. presented an interactiveteaching technique with verbal and gestural channels[13].Miura et al. developed a task teaching framework whichdescribes what knowledge is necessary to achieve a task andreported on an interactive method to teach tasks to a mobilerobot[10].On the other hand, concerning computational model of dialogue systems, Thorisson presented a model of real-time task-oriented dialogue skills that includes perceptionand interpretation of user behavior and demonstrated ahumanoid that is capable of real-time face-to-face dialogueswith human users using hand gestures, facial expressions,body language and meaningful utterances[9]. Inamura et al.proposed using experience shared between human beings androbots with stochastic experience representation to tackle theproblem of vague instructions based on the certainty factor of the candidate and the degree of localization of the object[7].Deb Roy introduced a theoretical framework for ground- 978-1-4244-2495-5/08/$25.00 © 2008 IEEE. 1176 Proceedings of the 2008 IEEE/ASMEInternational Conference on Advanced Intelligent MechatronicsJuly 2 - 5, 2008, Xi'an, China  ing language which combines concepts from semiotics andschema theory to develop a holistic approach to linguisticmeaning[12]. Mavridis et al. reported on a Grounded situ-ation model which allow the referencing objects, act uponobjects and visualised in mental imagery through objects ona table[11].To realize systems that understand natural language in-struction for humanoid robots with high reliability willneed the integration of technologies from different fields.Knowledge representation for tasks and actions, as well asinteraction technologies between robots, environments andhuman users need to be solved not only partially, but in anintegrated and holistic manner.This paper presents an integrated approach to realize anatural language instruction systems enabling a human userto operate humanoid robots in human environments. Fig. 1depicts some scenes using the proposed system during oneof our public demonstrations. The contributions of this paperare the following: •  We present an integrated behavior system that is ableto trigger behaviors according to speech commands, byrecognizing objects, triggering actions and generatingwhole body motions on-line •  We present a situated natural language instruction sys-tem that is able to not only act according to speechcommands, but also response to the direction of thesound sourceThis paper will first present the challenges for creatinga natural language instruction system for humanoid robotsserving in human environment, then we will propose ourapproach in tackling these challenges from a holistic per-spective by addressing the issues of representing knowledgeof objects and actions which facilitates natural languageinstructions for tasks in indoor human environments. Wepropose a taxonomy of objects in indoor human environ-ments and a lexicon of actions in this preliminary attempt toconstruct a reliable and flexible natural language instructionsystem. We report on the implementation of an operationsystem using RT middleware for humanoid robot HRP-2,which is able to locate auditory sources and receive naturallanguage instructions from a user within 2 meters using a 8-channel microphone array connected to a speech recognitionembedded system[17] on-board the robot.II. C HALLENGES IN  C REATING  A N ATURAL  L ANGUAGE I NSTRUCTION  S YSTEM FOR  H UMANOID  R OBOTS S ERVING  I N  H UMAN  E NVIRONMENT Apart from the general difficulties of constructing naturallanguage interpretation system, for humanoid robots servingin human environment, the following challenges will need tobe addressed: •  Humanoid robots are mobile platforms moving in dy-namically changing environments: how to ground ref-erent in such environments is crucial. In order to beable to instruct a robot to perform task in a physicalenvironment, the referent in the instruction must begrounded to the physical environment in a referenceframework which is common to both the robot andthe human user. A systematic environment descriptionframework will be needed, without which the robot willhave to search the entire environment for a referent,which will not be feasible for a service system. •  Humanoid robots are mechanism possessing multi-degrees of freedom and multi operational points, naturallanguage instruction can have more ambiguity for hu-manoid robots. Consider the instruction “push that can”,a humanoid robot with multiple degrees of freedomwill have to judge if it is going to use its right hand,left hand, right foot, left foot or even its chest to pushthe can. Apart from selecting the operational points toexecute the command, the robot will have to select thebalance criterion, the CoM position, the stepping areaand the degrees of freedom to be used, to perform aparticular task.In addressing the first challenge, this paper first introducesan object taxonomy in which objects in indoor humanenvironment is categorized. Using the taxonomy, objectsare grounded to physical objects in the world using visionrecognition modules on-board the robot which consists of amodel-based object recognition engine.After discussing the taxonomy of objects, we will presenta lexicon of actions and behaviors to address the secondchallenge in grounding symbolic commands to on-line wholebody motion generation.III. G ROUNDING  N ATURAL  L ANGUAGE  R EFERENTS TO O BJECTS IN THE  W ORLD : A T AXONOMY OF  W ORLD O BJECTS Considering the way we give instructions to someoneto perform a certain task, the instructions are usually of the form “go down the corridor, when you see door room“101A”, open the door and enter”, or “the drink is inthe fridge”, or “ you shall find the trash-box next to thefridge”. In order for the objects in such instructions to besemantically meaningful, the robot should be equipped withphysical information which will assist the robot to groundthese referents to real entity in the world. We seek to solvethis grounding problem by first categorized object in theenvironment and using the categorization to ground objectsin a systematic hierarchical way.First we classified objects in everyday indoor environmentinto the following category: •  stationary objects:non-movable objects where their locations with regardto the environment will not change. The examples forthis class of objects are a wall, a cabinet fixed onthe wall, a door, a window and etc. The location of these objects can be described using the map of theenvironment. •  semi-stationary objects:heavy objects in which an excessive force is needed tochange their locations with regard to the environment. 1177  The examples are a fridge, a dining table, a cabinet andetc. •  non-stationary objects:objects of which their location change easily due toexternal force. The examples are a chair, a cup, a canof drink and etc. •  moving objects:objects that can move on themselves. The examples areliving objects such as animal and human, or artifactssuch as a remote-controlled toy-car, or a mobile robot.Using these categorization, the robot grounds its locationto the environment by locating stationary objects, in the fash-ion similar to how landmarks are used in typical localizationsystems for mobile robots. Although, it is quite impossiblefor a robot developer to have full knowledge of the environ-ment where the robot is going to be used, the informationof non-stationary objects can be easily provided using theconstruction map of the building. Thus the wall, the door,the windows, the stair-case position can be obtained, andused for grounding the robot’s location to the environment.We ground the semi-stationary objects using informationwhich describes the locational relation between the semi-stationary objects and the stationary objects in the world. Forexample, we describe the location of a fridge by indicatingthat the fridge is about 10 centimeters away, left of thekitchen door. Non-stationary objects are grounded usinginformation of stationary objects or semi-stationary objects.We can describe objects by saying a cup on the table, or acan drink inside the fridge. Moving objects can be describeusing the same method. However to ground moving objectsrequires more advance techniques, such as high speed visualrecognition algorithms or hardware.In our system, all objects possess the following informa-tion in the robot’s knowledge base: •  name:common name used by the user as well as in the robot’ssystem •  category:any of the stationary, semi-stationary, non-stationary ormoving category •  geometrical model:3D CAD model of the object which is used by model-based visual recognition system. We must note thatthis information will not be all the while relevant. Formanufactured things such as a fridge, a table, a cabinetCAD model can be obtained relatively easily throughmanufacturers. This information will be quite difficultto obtain for natural objects such as fruits, vegetable,meat and etc. and requires a dynamical representationfor the category of objects such as a plastic bag, a pieceof cloth and etc. •  color information:color histogram for visual color recognition •  weight information:weight information of the object. •  handling information: 3D Geometrical Model of a FridgeRecognition Result of a FridgeOverlayed on the Captured Image Fig. 2. Example of the Geometrical Model of a Fridge and the RecognitionResult Using the Model Overlaid on the Captured Image 3D Geometrical Model of a MugRecognition Result of a Red MugOverlayed on the Captured Image Fig. 3. Example of the Geometrical Model of a Mug and the RecognitionResult of a Red Mug Using the Geometrical Model of the Mug Coupledwith Color Information Overlaid on the Captured Image standing position and handling position in object frame,compliance information •  visibility information:the distance in which the object can be recognized usingvisual recognition system. Distance will be far for a bigobject and near for a small object. •  relative locational information:expected location with regard to either a map, stationaryobjects, or semi-stationary objects •  temporal information:time stamp of the last time when the object was lastlocated •  location:position, orientation in robot’s reference frame when theobject was last locatedWith such an object description framework which is anextension of the way information is being represented inArtificial Intelligence systems, information for any particularobject can be grounded by filling the information by directteaching by human or using the robot’s perceptual process,or by querying the system for the information about objects.Fig. 2 shows the example of the geometrical model of a fridge and the recognition of a fridge using the modeland Fig. 3 shows the geometrical model of a mug and therecognition result of a red mug on a shelf using the modelof the mug and color information.Fig. 4 shows some examples of standing position andgrasping frame for handling information of various objects insimulation after the object is being recognized and groundedto the robot’s frame. 1178  TableWaistFrameduringStandingCan FrameWaist FrameDuringStandingPositionGrasping Frame Can FrameWaist FrameDuringStandingPositionGrasping Frame Fridge Handle FrameWaist FrameDuringStanding PositionGrasping Frame To Approach a TableTo Open a Fridge and to Continueto Take Things Using Left HandTo take a Can on a TableTo take a Can on the Floor Fig. 4. The Standing Position and Grasping Frame for Various Object-Based Operation IV. G ROUNDING  S YMBOLIC  C OMMAND  I NTO  O N - LINE W HOLE  B ODY  M OTION  C ONTROL In order to derive a mapping from natural languageinstruction to robot actions, the robot requires both a setof atomic actions that will allow it to perform the requiredbehaviors as well as behaviors necessary to satisfy a set of tasks. We define behaviors as functions which map a set of inputs, including sensor information as well as derived stateinformation, to a set of atomic actions.Actions are categorized into: •  Perceptive actions:Examples are visual perceptive actions such as rec-ognize(object), speech perceptive actions such as rec-ognize(words from a speech library), sound sourceperceptive actions such as detect(sound source) and etc. •  Body actions: Examples are walkto(location),reach(object) and etc. •  Speech actions: Examples are say(“hello”) and etc.Each behavior can be defined using finite state machineswith a start state and a terminal state. Behaviors can be usedas fundamental building blocks to satisfy the required tasks.Fig. 5 shows how the task of “Bring me a drink from thefridge” can be decomposed into sub-tasks, behaviors andactions.We have reported on the construction of a behavior leveloperation system in [19]. Here we will discuss on howbasic actions can be used to construct behaviors, and howparameters such as standing position will differ dependingon the combination of basic actions. Let us consider a robotbeing given the following three instructions: •  (a)go to the fridge •  (b)go to the fridge and look for a drink  •  (c)go to the the fridge and take out a drink In order to achieve (a), the basic action used iswalkto(fridge).For (b), the basic actions used is Fig. 5. The Task of Bring Me A Drink From the Fridge •  walkto(fridge), •  reach(fridge handle, righthand), •  touch(fridge handle, righthand), •  pull(fridge handle, righthand), •  lookat(fridge door pocket), •  locate(drink)For (c), the basic actions used are those used for (b) plus: •  reach(drink, lefthand), •  touch(drink, lefthand), •  liftup(drink, lefthand), •  retrieve(lefthand), •  push(fridgehandle, righthand).In all three cases, the action to be executed is go tothe fridge. however, (a) would be correctly interpreted bywalking to anywhere in front of the fridge. this shows thatin (b) the agent must be inferring the proper position to goin order to stand at a place where it is able to open the fridgedoor and able to look for a can in the fridge. In(c) in additionto be able to stand at the position where the robot is able toopen the fridge door and look for a can, the robot will haveto be able to simultaneously keep the right hand in positionwhile taking the can with the left hand. Fig. 6 shows thearea of standing position for all three cases which is obtainedthrough searching the stepping area for these actions. foot place areawhite: reachable area for right hand to open doorblue: reachable for right hand and good to seered: reachable, good to see, and reachable to left hand to pick up a can on the compartmentTop View of the tasktaking a drink with the left hand while keeping the fridge door openSide View of the tasktaking a drink with the left hand while keeping the fridge door open Fig. 6. Right Foot Standing Area for the Task of Holding Fridge DoorHandle with Right Hand and Taking an Object in the Fridge with Left Hand Even with the same behavioral command such as “pick up an object”, depending on the situation of the ob- ject and the robot, the components of the basic ac-tions used are different. For example for the behavior 1179  of picking up an object on the table(Fig. 7), the basicactions used are walkto(object), reach(object, righthand),touch(object, righthand), liftup(object, righthand). On theother hand, to pick up an object on the floor, in addition towalkto(object), reach(object, righthand), touch(object, right-hand) and liftup(object, righthand), the actions of squat() andstandup() in order to reach to the object on the floor are addedto the basic action list( Fig. 8). goto(phone)grasp(phone,righthand)reach(phone,righthand)liftup(phone,righthand)touch(phone,righthand) Fig. 7. The Sequence of Motor Actions Triggered to Pick Up a Phone onthe Table goto(can)squat()grasp(can,righthand)stand up()reach(can,righthand)liftup(can,righthand)touch(can,righthand) Fig. 8. The Sequence of Motor Actions Triggered to Pick Up a Can onthe Floor V. I MPLEMENTATION ON  H UMANOID  R OBOT  HRP-2  A. Hardware System For an autonomous Humanoid robot, as its control module,information processing module, batteries, and the mechanicalparts need to be built in the robot, the number and perfor-mance of CPU that can be installed on board the robot islimited. For this reason, while visual processing are usuallyprocessed on board of the robot due to the difficulty tohave high-bandwidth communication for high quality on-line visual information, speech information is relatively lowand its recognition are usually done remotely[15]. Hara etal. introduced a robust speech interface based on audio andvideo systems on-board a humanoid robot[14].In order to have all processing modules for visual recog-nition, on-line whole body motion generation and speechrecognition on-board the robot, we have adopted a speechrecognition hardware module developed by NEC in Devel-opment Project for the Common Basis of Next-GenerationRobots (Development of Speech Recognition Device andModule) of NEDO[17].The speech recognition module is developed using anapplication processor, MP211, producedby NEC Electronics.The processor has three ARM9 CPUs, one DSP and 128MBmain memory in a single package. The speech recognitionmodule consists of two layers of printed circuit board(PCB)with the size of 55mm by 100mm. The MP211 processorand its connectors are integrated on one PCB(CPU board),and 16 channels of synchronous audio input are integrated onanother PCB(Audio board). The speech recognition moduleis able to process the speech information input from a 8-channel microphone array connected via the Audio board.The speech recognition module provides the functionsfor speech recognition, large vocabulary continuous speechrecognition, noise canceller, sound source direction detectionand speech synthesis[17].In our framework, 2 of the 8-channel microphones areused for speech recognition with noise canceller, and 4 of themicrophones are used for sound source direction detection.Speech acts are generated using the speech synthesis functionvia a speaker mounted in the head of the robot.The overview of the software system is shown in Fig.10. The distributed server system consists of the followingcomponents: •  Knowledge Management Module: Where the data baseon objects, behaviors and actions are stored as objectlibrary, behavior library and action libraries. The in-formation from this module is being used by speechcommand interpreter module, the speech recognitionmodule, the vision recognition module •  Speech Recognition Module: The speech recognitionmodule provides the functions for speech recognition,large vocabulary continuous speech recognition, noisecanceller, sound source direction detection and speechsynthesis. •  Visual Recognition Module: This module is the imple-mentation of the Vision Based Recognition Module. Ithas access to four cameras on-board the robot, provideraw image stream, stereo vision information and objectrecognition results. The Vision Recognition Modulewas implemented using the Volumetric Versatile Visionsystem[16] which has a segment-based object recogni-tion engine with an expandable library of objects. •  Behavior Module: This is the Behavior Module imple-mented on top of the On-line Motion Generation Mod-ule which uses the Whole Body Motion Generator asits building block. This module receives command fromthe Speech Recognition Module to trigger behaviors. Itsends commands to the Visual Recognition Module forthe triggering of recognition process and receives infor-mation on recognized objects from Visual RecognitionModule and information on robot’s condition from theOn-line Motion Generation Module. •  On-line Motion Generation Module: This is the motionlevel controller which will generate whole body motionssatisfying the desired operational point trajectories, mo-mentum and ZMP to maintain balance.[18]The Speech Recognition Module was implemented usingOpenRTM[17], while the other modules were implementedas CORBA servers.VI. E XPERIMENTAL  S TUDIES We have conducted several experiments using the pro-posed system. We have a depository of 72 geometrical mod-els for the visual recognition of 32 objects and a repertoireof 27 behaviors constructed using 11 kinds of body actions. 1180
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x