Computers & Electronics

Student mental state inference from unintentional body gestures using dynamic Bayesian networks

Description
Applications that interact with humans would benefit from knowing the intentions or mental states of their users. However, mental state prediction is not only uncertain but also context dependent. In this paper, we present a dynamic Bayesian network
Published
of 11
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  J Multimodal User Interfaces (2010) 3: 21–31DOI 10.1007/s12193-009-0023-7 ORIGINAL PAPER Student mental state inference from unintentional body gesturesusing dynamic Bayesian networks Abdul Rehman Abbasi  · Matthew N. Dailey  · Nitin V. Afzulpurkar  · Takeaki Uno Received: 7 May 2009 / Accepted: 9 November 2009 / Published online: 5 December 2009© OpenInterface Association 2009 Abstract  Applications that interact with humans wouldbenefit from knowing the intentions or mental states of theirusers. However, mental state prediction is not only uncer-tain but also context dependent. In this paper, we present adynamic Bayesian network model of the temporal evolutionof students’ mental states and causal associations betweenmental states and body gestures in context. Our approachis to convert sensory descriptions of student gestures intosemantic descriptions of their mental states in a classroomlecture situation. At model learning time, we use expecta-tion maximization (EM) to estimate model parameters frompartly labeled training data, and at run time, we use the junc-tion tree algorithm to infer mental states from body gestureevidence. A maximum a posteriori classifier evaluated withleave-one-out cross validation on labeled data from 11 stu- A.R. AbbasiAsian Institute of Technology, Mail Box No. 1192, P.O. Box 4,Klong Luang, Pathumthani 12120, Thailand Present address: A.R. Abbasi (  )Advanced Computing Lab, KINPOE, P.O. Box 3183, Karachi,Pakistane-mail: qurman2000@gmail.comM.N. Dailey  ·  N.V. AfzulpurkarAsian Institute of Technology, P.O. Box 4, Klong Luang,Pathumthani 12120, ThailandM.N. Daileye-mail: mdailey@ait.ac.thN.V. Afzulpurkare-mail: nitin@ait.ac.thT. UnoNational Institute of Informatics, 2-1-2 Hititsubashi, Chiyoda-ku,Tokyo 101-8430, Japane-mail: uno@nii.ac.jp dents obtains a generalization accuracy of 97.4% over caseswherethestudentreportedadefinitementalstate,and83.2%when we include cases where the student reported no mentalstate. Experimental results demonstrate the validity of ourapproach. Future work will explore utilization of the modelin real-time intelligent tutoring systems. Keywords  Affect analysis in context  ·  Dynamic Bayesiannetworks  ·  Mental state inference  ·  Unintentional gestures 1 Introduction Many applications that interact with humans would benefitfrom predicting their emotional or mental state. For exam-ple, a classroom barometer could help an instructor to knowabout a student’s emotional or mental state to adapt its ped-agogical strategy in real time [1].Recently, there has been an explosion of interest by com-puterscienceresearchersinefficientlyprocessinghumanso-cial signals for effective HCI applications, more specificallythose signals or clues that may help to understand humanemotions, intentions and mental states [2]. Lehman et al. [3] explore the relationship between affective states and studentengagement levels when working with expert tutors. Simi-larly, Dragon et al. [4] analyze students’ behavior, to under- stand their learning ability, and decide when an interventionby the intelligent tutor is needed. In similar work, D’Melloet al. [5] propose using conversational cues, body posture, and facial features to determine when learners are confused,bored, or frustrated during tutoring sessions. In response,their affect-sensitive Auto Tutor generates tutorial hints ac-companied by empathetic and motivational statements.This work is part of a broader trend in human-computerinteraction (HCI) toward the innovative use of affect (emo-  22 J Multimodal User Interfaces (2010) 3: 21–31 tion or mental state) in human computer interfaces [6]. Re- searchers argue that their innovations will enable systemsto be more effective and efficient than traditional systems,and they support these claims with experimental results. Forexample, Wentzel [7] reports that students show more im- provement when using a tutoring system able to adapt totheir level of frustration than when using a traditional tutor-ing system. Similarly, students report interventions as usefulby a pedagogical agent responding their affective state whenlearning through a database tutor [8]. Our work focuses on the relationship between mentalstate and body gestures. The idea that human gestures andbody movements are important in conveying mental and af-fective information is very old [9]. Such gestures can be intentional or unintentional. Unintentional gestures in par-ticular convey a great deal of useful information about af-fect [10]; people often react to emotional situations with un- intentional body movements [11, 12] such as a knee shake showing fear [13], a backward bend showing surprise or fear [14], and a non-deliberate shoulder shrug showing un- certainty [15]. More recently, studies on extraction of affec- tive content from real-world situations such as people drink-ing, knocking and walking [16, 17] have been carried out with promising results.Another central focus of research on the affective contentof gestures is multimodal affect analysis, in which gesturesare analyzed in combination with another communicationchannel. Ambady and Rosenthal [18] report findings from a human study indicating that body gesture information canhelp improve emotional facial expression recognition accu-racy. Balomenos et al. [19] and Gunes et al. [20] reach sim- ilar conclusions from work combining machine recognitionof facial expressions with hand gesture analysis.However, predicting human mental state from body ges-tures in real-world applications such as educational systemsis problematic because the relationship between the gesturesand the underlying mental states is context dependent anduncertain. For example, a student who scratches her headin response to a teacher’s question in a classroom situationmightperformadifferentgesturewhenfeelingthesamewayinadifferentsituation.Similarly,astudentwhorubshiseyeswhen feeling “tired” during a lecture would probably nei-ther perform the same action every time nor only when he istired; at some other time, he might rub his eyes because theyare itchy.In this paper, we propose a method to create a context-specific model of such relationships and to use that modelfor inference of mental state. We advocate a probabilisticapproach that takes into account context and uncertainty.In particular, we use Bayesian networks to model the sto-chastic relationship between the possible or assumed causes,i.e., student mental states, and their consequent effects, i.e.,observable unintentional gestures performed by students inclassroom situations. Although Bayesian networks simplymodel correlations that may or may not be due to cause-effect relationships, they are nevertheless extremely power-ful tools for inferring hidden states from correlated observ-able events in context. In addition, since they provide an es-timatedposteriordistributionoverallpossiblehiddencauses(mental states), without committing to any particular mentalstate, they account for the uncertain nature of the learnedrelationship.In the area of intelligent tutoring systems, the use of probabilistic models, especially Bayesian networks, is wide-spread. Mislevy and Gitomer’s intelligent tutoring sys-tem [21] predicts students’ knowledge, skills and strate- gies using a probabilistic approach. Conati et al.’s studentmodel [22] is built upon Bayesian networks. Reye [23] uses DBNs to update a probabilistic student model for an intelli-gent tutoring system (ITS). In a problem-based learning ap-plication, Suebnukarn and Haddawy [24] use Bayesian net- works to assist medical students by generating appropriatetutorial hints. Kapoor et al. [25] use Bayesian inference to classify whether learners are either in a “pre-frustration” or“not pre-frustration” state while solving a Towers of Hanoipuzzle.Among the work most related to ours, Ji et al. [26] pro-pose a Bayesian-network based dynamic framework. Theymodel and infer human fatigue in real time using contextand temporal observations. They use visual clues from facialexpressions, gaze direction, head movements and eye move-ment. They measure fatigue based on the percentage of timeeyelids are closed on the assumption that this means the sub- ject is tired. They make similar assumptions that gaze direc-tion and head movement is associated with situation aware-ness. In other closely related work, El Kaliouby and Robin-son [27] use visual clues from facial expressions and head movementsto infer affective and cognitivestatesin real timefrom video shots of actors with a recognition accuracy of 77.4%. They use labeled data from psychologists [28] to in-fer complex mental states such as “agreeing,” “concentrat-ing,” “disagreeing,” “interested,” “thinking,” and “unsure.”They claim that their work is the first to infer complex men-tal states beyond the basic emotion categories. Similarly,Conati [29] proposes a probabilistic model to monitor user’semotions and engagement during interaction with educa-tional games. Her proposed model uses bodily expressionsas effects from emotional processes. The relationship be-tween the expressions and emotional states is described as;frowned eyebrows indicating negative emotions, skin con-ductivity reflecting the level of arousal, and increased heart-beat indicating negative emotions. Finally, Liao et al. [30] suggest a probabilistic framework to infer user stress and fa-tigue from multiple measures such as EEG, ECG, GSR, eye-lid movement, head movement and facial expressions. Allof this recent work is based on well-justified psychological  J Multimodal User Interfaces (2010) 3: 21–31 23 Fig. 1  Bayesian network-basedmodel for high-level processingof gestures for student mentalstate inference theories or reasonable assumptions about action-state rela-tionships.In our work, however, since there is no existing theorymapping mental states to gestures in educational contexts,we take a more speculative approach and find correlationbetween gestures and self-reported mental states without re-gard to any prior theory about these relationships. This al-lows us to uncover previously unknown or less known rela-tionships between gestures and mental states [31] that may be relevant to and may be exploited in specific contexts. Weuse spontaneous training data and self reports to estimate theparameters of a dynamic Bayesian network-based computa-tional model, then use subjects’ unintentional hand gesturesin context to predict their mental states. A schematic viewof the approach is shown in Fig. 1.Currently, we label gestures manually, but envision au-tomating the low-level processing in future work. Automaticclassification of unintentional hand gestures is, of course, adifficult problem in and of itself. The problem is two-fold:we must locate the hands and other body parts involvedin the gesture, and we must model the temporal dynam-ics of each gesture. However, recent advances in detectingand tracking human body parts such as hands and face arepromising[32].Shanetal.[33]detectheadsandhandsusing skin color information. Ahmad and Lee [34] track gestures using shape and motion information. Caridakis et al. [35] report classification of 30 hand gesture classes using spa-tial features (hand trajectory and hand motion direction) andtemporal features (hand position in the trajectory), modeledthrough hidden Markov models. They report recognitionrates between 85–93%. Suk et al. [36] even report recog- nition rates of up to 99.59% for single-hand and two-handgesture detection.However, in this paper, we focus on the high-levelprocessing that converts symbolic gesture event descriptionsintosemanticinformationon mentalstate or high-levelmen-tal state descriptors.We test our model on the classroom data we acquired us-ing leave-one-out cross validation and find a generalizationaccuracy of 97.4% over cases where the student reported amental state, and 83.2% over all cases including those inwhich the student reported no mental state. 2 Data description We recorded a total of 11 students (six males and five fe-males; two Americans, two Europeans, and seven Asians) infive classroom lecture sessions with three different instruc-tors. All the students volunteered for the study, and to en-courage spontaneous behavior, they were kept unaware of the exact nature or goal of the research study. Here we pro-vide a brief overview of the study; we have reported the de-tails of part of the study elsewhere [37, 38]. After recording the videos, in post-recording interviews,we asked each student what they were feeling during the lec-ture. The student watched each video clip during the anno-tation session. This retrospective “video-cued recall” tech-nique aims to reduce bias in self-reporting by helping sub- jects recall the details of their experience [39]. Ericsson  24 J Multimodal User Interfaces (2010) 3: 21–31 Fig. 2  Examples of students’ unintentional body gestures recordedduringclassroomlecturesessions. From topleft   to topright  :ChinRest,Head Scratch, Nose Itch, and Eye Rub. From  bottom left   to  bottomright  : Lip Touch, Ear Scratch, Locked Fingers, and Yawn and Simon [40] report that the validity of self-reports in- creases with temporal proximity between the event causingthe thought and the verbal report of that thought. Given thatwe could not obtain self reports in real time without disrupt-ing video data collection, we did process the video and ob-tain the verbal reports as quickly as was feasible, thus max-imizing the validity of the data.In our verbal reporting interview, we attempted to obtainground truth data. The students watched the video clips syn-chronized with sound during the annotation session. How-ever, we found that they could not recall what they werethinking except during those intervals in which they per-formed some body gesture (20-second time interval was se-lected considering the maximum occurring time of a ges-ture). In the revised self-reporting protocol, though the par-ticipants were shown only those video clips that had an oc-currence of gesture but they were not tipped by the experi-menters to label with a forced-choice mental state. Some ex-amples of their segmented gestures are illustrated in Fig. 2.Interestingly, in some cases, the context provided sup-porting evidence of the students’ mental state when partic-ular gestures were made. For example, in some cases, wefound that students observing their own “Head Scratch” anddescribing their mental state as “Recalling” were in fact be-ing questioned by the instructor of the class at the time of occurrence of that gesture. Such conversational cues fromthe instructor would be quite useful as another input modal-ity in future work.For purposes of analysis, we clustered students’ free-form responses into a limited number of categories follow-ing the Geneva Affect Label Coding (GALC) system [41]. From the set of 16 distinct gestures that we observed overthe five sessions, we did, however, limit analysis to the spe-cific gestures that were most common among our partici-pants: Head Scratch, Chin Rest, Eye Rub, Lip Touch, EarScratch, Nose Itch, Locked Fingers, and Yawn. These ges-tures are those we observed in video recordings of students Table 1  Correlation matrix of self-reported mental states and co-occurring gestures.  G 1  = “Chin Rest,”  G 2  = “Head Scratch,”  G 3  = “Nose Itch,”  G 4  =  “Eye Rub,”  G 5  =  “Lip Touch,”  G 6  =  “EarScratch,”  G 7  =  “Locked Fingers,”  G 8  =  “Yawn,”  S  1  =“Stressed,” S  2  =  “Tired,”  S  3  =  “Thinking,”  S  4  =  “Satisfied,”  S  5  =  “Recalling,” S  6  = “Concentrating.”  None  stands for the event when the gesture isnot associated with any mental stateGestures Self-reported mental states S  1  S  2  S  3  S  4  S  5  S  6  None G 1  0 0 75 0 0 0 8 G 2  0 0 0 0 21 0 2 G 3  0 0 0 34 0 0 12 G 4  0 12 0 0 0 0 5 G 5  0 0 40 0 0 0 6 G 6  0 0 0 0 0 7 0 G 7  3 0 0 0 0 0 0 G 8  0 2 0 0 0 0 0 in a real classroom, and the mental states are those reportedby the experimental participants in the post-experiment in-terviews. Our participants associated the eight gestures withsix different GALC-coded mental states: Stressed, Tired,Thinking, Satisfied, Recalling, and Concentrating. The oc-currence of gestures and corresponding mental states re-ported by all subjects are shown in Table 1.Conclusively establishing relationships between mentalstate and observable behavior will require a large amount of data, and researchers rightly claim that it is the HCI com-munity’s greatest challenge to collect authentic data sets of the necessary volume [42]. We believe this is the first data set that correlates student mental states during tutoring withtheir unintentional gestures. The data will be posted for oth-ers to use. 1 3 Generative dynamic model for mental stateestimation As mentioned earlier, we propose a dynamic Bayesian net-work (DBN) based approach. DBNs provide a powerfulway to represent and reason about uncertainty in time se-ries data [43]. A DBN is a graphical representation of a complex state-space process where the nodes represent theprocess variables and the links between the nodes show theprobabilistic causal and temporal relationships between thevariables. The conditional independence assumption allowsconversion of a complex joint probability distribution intoa product of conditionally independent terms. For exam-ple, we can represent the joint probability distribution of  1 http://emotion-research.net/databases.  J Multimodal User Interfaces (2010) 3: 21–31 25 an  n -node Bayes net (BN) having random variables  X  = X 1 ,X 2 ,...,X n  as follows: P(x 1 ,x 2 ,...,x n ) =  P(x 1  |  x n )P(x 2  |  x n )...P(x n − 1  |  x n )P(x n ). Our model consists of a Bayesian network structured asa first-order hidden Markov model (HMM). We could fullydescribe the correlations between mental states and gestureswith a purely static Bayesian network, but static Bayesiannetworks are limited to performing inference at a single timeinstant, without the ability to capture temporal relationships.We add first-order Markov dynamics to the static network in order to capture temporal dependencies among mentalstates.3.1 Model descriptionOur mental state prediction process is based on the follow-ing generative model; at time  t  ,  t   ∈ { 1 ,...,T  } , a studentprobabilistically selects a set of mental states representedby a random vector  S  t   =  (S  1 t   ,S  2 t   ,...,S  mt   ) , taking on the bi-nary values “true” and “false” for each of the  m  elements(in our case  m  =  6) according to dynamics  P( S  t   |  S  t  − 1 ) .The subject then probabilistically selects a set of observ-able gestures to display, represented by the random vector G t   =  (G 1 t   ,G 2 t   ,...,G nt   ) , taking on the binary values “true”and “false” for each of the  n  elements (in our case  n  =  8) ac-cording to  P( G t   | S  t  ) . The DBN unrolled over time to makeexplicit the temporal and causal dependencies is illustratedin Fig. 3, and we write the joint probability distribution overstates and gestures as P( S  0 ,..., S  T  , G 1 ,..., G T  ) =  P( S  0 ) T   t  = 1 P( S  t   | S  t  − 1 )P( G t   | S  t  ), where P( S  0 )  =  P(s 10 ,...,s m 0  ) is some initial distribution over mental states, P( G t   | S  t  )  = n  j  = 1 P(g j t   |  s 1 t   ,...,s mt   ),  (1)and P( S  t   | S  t  − 1 )  = m  i = 1 P(s it   |  s 1 t  − 1 ,...,s mt  − 1 ).  (2)Training our model means estimating two distributions,the transition model  P( S  t   |  S  t  − 1 )  and the sensor model P( G t   | S  t  ) . For simplicity, we assume that the model is sta-tionary, i.e., that the distributions do not vary with time. Wespecify the models in more detail in the following sections. 3.1.1 The sensor model Oursensormodelisdescribedbytheprobabilitydistributiongiven in (1) and illustrated graphically in Fig. 4. Completely specifying the discrete distributions  P(g j t   | s 1 t   ,...,s mt   )  in (1) would require 2 m ×  n  conditional prob-abilities, which would be impossible to estimate from areasonably-sized dataset. The noisy-OR model, commonlyused for Bayesian network nodes with multiple parents,gives us the advantage of not only reducing the number of parameters that need to be estimated, but it also allows us toaccount for unmodeled causes. The noisy-OR model makesthe assumption of “exception independence” or “indepen-dent inhibition;” [44] in our case, the distribution can be written as: P(G it   =  true  |  s 1 t   ,...,s mt   )  =  1 −  j  ∈ T  S  t  q ij  ,  (3)where T  S  t   = { i  :  s it   =  true } and q ij  =  P(G it   =  false  |  S  j t   =  true ). With complete training data, the  mn  parameters  q ij  canbe estimated by counting. Fig. 3  Unrolling our dynamicBayesian network.  Shaded nodes  represent observedgestures, and  unshaded nodes represent hidden mental states,which are not directlyobservable
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks