Documents

Ivc2011 Cost Effective

Description
ggt
Categories
Published
of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Cost-effective solution to synchronised audio-visual data capture usingmultiple sensors ☆  Jeroen Lichtenauer  a, ⁎ , Jie Shen  a , Michel Valstar  a , Maja Pantic  a,b a Department of Computing, Imperial College London, 180 Queen's Gate, SW7 2AZ, UK  b Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente, The Netherlands a b s t r a c ta r t i c l e i n f o  Article history: Received 13 February 2011Received in revised form 7 June 2011Accepted 18 July 2011Available online xxxx Keywords: Video recordingAudio recordingMultisensor systemsSynchronisation Applications such as surveillance and human behaviour analysis require high-bandwidth recording frommultiple cameras, as well as from other sensors. In turn, sensor fusion has increased the required accuracy of synchronisation between sensors. Using commercial off-the-shelf components may compromise quality andaccuracyduetoseveralchallenges,suchasdealingwiththecombineddataratefrommultiplesensors;unknownoffsetandratediscrepanciesbetweenindependenthardwareclocks;theabsenceoftriggerinputsor-outputsinthe hardware; as well as the different methods for time-stamping the recorded data. To achieve accuratesynchronisation, we centralise the synchronisation task by recording all trigger- or timestamp signals with amulti-channel audio interface. For sensors that don't have an external trigger signal, we let the computer thatcapturesthesensordataperiodicallygeneratetimestampsignalsfromitsserialportoutput.Thesesignalscanalsobe used as a common time base to synchronise multiple asynchronous audio interfaces. Furthermore, we showthat a consumer PC can currently capture 8-bit video data with 1024×1024 spatial- and 59.1Hz temporalresolution, from at least 14 cameras, together with 8 channels of 24-bit audio at 96 kHz. We thus improve thequality/cost ratio of multi-sensor systems data capture systems.© 2011 Elsevier B.V. All rights reserved. 1. Introduction In the past two decades, the use of CCTV (Closed Circuit Television)and other visual surveillance technologies has grown to unprecedentedlevels.Besidessecurityapplications,multi-sensorialsurveillancetechnol-ogy has also become an indispensable building block of various systemsaimed at detection, tracking, and analysis of human behaviour with awide range of applications including proactive human-computer in-terfaces, personal wellbeing and independent living technologies,personalised assistance, etc. Furthermore, sensor fusion  –  combiningvideo analysis with the analysis of audio, as well as other sensormodalities  –  is becoming an increasingly active area of research [1]. It isalsoconsideredaprerequisitetoincreasetheaccuracyandrobustnessof automatic human behaviour analysis [2]. Although humans tolerate anaudiolagofupto200msoravideolagofupto45ms[3],multimodaldatafusionalgorithmsmaybene 󿬁 tfromhighersynchronisationaccuracy.Forexample,in[4],correctionofa40mstimedifference,betweentheaudioand video streams recorded by a single camcorder, resulted in asigni 󿬁 cant increase in performance of speaker identi 󿬁 cation based onAudio-Visual (A/V) data fusion. Lienhart et al. [5] demonstrated thatmicrosecond accuracy between audio channels helps to increase signalseparation gain in distributed blind signal separation.With the ever-increasing need for multi-sensorial surveillancesystems, the commercial sector started offering multi-channel framegrabbers and Digital Video Recorders (DVR) that encode video (possiblycombinedwithaudio)inreal-time(e.g.see[6]).Althoughthesesystemscan be the most suitable solutions for current surveillance applications,theymaynotallowthe 󿬂 exibility,quality,accuracyornumberofsensorsrequiredfortechnologicaladvancementsinautomatichumanbehaviouranalysis. The spatial and temporal resolutions, as well as the supportedcamera types of real-time video encoders are often  󿬁 xed or limited to asmallsetofchoices,dictatedbyestablishedvideostandards.Theaccuracyof synchronisation between audio and video is mostly based on humanperceptualacceptability,andcouldbeinadequateforsensorfusion.Evenif A/V synchronisation accuracy is maximised, an error below the timedurationbetweensubsequentvideoframecapturescanonlybeachievedwhen it is exactly known how the recorded video frames correspond tothe audio samples. Furthermore, commercial solutions are often closedsystems that do not allow the accuracy of synchronisation that can beachieved with direct connections between the sensors. Some systemsprovidefunctionalityoftime-stampingthesensordatawithGPSorIRIG-Bmodules. Such modules can provide microsecond synchronisationaccuracy between remote systems. However, the applicability of such asolution depends on sensor hard- and software, as well as on theenvironment(GPSreceiversneedanunblockedviewtotheGPSsatellitesorbitingtheEarth).Also,actualaccuracycanneverexceedtheuncertaintyof the time lag in the I/O process that precedes time-stamping of sensordata. For PC systems, this can be in the order of milliseconds [5]. Image and Vision Computing xxx (2011) xxx – xxx ☆  This paper has been recommended for acceptance by Jan-Michael Frahm. ⁎  Corresponding author. Tel.: +44 20 7594 8336; fax: +44 20 7581 8024. E-mail address:  j.lichtenauer@imperial.ac.uk (J. Lichtenauer). IMAVIS-03067; No of Pages 15 0262-8856/$  –  see front matter © 2011 Elsevier B.V. All rights reserved.doi:10.1016/j.imavis.2011.07.004 Contents lists available at ScienceDirect Image and Vision Computing  journal homepage: www.elsevier.com/locate/imavis Please cite this article as: J. Lichtenauer, et al., Cost-effective solution to synchronised audio-visual data capture using multiple sensors,Image Vis. Comput. (2011), doi:10.1016/j.imavis.2011.07.004  A few companies aim at custom solutions for applications withrequirements that cannot be met with what is currently offered bycommercial surveillance hardware. For example, Boulder Imaging [7]builds custom solutions for any application, and Cepoint Networksoffers professionalvideo equipmentsuchastheStudio9000 ™ DVR  [8],whichcanrecordupto4 video streams permodule, aswell asexternaltrigger events, with an option to timestamp with IRIG-B. It also has theoptionofconnectinganaudiointerfacethroughaSerialDigitalInterface(SDI) input. However, it is not clear from the speci 󿬁 cations if the time-stampingof audio and videocanbedone withoutbeingaffectedbythelatency between the sensors and the main device. Furthermore, whenmorethan4videostreamshavetoberecorded,asingleStudio9000willstill not suf  󿬁 ce. The problem of the high cost of custom solutions andspecialised professional hardware is that it keeps accurately synchro-nised multi-sensor data capture out of reach for most computer visionandpatternrecognitionresearchers.Thisisanimportantbottleneckforresearch on multi-camera and multi-modal human behaviour analysis.Toovercomethis,we proposesolutionsand present 󿬁 ndings regardingthe two most important dif  󿬁 culties in using low-cost Commercial Off-The-Shelf (COTS) components: reaching the required bandwidth fordata capture and achieving accurate multi-sensor synchronisation.Fortunately, recent developments in computer hardware technologyhave signi 󿬁 cantly increased the data bandwidths of commercial PCcomponents,allowingformoreaudio-visualsensorstobeconnectedtoasingle PC. Our low-cost PC con 󿬁 guration facilitates simultaneous,synchronous recordings of audio-visual data from 12 cameras having780×580pixels spatial resolution and 61.7fps temporal resolution,together with eight 24-bit 96 kHz audio channels. The relevantcomponents of our system setup are summarised in Table 1. By usingsixinternal1.5TBHardDiskDrives(HDD),7.6hofcontinuousrecordingscan be made. With a different motherboard and an extra HDD controllercardtoincreasetheamountofHDDsto14,weshowthat1PCiscapableof continuously recording from 14 Gigabit Ethernet cameras with1024×1024pixels spatial resolution and 59.1fps, for up to 6.7 h. InTable2weshowthemaximumnumberofcamerasthatcanbeusedinthedifferent con 󿬁 gurations that we tested. A higher number of cameras perPC means a reduction of cost, complexity as well as space requirementsfor visual data capture.SynchronisationbetweenCOTSsensorsishinderedbytheoffsetandrate discrepancies between independent hardware clocks, the absenceof trigger inputs or -outputs in the hardware, as well as differentmethods of time-stamping of the recorded data. To accurately derivesynchronisation between the independent timings of different sensors,possibly running on multiple computer systems, we centralise thesynchronisation task in a multi-channel audio interface. A systemoverviewisshowninFig.1.Forsensorswithanexternaltrigger(b),werecord the binary trigger signals directly into a separate audio track,parallel to tracks with recorded sound. For sensors that don't have anexternaltriggersignal(f), weletthecomputerthat captures thesensordata (e) periodically generate binary timestamp signals from its serialportoutput.Thesesignalscanberecordedinaparallelaudiochannelaswell, and can even be used as a common time base to synchronisemultiple asynchronous audio interfaces.Usinglow-costCOTScomponents,ourapproachstillachievesahighsynchronisation accuracy, allowing a better trade-off between qualityand cost. Furthermore, because synchronisation is achieved at thehardwarelevel,separatesoftwarecanbeusedforthedatacapturefromeach sensor. This allows the use of COTS software, or even freeware,maximising the  󿬂 exibility with a minimal development time and cost.The remainder of this article consists of six parts. We begin withdescribing related multi-camera capture solutions that have beenproposed before, in Section 2. In Section 3, we describe important choices that need to be made for sensors that will be used in a multi-  Table 1 Components of the capture system for 8 FireWire cameras with a resolution of 780×580pixels and 61.7 fps.Sensor component Description7 monochrome videocamerasAVT Stingray F-046B, 780×580 pixels resolution,max. 61.7 fpsColour video camera AVT Stingray F-046C, 780×580 pix. Bayer pattern,max. 61 fps2 camera interface cards Dual-bus IEEE 1394b PCI-E×1, Point GreyRoom microphone AKG C 1000 S MkIIIHead-worn microphone AKG HC 577L External audio interface MOTU 8-pre FireWire 8-channel, 24-bit, 96 kHzEye tracker Tobii X120Computer component Description6 capture disks Seagate Barracuda 1.5 TB SATA, 32 MB Cache,7200 rpmSystem disk PATA Seagate Barracuda 160 GB 2 MB Cache,7200 rpmOptical drive PATA DVD RW4 GB Memory 2 GB PC2-6400 DDR2 ECC KVR800D2E5/2 GGraphics card Matrox Millenium G450 16 MB PCIMotherboard Asus Maximus Formula, ATX, Intel X38 chipsetCPU Intel Core 2 Duo 3.16 GHz, 6 MB Cache,1333 MHz FSBATX Case Antec Three HundredPSU Corsair Memory 620 WattSoftware application DescriptionMS Windows Server 2003 32-bit Operating SystemNorpix Streampix 4 Multi-camera video recordingAudacity 1.3.5 Freeware multi-channel audio recordingAutoIt v3 Freeware for scripting of Graphical UserInterface controlTobii Studio version 1.5.10 Eye tracking and stimuli data suiteTobii SDK Eye tracker Software Development Kit  Table 2 Camera support of a single consumer PC.Spatial resolution Temporal resolution Rate per camera Max. no. of cameras780×580 pixels 61.7 fps 26.6 MB/s 14780×580 pixels 49.9 fps 21.5 MB/s 16780×580 pixels 40.1 fps 17.3 MB/s 18With controller card for 8 additional HDDs1024×1024 pixels 59.1 fps 59.1 MB/s 14 analog signalanalog signalbinary signalbinary signalnon-synchronised databinary signalnon-synchronised datanon-synchronised datasynchronised datasynchronised data Fig. 1.  Overview of our synchronised multi-sensor data capture system, consisting of (a) microphones, (b) video cameras, (c) a multi-channel A/D converter, (d) an A/V capture PC, (e) an eye gaze capture PC, (f) an eye gaze tracker and (g) a photo diode tocapture the pulsed IR-illumination from (f).2  J. Lichtenauer et al. / Image and Vision Computing xxx (2011) xxx –  xxx Please cite this article as: J. Lichtenauer, et al., Cost-effective solution to synchronised audio-visual data capture using multiple sensors,Image Vis. Comput. (2011), doi:10.1016/j.imavis.2011.07.004  sensor data capture system. The following three sections cover threedifferentproblemsofsynchronisedmulti-sensordatacapture:achievinga high data-throughput (Section 4), synchronisation at sensor-level(Section 5) and synchronisation at computer-level (Section 6), respec- tively.Eachsectiondescribeshowwehavesolvedtherespectiveproblemandpresentsexperimentalresultstoevaluatetheresultingperformance.Finally,Section7containsourconclusionsabouttheachievedknowledgeand improvements. 2. Related multi-sensor video capture solutions Because of the shortcomings and high costs of commerciallyavailable video capture systems, many researchers have alreadysought custom solutions that meet their own requirements.Zitnicketal.[9]usedtwospeciallybuiltconcentratorunitstocapturevideo from eight cameras of 1024×768 pixels spatial resolution at15 fps.Wilburn et al. [10] built an array of 100 cameras, using 4 PCs andcustom-built low-cost cameras of 640×480 pixels spatial resolution at30 fps, connected through trees of interlinked programmable proces-sing boards with on-board MPEG2 compression. They used a tree of trigger connections between the processing boards (that each controlone camera) to synchronise the cameras with a difference of 200 nsbetween subsequent levels of the tree. For a tree of 100 cameras, thisshould result in a frame time difference of 1.2  μ  s, between the root andthe leaf nodes.More recently, a modular array of 24 cameras (1280×1024 pixelsat 27 fps) was built by Tan et al. [11]. Each camera was placed in aseparate special-built hardware unit that had its own storage disk,using on-line video compression to reduce the data. The synchroni-sation between camera units was done using a tree of trigger- andclock signal connections. The delay between the tree nodes was notreported. Recorded data was transmitted off-line to a central PC via aTCP/IP network.Svoboda, et al. [12] proposed a solution for synchronous multi-camera capture involving standard PCs. They developed a softwareframework that manages the whole PC network. Each PC could handleup to three cameras of 640×480 pixels spatial resolution at 30 fps,althoughtheirsoftwarewaslimitedtohandlingatemporalresolutionof 10 fps. Camera synchronisation was done by software triggers,simultaneously sent to all cameras through the Ethernet network. Thissolutioncouldreducecostsbyallowingtheuseoflow-costcamerasthatdonothaveanexternaltriggerinput.However,thecostofmultiplePCsremains. Furthermore, a software synchronisation method has a muchlower accuracy than an external trigger network.A similar system was presented in [13], which could handle 4 cameras of 640×480 pixels spatial resolution at 30 fps per PC. Thesynchronisationaccuracybetweencameraswasreportedtobewithin15 ms.Hutchinsonetal.[14]usedahigh-endserverPCwiththreePeripheralComponent Interconnect (PCI) buses that provided the necessarybandwidth for 4 FireWire cards and a PCI eXtended (PCI-X) SmallComputerSystemInterface(SCSI)HardDiskDrive(HDD)controllercardconnecting 4 HDDs. This system allowed them to capture video inputfrom 4 cameras of 658×494pixels spatial resolution at 80 fps.Fujii et al. [15] have developed a large-scale multi-sensorial setupcapable of capturing from 100 cameras of 1392×1040 pixels spatialresolutionat29.4 fps, aswell asfrom 200 microphones at96 kHz.Eachunit that captures from 1 camera (connected by a Camera Linkinterface), and 2 microphones, consists of a PC with custom builthardware. During recording, all data is stored to internal HDDs, to betransported off-line via Ethernet. A central host computer manages thesettingsofallcaptureunitsaswellasthesynchronouscontrolunitthatgenerates the video- and analogue trigger signals from the same clock.By using a single, centralised trigger source for all measurements, thesynchronisation error between sensors is kept below 1  μ  s. Disadvan-tages of this system are the high cost and volume of the equipment, aswell as the required custom built hardware.Table 3 summarises the multi-camera capture solutions that wehave described above. From this, it immediately becomes clear thataudio has been a neglected factor in previous multi-sensor datacapture solutions. With custom hardware, only Fujii et al. achieveaccurate A/V synchronisation. The only low-cost solution that has astandard support for audio is a commercial surveillance DVR system.Unfortunately, having a microsecond synchronisation accuracy is nota key issue in surveillance applications, since the primary purpose of the systems is to facilitate playback to a human observer. However,having such an exact synchronisation accuracy is necessary forachieving (automatic) analysis of human behaviour.To the best of our knowledge, the multi-sensor data capturesolutionproposedhereisthe 󿬁 rstcompletemulti-sensordatacapturesolution that is based on commercial hardware, while achievingaccurate synchronisation between audio and video, as well as withother sensors and computer systems.  Table 3 Overview of multi-sensor audio-visual data capture solutions. A  ‘ unit ’  is a system in which sensor data is collected in real-time. For most cases, this is a PC. However, forZitnick et al. [9] it was a  ‘ concentrator unit ’ .  ‘ cam.#/unit ’  indicates the maximum number of cameras that can be connected to a unit,  ‘ audio#/unit ’  indicates the maximumnumber of audio channels per unit,  ‘ sync unit# ’  shows the maximum number of units that can be synchronised,  ‘ unit sync ’  the type or accuracy (if known) of synchronisation between units,  ‘ camera sync ’  the type or accuracy of synchronisation between cameras and  ‘ A/V sync ’  the accuracy of synchronisation between audioand video.Solution Cam.#/unit at 640×480 30 fps Audio #/unit Sync unit# Unit sync Camera sync A/V syncOur solution 14×1024×1024 p at 59.1 fps  b 7 Unlimited  b 20  μ  s ~30  μ  s  b 25  μ  sStudio 9000 DVR 4 Optional via SDI Unlimited with IRIG-B Optional IRIG-B Depends on the cameras Not speci 󿬁 edtypical CCTV DVR 16 16 1 n.a. Depends on the cameras Not speci 󿬁 edZitnick et al. [9] 4×1024×768 pat 15 fpsn.a.  ≥ 2 (not speci 󿬁 ed) By FireWire Not speci 󿬁 ed n.a.Wilburn et al. [10] 30 n.a. Unlimited Hardwaretrigger1.2  μ  s with100 camerasn.a.Tan et al. [11] 1×1280×1024 p at 27 fps n.a. Unlimited HardwareTriggerHardwareTriggern.a.Svoboda et al. [12] 3 at 10 fps n.a. Unlimited NetworktriggerSoftwaretriggern.a.Cao et al. [13] 4 n.a. Unlimited 15 mswith 16 unitsSoftwaretriggern.a.Hutchinson et al. [14] 4×658×494 p at 80 fps n.a. 1 n.a. Softwaretriggern.a.Fujii et al. [15] 1×1392×1040p at 29.4 fps 4 Unlimited  b 1  μ  s with 100 units  b 1  μ  s with 100 cameras  b 1  μ  s with 100 units3  J. Lichtenauer et al. / Image and Vision Computing xxx (2011) xxx –  xxx Please cite this article as: J. Lichtenauer, et al., Cost-effective solution to synchronised audio-visual data capture using multiple sensors,Image Vis. Comput. (2011), doi:10.1016/j.imavis.2011.07.004  3. Sensor- and measurement considerations Having a cost-saving and accurate sensor-synchronisation method isonly relevant if it is combined with a set of recording equipment thatis equallycost-effectiveandsuitedfortheintendedpurpose.The qualityof the captured data is limited by the quality of the sensors and isinterdependent with the data-bandwidth, the synchronisation-possibil-ities as well as the quality of the recording conditions. Many importantconsiderations were not familiar to us before we started building ourrecording setup. To facilitate a more (cost-) effective and smooth designprocess, this section covers the most important aspects that we had totake into consideration in choosing audio-visual sensors for a humanbehaviour analysis application.We start by covering several important aspects of illumination,followed by the most important camera properties and image post-processing procedures. Subsequently, we discuss different microphoneoptions for recording a person's vocal sounds and some comments ontheuseofCOTSsoftware.Thissectionendswithanexampleofthecostswe spent on equipment regarding each of these different aspects.  3.1. Illumination Illumination determines an object's appearance. The most impor-tant factors of illumination are spectrum, intensity, location, sourcesize and stability.  3.1.1. Illumination spectrum Ifacolourcameraisused,itisimportantthatthelighthassigni 󿬁 cantpowerovertheentirevisiblecolourspectrum.Ifamonochromecameraisused,amonochromecoloursourcecanimproveimagesharpnesswithlow-cost lenses, by preventing chromatic aberration. Most mono-chrome cameras are sensitive to the Near Infra Red (NIR) wavelengths(between 700 nm and 1000 nm). Since the human eye is insensitive tothese wavelengths, a higher illumination intensity can be used here(withinsafetylimits),withoutcompromisingcomfort.Furthermore,thehuman skin is more translucent to NIR light [16]. This has a smoothingeffect on wrinkles, irregularities and skin impurities, which can bebene 󿬁 cial to some applications of computer vision.Note that incandescent studio lights often have a strong infraredcomponent that can interfere with active infrared sensors. The Tobiigaze tracker we discuss in Section 6.4 was adversely affected by a500 Watt incandescent light, while it worked well with two 50 Wattwhite-light LED arrays that produce a comparable brightness.  3.1.2. Illumination intensity The intensity of light cast on the target object will determine thetrade-off between shutter time and image noise. Short shutter times(to reduce motion blur) require more light. Light intensity may beeither increased by a more powerful light source, or by focussing theillumination onto a smallerarea (using focussing re 󿬂 ectors orlenses).  3.1.3. Illumination source location For most machine-vision applications, the ideal location of theilluminationsourceisatthepositionofthecameralens.Therearemanytypes of lens-mountable illuminators available for this. However, forhumansubjects,itcanbeverydisturbingtohavethelightsourceinfrontof them. It will shine brightly into the subject's eyes, reducing thevisibility of the environment, such as a computer screen. Placing theillumination more sideways can solve this problem. However, when alight source shines directly onto the glass of the camera lens, lens  󿬂 aremay be visible in the captured images. Especially in multi-camera datacapture setups, these issues can cause design con 󿬂 icts.  3.1.4. Illumination source size Small (point) light sources cause the sharpest shadows, the mostintense lens  󿬂 are, and are the most disturbing (possibly even harmful)to the human eye. Therefore, in many situations, it is bene 󿬁 cial toincreasethesizeofthelightsource.Thiscanbeeitherrealisedbyalargediffuser between the light source and the subject, or by re 󿬂 ecting thelightsourceviaalargediffusing(white,dull)surface.Notethatthesizeandshapeofthelightsourcewilldirectlydeterminethesizeand shapeofspecularre 󿬂 ectionsinwetorglossysurfaces,suchasthehumaneyesand mouth.  3.1.5. Illumination constancy Formanycomputer-visionapplications,aswellasfordatareductionin video compression, it is crucial to have constant illumination oversubsequentimages.However, the AC power frequency(usuallyaround50 or 60 Hz) causes oscillation or ripple current in most electricallypoweredlightsources.Iftheilluminationcannotbestabilised,therearetwo alternative solutions to prevent  ‘ 󿬂 icker ’  in the captured video. The 󿬁 rst is to use a shutter time that is equal to a multiple of the oscillationperiod.Incaseofa100 Hzperiod,theminimumshuttertimeis10 ms.Inhuman behaviour analysis applications, this is not suf  󿬁 ciently short toprevent motion blur (e.g. by a fast moving hand). Another option is tosynchronise the image capture with the illumination frequency. Thisrequires special algorithms (e.g. [17]) or hardware (e.g. generatingcamera trigger pulses from the AC oscillation of the power source) andlimits the video frame rate to the frequency of the illumination.  3.1.6. Illumination/camera trade-off  Experimenting with recordings of fast head and hand motionsshowedusthatforacloseupvideo(wheretheinter-oculardistancewasmorethan100 pixels),theshuttertimeneedstobeshorterthan1/200 s,in order to prevent signi 󿬁 cant motion blur. Obtaining high SNR withshortshuttertimesrequiresbrightillumination,alargelensaperture,ora sensitive sensor. Because illumination brightness is limited by safetyand comfort of human beings, and the lens aperture is limited by theminimum required Depth of Field (DoF), video quality for humananalysis depends highly on the sensor sensitivity. Therefore, it can beworthy investing in a high-quality camera, or sacri 󿬁 cing colour for thehigher sensitivity of a monochrome camera.  3.2. Spatial and temporal video resolution The main properties tochoose in a video camera are the spatial andtemporalresolution.Selectinganappropriatespatialresolutioninvolvesessentially a trade-off between Signal-to-Noise Ratio (SNR) and thelevel of detail. Sensors with higher spatial resolution receive less lightper photo sensor (due to smaller sensor sizes), and are generally lessef  󿬁 cient (more vulnerable to imperfections and circuitry takes uprelatively more size). These factors contribute to a lower SNR when ahigher spatial resolution is used.Furthermore, a higher spatial and/or temporal resolution is morecostly. Not only that the high-resolution cameras are more expensive,but the required hardware capable of real-time data capture andrecording of the high data rate is more expensive as well. Anotherissue that needs to be taken into consideration when a high temporalresolutionisused,istheupperlimitfortheshuttertime,whichequalsthe time between video frames. Depending on the optimal exposure,high-speed video may require brighter illumination and moresensitive imaging sensors, in order to achieve a suf  󿬁 cient SNR.Forthese reasons, itiscrucialtochoosenomorethantheminimumspatial and temporal resolution that provides suf  󿬁 cient detail for thetarget application. The analysis of temporal segments (onset, apex,offset) of highly-dynamic human gestures, such as sudden head andhand movements, demands a limited shutter time (to prevent motionblur) as well as suf  󿬁 cient temporal resolution (to capture at least acoupleofframesofeachgesture).Previousresearch 󿬁 ndingsinthe 󿬁 eldof dynamics of human behaviour reported that the fastest facialmovements (blinks) last 250 ms[18,19], and that the fastest handmovements ( 󿬁 nger movements) last 80ms[20]. Hence, in order to 4  J. Lichtenauer et al. / Image and Vision Computing xxx (2011) xxx –  xxx Please cite this article as: J. Lichtenauer, et al., Cost-effective solution to synchronised audio-visual data capture using multiple sensors,Image Vis. Comput. (2011), doi:10.1016/j.imavis.2011.07.004

Misc Funda

Jul 23, 2017

CASE

Jul 23, 2017
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks