Analysis of Performance Evaluation of Parallel Katsevich Algorithm for 3-D CT Image Reconstruction

Aalysis of Performace Evaluaio of Parallel Kasevich Algorihm for -D CT Image Recosrucio Ju i, Juju Deg, Hegyog Yu, Tao He, ad Ge Wag Medical Imagig High Performace Compuig Lab. Deparme of Radiology, The
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Aalysis of Performace Evaluaio of Parallel Kasevich Algorihm for -D CT Image Recosrucio Ju i, Juju Deg, Hegyog Yu, Tao He, ad Ge Wag Medical Imagig High Performace Compuig Lab. Deparme of Radiology, The Uiversiy of Iowa, Iowa Ciy, IA { ju-i, juju-deg, hegyog-yu, ao-he, Absrac The firs heoreically exac spiral coe-beam CT recosrucio algorihm developed was by Kasevich [1-2]. Recely, Yu e al. [-4] implemeed he algorihm umerically. Alhough he mehod is very promisig, he compuaio is very iesive. I requires huge amou of compuer ime. Recely, people [5-6] bega o parallelize he algorihm for achievig high performace compuaio. This paper preses a aalysis of daa decomposiio ad daa commuicaio i he parallel Kasevich algorihm [5] ad develops a aalysis expressio o evaluae he performace of he algorihm parallelism. The resuls based o he aalyical model ad umerical bechmarks compared i a fare agreeme. The aalyical model provides a grea ool o evaluae high performace compuig bechmarks i he parallel Kasevich algorihms. 1. Iroducio X-ray compued omography (CT) is a impora medical-imagig modaliy where projecio daa are used o recosruc a cross-secioal or volumeric image of a paie. Spiral coe-beam CT is oe of promisig echologies for D CT. I a spiral coe-beam CT sysem, a daa acquisiio sysem cosisig of a X-ray ube ad a muli-row deecor bak roaes while he paie is moved io a scaer gary [7]. Relaive o he paie, he X-ray source scas alog a helix, ad geeraes coe beam X-rays hrough he objec. The aeuaed X-ray sigals are he recorded o he deecors placed o he oher side of he paie. Alhough he mechaism of spiral coe beam CT seems simplisic, he coe-beam divergece ad he logiudial rucaio of projecio daa make he exac image recosrucio far from rivial. A ladmark algorihm, coribued by Feldkamp e al [8] allowed approximae recosrucio from coe-beam daa colleced alog a circular rajecory. Wag e al. [9] primarily developed a geeralized approximae he Feldkamp algorihm which is excelle i erms of efficiecy ad parallelism. A breakhrough was made i 22 whe Kasevich derived a filered backprojecio (FBP) algorihm ha is similar o he Feldkamp algorihm bu perform recosruc images exacly [1-2]. Lae, he Kasevich algorihm was umerically implemeed by Yu e al. [-4]. The compuaio of Kasevich algorihm requires sigifica amou of compuer ime, especially for large daase. From he compuig perspecive, people bega o parallelize he Kasevich algorihm by usig muliprocessor sysems [5-6]. A parallel compuig machie ca be a sigle Symmeric Muli-Processig (SMP) sysem wih muliple buil-i processors sharig a commo memory; or a cluser of locally-coeced compuer processors wih disribued memories; or a cluser comprisig muliple worksaios liked by a LA ework. I aalysis of parallel compuig, a processor paricipaig i a compuaioal process is called a processig eleme (PE). A overall compuaioal ask is ypically pariioed io muliple sub-asks, ad he associaed daa is se o differe PEs hrough a iercoecio (wih a ieral swich) or a eworked coecio (wih a exeral swich). Afer he sub-asks are compleed, he resuls are assembled by a maser PE o obai he fial resul. The pariio is also called parallel decomposiio. Parallel decomposiio algorihms ifluece he fial performace of compuig. The parallel compuig echology has bee successfully used i several medical applicaios ivolvig image recosrucio. May parallel algorihms have bee developed. For example, Rama [1] developed a parallel Filered-Backprojecio (FBP) algorihm ad implemeed i o Iel Parago sysem wih 16 processors ad he Coecio Machie (CM5) sysem wih 2 processors. The performace of heir parallel FBP programs was compromised by a large commuicaio overhead, givig a speed-up of abou 4 o Parago ad 1.6 o CM5, respecively. I he early 199s, some parallel Expecaio-Maximizaio (EM) algorihms were proposed [11-12]. The parallel implemeaio was direcly based o he coveioal EM algorihm wih various domai pariio echiques [1-14]. Ordered subse echiques were also furher used o speed up he ieraive recosrucio [15]. Recely, Johso ad Sofer ivesigaed various parallelisms i image recosrucio [16]. A OSC (Order-Subse Covex)-based parallel saisical coebeam x-ray CT algorihm was proposed based o a shared memory [17]. This algorihm employed wo parallelizaio echiques: (1) processig all he projecios wihi oe subse i parallel (OSC-ag), ad (2) dividig he whole volume io various pars ad recosrucig hem i parallel (OSC-vol). Boh he echiques rely o re-projecio/back-projecio operaios heavily. The secod parallelizaio sraegy is suiable for disribued memory sysems. I was also foud ha he opimal choice of he OSC-ag ad OSCvol specifics depeded o he daase size [17]. The paradigm of usig muliple parallelizaio echiques is effecive o reduce he commuicaio cos durig daa rasferrig. This paper highlighs a parallel Kasevich algorihm developed by he auhors [5]. I focuses o he daa commuicaio ad aalysis of performace. The aalyical resuls are compared wih umerical oe o demosrae he validiy of he aalyical expressio for evaluaig performace bechmarks. I he followig secios, he Kasevich algorihm ad umerical implemeaio are briefly oulied. The daa parallelism ad commuicaio are summarized. A aalyical expressio o evaluae he performace speedup is derived. The resuls usig he aalyical expressio ad umerical experimes are compared ad discussed. A coclusio is give followed by fuure work proposed. Tam-Daielsso widow deecor g(, suv, ) s ( ) b x Figure 1 Coordiae sysems ad variables used for image recosrucio i he case of helical coe-beam CT. 2. Kasevich Algorihm ad is umerical implemeaio 2.1 Kasevich heorem PI-segme gary s ( x) h x r U R d locus d 2 y() s d 1 As show i Fig. 1, a helical scaig locus C i -D Euclidea space R ca be mahemaically described as [2]. If a geomerical se C:{ y( y1, y2, y) R, The Caresia coordiaes are y1 Rcos( s), y2 Rsi( s), ad y sh /2, s R, where s is a agular parameer; h ( ) ad R( ) are he pich ad radius of he locus. I a pracical CT sysem, a paie is moved hrough he gary, while he X-ray source roaes aroud he paie. Relaive o he paie s posiio, he locus of he X-ray source ca be 2 viewed as he helix sec. If S be he ui sphere i R, he he coe-beam rasform of f ca be expressed as d f () x D ((), y q (,, s x )) ds 2 f q s 2 xy( s) q si( ) IPI ( x) (1) where 2 D ( y, ): f( y) d, f S (2) The deailed expresses ad oaios ca be foud i [5]. 2.2 umerical Implemeaio As illusraed i Fig. 1, a local coordiae sysem o plaar deecor is formed o umerically impleme Kasevich s formula [1]. The coe-beam projecio daa is measured usig plaar deecor arrays parallel o d 1 ad d2 a a disace D from y(s). Kasevich algorihm ca be umerically implemeed by he followig wo seps: a filraio called Hilber Filerig o calculae a iermediae fucio ( s, u, v), ad Weighed Backprojecio o compue f fucio i a backprojecio process. The major compuaio is cosumed i he secod sep. The deailed expressios ad umerical resuls ca be foud i [-5].. Daa parallelism ad implemeaio Sice he accomplishme of a -D image recosrucio requires a grea amou of ime, he auhors desiged he algorihm s parallelism usig daa decomposiio. As described above, he wo major compuaio procedures are required o accomplish Kasevich algorihm for medical image recosrucio: filraio ad back-projecio. I filraio, he major compuaio ime is used o calculae various umerical differeiaios. The projecio daa from differe views (say views) wih correspodig agles ca be disribued o muliple PEs ad o be processed i parallel. The compuaio of he derivaives requires he projecio daa be pariioed i such a way ha he daa from several view agles be se o he same PE for isrucios. Therefore, how o decompose projecio daa be disribued o each PE is a key issue. I our desig, each PE processor gais equal amou of projecio daa based o is compuig capaciy. Sice he PC cluser is a homogeous sysem, he projecio daa are jus pariioed evely, as show i he Fig. 2. If a sysem is heerogeeous, oe ca decompose projecio daa wih a cosideraio of load balace. projecio filerig backprojecio PE1 PE Figure 2 Daa flow of he parallel Kasevich algorihm. The backprojecio process, is a voxel-drive formulaio. The recosrucio of each voxel x ca be idepedely performed, requirig he large amou of compuaio. Therefore, he volume is also pariioed over he PEs cosise o heir processig capabiliies. Each PE recosrucs correspodig voxels, as show also i Fig. 2. The overall parallel compuig is processed i he followig order. The projecio daa is firs pariioed ad disribued over all he PEs. Afer each PE receives is assiged projecio daa, i idepedely performs he filerig operaio. Oce each PE accomplishes is filerig operaio, i seds he filered daa o all he PEs. Oce each PE received all filered daa from oher PEs, i idepedely performs iesive backprojecio o pariioed voxels. Fially, he backprojeced daa are colleced o a maser PE which gives ou he fial recosruced image. 4. Parallel Experimes ad umerical resuls 4.1. Sysem hardware ad sofware collecio Maser PE The parallel Kasevich algorihm was implemeed o a Microway 64-bi AMD-based Opero HPC cluser wih 16 odes wih 16 odes Each ode has wo processors (PEs) ad 4 GB memory. The sysem is locaed a Medical Imagig High Performace Compuig Lab (MIHPC Lab) of he Uiversiy of Iowa. The oal sysem sorage is 8 TB for archivig ad rerieval of highresoluio images. The program is i C, compiled by he Porlad C compiler wih Message Passig Library (MPI). The program ivokes MPI fucios, such as sedig, receivig, broadcasig, ad collecig Daa preparaio The parallel implemeaio of he Kasevich algorihm was evaluaed by recosrucig he -D Shepp-Loga phaom [17]. The spiral coe-beam projecio daa was colleced wih a plaar deecor, as show i Fig. 1. Differe daases (volumes of 128, 256, 84 ad 512 ) were used o measure he performace (maily speedup ad efficiecy) ad o sudy he effecs of daa size, daa rasfer rae ec. The double precisio forma was used for expressig he projecio daa ad image daa. 4.. Resuls of parallel compuaios The measured compuaioal ime for volumes of 128, 256, 84 ad 512 voxels, (Case I, Case II, Case III, ad Case IV) respecively, are ploed i Fig. (a). The deailed daa are available i [5]. The resuls show ha he recosrucio ime sigificaly decreases wih he icreme i he umber of PEs. The bechmarks of a parallel algorihm are give i erms of a speed-up S p Ts / Tp ad a parallel efficiecy Sp / p, where p (or ) is he umber of processors, T s is he oal execuio ime whe oe processor is used, T p is he oal parallel execuio ime whe processors are used. The speed-up was calculaed ad ploed i Fig. (b) wih he umber of processors i each of he four daases. The parallel efficiecies i he Cases I, II, III ad IV wih respec o he umber of processors are ploed i Fig. (c). I is oiced ha he efficiecy curve for he firs case says below he ideal efficiecy curve ad decreases relaively rapidly, whereas he curves for he oher cases desced slowly ad are close o he ideal efficiecy curve. All he X-axes represe he umber of processors. The Y-axes of (a), (b), (c), ad (d) are for compuaioal ime, speedup, efficiecy, ad raio of he commuicaio ime o he compuaio ime, respecively. ime 1 4 Case I Case II Case III 1 Case IV (a) p Sp Case I Case II Case III Case IV Ideal Speedup (b) Case I.4 Case II Case III.2 Case IV Ideal Speedup (c) p p which he speed-up is greaer ha he ideal liear speedup) is due o he fac ha i he muliprocessor sysem he memory usage associaed wih each PE is less ha ha i he sigle processor sysem [18]. For example, durig he backprojecio process, each processor recosrucs a porio of he objec, hus allocaig oly ha porio of memory. I he case IV, o recosruc a objec io a volume of 512, a leas 512 x8byes-1gb memory is eeded for backprojecio i a sigle-processor sysem. While i a muliprocessor sysem where ( 1) PEs are used, he memory associaed wih each processor is 1/ of he memory (1GB). The impac of memory o he compuaioal abiliy of PEs is resposible for he superliear speed-up. Such pheomea are more evide wih a larger daase. Tha is why i appears more promiely i he cases II hrough IV. The commuicaio ime cosiues a smaller perceage of he oal recosrucio ime as he recosrucio volume becomes larger. Hece, he parallel algorihm will be compuaioally more efficie whe a large daase is deal wih for higher resoluio recosrucio. The raio bewee he commuicaio ad compuaio ime correspodig o differe umbers of processors is also ploed i Fig. (d). I shows ha as he size of a daase icreases his raio decreases, resulig i a higher performace Validiy Compariso com comp T /T Case I Case II Case III Case IV Fially, o verify he correcess of he curre parallel implemeaio, he seleced slices of he recosruced objecs are compared wih he correspodig slices of he -D Shepp-Loga phaom. A excelle agreeme ca be see i Fig Aalyical sudy of parallel performace (d) Figure. Comparisos of he performace parameers for he parallel Kasevich algorihm. I addiio, he efficiecy curves for he laer cases show a commo wavy paer, i which he efficiecy decreases firs, he icreases ad fially decreases agai. I he regio 1, where he umber of PE rages from 1 o 5, he parallel efficiecies for hese cases decrease. I he regio 2, he efficiecies icrease wih icreme i he umber of PEs. The curves reach heir peaks whe he umber of PEs is abou 16. I he regio, also called he pos-peak performace regio, he efficiecies decrease agai as he umber of PEs furher icreases. The appearace of he super-liear effec (he behavior i p The speedup of he parallel Kasevich algorihm ca be aalyically evaluaed hrough a model. The model ca be used o predic bechmark performace i erms of speedup ad efficiecy wihou coducig compuaios, ad ca be used o desig a parallel CT recosrucio sysem. I has of grea value o medical imagig recosrucio usig parallel he Kasevich algorihm. A model ca be derived based o he workflow of parallel daa commuicaios ad various compuaios. Le se m he oal pixel size o a deecor plae (i Fig. 5). The m sads for parial amou of projecio daa o a deecor plae. The m value is deermied by a CT scaer. If oe collecs perspecive views of CT projecio (see i Fig. 5), he oal projecio daa size P is equal o he produc of m ad (i.e., P=m*). The projecio daa ui is pixel. If oe desires o geerae a volume of recosrucio objec i voxels, he value ca be deoed as V. Figure 4. Represeaive slices of recosruced 256 volume. The op row shows he recosruced slices of he D Shepp- Loga phaom, while he boom reveals he differeces bewee he recosruced ad origial slices. The gray rages are [1., 1.5] ad [-.5.5] for he recosruced slices ad he differeces, respecively. m Loci () Figure 5. Image daa size ad decomposiio eighborig deecor 1 of m 2 of m A he iiial sage, here is a ime eed o prepare iiiaio of he program, ipu or load projecio daa, dyamic allocaio of memory ec. o a maser ode. I ca be expressed as T T T T... () ii ipu MPIii allocaio Usually, his amou of ime is egligibly small compare wih oher commuicaio imes ad compuaio imes. Oce he projecio daa are prepared, he maser begis o pariio he projecio daa ad rasfer he daa o each compuaioal ode. There are processors ha paricipae a parallel compuaio ask. If he oal projecio daa (*m) is decomposed evely oa homegeous HPC sysem, each PE receives (m)/+2m, sice he upper or lower layers eed iformaio of he eighborig oes for calculaig derivaives [-5]. Therefore, each PE eeds average commuicaio ime as } upper PE lower of m ( m) T [ 2 m] laecy rasfer ( PE, i 2,, 4,... 1) (4) i T comm 1 ( m) T [ m] laecy rasfer ( PE, i 1, ) i where Tlaecy is he ime holdig for prepare ad iiiae daa rasfer. I depeds o ework laecy. For simpliciy wih lose impora iformaio, oe ca approximaely use he followig expressio o esimae T, The firs daa commuicaio ime ca be comm 1 expressed as ( m) T T [ 2 m] (5) comm1 laecy rasfer This is a ime required o accomplish daa commuicaios bewee he maser ode o he workig odes. The rasfer is he ime used o rasfer a bye bewee wo odes (maser ad workig odes). I ca be deermied by he local or ework badwidh i ui of bye per secod. For example, a Gagabi swich offers badwidh (BW) 128M bye/s daa rasfer rae. Therefore ypically, rasfer=1/bw is abou 1-8 (secod). If filer is he compuaioal ime required o filer a sigle projecio daa, he oal compuaioal ime for filraio process o each processor ca be approximaely calculaed as m T ( ) ( PE, i 1, 2,, 4,... ) (6) filraio filer i Le s evaluae he secod daa commuicaio ha is eeded o pass filered daa from each sigle processor o all oher processors. If T comm2 sads for he secod par of daa commuicaio ime used for all he odes o gai filered projecio daa, i ca be esimaed as m T T ( ) (7) comm2 laecy rasfer If bp sads for he ime required accomplish a backprojecio for a sigle voxel, he oal backprojecio compuaioal ime Tcomp 2 used for backprojecio o he whole voxels ca be esimaed as V kp k( m) T (8) bp bp bp bp where k is defied as he raio of oal image volume o oal projecio daa, k=v/p (i ui of voxel/pixel). Fially, a commuicaio is eeded o rasfer backprojeced daa o he maser ode for assembly he fial recosruced image. I ca be calculaed as T T V T kp comm laecy rasfer laecy rasfer (9) T k( m) laecy rasfer Therefore, he oal parallel ime usig processors ca be evaluaed usig he followig expressio T T T T T parallel comm1 filraio comm2 backprojecio T comm T oupu Sice he oal sequeial compuaio ime is T T P V T sequeial ii filer bp oupu T ( m ) km ( ) T ii filer bp oupu Thus, he speedup ca be expressed as T T ( m)[ k ] T S T T T T sequeial ii filer bp oupu parallel comm, i comp, j oupu i1,2, j1,2 a b a T ( m)[ k ] T ii filer bp oupu ( m) ( m) b T T [ 2 m] ( m) km ( ) T laecy rasfer p T k( m) T ii laecy rasfer filer laecy rasfer oupu ] (1) (11) a S c m c Tii Tlaecy [2 2 mk( m)] rasfer ( m) k( m) filer p Toupu (12) If oe he values of Tii, Tlaecy ad Toupu are egligibly small, as usual, he above equaio yields k filer bp (1) S k ( k ) rasfer filer p Usually, he is very large ad order magiude of 1,- 1,, he secod erm i he deomiaor is a very small value, which ca be egleced. The above expressio reduces o k filer bp S (14) 2 1 k ( k ) rasfer filer p Or ( k ) filer bp S (2 k) ( k ) rasfer filer bp (15) rasfer 1 1 (2 k) filer k bp where rasfer (2 k) (16) filer k bp Equaio (16) gives a precise model o evaluae he parallel performace of he algorihm. The value is a correlaio facor ha is deermied by may idepede variables. Through equaio (16), oe ca discuss may iflueces o he speedup. The discussio is depiced as follow. 5.1 Ifluece by umber of processors, I is very clear ha he speedup is maily deermied by he umber of processors,, i a liear relaioship, correced by he facor, if he k value is small, of if he bye rasfer rae is small due o a fas badwidh of ier-coecio or eworkig. This explais why whe is small, he speedup appear liearly. Whe icreases, he speedup up slow dow. If becomes a large umber, for give k, filerig ad backprojecio algorihms, ad badwidh, he speedup reaches a limiaio, i.e. ( filer kbp ) S,limi (17) krasfer This pheomeo ca be foud i our experimes of image cosrucios o USA aioal TeraGrid, whe we deployed 8 processors. Whe he reaches o a value of 2 processors, he speedup reaches o he S,lim i which is abou. I he curre sudy, for he cases of recosrucio volumes V=128 ad 256, he oal projecio daa P=51**4 (m=*4, =51), while for he case of recosrucio volume is 512, he oal projecio daa P= 71*5*7 (m=5*7, =71). Therefore, he raios k are.49918,.994, ad , respecively. Sice he is proposal o he umber of projecio views,, The bp * raios of bp bp / 512 bp ( / ) =7/5 = 2, ad he value filraio filer comparig wih bp, is relaively small, he raio of S,limi i 512 case o he oe i 256 case ca be esimaed as he raio * of bp, which is double by he facor of wo. Tha ca be exacly see i Fig. 6. Oe ca use he relaio o predic he size effecs of high performace resuls whe dealig wih large daase. 5.2 Ifluece by badwidh The correcio facor is deermied by ework badwidh (i ui of bis/secod), he raios of image volume vs. he oal projecio daa size, filerig ime ad backprojecio ime for each projecio daa ad volume, respecively. I is very ieresig o see ha he projecio daa described by m ad has o direc ifluece o he speedup. Ho
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks