Recipes/Menus

Efficient SIMD Arithmetic Modulo a Mersenne Number

Description
SIMD arithmetic for fast modulo operations implementation for special forms of right hand side operands
Categories
Published
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Efficient SIMD arithmetic modulo a Mersenne number Joppe W. Bos, Thorsten Kleinjung, Arjen K. Lenstra  EPFL IC LACALStation 14, CH-1015 Lausanne, Switzerland  Peter L. Montgomery  Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA  Abstract —This paper describes carry-less arithmetic opera-tions modulo an integer  2 M  − 1  in the thousand-bit range,targeted at single instruction multiple data platforms andapplications where overall throughput is the main performancecriterion. Using an implementation on a cluster of PlayStation 3game consoles a new record was set for the elliptic curvemethod for integer factorization.  Keywords -Mersenne number, Single Instruction MultipleData, Cell processor, Elliptic curve method, Integer factoriza-tion I. I NTRODUCTION Numbers of a special form often allow faster modulararithmetic operations than generic moduli. This is exploitedin a variety of applications and has led to a substantialbody of literature on the subject of fast special arithmetic.Speeding up calculations using special moduli was alreadyproposed in the mid 1960s by Merrill [40] in the settingof residue number systems (RNS) [25]. Other applica-tions range from speeding up fast Fourier transform basedmultiplication [19], enhancing the performance of digitalsignal processing [54], [50], [23], to faster elliptic curvecryptography (ECC; [32], [41]), such as in [3].Another application area of special moduli is in factoriza-tion attempts of so-called  Cunningham numbers , numbers of the form  b n ± 1  for  b  = 2 , 3 , 5 , 6 , 7 , 10 , 11 , 12 up to high pow-ers. This long term factorization project, srcinally reportedin the Cunningham tables [21] and still continuing in [15],has a long and distinguished record of inspiring algorithmicdevelopments and large-scale computational projects [34],[42], [14], [46], [37], [13]. Factorizations from [15] with b  = 2  are used in formal correctness proofs of floating pointdivision methods [27]. Several of these developments [36]turned out to be applicable beyond special form moduli,and are relevant for security assessment of various commonpublic-key cryptosystems.This paper concerns efficient arithmetic modulo aMersenne number, an integer of the form  2 M  − 1 . Thesenumbers, and a larger family of numbers called general-ized Mersenne numbers [51], [17], [1], have found manyarithmetic applications ranging from number theoretic trans-forms [12] to cryptography. In the latter they are used to runcalculations concurrently using RNS [2] or to improve thespeed of finite field arithmetic in ECC based schemes [51],[55]. The great internet Mersenne prime search project [26]is based on an implementation of the Lucas-Lehmer primal-ity test [39], [33] for Mersenne numbers in the many millionbit range. Hence, efficient arithmetic modulo a Mersennenumber is a widely studied subject, not just of interest in itsown right but with many applications.Our interest in arithmetic modulo a Mersenne number wastriggered by a potential (special) number field sieve (NFS)project [36], for which we need a list of composites dividing 2 M  − 1  for exponents M   in the range from 1000 to 1200. TheCunningham tables contain at least 20 composite Mersennenumbers (or composite factors thereof) in the desired rangethat have not been fully factored yet. It may be expectedthat some of these composites are not suitable candidatesfor our list because they can be factored faster using theelliptic curve method (ECM) for integer factorization [38]than by means of special NFS (SNFS). The only way tofind out if ECM is indeed preferable, is by subjecting eachcandidate to an extensive ECM effort (which, though it maybe substantial, is small compared to the effort that wouldbe required by SNFS): only candidates that ECM failed tofactor should be included on the list.The efficiency of ECM factoring attempts relies on theefficiency of integer arithmetic modulo the number beingfactored. Given the need to do extensive ECM pre-testingfor at least 20 composite Mersenne numbers, we developedarithmetic operations modulo a Mersenne number suitablefor implementation of ECM on the platform that we intendedto use for the calculations: the Cell processor as found inthe Sony PlayStation 3 game console. Because each ECMeffort consists of a large number of independent attemptsthat can be executed in  single instruction multiple data (SIMD) mode and because each core of the Cell processorcan be interpreted as a 4-way SIMD environment, ourarithmetic modulo a Mersenne number is geared towardsSIMD implementation. It is described in Section III aftera brief description of the Cell architecture in Section II.Although our implementations were written for the Cellprocessor, our methods apply to any type of SIMD platform,including graphics cards. Section IV sketches ECM, ourCell processor implementation, and lists some of our ECM  results, including a new ECM record factorization.While the new ECM factorizations removed some of theeasy cases from our list of candidate Mersenne numbers, thefurther practical implications of ECM records are limitedto their consequence for two variants of the RSA cryp-tosystem [47], namely  RSA multiprime  [47] and  unbalanced  RSA  [48]. The former gains a speedup by a factor of   r 2 or  r 2 4  for the private operation in vanilla RSA or CRT-RSA,respectively, by selecting RSA moduli (of appropriate sizeto be out of reach of NFS) consisting of the product of  r >  2  primes of about the same size. In unbalanced RSA,the RSA modulus has two factors as usual, but one is chosenmuch smaller than the other. In these variants,  r  and thesmallest factor must be chosen in such a way that ECM hasa sufficiently low probability to find the resulting relativelysmall prime factor(s). Our ECM findings affirm that 1024-bit RSA moduli with  r ≥ 4  should be avoided [35] and maygive practitioners of these variants some guidance how smallthe factors may be chosen.II. T HE  C ELL PROCESSOR AND ITS ARCHITECTURE The Cell processor, the main processor of the PS3 andthus mainly targeted at the gaming market, is a powerfulgeneral purpose processor. On the first generation PS3s itcan be accessed using Sony’s hypervisor, a feature that hasbeen disabled in current versions. This made the PS3 arelatively inexpensive and also flexible source of processingpower, as witnessed by a variety of cryptanalytic projects:chosen prefix collisions for the cryptographic hash functionMD5 [52], [53], the solution of a 112-bit prime field ellipticcurve discrete logarithm problem [9], and implementationof elliptic curve group arithmetic over a degree-130 binaryextension field [10].The architecture of the Cell processor is quite differentfrom that of regular server or desktop processors. Takingfull advantage of it requires designing new software. It isworthwhile doing so, because architectures similar to theCell’s will soon be mainstream [44]. It not only helps us totake advantage of the Cell’s inexpensive processing power,it also helps to prepare for future generations of processors.See Section IV-A for the rationale why the Cell processorwas chosen as the platform for our ECM attempts.The Cell has a  Power Processing Element   (PPE), a dual-threaded Power architecture-based 64-bit processor withaccess to a 128-bit AltiVec/VMX SIMD unit. Its mainprocessing power, however, comes from eight  SynergisticProcessing Elements  (SPEs). When running Linux, six SPEscan be used: one is disabled, and one is reserved by thehypervisor. It is conceivable that this last one becomesaccessible too [28]. Each SPE runs independently from theothers at 3.192GHz, using its own 256 kilobyte of fast localmemory for instructions and data. It has 128 registers of 128bits each, allowing SIMD operations on sixteen 8-bit, eight16-bit, or four 32-bit integers. An SPE has no  32 × 32 → 64 -bit or  64 × 64 → 128 -bit integer multiplier, but has several4-way SIMD  16 × 16 → 32 -bit integer multipliers includingmultiply-and-add instructions.There is an odd and an even pipeline: in ideal cir-cumstances an SPE can dispatch one odd and one eveninstruction per clock cycle. Most arithmetic instructionsare even. Because the SPE lacks smart branch prediction,branching is best avoided (as usual in SIMD). MultipleSIMD processes may be interleaved, filling both pipelines toincrease throughput, while possibly increasing per processlatency. Here we took advantage of interleaving in anothermanner.The Cell processor has also been made available to thesupercomputing community by placing two Cell chips ina single blade server. They come with more memory thanin the PS3 and on each Cell all eight SPEs are accessible.For high-performing blade servers a newer derivative of theCell, the PowerXCell 8i, offers enhanced double-precisionfloating-point capabilities. Due to their significantly higherprice these compute nodes come at a price performance ratioquite different from the relatively inexpensive PS3.III. A RITHMETIC MODULO  2 M  − 1  ON THE  SPEIn this section we describe the SPE-arithmetic that wedeveloped for arithmetic modulo  N   = 2 M  − 1 , for  M   in therange from 1000 to 1200 (allowing larger values as well).Assume that  M <  13 · 96 − 2 = 1246  (larger  M  -values canbe accommodatedby putting  M < u · v − 2  with  v · (2 u − 1 ) 2 < 2 31 ). Our approach aims to optimize overall throughput asopposed to minimize per process latency. Two variants arepresented: a first approach where addition and subtractionare fast at the cost of a radix conversion before and afterthe multiplication, and an alternative approach where radixconversions are avoided at the cost of slower addition andsubtraction. This second variant turns out to be faster for ourECM application. In applications with a different balancebetween the various operations the first approach could bepreferable, so it is described as well. All our methods areparticularly suited to SPE-implementation, but the approachmay have broader applicability.For  k  ∈  Z > 0  a  k -bit integer   is an integer  w  with 0  ≤  w <  2 k . A  signed   k -bit integer   is an integer  w  with − 2 k − 1 ≤ w <  2 k − 1 . For  r  ∈ Z > 1  a  radix- r  representation of an integer  z  with  0  ≤  z < r s is a sequence of   radix- r digits  ( w j ) s − 1 j =0  such that  z  =  s − 1 j =0  w j r j and  w j  ∈  Z ≥ 0 .It is unique if   0  ≤  w j  < r  for  0  ≤  j < s . If   2 k ≥  r ,a  signed   k -bit radix- r  representation  of   z  is a sequence ( w j ) sj =0  of signed  k -bit integers such that  z  =  sj =0  w j r j .We use  signed radix- 2 k representation  for signed  k -bit radix- 2 k representation.   A. Related work  In [18] an SPE implementation is presented using arith-metic modulo the special prime  2 255 − 19  introduced in [3].SPE-arithmetic modulo a special prime is used in [9] tosolve a 112-bit elliptic curve discrete logarithm problemon Cell processors. The SPE-performance of generic versusgeneralized Mersenne moduli is compared in [8]. SPE-arithmetic for moduli in the 200-bit range is presented in [6],[16]; on PS3s the former is more than twice faster than thelatter. Different approaches to implement arithmetic over abinary extension field on SPEs are stated in [10].Our usage of a small radix to avoid carries (cf. below) isnot new [20], [31, Section 4.6], [6]. In [6] signed radix- 2 13 representation is used along with the SPE’s  16 × 16 → 32 -bit multiplication instruction to develop fast multiplicationmodulo 195-bit moduli. All additions done during a singleschoolbook multiplication are carry-less, requiring normal-ization to radix- 2 13 representation only at the end of themultiplication.  B. Representation of 4-tuples of integers modulo  N  On the SPE it is advantageous to operate on four integersmodulo  N   simultaneously, in 4-way SIMD fashion. Each128-bit SPE register is interpreted as being partitioned intofour 32-bit  words . With  s  128-bit registers thought to bestacked on top of each other, where  32 s ≥ M  , four differentintegers modulo  N   can be represented using four disjointparallel columns, each consisting of   s  words: denoting the i th word of the  j th register by  w ij  for  i  ∈ { 1 , 2 , 3 , 4 } and  j  = 0 , 1 ,...,s  −  1 , the sequence  ( w ij ) s − 1 j =0  is inter-preted as the radix- 2 32 representation of the  32 s -bit integer  s − 1 j =0  w ij 2 32 i . More generally, for any  t  ≤  32  of one’schoice, the sequence  ( w ij ) s − 1 j =0  may represent the integer  s − 1 j =0  w ij 2 ti whose value depends on the interpretation of the words  w ij : as an unnormalized radix- 2 t representation if the  w ij  are interpreted as non-negative integers (normalizedand unique if   w ij  <  2 t as well), and as a signed  k -bit radix- 2 t representation, for some  k ≤ 32 , if the  w ij  are interpretedas signed  k -bit integers.It should be understood that the integer operations de-scribed below are always carried out in 4-way SIMD fashionon the SPE. C. Addition and subtraction modulo  N  Addition and subtraction in 4-way SIMD fashion on a pairof 4-tuples of integers modulo  N   in radix- 2 t representation,with each 4-tuple represented by a stack of   s  registers of 128-bits (where  ts  ≥  M  ), is done by applying  s  additionsor subtractions to the matching pairs of registers (one fromeach stack), combined with a moderate number of carrypropagations. The reduction modulo  N   most of the timeaffects just two of the radix- 2 t digits, with probability 2 − 1 − t − ( M   mod  t ) that more digits are affected (in which caseit causes a slight stall for the other three calculations in the4-tuple).For  t  = 32  the SPE’s built-in carry generation instructionsare used, for smaller  t -values somewhat more work needsto be done. For completeness (and future reference, cf.Step 5 in Section III-G), we describe the calculation of  c  =  a  +  b  mod  N   and  d  =  a  −  b  mod  N   (so-called addition-subtraction  of   a  and  b ) given the signed radix- 2 13 representations  a  =   95 j =0  a j 2 13 j and  b  =   95 j =0  b j 2 13 j .The following 5 steps are carried out:1) Let  a ′ j  =  a j  + 2 12 for  0 ≤  j <  96 .2) Set  c j  =  a ′ j  +  b j  and  d j  =  a ′ j − b j  for  0 ≤  j <  96 .3) Let the initial value of the carry  τ   be  0 . For  j  = 0  to 95in succession first replace  τ   by  τ   + c j , next replace  c j by  τ   mod 2 13 , and finally replace  τ   by  ⌊ τ/ 2 13 ⌋ . Theresulting  τ   is a carry corresponding to  τ   ·  2 13 · 96 ;modulo  N   this carry is taken care of by adding  τ  · 2 α to  c β  (for  γ   = 13  ·  96  −  M  ,  β   =  ⌊ γ/ 13 ⌋  and α  =  γ   −  13 β   ∈  [0 , 12] ) followed by a few morecarry propagations.If there is still a carry which occursrarely, use a more expensive function.4) Repeat the previous step with  c  replaced by  d .5) Set  c j  =  c j − 2 12 and  d j  =  d j  − 2 12 for  0 ≤  j <  96 .Steps 1, 2, and 5 allow arbitrary parallelization. Table Ilists SPE clock cycle counts for the addition operationsmodulo  2 1193 −  1 : it can be seen that for signed radix- 2 13 representation they are more than twice slower than forradix- 2 32 representation.  D. Multiplication modulo  N   using radix conversions Given a pair of 4-tuples of   M  -bit integers, the fourpairwise products result in a 4-tuple of   2 M  -bit integers. Thefour reductions modulo  N   can in principle be done by meansof a few of the above 4-tuple additions and subtractionsmodulo  N  . Here we present our first approach that uses twodifferent radix representations, thereby making it possible totake advantage of the fast radix- 2 32 addition and subtractionmodulo  N  . In Section III-F another approach is describedthat is based on signed radix- 2 13 representation.The multiplication modulo  N   of two  M  -bit integers  a and  b  given by their radix- 2 32 representations, each using 39words of 32 bits, proceeds in three steps that are describedin more detail in sections III-D1 through III-D3. The stepsare:1) conversion of inputs  a  and  b  to signed radix- 2 13 representation;2) carry-less calculation of the  2 M  -bit product  a · b  insigned 32-bit radix- 2 13 representation;3) reduction modulo  N   and conversion to radix- 2 32 rep-resentation of the  2 M  -bit product  a · b , resulting in c  =  a · b  mod  N   ∈{ 0 , 1 ,...,N   − 1 } .The following sections describe the steps in more detail.  1) Conversion of inputs to signed radix- 2 13 representa-tion:  Given the radix- 2 32 representation of the precomputedconstant  C  0  = 2 12 ·  95 j =0  2 13 j , first calculate the radix- 2 32 representation of   a  +  C  0 , in the usual way requiringcarries. Next, using masks and shifts, extract the radix- 2 13 representation  (˜ a ) 95 j =0  of   a  +  C  0 , and finally subtract  C  0 again by calculating  a j  = ˜ a j  − 2 12 , for  j  = 0 , 1 ,..., 95 (because  a 96  = 0  for our choice of   M  , it is dropped). Thelast two steps allow various straightforward parallelizationsand run twice faster (while requiring fewer registers) if two 13-bit chunks are packed into a single 32-bit word.Applying the same method to  b , we find signed radix- 2 13 representations of the inputs, below regarded as polynomials P  a ( X  ) =  95 j =0  a j X  j ,  P  b ( X  ) =  95 j =0  b j X  j ∈  Z [ X  ]  with P  a (2 13 ) =  a  and  P  b (2 13 ) =  b . 2) Carry-less calculation of the  2 M  -bit product in signed 32-bit radix- 2 13 representation:  The product polynomial P  ( X  ) =  P  a ( X  ) P  b ( X  ) =  190 j =0  p j X  j corresponds to thecarry-less product calculation of   a  and  b  as representedby  ( a j ) 95 j =0  and  ( b j ) 95 j =0 , respectively. Its coefficients satisfy |  p j | ≤  96 · (2 12 ) 2 <  2 31 , which allows computation mod-ulo  2 32 , resulting in a signed 32-bit radix- 2 13 representation (  p j ) 190 j =0  of the product  a · b  =  P  (2 13 ) . If   M <  13 · w  with w <  96 , the degree of   P  ( X  )  will be at most  2 w − 2  <  190 ,which leads to savings here and in the description below.The polynomial  P  ( X  )  is calculated using three levelsof Karatsuba multiplication [30] (but see Section III-F2 forthe possibility to use more levels), resulting in 27 pairs of polynomials  ( P  ( k ) a  ( X  ) ,P  ( k ) b  ( X  ))  of degree  ≤  11 , for  k  =1 , 2 ,..., 27  (in the more general case where  M < u · v − 2 we would use  16 − u  levels). This leads to 27 independentpolynomial multiplications  Q ( k ) ( X  ) =  P  ( k ) a  ( X  ) P  ( k ) b  ( X  ) ,done using carry-less schoolbook multiplications. The poly-nomial  P  ( X  )  is then obtained by carry-less additions andsubtractions of the appropriate  Q ( k ) ( X  ) ’s. 3) Reduction modulo  N   and conversion to radix- 2 32 representation of the  2 M  -bit product:  Given a signed 32-bit radix- 2 13 representation  (  p j ) 190 j =0  of the  2 M  -bit product a · b , regarded as the polynomial  P  ( X  ) =  190 j =0  p j X  j with P  (2 13 ) =  a · b , the radix- 2 32 representation  ( c i ) 38 i =0  of the M  -bit number  c ≡ P  (2 13 ) mod  N   is calculated. We use thefollowing precomputed constants: ã  C  1  ≡− 2 31 ·  190 j =0  2 13 j mod  N  ,  0 ≤ C  1  < N  . ã  Integers  k j ,l j  and  m j  such that 13  j  =  m j M   + 32 l j  +  k j with  0 ≤ 32 l j  +  k j  < M   and  0 ≤ k j  <  32 , for  0  ≤  j <  191 . Note that  m j  ∈ { 0 , 1 , 2 }  because M >  827  (and  M <  1246 ).Given these values, the following four steps are carried out,the correctness of which easily follows by inspection:1) For  0 ≤  j <  191 , compute  ˜  p j  =  p j  +2 31 (this allowsarbitrary parallelization), so that  0  ≤  ˜  p j  <  2 32 . As aresult  (  190 j =0  ˜  p j  · 2 13 j ) +  C  1  ≡ P  (2 13 ) mod  N  .2) For  0  ≤  j <  191 , left shift  ˜  p j  over  k j  bits and rightshift  ˜  p j  over  32 − k j  bits, to obtain  d j ,e j  such that ˜  p j · 2 13 j ≡ d j · 2 32 l j +  e j · 2 32( l j +1) mod  N  (this again allows arbitrary parallelization).3) Let  v 0  = 0 . For  0 ≤ i <  39 , let u i  =  j : l j = i d j  +  j : l j +1= i e j ,  (1)(where the indices  j  can be precomputed)and compute ˜ c i  = ( v i  +  u i ) mod 2 32 ∈{ 0 , 1 ,..., 2 32 − 1 } ,v i +1  = ⌊ ( v i  +  u i ) / 2 32 ⌋ (this allows partial parallelization). Finally, compute ˜ c 39  =  v 39  +  j : l j =38  e j .Using Eq. (1), reduction moduli  N   is effected bydisregarding  m j  and grouping together identical  d j -values and identical  e j -values. As a result,  (˜ c i ) 39 i =0 is the radix- 2 32 representation of a number  ˜ c  with ˜ c  +  C  1  ≡ c  mod  N  .4) Calculate  c ≡  ˜ c  +  C  1  mod  N  . Although the numbersare slightly bigger, this calculation is in principle thesame as regular addition modulo  N  .  E. OptimizationsSwapping even for odd instructions.  Modular arithmeticmostly relies on the SPE’s arithmetic instructions, whichare even pipeline instructions. Following the approachfrom [43], [11] one may replace an even instruction byone or more odd ones with the same effect. Althoughthis may increase the latency for the functionality of eachreplaced even instruction and the number of instructions,balancing the counts of even and odd instructions oftenincreases the throughput. This method was used throughoutour implementation. Examples are sketched below.  Modular squaring.  When squaring polynomials of degreeat most  11 , half of the mixed products, i.e.,  12 2 − 122  = 66 multiplications, can be saved by doubling their resulting 21sums (as the top elements are zero). Of these sums, theeleven for coefficients of odd degree can be doubled forfree during the conversion to radix- 2 32 , by using for odd  j precomputed integers  ˜ k j ,  ˜ l j , and  ˜ m j  such that 13  j  + 1 = ˜ m j M   + 32˜ l j  + ˜ k j with  0 ≤ 32˜ l j  + ˜ k j  < M   and  0 ≤ ˜ k j  <  32 , instead of   k j ,  l j , and  m j , as defined earlier. The tenremaining sums need to be doubled before they are added tothe corresponding squared input coefficient. Each doublingcan be done by a single even pipeline addition. However,a doubling can also be performed by four odd pipeline in-structions (or two doublings in six odd pipeline instructions).The ten remaining doublings could thus be squeezed in the

bimbingan I.docx

Jul 23, 2017

ob....

Jul 23, 2017
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks