Efﬁcient SIMD arithmetic modulo a Mersenne number
Joppe W. Bos, Thorsten Kleinjung, Arjen K. Lenstra
EPFL IC LACALStation 14, CH1015 Lausanne, Switzerland
Peter L. Montgomery
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA
Abstract
—This paper describes carryless arithmetic operations modulo an integer
2
M
−
1
in the thousandbit range,targeted at single instruction multiple data platforms andapplications where overall throughput is the main performancecriterion. Using an implementation on a cluster of PlayStation 3game consoles a new record was set for the elliptic curvemethod for integer factorization.
Keywords
Mersenne number, Single Instruction MultipleData, Cell processor, Elliptic curve method, Integer factorization
I. I
NTRODUCTION
Numbers of a special form often allow faster modulararithmetic operations than generic moduli. This is exploitedin a variety of applications and has led to a substantialbody of literature on the subject of fast special arithmetic.Speeding up calculations using special moduli was alreadyproposed in the mid 1960s by Merrill [40] in the settingof residue number systems (RNS) [25]. Other applications range from speeding up fast Fourier transform basedmultiplication [19], enhancing the performance of digitalsignal processing [54], [50], [23], to faster elliptic curvecryptography (ECC; [32], [41]), such as in [3].Another application area of special moduli is in factorization attempts of socalled
Cunningham numbers
, numbers of the form
b
n
±
1
for
b
= 2
,
3
,
5
,
6
,
7
,
10
,
11
,
12
up to high powers. This long term factorization project, srcinally reportedin the Cunningham tables [21] and still continuing in [15],has a long and distinguished record of inspiring algorithmicdevelopments and largescale computational projects [34],[42], [14], [46], [37], [13]. Factorizations from [15] with
b
= 2
are used in formal correctness proofs of ﬂoating pointdivision methods [27]. Several of these developments [36]turned out to be applicable beyond special form moduli,and are relevant for security assessment of various commonpublickey cryptosystems.This paper concerns efﬁcient arithmetic modulo aMersenne number, an integer of the form
2
M
−
1
. Thesenumbers, and a larger family of numbers called generalized Mersenne numbers [51], [17], [1], have found manyarithmetic applications ranging from number theoretic transforms [12] to cryptography. In the latter they are used to runcalculations concurrently using RNS [2] or to improve thespeed of ﬁnite ﬁeld arithmetic in ECC based schemes [51],[55]. The great internet Mersenne prime search project [26]is based on an implementation of the LucasLehmer primality test [39], [33] for Mersenne numbers in the many millionbit range. Hence, efﬁcient arithmetic modulo a Mersennenumber is a widely studied subject, not just of interest in itsown right but with many applications.Our interest in arithmetic modulo a Mersenne number wastriggered by a potential (special) number ﬁeld sieve (NFS)project [36], for which we need a list of composites dividing
2
M
−
1
for exponents
M
in the range from 1000 to 1200. TheCunningham tables contain at least 20 composite Mersennenumbers (or composite factors thereof) in the desired rangethat have not been fully factored yet. It may be expectedthat some of these composites are not suitable candidatesfor our list because they can be factored faster using theelliptic curve method (ECM) for integer factorization [38]than by means of special NFS (SNFS). The only way toﬁnd out if ECM is indeed preferable, is by subjecting eachcandidate to an extensive ECM effort (which, though it maybe substantial, is small compared to the effort that wouldbe required by SNFS): only candidates that ECM failed tofactor should be included on the list.The efﬁciency of ECM factoring attempts relies on theefﬁciency of integer arithmetic modulo the number beingfactored. Given the need to do extensive ECM pretestingfor at least 20 composite Mersenne numbers, we developedarithmetic operations modulo a Mersenne number suitablefor implementation of ECM on the platform that we intendedto use for the calculations: the Cell processor as found inthe Sony PlayStation 3 game console. Because each ECMeffort consists of a large number of independent attemptsthat can be executed in
single instruction multiple data
(SIMD) mode and because each core of the Cell processorcan be interpreted as a 4way SIMD environment, ourarithmetic modulo a Mersenne number is geared towardsSIMD implementation. It is described in Section III aftera brief description of the Cell architecture in Section II.Although our implementations were written for the Cellprocessor, our methods apply to any type of SIMD platform,including graphics cards. Section IV sketches ECM, ourCell processor implementation, and lists some of our ECM
results, including a new ECM record factorization.While the new ECM factorizations removed some of theeasy cases from our list of candidate Mersenne numbers, thefurther practical implications of ECM records are limitedto their consequence for two variants of the RSA cryptosystem [47], namely
RSA multiprime
[47] and
unbalanced RSA
[48]. The former gains a speedup by a factor of
r
2
or
r
2
4
for the private operation in vanilla RSA or CRTRSA,respectively, by selecting RSA moduli (of appropriate sizeto be out of reach of NFS) consisting of the product of
r >
2
primes of about the same size. In unbalanced RSA,the RSA modulus has two factors as usual, but one is chosenmuch smaller than the other. In these variants,
r
and thesmallest factor must be chosen in such a way that ECM hasa sufﬁciently low probability to ﬁnd the resulting relativelysmall prime factor(s). Our ECM ﬁndings afﬁrm that 1024bit RSA moduli with
r
≥
4
should be avoided [35] and maygive practitioners of these variants some guidance how smallthe factors may be chosen.II. T
HE
C
ELL PROCESSOR AND ITS ARCHITECTURE
The Cell processor, the main processor of the PS3 andthus mainly targeted at the gaming market, is a powerfulgeneral purpose processor. On the ﬁrst generation PS3s itcan be accessed using Sony’s hypervisor, a feature that hasbeen disabled in current versions. This made the PS3 arelatively inexpensive and also ﬂexible source of processingpower, as witnessed by a variety of cryptanalytic projects:chosen preﬁx collisions for the cryptographic hash functionMD5 [52], [53], the solution of a 112bit prime ﬁeld ellipticcurve discrete logarithm problem [9], and implementationof elliptic curve group arithmetic over a degree130 binaryextension ﬁeld [10].The architecture of the Cell processor is quite differentfrom that of regular server or desktop processors. Takingfull advantage of it requires designing new software. It isworthwhile doing so, because architectures similar to theCell’s will soon be mainstream [44]. It not only helps us totake advantage of the Cell’s inexpensive processing power,it also helps to prepare for future generations of processors.See Section IVA for the rationale why the Cell processorwas chosen as the platform for our ECM attempts.The Cell has a
Power Processing Element
(PPE), a dualthreaded Power architecturebased 64bit processor withaccess to a 128bit AltiVec/VMX SIMD unit. Its mainprocessing power, however, comes from eight
SynergisticProcessing Elements
(SPEs). When running Linux, six SPEscan be used: one is disabled, and one is reserved by thehypervisor. It is conceivable that this last one becomesaccessible too [28]. Each SPE runs independently from theothers at 3.192GHz, using its own 256 kilobyte of fast localmemory for instructions and data. It has 128 registers of 128bits each, allowing SIMD operations on sixteen 8bit, eight16bit, or four 32bit integers. An SPE has no
32
×
32
→
64
bit or
64
×
64
→
128
bit integer multiplier, but has several4way SIMD
16
×
16
→
32
bit integer multipliers includingmultiplyandadd instructions.There is an odd and an even pipeline: in ideal circumstances an SPE can dispatch one odd and one eveninstruction per clock cycle. Most arithmetic instructionsare even. Because the SPE lacks smart branch prediction,branching is best avoided (as usual in SIMD). MultipleSIMD processes may be interleaved, ﬁlling both pipelines toincrease throughput, while possibly increasing per processlatency. Here we took advantage of interleaving in anothermanner.The Cell processor has also been made available to thesupercomputing community by placing two Cell chips ina single blade server. They come with more memory thanin the PS3 and on each Cell all eight SPEs are accessible.For highperforming blade servers a newer derivative of theCell, the PowerXCell 8i, offers enhanced doubleprecisionﬂoatingpoint capabilities. Due to their signiﬁcantly higherprice these compute nodes come at a price performance ratioquite different from the relatively inexpensive PS3.III. A
RITHMETIC MODULO
2
M
−
1
ON THE
SPEIn this section we describe the SPEarithmetic that wedeveloped for arithmetic modulo
N
= 2
M
−
1
, for
M
in therange from 1000 to 1200 (allowing larger values as well).Assume that
M <
13
·
96
−
2 = 1246
(larger
M
values canbe accommodatedby putting
M < u
·
v
−
2
with
v
·
(2
u
−
1
)
2
<
2
31
). Our approach aims to optimize overall throughput asopposed to minimize per process latency. Two variants arepresented: a ﬁrst approach where addition and subtractionare fast at the cost of a radix conversion before and afterthe multiplication, and an alternative approach where radixconversions are avoided at the cost of slower addition andsubtraction. This second variant turns out to be faster for ourECM application. In applications with a different balancebetween the various operations the ﬁrst approach could bepreferable, so it is described as well. All our methods areparticularly suited to SPEimplementation, but the approachmay have broader applicability.For
k
∈
Z
>
0
a
k
bit integer
is an integer
w
with
0
≤
w <
2
k
. A
signed
k
bit integer
is an integer
w
with
−
2
k
−
1
≤
w <
2
k
−
1
. For
r
∈
Z
>
1
a
radix
r
representation
of an integer
z
with
0
≤
z < r
s
is a sequence of
radix
r
digits
(
w
j
)
s
−
1
j
=0
such that
z
=
s
−
1
j
=0
w
j
r
j
and
w
j
∈
Z
≥
0
.It is unique if
0
≤
w
j
< r
for
0
≤
j < s
. If
2
k
≥
r
,a
signed
k
bit radix
r
representation
of
z
is a sequence
(
w
j
)
sj
=0
of signed
k
bit integers such that
z
=
sj
=0
w
j
r
j
.We use
signed radix
2
k
representation
for signed
k
bit radix
2
k
representation.
A. Related work
In [18] an SPE implementation is presented using arithmetic modulo the special prime
2
255
−
19
introduced in [3].SPEarithmetic modulo a special prime is used in [9] tosolve a 112bit elliptic curve discrete logarithm problemon Cell processors. The SPEperformance of generic versusgeneralized Mersenne moduli is compared in [8]. SPEarithmetic for moduli in the 200bit range is presented in [6],[16]; on PS3s the former is more than twice faster than thelatter. Different approaches to implement arithmetic over abinary extension ﬁeld on SPEs are stated in [10].Our usage of a small radix to avoid carries (cf. below) isnot new [20], [31, Section 4.6], [6]. In [6] signed radix
2
13
representation is used along with the SPE’s
16
×
16
→
32
bit multiplication instruction to develop fast multiplicationmodulo 195bit moduli. All additions done during a singleschoolbook multiplication are carryless, requiring normalization to radix
2
13
representation only at the end of themultiplication.
B. Representation of 4tuples of integers modulo
N
On the SPE it is advantageous to operate on four integersmodulo
N
simultaneously, in 4way SIMD fashion. Each128bit SPE register is interpreted as being partitioned intofour 32bit
words
. With
s
128bit registers thought to bestacked on top of each other, where
32
s
≥
M
, four differentintegers modulo
N
can be represented using four disjointparallel columns, each consisting of
s
words: denoting the
i
th word of the
j
th register by
w
ij
for
i
∈ {
1
,
2
,
3
,
4
}
and
j
= 0
,
1
,...,s
−
1
, the sequence
(
w
ij
)
s
−
1
j
=0
is interpreted as the radix
2
32
representation of the
32
s
bit integer
s
−
1
j
=0
w
ij
2
32
i
. More generally, for any
t
≤
32
of one’schoice, the sequence
(
w
ij
)
s
−
1
j
=0
may represent the integer
s
−
1
j
=0
w
ij
2
ti
whose value depends on the interpretation of the words
w
ij
: as an unnormalized radix
2
t
representation if the
w
ij
are interpreted as nonnegative integers (normalizedand unique if
w
ij
<
2
t
as well), and as a signed
k
bit radix
2
t
representation, for some
k
≤
32
, if the
w
ij
are interpretedas signed
k
bit integers.It should be understood that the integer operations described below are always carried out in 4way SIMD fashionon the SPE.
C. Addition and subtraction modulo
N
Addition and subtraction in 4way SIMD fashion on a pairof 4tuples of integers modulo
N
in radix
2
t
representation,with each 4tuple represented by a stack of
s
registers of 128bits (where
ts
≥
M
), is done by applying
s
additionsor subtractions to the matching pairs of registers (one fromeach stack), combined with a moderate number of carrypropagations. The reduction modulo
N
most of the timeaffects just two of the radix
2
t
digits, with probability
2
−
1
−
t
−
(
M
mod
t
)
that more digits are affected (in which caseit causes a slight stall for the other three calculations in the4tuple).For
t
= 32
the SPE’s builtin carry generation instructionsare used, for smaller
t
values somewhat more work needsto be done. For completeness (and future reference, cf.Step 5 in Section IIIG), we describe the calculation of
c
=
a
+
b
mod
N
and
d
=
a
−
b
mod
N
(socalled
additionsubtraction
of
a
and
b
) given the signed radix
2
13
representations
a
=
95
j
=0
a
j
2
13
j
and
b
=
95
j
=0
b
j
2
13
j
.The following 5 steps are carried out:1) Let
a
′
j
=
a
j
+ 2
12
for
0
≤
j <
96
.2) Set
c
j
=
a
′
j
+
b
j
and
d
j
=
a
′
j
−
b
j
for
0
≤
j <
96
.3) Let the initial value of the carry
τ
be
0
. For
j
= 0
to 95in succession ﬁrst replace
τ
by
τ
+
c
j
, next replace
c
j
by
τ
mod 2
13
, and ﬁnally replace
τ
by
⌊
τ/
2
13
⌋
. Theresulting
τ
is a carry corresponding to
τ
·
2
13
·
96
;modulo
N
this carry is taken care of by adding
τ
·
2
α
to
c
β
(for
γ
= 13
·
96
−
M
,
β
=
⌊
γ/
13
⌋
and
α
=
γ
−
13
β
∈
[0
,
12]
) followed by a few morecarry propagations.If there is still a carry which occursrarely, use a more expensive function.4) Repeat the previous step with
c
replaced by
d
.5) Set
c
j
=
c
j
−
2
12
and
d
j
=
d
j
−
2
12
for
0
≤
j <
96
.Steps 1, 2, and 5 allow arbitrary parallelization. Table Ilists SPE clock cycle counts for the addition operationsmodulo
2
1193
−
1
: it can be seen that for signed radix
2
13
representation they are more than twice slower than forradix
2
32
representation.
D. Multiplication modulo
N
using radix conversions
Given a pair of 4tuples of
M
bit integers, the fourpairwise products result in a 4tuple of
2
M
bit integers. Thefour reductions modulo
N
can in principle be done by meansof a few of the above 4tuple additions and subtractionsmodulo
N
. Here we present our ﬁrst approach that uses twodifferent radix representations, thereby making it possible totake advantage of the fast radix
2
32
addition and subtractionmodulo
N
. In Section IIIF another approach is describedthat is based on signed radix
2
13
representation.The multiplication modulo
N
of two
M
bit integers
a
and
b
given by their radix
2
32
representations, each using 39words of 32 bits, proceeds in three steps that are describedin more detail in sections IIID1 through IIID3. The stepsare:1) conversion of inputs
a
and
b
to signed radix
2
13
representation;2) carryless calculation of the
2
M
bit product
a
·
b
insigned 32bit radix
2
13
representation;3) reduction modulo
N
and conversion to radix
2
32
representation of the
2
M
bit product
a
·
b
, resulting in
c
=
a
·
b
mod
N
∈{
0
,
1
,...,N
−
1
}
.The following sections describe the steps in more detail.
1) Conversion of inputs to signed radix
2
13
representation:
Given the radix
2
32
representation of the precomputedconstant
C
0
= 2
12
·
95
j
=0
2
13
j
, ﬁrst calculate the radix
2
32
representation of
a
+
C
0
, in the usual way requiringcarries. Next, using masks and shifts, extract the radix
2
13
representation
(˜
a
)
95
j
=0
of
a
+
C
0
, and ﬁnally subtract
C
0
again by calculating
a
j
= ˜
a
j
−
2
12
, for
j
= 0
,
1
,...,
95
(because
a
96
= 0
for our choice of
M
, it is dropped). Thelast two steps allow various straightforward parallelizationsand run twice faster (while requiring fewer registers) if two 13bit chunks are packed into a single 32bit word.Applying the same method to
b
, we ﬁnd signed radix
2
13
representations of the inputs, below regarded as polynomials
P
a
(
X
) =
95
j
=0
a
j
X
j
,
P
b
(
X
) =
95
j
=0
b
j
X
j
∈
Z
[
X
]
with
P
a
(2
13
) =
a
and
P
b
(2
13
) =
b
.
2) Carryless calculation of the
2
M
bit product in signed 32bit radix
2
13
representation:
The product polynomial
P
(
X
) =
P
a
(
X
)
P
b
(
X
) =
190
j
=0
p
j
X
j
corresponds to thecarryless product calculation of
a
and
b
as representedby
(
a
j
)
95
j
=0
and
(
b
j
)
95
j
=0
, respectively. Its coefﬁcients satisfy

p
j
 ≤
96
·
(2
12
)
2
<
2
31
, which allows computation modulo
2
32
, resulting in a signed 32bit radix
2
13
representation
(
p
j
)
190
j
=0
of the product
a
·
b
=
P
(2
13
)
. If
M <
13
·
w
with
w <
96
, the degree of
P
(
X
)
will be at most
2
w
−
2
<
190
,which leads to savings here and in the description below.The polynomial
P
(
X
)
is calculated using three levelsof Karatsuba multiplication [30] (but see Section IIIF2 forthe possibility to use more levels), resulting in 27 pairs of polynomials
(
P
(
k
)
a
(
X
)
,P
(
k
)
b
(
X
))
of degree
≤
11
, for
k
=1
,
2
,...,
27
(in the more general case where
M < u
·
v
−
2
we would use
16
−
u
levels). This leads to 27 independentpolynomial multiplications
Q
(
k
)
(
X
) =
P
(
k
)
a
(
X
)
P
(
k
)
b
(
X
)
,done using carryless schoolbook multiplications. The polynomial
P
(
X
)
is then obtained by carryless additions andsubtractions of the appropriate
Q
(
k
)
(
X
)
’s.
3) Reduction modulo
N
and conversion to radix
2
32
representation of the
2
M
bit product:
Given a signed 32bit radix
2
13
representation
(
p
j
)
190
j
=0
of the
2
M
bit product
a
·
b
, regarded as the polynomial
P
(
X
) =
190
j
=0
p
j
X
j
with
P
(2
13
) =
a
·
b
, the radix
2
32
representation
(
c
i
)
38
i
=0
of the
M
bit number
c
≡
P
(2
13
) mod
N
is calculated. We use thefollowing precomputed constants:
ã
C
1
≡−
2
31
·
190
j
=0
2
13
j
mod
N
,
0
≤
C
1
< N
.
ã
Integers
k
j
,l
j
and
m
j
such that
13
j
=
m
j
M
+ 32
l
j
+
k
j
with
0
≤
32
l
j
+
k
j
< M
and
0
≤
k
j
<
32
,
for
0
≤
j <
191
. Note that
m
j
∈ {
0
,
1
,
2
}
because
M >
827
(and
M <
1246
).Given these values, the following four steps are carried out,the correctness of which easily follows by inspection:1) For
0
≤
j <
191
, compute
˜
p
j
=
p
j
+2
31
(this allowsarbitrary parallelization), so that
0
≤
˜
p
j
<
2
32
. As aresult
(
190
j
=0
˜
p
j
·
2
13
j
) +
C
1
≡
P
(2
13
) mod
N
.2) For
0
≤
j <
191
, left shift
˜
p
j
over
k
j
bits and rightshift
˜
p
j
over
32
−
k
j
bits, to obtain
d
j
,e
j
such that
˜
p
j
·
2
13
j
≡
d
j
·
2
32
l
j
+
e
j
·
2
32(
l
j
+1)
mod
N
(this again allows arbitrary parallelization).3) Let
v
0
= 0
. For
0
≤
i <
39
, let
u
i
=
j
:
l
j
=
i
d
j
+
j
:
l
j
+1=
i
e
j
,
(1)(where the indices
j
can be precomputed)and compute
˜
c
i
= (
v
i
+
u
i
) mod 2
32
∈{
0
,
1
,...,
2
32
−
1
}
,v
i
+1
=
⌊
(
v
i
+
u
i
)
/
2
32
⌋
(this allows partial parallelization). Finally, compute
˜
c
39
=
v
39
+
j
:
l
j
=38
e
j
.Using Eq. (1), reduction moduli
N
is effected bydisregarding
m
j
and grouping together identical
d
j
values and identical
e
j
values. As a result,
(˜
c
i
)
39
i
=0
is the radix
2
32
representation of a number
˜
c
with
˜
c
+
C
1
≡
c
mod
N
.4) Calculate
c
≡
˜
c
+
C
1
mod
N
. Although the numbersare slightly bigger, this calculation is in principle thesame as regular addition modulo
N
.
E. OptimizationsSwapping even for odd instructions.
Modular arithmeticmostly relies on the SPE’s arithmetic instructions, whichare even pipeline instructions. Following the approachfrom [43], [11] one may replace an even instruction byone or more odd ones with the same effect. Althoughthis may increase the latency for the functionality of eachreplaced even instruction and the number of instructions,balancing the counts of even and odd instructions oftenincreases the throughput. This method was used throughoutour implementation. Examples are sketched below.
Modular squaring.
When squaring polynomials of degreeat most
11
, half of the mixed products, i.e.,
12
2
−
122
= 66
multiplications, can be saved by doubling their resulting 21sums (as the top elements are zero). Of these sums, theeleven for coefﬁcients of odd degree can be doubled forfree during the conversion to radix
2
32
, by using for odd
j
precomputed integers
˜
k
j
,
˜
l
j
, and
˜
m
j
such that
13
j
+ 1 = ˜
m
j
M
+ 32˜
l
j
+ ˜
k
j
with
0
≤
32˜
l
j
+ ˜
k
j
< M
and
0
≤
˜
k
j
<
32
,
instead of
k
j
,
l
j
, and
m
j
, as deﬁned earlier. The tenremaining sums need to be doubled before they are added tothe corresponding squared input coefﬁcient. Each doublingcan be done by a single even pipeline addition. However,a doubling can also be performed by four odd pipeline instructions (or two doublings in six odd pipeline instructions).The ten remaining doublings could thus be squeezed in the