Description

A Full-Adder-Based Methodology for the Design of Scaling Operation in Residue Number System

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, VOL. XX, NO. XX, XXXX 2006 1
A Full-Adder-based Methodology for the Design of Scaling Operation in Residue Number System
M. Dasygenis, K. Mitroglou, D. Soudris and A. Thanailakis
Abstract
—Over the last three decades there has been consider-able interest in the implementation of digital computer elementsusing hardware based on the residue number system, due to thecarry free addition and other beneﬁcial characteristics of thissystem. Scaling operation is one of the essential operations in thisnumber system, and is required for almost every digital signalprocessing application. Up to now, researchers have suggestedcostly and low throughput ROM based approaches to addressthis need. We also address this need by presenting a novelgraph-based methodology for designing high throughput andlow cost VLSI Residue Number System scaling architectures,based completely on full adders. Our formalized methodologyconsists of a number of steps, which specify the minimum numberof full adders for performing the scaling operation as well asthe interconnections among the full adders. We present ourformalized methodology together with a running example to aidin comprehension. Negative residue numbers are covered as well,requiring no additional effort. Finally, we have developed a designsupport tool that can provide structural VHDL descriptions of our residue number system scalers, which can be synthesized inVLSI tools.
Index Terms
—residue number system, scaling, pipelining, dig-ital integrated circuits
I. I
NTRODUCTION
S
INCE the evolution of computers, different arithmeticnumber systems have been developed. These systemsfocus on simplifying the basic mathematical operations, inorder to assist in utilizing the computer more efﬁciently. Oneof these systems is the Residue Number System (RNS)[1],which has been proposed as a means to efﬁciently performcomputations in digital signal processing (DSP) applications[2],[3]. An RNS is composed of modulus that are independentof each other. A number in the RNS is represented by theresidue of each modulus. Since the moduli are independentof each other, there is no carry propagation among them,and it is easy to implement RNS computations (addition,subtraction and multiplication) on a multi-ALU system. Theoperation based on each modulus can be performed by aseparate ALU, and all the ALUs can work concurrently. Thesecharacteristics allow RNS computations to be completed morequickly, an attractive feature for people who design Very HighSpeed Integrated Circuits (VHSIC), with real-time processingrequirements.
The authors are with the VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, GR671 00, Xanthi, Greece, emails:
{
mdasyg,kmitr,dsoudris,thanail@ee.duth.gr
}
This work was supported by a scholarship from the Public Beneﬁt Founda-tion of Alexander S. Onassis. The authors would like to thank the anonymouspeer reviewers for their insightfull comments and suggestions.Fig. 1. In order to avoid overﬂow errors (a), RNS scalers should be placedafter RNS computation stages for every modulus (b).
Scaling by a known constant is an essential operation inseveral signal processing algorithms. Especially in recursivealgorithms, such as the adaptive or inﬁnite impulse response(IIR) ﬁlters, scaling is a throughput bottleneck. Data scalingis always necessary in order to ensure that every computedresult represents the real value, and not its residue modulus
M
(dynamic total range of the considered RNS); this can beguaranteed by scaling data per speciﬁc number of processingstages (Figure 1), so data is kept within a range that ensuresa correct result for every possible operation. As a reminder,RNS scalers accept input in RNS form and produce results inRNS for the same moduli set; they are also placed in everymodulus channel. Thus, for a three moduli set three scalersshould be inserted, one for every modulus channel. Finally,when an RNS scaler is to be inserted, the designer has toestimate the maximum computed output of the previous stage,in order to ﬁnd out a suitable scaling factor.We overcome the drawbacks of ROM based approaches forthe RNS scaling operation, by proposing a novel systematicmethodology for designing efﬁcient scaling modules in theRNS, based completely on Full Adders (FAs) as buildingblocks. Our proposed architecture outperforms conventionallook-up table approaches, when very high throughput ratesare required, and provides a systematic design framework for deriving array architectures for performing the scalingoperation. Moreover, our methodology is further supported andveriﬁed by a Computer Aided Design (CAD) tool. The toolis implemented in C and its output is a synthesizable VHSICHardware Description Language (VHDL) code, which can be
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, VOL. XX, NO. XX, XXXX 2006 2
used as input to VLSI design systems. Our methodology sig-niﬁcantly extends an existing methodology for implementingFA-based Inner Product Step Processors [4], since it adoptsits basic mathematic principles.This paper commences with an overview of the RNSscaling literature, while in Section III we present the proposedmethodology. In the next section (Section IV) we describethe design of the FA-based architectures. Section V presentsthe estimated and experimental results showing the efﬁciencyof this methodology compared to existing ones. Finally, asummary Section VI reviews the main points of this work and concludes the paper.II. R
ELATED
W
ORK
Several research attempts have been reported for the system-atic design of scaling modules for the RNS. The great majorityof the residue scaling techniques are based on the decomposi-tion of Chinese Remainder Theorem (CRT) representation of aresidue number into the sum of its components, the successivedivision of each component by the scaling factor, and thesummation in the ﬁnal multi-operand modulo adders.For an RNS with
n
moduli, scaling by a product of anysubset of the moduli can be performed in
n
clock cycles asdescribed by Szabo and Tanaka [1]. A pioneer research work was presented by Jullien [5] whose algorithm performs scalingby a product of
s
out of the
n
moduli. Although Julien’swork requires fewer look-up tables (LUTs) and fewer numberof cycles than the traditional method [1], it gives a scaledinteger with an absolute error of
(
s
+1)
/
2
. Other researchers[6], have proposed more efﬁcient techniques, but they haveimposed hard restrictions on the scaling factor, which has tobe of a certain class (like
2
n
+ 1
). An interesting researchwork was presented by Taylor and Huang [7], who extendedthe previous research works and presented an autoscaler forscaling factors which are a power of two using look-up tables,achieving scaling in dynamic ranges of about 18 bits. Millerand Polky [8] proposed the use of mixed radix conversion(MRC) to obtain the full residue representation of the scalingresult, or CRT [9], in a technique that conﬁnes the scaling errorto unity. However, the conversion from RNS to binary in orderto use the CRT implies the use of 32-bit word-length modularadders and large multipliers. Thus, a direct implementation of the CRT for scaling results in excessive hardware usage andslow performance. Grifﬁn
et al
[10] disclosed a technique toovercome these drawbacks using only LUTs and conventionalbinary adders, embedding the scaling in the ﬁrst level lookups.Moreover, researchers have used RNS scaling techniquesthat are not based on CRT. For example, some authors haveconsidered scaling in the less popular symmetric RNS [11].Other scaling schemes [12] follow the same research track, byusing an iterative algorithm, which has the drawback that itleads to high computational time and memory requirementsthat are proportional to the number of moduli. In anotherresearch work, Aichholzer and Hassler [13] proposed the useof a parallel network for a binary-to-RNS converter whichcan be extended to perform scaling as well, covering negativenumbers with no additional effort, but also using LUTs.Recently, Garcia and Lloris [14] presented a technique basedpartly on ROM-based arithmetic modules and partly on fulladders, which exhibited good pipeline and could be extendedto cover zero multiplication and error detection. Finally, Ulman
et al
[15] presented a scaling approach with LUTs to storethe full residue representations, computing ﬁxed parts of theproposed scaling equation and storing them in binary form,which also incorporates full adders and multipliers, exploitingthe CRT theorem.The common characteristic of all previous scaling tech-niques, to the best of our knowledge, is that all authorsincorporate, in one or more levels of the scaling algorithm,look-up tables, which have a direct impact on the throughputand pipelineability of the system, as our estimations reveal.We present an alternative solution: an RNS scaling architecturebased completely on FAs, which overcomes some limitationsof the LUTs RNS scaling approaches.III. T
HE PROPOSED
M
ETHODOLOGY
The residue number system is a way for represent-ing integer data with a set of relatively prime numbers
{
m
1
,m
2
,...,m
N
}
, called moduli set. Given that:
M
=
N
i
=1
m
i
(1)an integer
X < M
is represented by the
N
-tuple
(
x
1
,x
2
,...,x
N
)
, with
x
i
=
X
m
i
; the integers
x
i
,i
=1
,
2
,...,N
are called the residues of
X
, and are the smallestpositive residues of division of the integer
X
by the posi-tive integers
m
i
, respectively. Generally, the notation
F
m
denotes the operation
F
modulo
m
.If
K
is the scaling constant and
Y
is the result of scaling
X
by
K
, then it can be deducted ([14]) that:
X
=
Y K
+
X
K
(2)Thus, it is clear that,
Y
= (
X
−
X
K
)
K
= (
X
−
X
K
)
·
K
−
1
(3)which can be computed over an RNS deﬁned by the moduliset
{
m
1
,m
2
,...,m
N
}
.The RNS scaling operation involves the operations of multi-plication and subtraction (Eq. (3)). We can further simplify thisequation by performing mathematical transformations, in orderto reach a more usable form. In this section we will deductan equivalent RNS scaling equation with Eq. (3), we willdeﬁne the required operations for this, and we will describe thecomputational stages required to implement the new equation.If we deﬁne the multiplicative inverse of
K
as
K
−
1
m
i
,the residue
y
i
of
Y
for modulus
m
i
(belonging to moduli set
{
m
1
,m
2
,...,m
i
,...,m
N
}
) is given by Eq. (4).
y
i
=
Y
m
i
=
X
−
X
K
m
i
·
K
−
1
m
i
m
i
(4)Therefore, scaling is reduced to a subtraction and a mul-tiplication for each modulus, with the generation of
X
K
.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, VOL. XX, NO. XX, XXXX 2006 3
Equation (4) can be computed over each modulus indepen-dently, since Eq. (5) holds true in RNS arithmetic.
X
−
X
K
m
i
=
x
i
−
X
K
m
i
m
i
(5)From Eq. (4) and (5), it can be easily deduced that:
y
i
=
x
i
−
X
K
m
i
m
i
·
K
−
1
m
i
m
i
(6)The main disadvantage of Eq. (6) is the existence of
K
−
1
m
i
, which can be uplifted by a proper selection of themoduli set. It is known [5] that scaling in the RNS is moreeasily implemented when the scaling factor is a product of themoduli set.If a rounding down estimation function is assumed, thescaling operation can be described by Eq. (7).
Y
=
X K
(7)where
⌊
a
⌋
denotes the integer value of
a
. Deﬁning
(
K,m
i
)
as the highest common factor of
K
and
m
i
, a satisfactorycondition for the existence of
K
−
1
m
i
is that
(
K,m
i
) = 1
and
K
= 0
. This condition is met for
K
being one of themodulus or a product of the moduli set of the RNS base.Thus, a valid solution for
K
−
1
m
i
exists, when
K
=
S
i
=1
m
i
, S < N
(8)
Example:
Suppose our RNS system (Figure 1(a)) uses thecoprime moduli set
{
255
,
256
,
257
}
. We estimate that afterthe ﬁrst stage of computation we have a maximum value of 1100 on the
m
1
= 255
channel, we have a maximum valueof 900 on the
m
2
= 256
channel, and we have a maximumvalue of 800 on the
m
3
= 257
channel. According to Eq. (8)we select the scaling constant
K
=
m
1
= 255
. This scaler issufﬁcient, because it succeeds in avoiding the overﬂow error inthe following stage. The scaled values of
{
1100
,
900
,
800
}
arecomputed using Eq. (3) as
1100
−
1100
255
255
= 0
,
900
−
900
255
255
=3
, and
800
−
800
255
255
= 3
. Our results can be veriﬁed by Eq.(4). For example for the third modulus
m
3
= 257
we have:
y
3
=
Y
m
3
=
X
−
X
K
m
3
·
K
−
1
m
3
m
3
=
800
−
800
255
257
·
255
−
1
257
257
=
765
·
128
257
= 3
Formulae (5) and (6) can be used for the negative case aswell, even though the discussion up to this point assumed that
X
was a non-negative integer. This is based on the existenceof the additive inverse of the RNS arithmetic. The additiveinverse of a number,
X
, is a number that, when added to
X
yields a result of 0. Since the modulus is congruent to 0, theadditive inverse can be deﬁned as:
X
+ (
−
X
) =
m
(
−
X
) =
m
−
X
(9)This property is used to deﬁne RNS negative numbers, withthe top half of the range being negative numbers of the bottomhalf of the range.In the case of the scaling in a moduli set with prime numbers
{
m
1
,m
2
,...,m
N
}
, the range
[0
,M
−
1]
used for the unsignednumbers becomes
[
−
M
+12
,...,
M
−
12
]
when RNS signed num-bers are used. Thus, binary operations can be carried outbetween signed numbers without explicit knowledge of thesign of either number. The only concern that a designer hasto have in mind is that of overﬂow prevention. Negation of a number is easy since multiplication by
−
1
is a deﬁnedoperation.After the mathematical transformations we can derive ourRNS scaling methodology. The methodology consists of somefundamental stages, all derived from Eq. (6). This equationwas decomposed into fundamental operations, which are im-plemented in stages. Speciﬁcally, we have derived three typesof stages from Eq. (6):
i
) Gradual Reduction of Word Length Stage, which imple-ments the operations
K
and
m
i
,
ii
) Subtraction Stage, which implements the operation of subtraction
x
i
−
X
K
, and
iii
) Multiplication Stage, which implements the operation of multiplication
x
i
−
X
K
m
i
m
i
·
K
−
1
m
i
.Moreover, the ﬁrst stage performs bit reduction and consistsactually of two steps:
a
) Bit reduction step
b
) Final Mapping stepThe Bit reduction step is used to gradually reduce thenumber of bits of the variable
X
and actually consists of
r
substeps, which are also called recursions. The Final Mappingstep maps the output of the last recursion of the bit reductionstep to its modulus value. The second stage performs theoperation of subtraction of Eq. (6), while the third stage is usedto calculate the multiplication of a variable with the constant
K
−
1
m
i
.The order of the types of stages is ﬁxed for every modulusand number of input bits (Figure 2). As a reminder, the scaler’sinput is in RNS with the moduli set
{
m
1
,...,m
N
}
, andsubsequently its RNS output has the same moduli set.In detail, we can use the three aformentionted types of thefundamental stages to derive the ﬁve stages that are requiredin order to design the RNS scaler. These ﬁve stages are placedin the following sequence:
i
) (First) Gradual Reduction of Word Length Stage, outputof which is
X
K
.
ii
) Subtraction Stage, output of which is
x
i
−
X
K
.
iii
) (Second) Gradual Reduction of Word Length Stage,output of which is
x
i
−
X
K
m
i
.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, VOL. XX, NO. XX, XXXX 2006 4
iv
) Multiplication Stage, output of which is
x
i
−
X
K
m
i
·
K
−
1
m
i
.
v
) (Third) Gradual Reduction of Word Length Stage, outputof which is
x
i
−
X
K
m
i
m
i
·
K
−
1
m
i
m
i
.Assuming that the RNS number
X
for the RNS digit is
X
m
i
1
and consists of
l
bits, Eq. (10) holds true.
X
m
i
=
l
−
1
j
=0
x
j
2
j
x
j
∈ {
0
,
1
}
(10)In the next paragraphs we describe the three fundamentalstages, and we give a running example of how they are used.In the running example, we are using the moduli set
{
5
,
17
}
,which even though it is non practical (real RNS systemshave more and bigger modulus), it aids the reader to followthe mathematics and better visualize the derived architecture.In the end of this section, as a second example, we brieﬂydescribe how to design RNS scalers for the realistic moduliset
{
255
,
256
,
257
}
.
A. Gradual Reduction of Word Length Stage
This stage is used in the
i
)
(First) Gradual Reduction of Word Length Stage,
iii
)
(Second) Gradual Reduction of WordLength Stage, and
v
)
(Third) Gradual Reduction of WordLength Stage. The function of this stage is based on themodulo arithmetic property:
i
b
i
2
i
m
=
i
b
i
2
i
m
m
(11)where
b
i
∈ {
0
,
1
}
. The bit reduction step recursively replacesany bit that corresponds to a power of
2
larger than
2
n
−
1
to a set of bits that correspond to powers of
2
whose sum isless than or equal to
2
n
−
1
. This operation may be repeatedas needed in order to eventually reduce the number of bitsof the input number to this stage from
l
to
n
m
.
n
m
is thenumber of bits of the modulus
m
(In case our system has asingle modulus we do not write the index of it but only
m
.Otherwise, if our system consists of more than one moduluswe write the index
m
i
. All the formulas that correspond to
m
variable will be exactly the same if
m
i
is used) (Eq. (12)).
n
m
=
⌈
log
2
m
⌉
(12)where
⌈ ⌉
denotes the ceiling function and
l
is the numberof input bits to the scaling architecture. In the
k
th (
0
< k
≤
r
)recursion, we obtain the output of the
k
th-recursion from the(
k
−
1
)th-recursion:
n
k
−
1
i
b
k,i
2
i
m
=
n
k
−
1
−
1
i
b
k
−
1
,i
2
i
m
m
(13)
1
We apply the presented methodology for every RNS digit of the moduliset of the RNS number
X
. Here, we will illustrate the steps for only onemodulus.Fig. 2. Our FA-based RNS scaler consists of a number of stages, computingstep by step the operations of Eq. (6).
The actual number of required recursions,
r
, and the wordlength
n
k
(
0
< n
m
≤
n
k
≤
n
k
−
1
≤
l
) are speciﬁed bythe algorithm developed in [4]. Furthermore, in this work adetailed description of how the replacement occurs into thebit reduction step is also presented. This stage is implementedusing only FAs.In the second step, the output
Y
r
of the bit reduction stephas to be mapped to its residue modulus
Y
m
. In case we wantto implement the operation
m
i
for modulus
m
=
m
i
, thenwe use Eq. (14).
Y
m
=
Y
r
, Y
r
< mY
r
−
m, Y
r
≥
m
(14)In case we want to implement the operation
K
, then weuse Eq. (15).
Y
m
=
Y
r
, Y
r
< K Y
r
−
K, Y
r
≥
K
(15)This step can be implemented by an
n
-bit adder.
Example 1:
Supposing that we would like to implementthe ﬁrst gradual reduction of word length stage, for an RNSsystem with moduli set
{
m
1
,m
2
}
=
{
5
,
17
}
, input bits
l
= 5
and scaling factor
K
= 5
and speciﬁcally for the modulus
m
=
m
1
= 5
(the same procedure applies to modulus
m
2
=17
). The
Y
m
will have word length
n
m
= 3
(Eq. (12)). Thisstage consists of two substages: (
a
)
) The bit reduction stepand (
b
)
) the ﬁnal mapping step.First, we have to compute the replacement bits of a 5-bitinput word for modulus
K
= 5
(we want to implement theoperation
K
=5
), according to Eq. (11). We compute theseas follows:
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, VOL. XX, NO. XX, XXXX 2006 5
TABLE II
N THE GRADUAL REDUCTION OF WORD LENGTH WE REPLACE THE MOSTSIGNIFICANT BITS WITH LOWER SIGNIFICANCE BITS ACCORDING TO THERESIDUE ARITHMETIC
.
j
= 4
j
= 3
j
= 2
j
= 1
j
= 0
x
0
0 0 0 0 1
x
1
0 0 0 1 0
x
2
0 0 1 0 0
x
3
0 0 0 1 1
x
4
0 0 0 0 1
5
i
x
i
2
i
5
=
5
i
x
i
2
i
5
5
=
(
x
0
2
0
5
) + (
x
1
2
1
5
) + (
x
2
2
2
5
) ++(
x
3
2
3
5
) + (
x
4
2
4
5
)
5
(16)
=
(
x
0
2
0
) + (
x
1
2
1
) + (
x
2
2
2
) ++(
x
3
(3)) + (
x
4
(1))
5
(17)
=
(
x
0
2
0
) + (
x
1
2
1
) + (
x
2
2
2
) ++(
x
3
(2
1
+ 2
0
)) + (
x
4
(2
0
))
5
(18)Eq. (16) - (18) hold true, because
2
3
5
is equal to
8
5
=3 = 2
1
+ 2
0
. In this case the outcome of the architecture (formodulus 5) will be the same, if bit
x
3
2
3
is replaced with bits
x
3
2
1
+
x
3
2
0
. Similar, the bit
x
4
2
4
can be replaced with thebit
x
4
2
0
.We can visualize the replacement of the bits in a bitreduction table (Table I). Considering Eq. (16) - (17), we haveplaced (Table I) the output column
j
and the input bits
x
i
. Weuse ’1’ to show the replacement positions of the input bits. Forexample, bit
x
0
2
0
is one of the bits that have to be added inorder to compute the output
j
= 0
column. Also, bit
x
4
2
4
canbe replaced with the bit
x
4
2
0
. This means that the input bit
x
4
has to be added to other bits to compute the output
j
= 0
column. For the same reason, input bit
x
3
2
3
has to be addedin the
j
= 1
and
j
= 0
columns.After computing the replacement bits, the next step is tocompute the number of FAs. We do this, considering thatevery FA has 2 input bits and a carry. In our example, incolumn
j
= 0
we have to add bits (Table I)
x
0
,
x
3
and
x
4
.One FA is sufﬁcient for this column. Similarly, we computethe FAs for the other columns. This completes the design of the ﬁrst substep of the bit reduction step. The bit reductionstep consists of a number of recursions. The actual numberof recursions can be computed using the equations of [4]. Butin terms of completeness, we will show how to practicallycompute the number of recursions.The maximum input number
X
for
l
= 5
input bits is 31. If we use the bit replacement table (Table I), we observe that themaximum output number is
2
0
+2
1
+2
2
+2
1
+2
0
+2
0
= 11
and it needs
n
1
= 4
bits. Our modulus is 5, which has 3 bits.Thus, another bit reduction substep is required to reduce thenumber of bits from 4 to 3. Similar with the previous stage weconstruct a bit replacement table for the second substep of bitreduction and check again the output. In our example, we see
Fig. 3. (a) Architecture for implementing the 1st gradual reduction of wordlength stage, and (b) architecture for implementing the subtraction stage for
K
= 5
, m
1
= 5
, m
2
= 17
, l
= 5
.
that after the second substep of bit reduction, the maximumoutput needs
n
2
= 3
bits, thus we can proceed with the designof the next stage, which is the ﬁnal mapping stage.The Final Mapping substage for our architecture is im-plemented according to Eq. (15). This substage determineswhether the output of the previous substage will be the sameor it will have to be reduced by
K
. Thus, every
j
column of theoutput is connected to two processing paths. One processingpath that just transfers the bits of
Y
r
to the output (in Eq.(15) the
Y
r
< K
case), and one processing path that subtracts
Y
r
−
K
(in Eq. (15) the
Y
r
≥
K
case). The subtraction isachieved by taking the 2’s complement of
K
and adding it to
Y
r
. 2’s complement of
m
is created by inverting all the bits of the
K
and adding a unity bit. In Figure 3(a) we have omittedthis circuit, in order to simplify it. After having computedthe subtraction, the decision of whether to use the subtractednumber of the
Y
r
without any operation follows. This decisionis made by the carry bit of the last
j
column’s FA. Theexistence or not of a carry decides, using a circuit of gatesand pass through switches, whether we will have as output
Y
r
or
Y
r
< K
.
B. Subtraction Stage
This stage performs the subtraction
x
i
−
X
K
,
(19)

Search

Similar documents

Tags

Related Search

New lime based materials for the constructiona different reason for the building of SilburA Practical Method for the Analysis of GenetiWeb-based accessibility for the visually impaInternational Society for the Philosophy of APopular Front For The Liberation Of PalestineNational Association For The Advancement Of CPractice Based Approaches to the Study of KnoMETHOD AND THEORY FOR THE STUDY OF RELIGIONBritish Association for the Advancement of Sc

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks