209
IEEE
JOURNAL
OF
SOLIDSTATE CIRCUITS, VOL.
26,
NO.
3,
MARCH
1991
A
GeneralPurpose Highspeed Equalizer
Serge Maginot, Freddy Balestro, Christophe Joanblanq, Patrice Senn, and Jacques Palicot
Abstract
The circuit presented in this paper is a highspeed self adaptive filter achieving equalization over a wide range
of
signals, with
a
frequency of up
to
40.5
MHz, as
for
the European D2MAC and HighDefinition Multiplexed Analog Components (HDMAC) transmis sion standards. It is composed
of
a 16tap transversal filter and a separate operative part computing the gradient algorithm and periodi cally updating the filter coefficients. This
105
000transistor chip has been designed in a CMOS
1.0pm
technology and is at this time being used in a D2MAC reception environment.
I.
SYSTEM
VERVIEW
N
THE framework of the European HighDefinition Tele
I
ision System EUREKA EU95 Project, a new European transmission standard has been developed: the HighDefini tion Multiplexed Analog Components (HDMAC) standard. It is designed to be compatible with the present European transmission standards (DMAC and DZMAC), already in cluding duobinary data and MAC components
[l].
Indeed: both D2MAC and HDMAC signals have approx imately the same structure, consisting of a timedivision multiplex between the analog components (color difference and luminance) and the digital data (including sound) (Fig. 1). These data are coded in a duobinary form, i.e., each bit
to
be transmitted is added to its predecessor, and the resulting number, actually transmitted, thus has three possible values. Transmission data rates are equal to 10.125 Mb/s for D2MAC and HDMAC duobinary data, and 20.25 Mb/s for HDMAC binary data, with a bandwidth of respectively
5
and 10 MHz. With an oversampling factor of 2, the sampling frequency ranges from 20.25 MHz for D2MAC to 40.5 MHz for HDMAC. Because of their digital part, decoding of these signals could be strongly affected by the distortions induced by the transmission channel. Fig. 2 shows the eye diagram of duo binary data from a D2MAC signal without perturbation and with
an
echo (100ns delay, 10dB attenuation). As these perturbations are consumer receiver chain dependent (exam ple of D2MAC via cable in different buildings or city dis tricts), they can only be corrected by selfadaptive filtering, i.e., channel equalization. The circuit being presented is a selfadaptive 16tap transversal filter achieving equalization
on
any 8b coded signal with a frequency of up to 40.5 MHz and containing periodically a window
of
binary or duobinary data samples,
Manuscript received July 30, 1990; revised October 26, 1990.
S.
Maginot, F. Balestro, C. Joanblanq, and
P.
Senn are with France 'Telecom, Centre National d'etudes des TClCcomunications (CNET), 38243 Meylan Cedex, France. J. Palicot is with Centre Commun d'Etudes des TClCcommunications et de TClCdiffusion (CCETT), 35512 Cesson Sevigne Cedex, France. IEEE
Log
Number 9041399.
level
Mps
h
17.5
ps 34.5ps
:
duobinary color difference h nce time sound/data
Fig. 1. Structure
of
a D2MAC line.
such as D, D2, and HDMAC signals. This chip includes a delay line
of
240 8b data samples which are used for the internal gradient computations. Only linear distortions (echos) can be corrected by this chip.
11.
ALGORITHM
The algorithm being processed in order to compute filter coefficients is the wellknown gradient algorithm (least mean squares, or LMS, algorithm) [2][4]. This algorithm is com puted in different times on a window of 240 consecutive damaged data samples (x,,,,
~238,
.,
xlr
,) included in the digital part of the signal. The same 240 samples are used during the entire compu tation
of
one set
of
coefficients (they are stored in a looped shift register). At each step of the algorithm, 16 of them are taken to compute the error. If sampling is achieved, without any degradation, using a synchronized clock, each
of
these data should take
one
of
two
possible values
V,m
and
Vhia
(if binary data) or one
of
three possible values
V,,,
Vmed,
and
Vhigh
(if duobinary data). This restricted number of data values is the only characteristic the signal is supposed to observe in order to be equalized by the following computations. The process executed at step number
n
of the algorithm is as follows. Let
x,,
x,~,.
.,
x,~~)
e the set of perturbed data, and
(hp',
hy),
*
9
h$))
be the set
of
filter coefficients. Filter output is
15
(1)
n
=
h$"'x,i.
i=O
Let
9
be the estimated
ttue data
corresponding to
y,.
Then
001
89200/9
1
/03000209$01
OO
991 IEEE
~
~_____~
~~ ~~
____
I
~~
~
210
IEEE
JOURNAL
OF
SOLIDSTATE CIRCUITS,
VOL.
26,
NO.
3,
MARCH
1991
E
1
3
(a)
B
1
3
(b)
tion
(echo
with 100ns delay and 10dB attenuation). Fig.
2.
Eye diagram
of
D2MAC
data (a) with and
(b)
without distor
the error is given by
e,,
=
y,

E
and the new set of filter coefficients for step
n
+
1
is where
p
=
27= 0.01 is the algorithm step. The value of
in
is computed according to the following. In the case of binary data, if
Vi
E
[o
15],5?+1)
=
h ")
Penxni
(2)
ow
+
high
2
=
then
Yn
G
S
n
=
ow
Yn
>
s
9
=
high.
In
the case
of
duobinary data, if
ow
+
'med
and
SD2=
med
+
high
2
2
DI
=
then
Yn
G
SD
=
ow
=
mcd
D2
<
Yn
Yn
>
S I
=
high.
Of course, in the case of oversampled digital data (such as the duobinary part of the D2MAC signal), the error
e,,
must not be computed
on
interpolated data,
which are not
infor mation data.
To handle this, an additional input signal will be required equal to
ONE
for information data and
ZERO
for interpolated data: the filter output will be computed for both, whereas the error will only be evaluated for informa tion data (for an interpolated datum, the error computed for the previous information datum will be reused). Equalization
on
oversampled data shows another impor tant restriction of this algorithm: indeed, such a gradient algorithm could be computed indefinitely, except for frac tional tapspacing equalization (i.e., equalization
on
over sampled data), where random coefficient overflows could be reached
[5],
[6]
In such a case, the equalizer will have many sets of tap values resulting in nearly equal values of mean squared error, some of these tap settings being large enough to cause overflows. In order to avoid this phenomenon, it is possible to introduce a bias
R)
to balance the tap wandering.
Vi€[O...
15]h~ +'~=(1~)h~ ~penx,,~,
Thus, (2) could be replaced by the following:
O<B<P).
(3)
Another solution (easier to implement) is given by
Vi
E
[O..
.
15]hi"+')=
hj")
Bsgnhj")
pe x 
O<B<P)
(4)
where sgn
H
is the sign of
H:
f
H
is positive (resp. negative), sgn
H
is equal to
1
(resp. ). According to
[6],
if
p
=
r7
should be nearly equal to 214, thus leading to a large increase in internal data width.
In
order to avoid this, we will assume that, in the case
of
fractionally spaced equalization (i.e., oversampling
of
input data), it will be possible to find a criterion
of
transmission quality, enabling the algorithm computations to be stopped. By doing this, we only need to implement (2). This will be possible for our main application, the D2MAC receiver. In this system, data are oversampled by a factor of 2 but are also duobinary coded. Consequently, the data sample duobit,
(
=
0,1, or
2
received at time
n
is given by duobit
,,
=
bit
,,

+
bit
,,
(5)
where bit,
(
=
0
or 1) is the decoded information bit trans mitted at time
n.
On
reception, it is thus possible to decode duobinary data and to find errors: if a decoded bit at time
n

1
(bit,
 )
is equal to
0
and received duobinary datum at time
n
(duobit,,) is equal to 2, it is impossible to decode bit, satisfying (5)there is a violation Such a violation appears each time, the system receives an even number of 1's between a
0
and a
2
or a 2 and a
0,
or an odd number of 1's between two
0 s
or two 2's. The number of such errors provides us with a quality criterion which can be used to stop or restart the equaliza tion process.
~
MAGINOT
et
al.:
A GENERALPURPOSE HIGHSPEED EQUALIZER
I
~
21
1
Fig.
3.
Equalizer architecture.
111.
CIRCUIT
PECIFICATIONS
As
the method for implementing such signal processing is very sensitive in terms of computing speed and material cost, a specific integrated circuit has been designed for this pur pose. This circuit, the
EQUALIZER,
processes a wide range
of
signals, as mentioned above. Indeed, the signals to be treated have to be 8b coded and contain a periodic subset of binary or duobinary data. The frequency is also limited to
40.5
MHz. The gradient algorithm is performed according to the following scheme:
240
damaged binary or duobinary data samples are stored in a looped delay line and used in an algorithmic part of the circuit to compute a new set of coefficients which will be transferred to the filter. Typically, this process makes the algorithm converge after 1.5 ms at
40.5
MHz (depending
on
circuit options). Computations can be continuously performed with a periodic coefficient update (a new window
of
240
data is used for each coefficient update). For fractional tapspacing equalization, computa tions need to be stopped before coefficients reach the limit values authorized by the implementation. Many circuit parameters are programmable by the user. Among these: a) filter coefficient initialization: transparent filter: one (central) coefficient at 1, the halfNyquist filter (lowpass filter) with a
20%
rolloff; b) position of central coefficient (in the case of a transpar ent filter):
in
this case, it is sometimes useful to shift the central coefficient in order to be able to correct echos which have a longer delay; binary data
(0
or
1):
for example, reference data inserted in line 312
of
the vertical blanking interval in the
HDMAC
signal, duobinary data
(0,
1, or
2):
for example, sound and synchronization data inserted in each line of the DZMAC,
DMAC,
and
HDMAC
signals; d) values of minimum and maximum data levels: for exam ple, digital values corresponding to
0.4
and
+0.4
V
others at
0,
c) type of data used for the equalization processing:
Corrected
signal
for the
D2MAC
signal (duobinary data levels), or
0
and
0.4
V
for the
HDMAC
signal (binary data levels); e) number
of
computing steps for the gradient algorithm between
two
consecutive filter coefficient updates. The circuit also has two operating modes:
1)
the equalizer mode (the main goal), and
2)
the standalone programmable 16tap filter mode (the equalization algorithm is inhibited and the filter coefficients are loaded by the user).
IV.
ARCHITECTURE
The circuit architecture, as shown in Fig. 3, is composed
of
five main parts: a 16tap transversal FIR filter, a dynamic delay line of
240
eightbit data samples, an operative part performing the gradient algorithm, a set
of
four pro grammable registers, and, finally, a control part supervising the whole circuit.
A.
Filter
The transversal filter performs a 16tap filtering with 8b data and 9b two’s complement coefficients. It has a classical transposed architecture, with parallel broadcasting
of
data to all multipliers and serial accumulation of partial products (Fig.
4).
The multiplier output is 15b rounded; the accumu lation word length is 18 b and the output is 8b saturated and rounded. The basic macrocells such as an
8
X
9 Booth carry save array multiplier and 18b ripple carry adder have been reused from a previous design
[7].
Synchronous updating
of
the coefficients is performed by serial downloading
of
the coefficient register bus to the multipliers. The filter can operate either in the standalone mode, with external coefficient loading on a dedicated tenpin
1/0
bus, or in the equalization mode with internal updating of the coefficients.
In
the latter case, the abovementioned bus outputs the new coefficient values after each algorithm itera tion for test purposes.
B. Delay
Line
The delay line stores
240
consecutive data samples (even tually padded by zeros
if
the window signal does
not
include
240
data) which are used in the gradient algorithm computa
212 IEEE JOURNAL
OF
SOLIDSTATE CIRCUITS, VOL.
26,
NO.
3,
MARCH 1991
Data
20)
nput
signal
Putcoefb
Coefflaent
transfert
signal
Fig.
4.
Filter architecture.
tion. At each step, that is every
41
clock cycles, a new datum is sent from the delay line to the operative part. Several options were available to implement this block: a
RAM
with
FIFO
addressing, a shift register, or the chosen structure, which relies
on
threetransistorcell bit planes with pointer controlled addressing. Our approach has the follow ing advantages over the first two: a) transistor count is reduced to a minimum with a stan dard CMOS technology (excluding onetransistorcell dynamic RAM technology which cannot be integrated with logic standard blocks). The overall number of transistors per memorized bit is below
3.5;
b) the structure is fast enough for our application: the allowable frequencies range between
500
kHz and more than
72
MHz;
c
furthermore, we developed a generator for such delay lines of arbitrary length and word width
[9].
Fig.
5
shows the internal structure
of
the delay line and the way data are transferred from one cell to the next. Each memory cell has two independent bit lines, allowing a read and a write operation to be performed in one half clock cycle, the first half being the precharge phase of the read bit line to
VDD.
Read/write word lines are shared between two adjacent cells, thus simplifying the pointer circuitry. At each cycle (Fig. 61, any datum in an addressed column is shifted up to the right by one position. The column just read is ready to be written at the next cycle: this assumes that there is always an empty "buffer" column in the array. The delay line is organized into eight bit planes of
six
rows and 41 columns each, and has a capacity of 6x40
=
240
eightbit samples. The pointer circuitry is shared between all the bit planes. However, this structure had to be slightly modified be cause
of
our need to access consecutive data at every nth clock cycle (here
n
=
41) and not every cycle. This could not have been done by simply working at
F/n,
if
F
is
the main clock frequency. In fact, we would have been very close to the minimum frequency required for the delay line itself
(500
kHz) because of its dynamic structure; this would have constrained the circuit to a main frequency over
20
MHz and consequently would have limited its potential range of appli cations.
To
avoid this constraint, we forced the delay line to loop after the
240
data acquisition phase and to operate at the main circuit clock frequency. Another problem encountered was the selection of the right datum after
n
cycles, while all data are shifting perma
9
bit
+
oef
( O)
Coefficient
transfer
bus
rewrite read/write readhrite line
word
line
word
line
DOUT
TI
TI
DD

i
readbiline
..
..

VDD
_____________________________________
Fig.
5.
Delayline internal structure.
nently. Indeed, consecutive input data become output every clock cycle and not every
n
clock cycles. This problem was solved by using not only the last line output, but all the read line outputs and selecting the right datum through a multi plexer. Let and
c
+
denote the number of lines and columns, respectively (the global delay achieved is equal to
IC).
Sup posing the line is looped, at time
i
if D(imodIc) is the output of the fth line, then data
D i
cmodlc),
D i
2c
mod
IC),
.,
D i
 I
1)c
mod
IC
are the output
of
lines
1,2,.
,

1
(Fig. 6). If data
D i
mod
fc)
s read at time
i
we want to read data
D i
+
mod
IC)
to be read at time
i
nj. Thus, for all
j
>
0:
i
+
mod
fc
must belong to
{i
nj

mod
fc;
.
,
i
+
nj

I

)
c
mod
IC,
i
+
nj
mod
IC}.
(6) We can verify that (6) has solutions if and only if
c
divides
n

1.
The simplest solution
is
obtained for
c
=
n

1.
(Note that this constrains the global delay to be a multiple
of
n,
which was not a problem in our case.) Thus, data
D i+
j
mod
fc)
an be taken at line
j
mod
1.
Here,
f
=
6,
n
=
c
+
1
=
41, and
fc=240.
Consecutive input data can then be taken at consecutive lines
of
the structure every 41 clock cycles.
MAGINOT
et
al.
:
A GENERALPURPOSE HIGHSPEED EQUALIZER
213
Xn+l
=
new
data
Wi
mod
IC
Wi+l)c
mod
IC
Mia
mod
k)
D ic
mod
C
DOUT
DIN
1
POINTER
f
Fig.
6.
Delayline architecture.
COd(89)
=.xefMent3tobe
transfened
m
he
Condcd
=
efficient
check
esult
lta
A
EL
I
Coefficient
check
i+
6
16
Fig.
7.
Algorithmic part architecture.
Compared with the srcinal structure, the multiplexer over head is very small and the select block (looped register) can be shared between all
8
b.
C.
Algorithmic Operative Part
As seen above, the algorithm (including filter output com putation) is processed in different times in the operative part (see Fig 7). This part therefore needs to memorize 16 data
xi
and 16 filter coefficients
hj )
at each step
of
the algorithm. Thus, this part includes four different static shift registers: the first of sixteen 8b data, the second of sixteen 17b coefficients, and the
two
others of sixteen 9b minimum and maximum coefficient values. The choice
of
fully static shift registers, instead of a dynamic delay line (as in Section IVB), was dictated by the following consideration: coefficient values can be stored in the operative part for a long and undetermined period. This can depend on the arrival time of the next window signal, but also on the system configuration: in the DZMAC applica tion, equalization can be suspended as long as the binary error rate is satisfactory, and then be restarted with the last computed coefficient set. It
is
thus impossible to use a dynamic delay line which has a minimum shifting frequency (unless using a complicated logic). Furthermore, it was easy to implement, in such a shift register,
two
different kinds
of
initialization (transparent filter
or
halfNyquist filter). This part also contains one 8b
X
9b multiplier (identical to filter multipliers), a 19b carry lookahead adder to per form the accumulation (equation (l)), and a 17b carry looka head adder to compute the coefficient incrementation (equa tion
(2)).
These optimized adder architectures are included in our datapath generator. Because
of
the accuracy required by
(2),
coefficients are 17b
two s
complement coded during the whole algorithm computation, except during the multipli cation and,
of
course, before the transfer to the transversal filter, where they are 9b rounded (Fig. 7). The values of computed coefficients are continuously checked in the operative part. They are first constrained to a certain range around their initial values:
two
9b compara tors are used with the
two
sixteen 9b "minimum and maxi mum value" shift registers. Second, their sum has to be greater than half the value
of
the central coefficient: a 13b ripple carry adder performing the coefficient accumulation and a 13b comparator are implemented. These checks are performed in order to detect any possible algorithm diver gence. In such a case, computations are reset to coefficient initial values.
D.
rogrammable Registers
Four programmable configuration registers can be loaded according to an asynchronous buslike mode (independent
of
system clock). With a chip select
(CS,
active high),
two
address bits (REGADO, REGADl), and a clock signal (WRITE, active
on
the rising edge), a master chip can load configuration values into the registers using the 8b input data bus.