# Design and Analysis of a Hierarchical Clock Distribution System for Synchronous Standard Cell/Macrocell VLSI

EBY G. FRIEDMAN, MEMBER, IEEE, AND SCOTT POWELL

Abstract — This paper describes the synchronous clock distribution problem in VLSI and techniques for its solution. In particular, a hierarchical design technique for minimizing clock skew within a VLSI circuit and its relative advantages and disadvantages is discussed. In addition, a model for clock distribution networks which considers the effects of distributed interconnect impedances on clock skew is described.

#### I. INTRODUCTION

IN MOST digital systems, the transfer of data between functional elements is synchronized by a single control signal, the processing clock. This signal typically constrains the timing and performance behavior of an entire system [1]. Therefore it is imperative that a design methodology exist for distributing clocks which will permit the system to operate as fast as possible without creating any unnecessary timing uncertainties or decreasing functional chip yield. This clock distribution design methodology must also provide an environment for hierarchical design of a VLSI circuit, thus permitting the VLSI design effort to be partitionable into smaller design problems. This paper describes a design methodology for distributing clock networks in a VLSI circuit design environment. The methodology considers all hierarchical levels of design detail and permits design of a clock distribution network that provides optimal performance, such as negligible clock skew. The overall design problem remains functionally partitionable and is well-suited for a multi-person design team environment. Lastly, a clock distribution model is discussed which considers the effects of distributed interconnect impedances on clock skew.

Section II of this paper describes the requirements of a clock distribution system for a classical sequential VLSI circuit. In Section III, the particular design approach of the partitionable clock distribution scheme is discussed, while in Section IV the advantages and disadvantages of this clock distribution scheme are described. In Section V, an approach for characterizing distributed resistive and capacitive interconnect impedances for optimal performance evaluation of clocking circuitry is discussed. Section VI describes an example in which this clock distribution methodology has been implemented in a complex VLSI circuit.

Finally, in Section VII, some concluding comments are made describing the relative benefits of this design approach to VLSI circuits.

## II. DESIGN REQUIREMENTS OF A CLOCK DISTRIBUTION SYSTEM

Clock distribution systems perform the task of synchronizing the flow of data in digital systems. As a clock signal arrives at a sequential register, it triggers the data from one bank of sequential registers to the next through a combinatorial network, which performs manipulations of the data in an appropriate functional manner (see Fig. 1) [1]. When designing the specific combinatorial logic resident between the sequential registers, careful attention must be spent on insuring that each path's timing requirements are maintained. The delay of a specific sequential signal path consists of five factors.

1)  $t_{C \to Q}$ —the clock-to-Q delay of the originating register in the signal path.

2)  $t_{\text{logic}}$ —the propagation delay due to the particular *RC* time constants of the specific path's logic and interconnect.

3)  $t_{\text{setup}}$ —the setup time of the final register in the signal path.

4)  $t_{skew}$ —the time difference between the triggering edge of the processing clock presented to two different sequential registers in the signal path. A critical race condition can result if the final register clock signal significantly leads or lags the originating register clock signal.

5)  $\Delta t_{t/r}$ —the additional delay caused by the effect of the variation of input transition time on each transistor within a specific sequential path. For slow rise and fall times, the propagation delay of a device can increase significantly. This delay is described as a time difference since the delay components of a synchronous signal path  $(t_{c \rightarrow Q}, t_{\text{logic}}, \text{ and } t_{\text{setup}})$  change as a function of input transition time.

Therefore the total propagation delay between two registers in a sequential signal path is given by

$$t_{PD} = t_{C \to Q} + t_{\text{logic}} + t_{\text{setup}} + t_{\text{skew}} + \Delta t_{t/r}.$$
 (1)

For a design to meet its specified requirements, the largest propagation delay of any signal path being enabled

Manuscript received September 1, 1985; revised December 16, 1985. The authors are with Hughes Aircraft Company, Carlsbad, CA 92008.

IEEE Log Number 8607450.



Fig. 1. Classical sequential data flow.

by the clock system must be less than the inverse of the circuit's maximum clock frequency (2) [1]-[3]

$$t_{PD_{\max}} < t_{\text{clock}} = \frac{1}{fclk}.$$
 (2)

Therefore, in order to permit maximum performance in a VLSI circuit, special attention must be given to decreasing nonfunctional parasitic time delays. This is commonly exhibited within the industry by an effort to decrease parasitic device and interconnect impedances [4] common in all integrated circuits. The intention of this paper is to describe a design technique for minimizing clock skew in sequential VLSI circuits while still maintaining a useful design team environment. By minimizing clock skew, the propagation delay of a path decreases, thereby permitting a higher frequency of operation and improving the data throughput of the overall system being integrated.

## III. A DESIGN TECHNIQUE FOR MINIMIZING CLOCK SKEW

When designing a clock distribution system in a VLSI circuit, it is imperative that the clock skew between each register in the circuit be limited. A common technique for minimizing the clock skew between each clock branch is the use of a general symmetric organization of the clock lines (e.g., equalizing line lengths) which, to first order, equalizes the parasitic load of each clock signal as seen by each clock buffer [2], [5], [6]. The difficulty occurs with this technique when one wants to hierarchically partition the overall chip design into separate functional elements, as is commonly done in large VLSI circuits. Ideally, each large functional element would contain its own locally optimized clock distribution system to satisfy its own particular timing constraints. However, local optimization within a func-

tional element does not necessarily lead to global optimization of the overall chip-level clock distribution system.

If the processing clock interconnect had relatively low resistance, a chip-level centralized clock-buffer circuit would satisfy chip-level buffering requirements [7]. However, in many VLSI circuits, top-level interconnect typically is highly resistive due to the length of these connections. By centralizing the clock buffering, all sequential registers must be driven by the central clock-buffer circuit. However, due to the large capacitive fan-out of these registers, the load seen by the central clock-buffer circuit has a very large capacitive component. Also, for the central clock circuit, the combined capacitance of the registers is typically much greater than the parasitic capacitance of the clock interconnect. The delay through the final stage of the clock circuit is then proportional to  $(R_{buffer} + R_{line}) \times C_{load}$ . The large capacitive component of the fan-out makes the delay along a path very sensitive to line resistance. Since different branches will have different line resistances, a significant skew between processing clock signals can result [2], [8].

To decrease delay and minimize clock skew, it is advantageous to configure the centralized clock-buffering circuit so that the clock signals with higher resistances drive loads with lower capacitances. In a hierarchical VLSI circuit, due to the length of the lines, clock lines connecting the various chip-level functional elements are likely to be more resistive than clock lines within a functional element. In these situations, the final stages of the clock buffering should be distributed within the individual functional elements. In this way, the inverter stage driving the large capacitive fan-out dependent loads has lower interconnect resistance.

Therefore, in general, the initial clock-buffering stages are centralized at the chip level. Parallel connections are made from the centralized clock buffer circuit to each of



Fig. 2. Clock distribution system for minimal clock skew.

the functional elements, as shown in Fig. 2. The clock load seen at the input to each of the functional elements is relatively small, permitting these loads to be driven by the higher resistive chip-level clock interconnect. Thus clock skew due to differing line resistances is minimized by distributing the clock buffering and reducing the capacitive load driven by large chip-level resistive interconnect.

As described in [9], parameterized buffer cells can be used to geometrically size an inverter to provide an appropriate level of current drive. The effective output impedance of these devices coupled with the distributed resistive and capacitive impedances of the buffer's interconnect and fan-out define precisely the timing response of that logic stage [1], [10], [11]. Therefore one can compensate for the variation of clock propagation delay between each of the functional elements by parameterizing each of the functional elements' clock buffers resident in the central clock-buffering circuit. As shown in Fig. 2, each of the J functional elements requires its own precisely tuned clock signal to minimize the variation in clock skew between each of the J functional elements. Also shown, each of the J functional elements is constrained to a K-stage clock distribution system. This is to insure that each functional element is triggered by the same clock edge. A list of recommended constraints in the design of a minimal clock skew distribution system is as follows:

 each of the J functional elements should utilize the same number (K stages) of clock buffering;

- 2) the clock signal rise and fall times seen by each sequential register within each functional element should be constrained to a maximum level; and
- the internal functional element clock skew should be constrained to a maximum level using this same minimal clock-skew technique in a hierarchical fashion.

Accurate resistive and capacitive interconnect parasitics at the intra-element level are extracted [12], [13] and used to characterize the interconnect impedances between the functional elements and the central clock-buffer circuit. To compensate for the variation in interconnect and fan-out loading of each of the functional elements, finely tuned parameterized buffers are placed within the central clockbuffer circuit to drive each particular clock line. Each of these parameterized buffer cells is chosen with precisely the correct level of current drive (i.e., channel resistance) to compensate for the variation in interconnect and fan-out loading that each functional element requires. Thus the chip-level clock distribution system produces extremely small clock skew across the entire chip.

The number of stages necessary to implement an optimal clock distribution system is dependent upon the fan-out and interconnect loading and the application specific speed/area trade-offs. Techniques are described in the literature for determining the optimal number and geometric size of cascaded buffer stages [1], [14]–[16]. The optimal configuration of buffer stages contained within each func-







common resistance and (b) without common resistance.

tional element is dependent upon the application. The configuration illustrated in Fig. 3(a) can be used in functional elements implemented with standard cells. Since the line resistance is typically low, all inverters can be placed in close physical proximity and final adjustments for skew are made at the final stage. For functional elements implemented with macrocells, the configuration illustrated in Fig. 3(b) is more accurate since macrocell-to-macrocell connections tend to be longer than cell-to-cell connections in a standard cell environment. Therefore line resistances tend to be larger, requiring the delay through each branch to be defined independently of other branches.

For minimal clock skew, care must be taken to insure that there are no common resistive paths shared by more than one sequential gate. Fig. 4(a) illustrates a standard cell layout and the resulting clock line network. Resistance

 $R_f$  is connected to several gates and causes a difference in delay between paths P1 and P2, resulting in a clock skew between these two paths. Fig. 4(b) shows the same circuit without  $R_f$ . Notice that the resistors in the resulting network only drive one gate, as opposed to the situation in Fig. 4(a) where resistor  $R_f$  drives several gates. In Fig. 4(b), the capacitance seen by any one resistance is minimized, thereby minimizing the effect of the resistance on the overall delay (which is directly proportional to RC) and decreasing the clock skew between paths P1 and P2.

### IV. ADVANTAGES AND DISADVANTAGES OF THE MINIMAL CLOCK-SKEW DESIGN TECHNIQUE

A key advantage to this technique is in reducing chip-level clock skew. Also, overall clock delay, from the input pin to the sequential registers, is reduced. This occurs from improved partitioning of the RC loads. The inverters within each functional element drive large capacitive loads. These inverters, by placing them within their functional elements, are physically close to their loads, minimizing the interconnect impedance, particularly the resistance of these nodes [17]–[20]. The fairly long distance of the intra-element interconnect induces a large resistance between the central clock-buffer circuit and the intermediate stage clock inverters with relatively low input capacitance. Thus the RC time constants are reduced, which reduces the overall clock delay and skew.

Another important advantage in using this design technique in a synchronous VLSI circuit is the ability to partition each of the functional elements among a VLSI design team. The overall chip design can be partitioned hierarchically into a manageable domain of information with an emphasis on optimal clock distribution. Each functional element's clock distribution system is therefore optimized for its particular requirements while still maintaining overall chip-level clock integrity. The usefulness of this technique is dependent upon the ability to characterize the VLSI circuit's device and interconnect impedances. Careful attention must be spent on characterizing and modeling these resistances and capacitances in order to minimize the chip-level clock skew.

Lastly, unlike interconnect impedances, transistor transconductances tend to be very sensitive to both process and environmental variations (e.g., temperature, radiation, etc.). Therefore the performance of an optimally designed clock distribution system designed under nominal conditions would tend to fluctuate under worst-case and best-case conditions. In particular, a processing clock signal path whose performance is dominated by interconnect impedances changes differently than a clock signal path that is dominated by device impedances. For example, the delay through an inverter driving a large resistive load will tend to be fairly insensitive to variations in temperature. Conversely, the delay of an inverter driving a small resistive load will vary widely with temperature. Thus situations occur where the absolute propagation delays of clock signals significantly decrease under best-case conditions but the clock skew actually increases under these same conditions due to the different sensitivities of different signal

paths to changes in device transconductance. However, if close attention is given to reducing the interconnect resistance when using the minimal clock-skew technique, the effect on clock skew is fairly minimal [17].

#### V. PERFORMANCE ANALYSIS

An accurate model of the clock distribution network within a VLSI circuit is necessary for proper performance evaluation. With the exception of the final stage, the clock-buffering system consists of a chain of inverters with simple interconnect between stages. The final stage consists of an inverter driving a multitude of sequential gates connected by a multibranch interconnection tree. The model for the load seen by the final clock-buffer stage is, in the general case, a large RC tree. Due to the size and complexity this tree can attain, circuit simulation of the entire network becomes prohibitive and a delay estimation technique must be used [10]. When using the clock-buffer distribution scheme described within this paper, the load seen by the final clock-buffer stage is of a simpler form, as is illustrated in Fig. 5(a). Clock skew is evaluated by comparing the delay through the branch with the largest *RC* time constant to the delay through the branch with the smallest RC time constant. Since the network consists entirely of parallel combinations of RC branches, several assumptions can be made to simplify the circuit used to model the load seen by the final clock-buffer stage.

The time constant  $R_jC_j$  of any particular branch j is normally one or more orders of magnitude less than the rise time of the clock signal. Approximations involving these branch time constants will have minimal effect on the accuracy of the result. The first simplifying approximation involves branch resistances. If the individual branch resistances do not vary significantly from the average branch resistance  $R_{av}$ , the impedance looking into the network of Fig. 5(a) can be approximated by the following equation (excluding the parasitic line capacitance which will be accounted for later):

$$Z_{in} \cong R_1 \|R_2\| \cdots \|R_n$$
  
 
$$\cdot \frac{(s+1/R_1C_1)(s+1/R_2C_2)\cdots(s+1/R_nC_n)}{s(s+1/R_eqC_{eq})^{n-1}} \quad (3)$$

where

R

$$n = \text{number of sequential gates}$$

$$R_{eq} = R_1 + R_2 + \dots + R_n = n \times R_{av}$$

$$C_{eq} = (1/C_1 + 1/C_2 + \dots + 1/C_n)^{-1} = C_{av}/n$$

$$1 \|R_2\| \dots \|R_n = (1/R_1 + 1/R_2 + \dots + 1/R_n) = R_{av}/n$$

$$s = j\omega = j2\pi f.$$

The capacitances  $C_1, C_2, \dots, C_n$  are all approximately equal to each other since they each represent the capacitance of a single sequential gate. Thus C can be approximated by  $C_{av}/n$  where  $C_{av}$  ( $= C_1 = C_2 = \cdots = C_n$ ) is the input capaci-



tance of a single sequential gate.  $R_{eq}$  can be equivalently represented as  $R_{av} \times n$  where  $R_{av}$  is the average of all the branch resistances. If branch resistances  $R_1, R_2, \dots, R_n$  do not vary significantly from  $R_{av}$ , the following equations hold true:

$$\frac{1/R_{eq}C_{eq}}{\cong 1/R_1C_1} \cong \frac{1}{R_2C_2} \cong \cdots \cong \frac{1}{R_nC_n} \quad (4)$$

$$R_1 \|R_2\| \cdots \|R_n = R_{av} / n. \tag{5}$$

Using (4), the n-1 poles of (3) will cancel all but one of the zeros. Substituting (4) and (5) into (3), the following clock load impedance equations result:

$$Z_{in} \approx (R_{av}/n) \frac{(s+1/R_{av}C_{av})}{s}$$
$$= R_{av}/n + \frac{1}{s(nC_{av})}.$$
(6)

Equation (6) suggests that the load seen by the final stage of the clock-buffering system can be modeled by a single-series RC branch where the resistance of the branch is the parallel combination of all the branch resistances and the branch capacitance is the sum of all the branch capacitances. The circuit represented by (6) is illustrated in Fig. 5(b). Additional simplification will result if a large number of sequential gates are driven by the final buffer stage and the individual branch resistances are sufficiently small such that  $R_1 \| R_2 \| \cdots \| R_n$  can be taken as zero. Thus, for many cases, the clock distribution scheme presented in this paper can be adequately modeled by a simple summation of the input capacitance of all the sequential gates driven by a particular buffer with an additional parallel capacitance to account for parasitic line capacitance  $C_{\text{line}}$ .

#### VI. IMPLEMENTATION OF CLOCK DISTRIBUTION **DESIGN TECHNIQUE**

The clock distribution technique described in this paper has been implemented in a complex VLSI circuit. The chip was partitioned into many levels of hierarchy and required a large diverse team of VLSI designers. The VLSI design system described in [12] was used in the design of this chip.

Utilizing the aforementioned constraints listed in Section III, each functional element's clock distribution system was optimized for its particular load environment. The delay from a functional element's clock input to any internal sequential register was globally constrained to a fixed value for each functional element. Each designer insured that within their functional element, all clock skews and transition times seen by each sequential register were kept below a preset maximum limit.

Once each of the functional element's clock distribution systems was completed, interconnect parasitics were extracted [12], [13] at the top level. These were described in SPICE format and incorporated with the chip-level clock buffering in the centralized clock-buffer circuit into one large SPICE file. Considering various process and environmental conditions, each of the parameterized buffers resident within the central clock-buffer circuit was tuned for minimal chip-level clock skew. Thus, with an accurate description of device and interconnect parasitic impedances embedded within a SPICE nodal description of the complete clock distribution system, the difference of delay between each clock line, clock skew, was designed to be of almost negligible magnitude.

#### VII. CONCLUSIONS

An efficient design technique for a clock distribution system oriented for sequential VLSI circuits has been described. The technique can be used to minimize clock skew within a chip, thereby improving the circuit's overall performance. The design technique also provides an environment for concurrent VLSI circuit design, thereby permitting the VLSI circuit design effort to be hierarchically partitionable among a VLSI design team without impacting chip-level performance or cohesiveness. However, the technique exhibits some sensitivity to process and environmental variations, but this effect is considered to be relatively small. An accurate model for a clock distribution network in a VLSI circuit was presented. The model considers the effects of distributed interconnect impedances on clock skew. Finally, an example that utilizes this clock distribution system and the methodology for its implementation has been described.

The authors would like to acknowledge the technical contributions of G. Yacoub, W. Marking, P. Pandya, L. Tsu, D. Barach, T. Cesear, and R. Rapoza. They are also grateful to G. Persky for his sincere encouragement and advice in writing this paper.

#### References

- A. Glasser and D. W. Dobberphul, The Design and Analysis of [1]
- VLSI Circuits. Reading, MA: Addison-Wesley, 1985. D. F. Wann and M. A. Franklin, "Asynchronous and clocked control structures for VLSI-based interconnection networks," *IEEE* [2]
- Trans. Comput., vol. C-32, no. 3, pp. 284–293, Mar. 1983. J. Beausang and A. Albicki, "A method to obtain an optimal [3] clocking scheme for a digital system," in *Proc. Int. Conf. Computer Design* (ICCD), Oct. 1985, pp. 68–72.
   A. K. Sinha, J. A. Cooper, Jr., and H. J. Levinstein, "Speed
- [4] limitations due to interconnect time constants in VLSI-integrated circuits," IEEE Electron Device Lett., vol. EDL-3, no. 4, pp. 90-92, Apr. 1982.
- Apr. 1982.
  S. Dhar, M. A. Franklin, and D. Wann, "Reduction of clock delays in VLSI structures," in *Proc. Int. Conf. Computer Design* (ICCD), Oct. 1984, pp. 778–783.
  K. D. Wagner and E. J. McCluskey, "Tuning, clock distribution, in VLSI biohespeed chips." Stanford Univ.
- K. D. Wagner and E. J. McCluskey, "Tuning, clock distribution, and communication in VLSI high-speed chips," Stanford Univ., Stanford, CA, CRC Tech. Rep. 84-5, June 1984.
   R. Woudsma and J. M. Noteboom, "The modular design of clock-generator circuits in a CMOS building-block system," *IEEE J. Solid-State Circuits*, vol. SC-20, no. 3, pp. 770–774, June 1985.
   F. Anceau, "A synchronous approach for clocking VLSI systems," *IEEE J. Solid-State Circuits*, vol. SC-10, no. 1, pp. 51–56. [7]
- [8] IEEE J. Solid-State Circuits, vol. SC-17, no. 1, pp. 51-56, Feb. 1982.
- [9] E. Friedman, W. Marking, E. Iodice, and S. Powell, "Parameterized buffer cells integrated into an automated layout system," in Proc.
- [10]
- Custom Integrated Circuits Conf. (CICC), May 1985, pp. 389–392. P. Penfield, Jr. and J. Rubinstein, "Signal delay in RC tree net-works," in *Proc. 18th Design Auto. Conf.*, June 1981, pp. 613–617. C. M. Lee and H. Soukup, "An algorithm for CMOS timing and area optimization," *IEEE J. Solid-State Circuits*, vol. SC-19, no. 5, pp. 787–787. [11]
- pp. 781–787, Oct. 1984. S. Powell, W. R. Smith, and G. Persky, "A parasitics extraction program for closely-spaced VLSI interconnects," in *Proc. Int. Conf.* [12]
- Computer-Aided Design (ICCAD), Nov. 1985, pp. 193–195. E. Friedman, G. Yacoub, and S. Powell, "A CMOS/SOS VLSI design system," J. Semicustom IC's, vol. 2, no. 4, pp. 5–11, June [13] 1985
- [14] H. C. Lin and L. Linholm, "An optimized output stage for MOS-integrated circuits," *IEEE J. Solid-State Circuits*, vol. SC-10, no. 2, pp. 106–109, Apr. 1975.
  [15] R. Jaeger, "Comments on 'An optimized output stage for MOS-integrated circuits," *IEEE J. Solid-State Circuits*, vol. SC-10, no. 3, pp. 185–186, June 1975.
  [16] H. J. M. Veendrick, "Short-circuit dissipation of static CMOS circuitry and its impact on the design of huffer circuits," *IEEE J.*
- circuitry and its impact on the design of buffer circuits," IEEE J.
- Solid-State Circuits, vol. SC-19, no. 4, pp. 468–473, Aug. 1984.
  H-T. Yuan, Y-T. Lin, and S-Y. Chiang, "Properties of interconnect on silicon, sapphire, and semi-insulating gallium arsenide substrates," *IEEE Trans. Electron Devices*, vol. ED-29, no. 4, pp. 639–644, Apr. [17] 1982.

- [18] R. J. Antinone and G. W. Brown, "The modeling of resistive interconnects for integrated circuits," *IEEE J. Solid-State Circuits*,
- vol. SC-18, no. 2, pp. 200–203, Apr. 1983. G. De Mey, "A comment on 'The modeling of resistive interconnects for integrated circuits," *IEEE J. Solid-State Circuits*, vol. [19] SC-19, no. 4, pp. 542–543, Aug. 1984. H. B. Bakoglu and J. D. Meindl, "Optimal interconnection circuits
- [20] for VLSI," *IEEE T* 903–909, May 1985. IEEE Trans. Electron Devices, vol. ED-32, no. 5 pp.



Eby G. Friedman (S'78-M'79) was born in Jersey City, NJ, in 1957. He received the B.S. degree in electrical engineering from Lafayette College, Easton, PA, in 1979 and the M.S. degree in electrical engineering from the University of California at Irvine in 1981. He has recently begun working toward the Ph.D. degree in electrical engineering at the University of California at Irvine.

He was previously employed by Philips Gloeilampen Fabrieken in Eindhoven, The

Netherlands, in 1978 and worked on the design of bipolar differential amplifiers. From 1979 to 1983, he was employed by Hughes Aircraft Company in Newport Beach, CA, working in the areas of custom IC design, software compatible gate array design, one- and two-dimensional device modeling, circuit modeling, and double-level metal process development. He is currently Head of the VLSI Circuit Design Section in the Semiconductor Division of Hughes Aircraft Company in Carlsbad, CA, responsible for custom and semicustom CMOS VLSI design, development of VLSI design methodologies, and the generation of cell compiler-oriented CAD tools for IC synthesis. He is the author of several articles and presentations in the fields of VLSI, CMOS design techniques and CAD tools, and silicon compilation.



Scott Powell received the B.S. and M.S. degrees in electrical engineering from Oregon State University, Corvallis, in 1982 and 1983, respectively.

He was previously employed by Hewlett-Packard and Tektronix, working in the areas of analog and digital circuit design, LSI testing, and CAD development. He is currently a Group Head at Hughes Aircraft Company in Carlsbad, CA, where he is involved in the research and development of VLSI circuit design techniques and CAD tools. He has authored several papers in the field

of digital VLSI circuit design.