Ph.D. Public Defense

Architectures for More Effective and Efficient Decoupled Look-Ahead

Sushant Kondguli

Supervised by Professor Michael Huang

Tuesday, April 9, 2019
9 a.m.

Computer Studies Building, Room 703

Single thread performance is still a central component for general purpose microarchitectures. In the past, technological drivers (faster clocks and increasing on-chip resources) guaranteed continued growth in single thread performance. However, going forward, single thread performance benefits (if any) form these technological techniques will come at significant costs. Innovative improvements in microarchitectural techniques offer a potential way forward for continued improvements in single thread performance mainly because today’s general-purpose applications continue to have significant levels of implicit parallelism. Conventional microarchitecture is unable to exploit this parallelism due to significant barrier posed by data and instruction supply subsystem. One possible way of improving this subsystem is by using Decoupled Lookahead Architectures (DLA). In DLA, a self-sufficient thread guides the look-ahead activities largely independent of the main thread performing the actual program execution. In principle, the effectiveness of DLA does not depend on any access pattern or program behavior and this general purpose nature makes it an attractive platform for continued improvements in single thread performance.

To show the effectiveness of DLA at improving single thread performance, we first evaluate it as an on-demand performance boosting technique and compare it against traditional performance boosting techniques like scaling clock frequency and increasing on chip resources with wide cores. We show that DLA offers comparable effectiveness to traditional performance boosting techniques with better efficiency. We also note that effectiveness of boosting techniques vary at different phase of an application and propose an effective and efficient mechanism to enable optimum boosting technique for each phase of the application.

Lookahead thread in DLA tries to optimize the supply of data and instruction to the main thread. The overall speed of DLA in a given phase is limited by the slower of the two threads. So improving one thread only helps until the other thread becomes the bottleneck. Convention- ally, the two threads run on two separate identical cores/thread contexts that equally share the on-chip resources. This is inefficient since the resource requirements of the two threads vary at runtime. Similarly, by convention, lookahead thread tries to perform all lookahead activities that could benefit main thread and main thread redundantly repeats many of the lookahead thread’s activities. We first propose an efficient implementation to optimally distribute on-chip resources between lookahead thread and main thread. Next, we propose various optimizations to both the threads that optimize lookahaed thread and extract more utility from it for the main thread. Since DLA is bottlenecked by its slowest thread, our techniques offer relatively small performance gains individually. However, together their benefit is synergistic. When all pro- posed techniques are combined, DLA architecture can obtain an overall performance benefit of more than 50% compared to commercially available, aggressive, state-of-the-art designs; making it a compelling feature for general purpose microarchitecture.