Non-Boolean Accelerators for Deep Learning Networks

Active Sites

University of Notre Dame

Motivation

We will study non-Boolean information processing primitives derived from spin-based media and oscillator-based networks comprised of existing and emerging transistor technologies. We target hardware-based solutions to support computational kernels such as convolutional transforms, wavelet transforms (WTs), etc. Said kernels form the basis of computational models such as holographic computing (inspired by optical signal processing), convolutional neural networks (CNNs), etc. that can support higher-level application targets such as object recognition, etc. Estimates show that said hardware can be very energy efficient when compared to conventional CMOS hardware equivalents.

Information processing primitives based on (spin) waves and oscillators are particularly well-suited for convolutional transforms. By properly configuring a spinwave-based system, different types of computational kernels (mimicking optically-inspired computing functions such as Fourier transforms (FTs)) can be realized.  A spin-wave lens can provide this functionality just as any other 1D vector-to-1D vector linear mapping. Spintronic devices to generate and detect spinwaves can be placed on a continuous magnetic film. The film may have a ‘hard wired’ structure (i.e., an exchange bias that provides the local field for a lens function) or additional spintronic devices may provide reconfigurability. All components are planar and compatible with standard CMOS technology.  Transistor-based analog circuits can also emulate an optical system.

A key distinguishing aspect of deep convolutional neural networks (CNNs) from older hand-tuned filtering is the notion of feature learning. Instead of making a priori assumptions about the configuration of filters that represent a feature space, CNNs learn a set of convolutional filters to minimize error on a training set. Further, the deeper the network and the more data that it is trained with, the better the performance that is typically observed. Scalability is essential, and the very best networks require vast computational resources (e.g., GPU clusters) and weeks or months of training time. Thus, hardware that can efficiently perform convolutional transforms could have a significant impact in this space. Moreover, recent work from the machine learning community suggests that CNNs based on wavelet transforms (WTs) can provide superior accuracy for certain classes of problems when compared to other classifiers. While a FT outputs the frequency of a given signal, a WT can provide information about a given frequency w.r.t. time. Wavelet scattering neural networks can be employed to compute translation invariant image representations that are more stable to image deformations (e.g., images of a flag flying in a breeze). WT convolutions (i.e., the filtering part of a CNN) may be particularly amenable to the hardware kernels that form the basis of this project.

Notably, we have begun to quantify the impact of I/O overhead for spinwave-based kernels.  Using a FT as a representative case study, the spinwave-based approach could be at 40X-3000X more energy efficient than state-of-the-art CMOS implementations even after accounting for I/O overhead.

Objectives

We will consider how emerging technologies and hardware can impact the energy, performance and accuracy of learning-based architectures for the applications in the project.

Team

Yiyu Shi and X. Sharon Hu (CSE, ND) will be responsible for the simulation and modeling, and Michael Niemier (CSE, ND) will be responsible for benchmarking.

Deliverables

By the end of the first year, the proposed deliverables include preliminary simulations of filters that address computational problems described herein. The final deliverables of this project will also include projections for how emerging technologies impact the power, performance, and/or accuracy of the computational requirements for problems of interest.

Experimental Plan and Industrial Relevance

We will compare our projections to existing solutions for similar application-level problems assuming computational models running on state-of-the-art CPUs/GPUs. We will perform direct measurements of power, delay, etc. on said processors assuming a best-performing algorithm for the computational problems. Moreover, PIs Niemier and Hu have also developed a well-validated architectural-level benchmarking infrastructure that projects how new device technologies impact processor performance. This will allow us to properly consider how technology scaling, voltage scaling, etc. impact the power/performance of algorithms running on future von Neumann architectures. As such, we can make meaningful comparisons of our non-von Neumann solutions to a “moving target.”

The project will be of great interest to companies for design companies such as Intel, IBM, etc. as well as certain defense industries.

Milestones and Time-to-Completion

The estimated duration of this project is 2 years. The milestones are listed in the following table.

Year 1

Year 2

Determine subset of problems for case studies; Develop complete description of hardware needs; Design and simulate hardware kernels

Refine simulations, designs; Preliminary projections for I/O overhead, etc; Refine simulations and designs; Final power, performance, accuracy projections

Number of Graduate Students Supported

2

Budget

$100K/year

Total Cost to Completion

$200K