Duke

ID: D1

Title: Advanced network pruning and quantization

Abstract: Pruning and quantization are two basic methodologies to reduce computation time in
both training and inference stages of Deep Neural Networks (DNNs). However, two challenging
problems are unsolved: (1) Pruned DNNs have irregular sparse weights and pseudo-quantized
weights. Irregular sparse computation has overhead of indexing and poor data locality, and
pseudo-quantized weights have to be converted to floating precision for computation. Those
facts limit speed gains from DNN pruning and quantization; (2) state-of-the-art pruning and
quantization methods only compress weights but ignore activations which are the units that
majority of data involved during the computation of DNNs. To tackle those problems, we will
(1) leverage hardware and system expertise to discover most computation-efficient sparsity
and quantization patterns and guide pruning and quantization to learn those patterns, so as to
exploit the power of both hardware and algorithms; (2) explore pruning and quantization on
both weights and activations, so as to transfer computation between sparse weights and dense
activations to computation between sparse weights and sparse activations, and transfer
computation between quantized weights and floating activations to computation between
quantized weights and quantized activations.
Deliverable: 1) A submission to premium ML conference or journal; 2) Code and benchmarks
used in the experiment.

Proposed budget: $50K.

Participated site: Duke (PI: Hai Li).

ID: D2

Title: Fine-grained quantization for deep neural networks and its interpretation

Abstract: Many recent works have proven the effectiveness of deep neural network (DNN)
compression in reducing computational cost. However, most of the existing methods and tools
such as TensorFlow Lite simply quantize the whole DNN by assigning the same bit-width to all
the layers. Our previous work accepted in the NIPS 2018 workshop on Compact Deep Neural
Networks with industrial applications demonstrated a systematic approach to optimize the
bit-width for each layer individually. This approach can be further expanded by taking into
account other design factors such as hardware constraints etc. One open question that was
generated by NIPS workshop paper is how to interpret the sensitivity of the quantization of
each layer of the DNN to the model accuracy accurately and effectively. To answer this
question, we plan to develop an automated approach that can read in the model, automatically
interpret the quantization sensitivity of each layer, block and cell, and then compress the neural
network at different granularities without or with minimum accuracy loss.
Deliverable: 1) Submission to ML or CV conferences or journals; 2) Code and benchmarks used
in the experiment.

Proposed budget: $50K.

Participated site: Duke (PI: Yiran Chen).

ID: D3

Topic: Versatile, privacy-preserving, and efficient cloud-edge image rendering

Abstract: Traditional mobile-based image rendering systems become increasingly difficult to
satisfy users’ various requirements due to versatility limitations of ad-hoc rendering algorithms
and limited availability of image resources at edge. Additional image resources can be collected
from other edge devices, however, leading to severe concerns on information privacy and
rendering efficiency in cloud-edge environments. In our project, we propose a versatile image
rendering method based on image hashtag similarity and unsupervised image decomposition.
Two goals are expected to be achieved. The first is to maximally capture hidden rendering
styles in training data, instead of one pre-defined style in the traditional methods. In our
method, the hidden rendering styles are automatically discovered by doing image translation
between similar images. The similarity is measured by hashtags’ distance between the images.
The second goal is to offer a high image rendering efficiency and preserves users’ privacy. In our
method, an image to be rendered is decomposed into two components (to enhance privacy)
which are separately sent from an edge device to the cloud, and one component only is
processed (to improve efficiency) for rendering in the cloud.
Deliverable: 1) A submission to premium mobile computing conference or journal; 2) Code and
benchmarks used in the experiment.

 

Proposed budget: $100K.

Participated site: Duke (PI: Yiran Chen).

ID: D4

Title: Reinforcement learning for datacenter scheduling

Abstract: Datacenters require sophisticated management frameworks for resource allocation
and job scheduling. Most existing mechanisms and policies are statically optimized and defined,
hindering their ability to respond to system dynamics---users may arrive or depart, jobs may
transition from one computational phase to another. We propose methods in reinforcement
learning to model system conditions and optimize the allocation of server resources. We will
show how reinforcement learning can learn effective policies as the datacenter computers and
adapt quickly to new system conditions and workloads.
Deliverable: (1) A manuscript submitted to a premier computer systems or architecture venue.
(2) Code and benchmarks used in the experiments.

Proposed budget: $50K.

Participated site: Duke (PI: Benjamin Lee).

ID: D5

Title: Hybrid parallelism and communication methods for decentralized DNN training

Abstract: Decentralized method was recently proposed to boost performance of distributed
DNN training. However, the spontaneous communications between workers demand high
bandwidth and thus significantly slow down the computing performance. We propose hybrid
parallelism, which includes both data and model parallelism, to reduce the intrinsic
communication of partial sum accumulation and weight updating. We also propose to develop
a hybrid communication scheme where the communication within a worker group is
synchronous and the communication between worker groups is asynchronous to further reduce
the communication while still maintaining the training accuracy.
Deliverable: 1) A submission to premium ML conference or journal; 2) Code and benchmarks
used in the experiment.

Proposed budget: $50K.

Participated site: Duke (PI: Hai Li).

ID: D6

Title: Joint optimization of deep neural network acceleration in speech recognition applications

Abstract: DNN models, especially CTC+LSTM, have been successfully used in speech recognition
and shows great performance. In order to be better deployed on mobile devices, quantization,
sparsification and pruning methods are often used to accelerate the execution of speech
recognition models. Pruning is often used to reduce the number of nodes in CNN and sparsity
method is often applied on LSTM part. However, these acceleration methods only optimize the
CNN or LSTM part separately. Moreover, speed up one part of the model may influence the
accuracy of entire speech recognition model. We produce to optimize the CNN and LSTM model
jointly in order to further improve the overall performance. By finding a way to balance the
optimization of LSTM and CNN, we can achieve a high accuracy while reducing the execution
time as much as possible.
Deliverable: 1) A submission to premium ML conference or journal; 2) Code and benchmarks
used in the experiment.

Proposed budget: $50K.

Participated site: Duke (PI: Hai Li).

ID: D7

Title: Robust DNN model compression for noisy scenarios

Abstract: Traditional DNN model compression methods reduce number of weights and neurons
via quantization, sparsification, pruning, and compact network design. These methods,
however, usually do not consider noisy application scenarios, especially when multiple noise
sources exist. There are at least two major challenges in improving resiliency of the model
compression methods to noise: 1) noise distribution is complex and often unknown in
real-world applications; 2) quantitatively analyzing the impact of the noise on accuracy of the
compressed model additive is difficult. We propose to track and interpret the changes of
patterns of DNN boundaries induced during model compression under various noisy scenarios.
DNN regularization on model compression will then be adaptively tuned for enhancing the
robustness of the DNN under noisy scenarios according to the captured boundary changing
patterns.
Deliverable: 1) A submission to premium ML conference or journal; 2) Code and benchmarks
used in the experiment.

Proposed budget: $50K.

Participated site: Duke (PI: Hai Li).

ID: D8

Title: A secure and privacy preserving cloud-based online learning system

Abstract: Traditional cloud-based learning system requires all the users to upload their data to
an online server before it can start learning the model, requiring a huge data storage space,
large amount of data communication to each user, and a potential threat of leaking private user
data. The recently proposed federated learning has provide a way to train model locally with
data from each user, then communicate the model parameters to reach combined model. Such
method eliminate the threat of privacy leaking, but the performance may be limited by the data
amount and computation resource of each individual user. Here we propose to divide our
model into a public part plus a local part. The local part will be able to efficiently deployed on
the edge devices held by the users, where it will preprocess the user data to shrink the size and
eliminate privacy concerns while preserving useful information of the learning task. These
information will then be send onto the cloud for training/inferencing with the public model.
Similar to federated learning, parameters and gradients of the public model and local models
will be communicated with each other to reach a high performance model collaboratively.
Deliverable: 1) One or more submissions to premium ML conference or journal; 2) A
cloud-based online learning and data storage system implemented on famous cloud computing
platforms (e.g. AWS etc.).

Proposed budget: $50K.

Participated site: Duke (PI: Hai Li).

ID: D9

Title: Customized machine learning estimators on timing and IR drop

Abstract: EDA (Electronic Design Automation) technology has advanced remarkable progress
over the decades, from attaining merely functional correct designs of thousand-transistor
circuits to handling multi-million-gate circuits. However, there are two significant challenges in
current EDA methods: 1) traditional tools are largely restricted to their own design stage, which
forces designers to make pessimistic estimation at early stage. Such a policy can lead to very
long turn-around time for desired QoR (quality of results); 2) many tools rely heavily on manual
tuning, which imposes a stringent requirement on VLSI designers’ experience. Moreover, major
design problems like IR drop, DRC violation and negative timing slack grow to be increasingly
critical as technology scales. We rethink those EDA challenges from ML (machine learning)’s
perspective: ML methods will be customized to make fast and high-fidelity prediction on
different design goals, including power, timing, and DRC violations. Here are two promising
research directions: 1) for early timing analysis before placement, we propose a graph
convolution model that can effectively learn from the gate connection topology; 2) for IR drop
and power analysis, we plan to incorporate timing and spatial information into features and
customize a CNN model with maximum structure, which captures the moment with the most
serious power hotspot.
Deliverable: 1) A submission to premium EDA conference or journal; 2) Code used in the
experiment.

Proposed budget: $50K.

Participated site: Duke (PI: Yiran Chen).

ID: D10

Title: Processing-in-memory architecture supporting GAN executions

Abstract: Processing-in-memory (PIM) technique has recently been extensively explored in the
designs of DNN accelerators. However, we found that existing solutions are unable to efficiently
support the computational needs required by unsupervised Generative Adversarial Network
(GAN) training due to the lack of the following two features: 1) Computation efficiency: GAN
utilizes a new operator, called transposed convolution , which introduces significant resource
underutilization as it inserts massive zeros in its input before a convolution operation; 2) Data
traffic: The data intensive training process of GAN often incurs structural heavy data traffic as
well as frequent massive data swaps. We propose a novel computation deformation technique
that can skip zero-insertions in transposed convolution for computation efficiency
improvement. Moreover, we will explore an efficient training procedure to reduce on-chip
memory access, designing flexible dataflow to achieve high data reuse and implementing
specific circuits to support the proposed GAN architecture with minimum cost on area and
energy consumption.
Deliverable: 1) A submission to premium Computer Architecture conference or journal; 2) Code
and benchmarks used in the experiment.

Proposed budget: $50K.

Participated site: Duke (PI: Yiran Chen).

ID: D11

Title: Efficient attention-based model designed for mobile devices

Abstract: Recently, attention-based DNN models achieve state-of-the-art accuracy in modern
Neural Machine Translation (NMT) systems. In the NMT systems, however, the hidden state of
current target word is required to compare with all the hidden states of the words in the source
sentence. The incurred large-scale vector-matrix multiplications introduce large memory
consumption and high computation cost, preventing these models from being deployed on
resource-constrained mobile devices. In order to accelerate vector-matrix multiplication,
technologies such as random weight pruning could be utilized to get a sparse weight and thus
decrease the total FLOPs. However, random sparsity could hardly lead to practical speedup
using general computation units because of their poor data locality associated with the
scattered weight distribution. In this project, we propose to explore structured sparsity on the
NMT models based on attention models, which prunes the whole row or the whole column of
the weight matrix for computation cost reduction. We will realize a prototype of the proposed
technique on mobile platforms.
Deliverable: 1) A submission to premium ML/NLP conference or journal; 2) Code and
benchmarks used in the experiment.

Proposed budget: $50K.

Participated site: Duke (PI: Yiran Chen).

ID: D12

Title: BW-SVD-DNNs: A block-wise SVD-based method to balance processing pipeline in DNN
accelerators and minimize the inference accuracy loss

Abstract: Deep Neural Networks (DNNs) have made phenomenal success in many real-world
applications. However, as the size of DNNs continues growing, it is difficult to improve the
energy efficiency and performance while still maintaining a good accuracy. Many techniques,
e.g., model compression and data reuse, were proposed to reduce the computational cost of
DNN executions and to efficiently deploy large-scale DNNs on various hardware platforms.
Nonetheless, most of existing DNN Models lack of efficient hardware acceleration solutions.
One major problem is that the transmission time of the data (i.e., weights and inputs/features)
is O(n) while the computation time of the data is O(n 2 ) for a network layer. Balancing the data
transmission time and the computation time is crucial to avoid long idle time of the system
awaiting for the incoming data or requesting a large storage to buffer the data. To solve this
problem, we propose BW-SVD-DNNs – a Block-Wise Singular Value Decomposition (BW-SVD)
Method to balance processing pipeline in DNN accelerators. An optimum trade-off between the
transmission time and the computation time can be achieved with minimized the inference
accuracy loss. In particular, we plan to 1) use BW-SVD technology to decompose large pieces of
data in the DNN models to balance the computation time and the transmission time; 2) design
efficient acceleration engine and data control unit according to the characteristics of the data
processed by BW-SVD; and 3) use Group Lasso-based retraining to minimize the impact of
BW-SVD on data decomposition for retaining inference accuracy,.
Deliverable: 1) A submission to premium ML\Architecture\FPGA conference or journal; 2) Code
and benchmarks used in the experiment

Proposed budget: $50K.

Participated site: Duke (PI: Yiran Chen).

ID: D13

Title: Systolic processing unit based reconfigurable accelerator for both CNN and LSTM

Abstract: Operations of LSTM usually include vector-matrix multiplications in fully-connected
(fc) layers and element-wise forget/addition gate operations. Although accelerator designs for
convolutional (conv) layers have been extensively studied, there exist two major differences
between the computations of fc layers and conv layers: 1) the ratios between the numbers of
weights and activations in fc and conv layers are very different; and 2) there does not exist any
data reuse of weights and activations in fc layers when batch size = 1, which is the typical case
in real-time applications. In this project, we propose a reconfigurable accelerator design based
on systolic processing unit to efficiently support both CNN and LSTM. In particular, the systolic
processing array can flexibly switch between conv mode and fc mode. A programmable fused
on-chip buffer is introduced to adapt to the different ratios between weights and activations in
conv and fc layers. We will also explore the proper quantization and mixed-precision techniques
for the target applications.
Deliverable: 1) A submission to a premium conference or journal on solid-state circuit, 2) a
processing unit design IP for CNN and LSTM.

Proposed budget: $50K.

Participated site: Duke (PI: Hai Li).

ID: D14

Title: Machine learning for datacenter performance analysis

Abstract: Datacenters deliver performance at scale but suffer from performance anomalies and
stragglers, atypically slow tasks that degrade job completion times. Although varied heuristics
and mechanisms have been proposed to mitigate stragglers, they rarely diagnose their root
causes. We propose methods in causal inference to diagnose performance anomalies at
datacenter scale. We will develop these methods for offline diagnosis as well as online
detection and mitigation.
Deliverable: (1) A manuscript submitted to a premier computer systems or architecture venue.
(2) Code and benchmarks used in the experiments.

Proposed budget: $50K.

Participated Site: Duke (PI: Benjamin Lee).

ID: D15

Title: Edge computing - aided Intelligent multi-user augmented reality

Abstract: Modern augmented reality (AR) applications, while already impressive, have multiple
limitations, including excessive energy consumption, restricted multi-user capabilities, and
limited adaptiveness and intelligence. Edge computing, the use of local computing resources to
bring advanced computing capabilities closer to the end users, has the potential to address all
these limitations. In this project, building on our ongoing experiments with Google ARCore,
Microsoft HoloLens, and Magic Leap One AR systems, we will develop techniques for aiding
mobile multi-user AR experiences via persistent edge computing-based applications and
persistent edge-integrated sensors.
Deliverable: 1) A submission to a premium mobile systems conference; 2) An interactive
demonstration at a premier mobile systems conference; 3) Code and data for edge-aided
augmented reality applications.

Proposed budget: $50K.

Participated site: Duke (PI: Maria Gorlatova).