Issue Downloads
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators
General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (...
Coherence Attacks and Countermeasures in Interposer-based Chiplet Systems
Industry is moving towards large-scale hardware systems that bundle processor cores, memories, accelerators, and so on. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect ...
A Concise Concurrent B+-Tree for Persistent Memory
Persistent memory (PM) presents a unique opportunity for designing data management systems that offer improved performance, scalability, and instant restart capability. As a widely used data structure for managing data in such systems, B+-Tree must ...
An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs
Resource-efficient Convolutional Neural Networks (CNNs) are gaining more attention. These CNNs have relatively low computational and memory requirements. A common denominator among such CNNs is having more heterogeneity than traditional CNNs. This ...
Assessing the Impact of Compiler Optimizations on GPUs Reliability
Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex ...
Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey
Performance in scientific and engineering applications such as computational physics, algebraic graph problems or Convolutional Neural Networks (CNN), is dominated by the manipulation of large sparse matrices—matrices with a large number of zero elements. ...
An Instruction Inflation Analyzing Framework for Dynamic Binary Translators
- Benyi Xie,
- Yue Yan,
- Chenghao Yan,
- Sicheng Tao,
- Zhuangzhuang Zhang,
- Xinyu Li,
- Yanzhi Lan,
- Xiang Wu,
- Tianyi Liu,
- Tingting Zhang,
- Fuxin Zhang
Dynamic binary translators (DBTs) are widely used to migrate applications between different instruction set architectures (ISAs). Despite extensive research to improve DBT performance, noticeable overhead remains, preventing near-native performance, ...
Cost-aware Service Placement and Scheduling in the Edge-Cloud Continuum
The edge to data center computing continuum is the aggregation of computing resources located anywhere between the network edge (e.g., close to 5G antennas), and servers in traditional data centers. Kubernetes is the de facto standard for the ...
Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs
Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing ...
SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching
‘‘Learned” admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache ...
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources
FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing ...
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration
DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small but fast regions to cache frequently accessed data, thereby reducing the average latency. ...
The Droplet Search Algorithm for Kernel Scheduling
Kernel scheduling is the problem of finding the most efficient implementation for a computational kernel. Identifying this implementation involves experimenting with the parameters of compiler optimizations, such as the size of tiling windows and ...
Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program Traces
Trace-based simulation is a widely used methodology for system design exploration. It relies on realistic traces that represent a range of behaviors necessary to be evaluated, containing a lot of information about the application, its inputs and the ...
TEA+: A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture
- Chengying Huan,
- Yongchao Liu,
- Heng Zhang,
- Shuaiwen Song,
- Santosh Pandey,
- Shiyang Chen,
- Xiangfei Fang,
- Yue Jin,
- Baptiste Lepers,
- Yanjun Wu,
- Hang Liu
Many real-world networks are characterized by being temporal and dynamic, wherein the temporal information signifies the changes in connections, such as the addition or removal of links between nodes. Employing random walks on these temporal networks is a ...
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication
The multiplication of sparse matrix and vector (SpMV) is one of the most widely used kernels in high-performance computing as well as machine learning acceleration for sparse neural networks. The design space of SpMV accelerators has two axes: algorithm ...
xMeta: SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage
Object storage has been widely used in the cloud. Traditionally, the size of object metadata is much smaller than that of object data, and thus existing object storage systems (such as Ceph and Oasis) can place object data and metadata, respectively, on ...
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals
Many applications are designed to perform traversals on tree-like data structures. Fusing and parallelizing these traversals enhance the performance of applications. Fusing multiple traversals improves the locality of the application. The runtime of an ...