Publications / 出版物

Journal Paper

T5. Theseus: Exploring Efficient Wafer-Scale Chip Design for Large Language Models

T4. DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures

T3. Fine-Grained Structured Sparse Computing for FPGA-Based AI Inference

T2. TSCompiler: Efficient Compilation Framework for Dynamic-shape Models

T1. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks

Conference Paper

C28. H^2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference

C27. DATIS: DRAM Architecture and Technology Integrated Simulation

C26. Tb-STC: Transposable Block-wise N:M Structured Sparse Tensor Core

C25. SynGPU: Synergizing CUDA and Bit-Serial Tensor Cores for Vision Transformer Acceleration on GPU

C24. Oltron: Algorithm-Hardware Co-design for Outlier-Aware Quantization of LLMs with Inter-/Intra-Layer Adaptation

C23. Amanda: Unified instrumentation framework for deep neural networks

C22. Cambricon-r: A fully fused accelerator for real-time learning of neural scene representation

C21. RM-STC: Row-merge dataflow inspired GPU sparse tensor core for energy-efficient sparse acceleration

C20. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

C19. Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training

C18. Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization

C17. SQuant: On-the-fly data-free quantization via diagonal hessian approximation

C16. Dual-side sparse tensor core

C15. Boosting mobile CNN inference through semantic memory

C14. Scylla: Qoe-aware continuous mobile vision with fpga-based dynamic deep neural network reconfiguration

C13. Ladabert: Lightweight adaptation of bert through hybrid model compression

C12. Live video analytics with FPGA-based smart cameras

C11. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity

C10. Balanced sparsity for efficient dnn inference on gpu

C9. Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization

C8. Best-effort FPGA programming: A few steps can go a long way

C7. Using data compression for optimizing FPGA-based convolutional neural network accelerators

C6. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster

C5. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks

C4. Optimizing FPGA-based accelerator design for deep convolutional neural networks

C3. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD

C2. Memory partitioning for multidimensional arrays in high-level synthesis

C1. Automatic multidimensional memory partitioning for FPGA-based accelerators

Patent