Publications / 出版物

Journal Paper

T5. Theseus: Exploring Efficient Wafer-Scale Chip Design for Large Language Models

Jingchen Zhu, Chenhao Xue, Yiqi Chen, Zhao Wang, Chen Zhang, Yu Shen, Yifan Chen, Zekang Cheng, Yu Jiang, Tianqi Wang, Yibo Lin, Wei Hu, Bin Cui, Runsheng Wang, Yun Liang, Guangyu Sun
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (T-CAD 2025)【PDF】

T4. DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures

Chen Zhang, Yang Wang, Zhiqiang Xie, Cong Guo, Yunxin Liu, Jingwen Leng, Guangyu Sun, Zhigang Ji, Runsheng Wang, Yuan Xie, Ru Huang
IEEE Transactions on Computers (TC 2025) 【PDF】
Keyword: CNN, LSTM, LLM, GPU, Sparse Computing

T3. Fine-Grained Structured Sparse Computing for FPGA-Based AI Inference

Chen Zhang, Shijie Cao, Guohao Dai, Chenbo Geng, Zhuliang Yao, Wencong Xiao, Yunxin Liu, Ming Wu, Lintao Zhang, Guangyu Sun, Zhigang Ji, Runsheng Wang, Ru Huang
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (T-CAD 2025) 【PDF】
Keyword: CNN, LSTM, LLM, FPGA, Fine-Grained Structured Sparse Computing

T2. TSCompiler: Efficient Compilation Framework for Dynamic-shape Models

Xiang Luo, Chen Zhang*, Chenbo Geng, Yanzhi Yi, Jiahui Hu, Renwei Zhang, Zhen Zhang, Gianpietro Consolaro, Fan Yang, Tun Lu, Ning Gu, Li Shang*
SCIENCE CHINA Information Sciences (SCIS 2024)【PDF】

T1. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks

Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, Jason Cong
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (T-CAD 2018) 【PDF】
Award: 2017~2019 Donald O. Pederson Best Paper Award
Keyword: Convolutional Neural Network, FPGA, Design Automation, Caffe, SDAccel

Conference Paper

C28. H^2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference

Cong Li, Yihan Yin, Xintong Wu, Jingchen Zhu, Dimin Niu, Qiang Wu, Xin Si, Yuan Xie, Chen Zhang*, Guangyu Sun*
Proceedings of the 52th Annual International Symposium on Computer Architecture (ISCA 2025)【PDF】
Award: Best Paper Award

C27. DATIS: DRAM Architecture and Technology Integrated Simulation

Shiyu Xia, Chen Zhang*, Guangyu Sun, Guohao Dai, Runsheng Wang, Zhigang Ji*, Ru Huang
Proceedings of the 2025 International Symposium of EDA （ISEDA 2025）【PDF】【Slide】
Award: Best Paper Award

C26. Tb-STC: Transposable Block-wise N:M Structured Sparse Tensor Core

Jun Liu, Shulin Zeng, Junbo Zhao, Li Ding, Zeyu Wang, Jinhao Li, Zhenhua Zhu, Xuefei Ning, Chen Zhang, Yu Wang, Guohao Dai*
Proceedings of the 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA 2025)【PDF】

C25. SynGPU: Synergizing CUDA and Bit-Serial Tensor Cores for Vision Transformer Acceleration on GPU

Yuanzheng Yao, Chen Zhang*, Chunyu Qi, Ruiyang Chen, Jun Wang, Zhihui Fu, Naifeng Jing, Xiaoyao Liang, and Zhuoran Song*
Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC 2025)【PDF】

C24. Oltron: Algorithm-Hardware Co-design for Outlier-Aware Quantization of LLMs with Inter-/Intra-Layer Adaptation

Chenhao Xue, Chen Zhang*, Xun Jiang, Zhutianya Gao, Yibo Lin, Guangyu Sun*
Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC 2024)【PDF】【Slide】

C23. Amanda: Unified instrumentation framework for deep neural networks

Yue Guan, Yuxian Qiu, Jingwen Leng, Fan Yang, Shuo Yu, Yunxin Liu, Yu Feng, Yuhao Zhu, Lidong Zhou, Yun Liang, Chen Zhang, Chao Li, Minyi Guo
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2024)

C22. Cambricon-r: A fully fused accelerator for real-time learning of neural scene representation

Xinkai Song, Yuanbo Wen, Xing Hu, Tianbo Liu, Haoxuan Zhou, Husheng Han, Tian Zhi, Zidong Du, Wei Li, Rui Zhang, Chen Zhang, Lin Gao, Qi Guo, Tianshi Chen
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2023)

C21. RM-STC: Row-merge dataflow inspired GPU sparse tensor core for energy-efficient sparse acceleration

Guyue Huang, Zhengyang Wang, Po-An Tsai, Chen Zhang, Yufei Ding, Yuan Xie
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2023)

C20. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Zhang, Fan Yang, Yunxin Liu, Minyi Guo, Yuhao Zhu
Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA 2023)【PDF】

C19. Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training

Cong Guo, Yuxian Qiu, Jingwen Leng, Chen Zhang, Ying Cao, Quanlu Zhang, Yunxin Liu, Fan Yang, Minyi Guo
2022 IEEE 40th International Conference on Computer Design (ICCD 2022)【PDF】

C18. Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization

Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, Yuhao Zhu
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO 2022)【PDF】
Award: MICRO 2022 Top Picks Honorable Mention
Keywords: AI acceleration, Tensor Core, Quantization

C17. SQuant: On-the-fly data-free quantization via diagonal hessian approximation

Cong Guo, Yuxian Qiu, Jingwen Leng, Xiaotian Gao, Chen Zhang, Yunxin Liu, Fan Yang, Yuhao Zhu, Minyi Guo
International Conference on Learning (ICLR 2022) 【PDF】

C16. Dual-side sparse tensor core

Yang Wang, Chen Zhang*, Zhiqiang Xie, Cong Guo, Yunxin Liu, Jingwen Leng
2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA 2021)【PDF】
Keywords: GPGPU, Sparse Tensor Core, AI acceleration

C15. Boosting mobile CNN inference through semantic memory

Yun Li, Chen Zhang*, Shihao Han, Li Lyna Zhang, Baoqun Yin*, Yunxin Liu, Mengwei Xu
Proceedings of the 29th ACM International Conference on Multimedia (Multimedia 2021)【PDF】【Web】

C14. Scylla: Qoe-aware continuous mobile vision with fpga-based dynamic deep neural network reconfiguration

Shuang Jiang, Zhiyao Ma, Xiao Zeng, Chenren Xu, Mi Zhang, Chen Zhang, Yunxin Liu
Proceedings of the 2022 IEEE Conference on Computer Communications (INFOCOM 2020)【PDF】

C13. Ladabert: Lightweight adaptation of bert through hybrid model compression

Yihuan Mao, Yujing Wang, Chufan Wu, Chen Zhang, Yang Wang, Yaming Yang, Quanlu Zhang, Yunhai Tong, Jing Bai
Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)【PDF】

C12. Live video analytics with FPGA-based smart cameras

Shang Wang, Chen Zhang*, Yuanchao Shu, Yunxin Liu*
Proceedings of the 2019 Workshop on Hot Topics in Video Analytics and Intelligent Edges (HotEdges 2019)【PDF】

C11. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity

Shijie Cao, Chen Zhang*, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu, Lintao Zhang
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2019) 【PDF】【Web】
Industrial Impact: Used by NVIDIA Sparse Tensor Core (Ampere and Hopper Architecture)
Keyword: Sparse Neural Network, Acceleration, FPGA

C10. Balanced sparsity for efficient dnn inference on gpu

Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang*, Lanshun Nie
Proceedings of the AAAI conference on artificial intelligence (AAAI 2019) 【PDF】
Industrial Impact: Used by NVIDIA Sparse Tensor Core (Ampere and Hopper Architecture)
Keyword: Sparse Neural Network, Acceleration, GPGPU

C9. Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization

Shijie Cao, Lingxiao Ma, Wencong Xiao, Chen Zhang*, Yunxin Liu, Lintao Zhang, Lanshun Nie, Zhi Yang
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) 【PDF】

C8. Best-effort FPGA programming: A few steps can go a long way

Jason Cong, Zhenman Fang, Yuchen Hao, Peng Wei, Cody Hao Yu, Chen Zhang, Peipei Zhou
arXiv preprint arXiv:1807.01340 (2018)

C7. Using data compression for optimizing FPGA-based convolutional neural network accelerators

Yijin Guan, Ningyi Xu, Chen Zhang, Zhihang Yuan, Jason Cong
International workshop on advanced parallel processing technologies (2017)

C6. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster

Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, Jason Con
Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED 2016) 【PDF】

C5. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks

Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, Jason Cong
Proceedings of the 35th International Conference on Computer-Aided Design (ICCAD 2016) 【PDF】

C4. Optimizing FPGA-based accelerator design for deep convolutional neural networks

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, Jason Cong, “
Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays (FPGA 2015)【PDF】
Citation: 2218 (Top-1 citation in FPGA conference history since 1992)
Award: FPGA-2015 Best Paper Nomination
Keyword: Convolutional Neural Network, FPGA, Acceleration, Roofline Model

C3. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD

Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, Jason Cong
Proceedings of the Ninth European Conference on Computer Systems (EuroSys 2014)【PDF】

C2. Memory partitioning for multidimensional arrays in high-level synthesis

Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, Jason Cong
Proceedings of the 50th Annual Design Automation Conference (DAC 2013)【PDF】

C1. Automatic multidimensional memory partitioning for FPGA-based accelerators

Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, Jason Cong
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays (FPGA 2013)

Patent

Bin Lin,Tao Peng,Chen Zhang,Minmin Sun,Lanbo Li,Xiafei Qiu,Shen Li,Yong Li,Wei Lin,Task handling methods as well as automated question answering methods, January 02, 2025, PCT/IB2025/050023
林彬,彭陶,张宸,孙敏敏,李澜博,邱侠斐,李深,李永,林伟,任务处理方法以及自动问答方法, January 03, 2024, 202410010610.3
Haoran Li, Fei Sun, Yuan Gao, Guyue Huang, Ruiguang Zhong, Chen Zhang; GPU and Related Methods, November 24, 2023, China, CN117114960A
Yuan Gao, Fei Sun, Haoran Li, Guyue Huang, Chen Zhang, Ruiguang Zhong; Thread Warp Execution Method and Related GPU, December 15, 2023, China, CN117237178A
Haoran Li; Fei Sun; Yuan Gao; Guyue Huang; Ruiguang Zhong; Chen Zhang; GPU AND METHOD OF THE SAME, 2023-12-07, U.S.，US-20230367741-A1
Yuan Gao; Fei Sun; Haoran Li; Guyue Huang; Chen Zhang; Ruiguang Zhong; WARP EXECUTION METHOD AND ASSOCIATED GPU, 2023-12-07，U.S., US-20230394617-A1
Chen Zhang; Yunxin Liu; NEURAL NETWORK COMPRESSION BASED ON BANK-BALANCED SPARSITY, 2019-11-15, U.S., US20210150362A1
Chen Zhang, Yunxin Liu; Sparse Convolutional Neural Network, June 18, 2019, China, CN112101511A

Chen Zhang / 张宸

Publications / 出版物

Journal Paper

T5. Theseus: Exploring Efficient Wafer-Scale Chip Design for Large Language Models

T4. DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures

T3. Fine-Grained Structured Sparse Computing for FPGA-Based AI Inference

T2. TSCompiler: Efficient Compilation Framework for Dynamic-shape Models

T1. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks

Conference Paper

C28. H^2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference

C27. DATIS: DRAM Architecture and Technology Integrated Simulation

C26. Tb-STC: Transposable Block-wise N:M Structured Sparse Tensor Core

C25. SynGPU: Synergizing CUDA and Bit-Serial Tensor Cores for Vision Transformer Acceleration on GPU

C24. Oltron: Algorithm-Hardware Co-design for Outlier-Aware Quantization of LLMs with Inter-/Intra-Layer Adaptation

C23. Amanda: Unified instrumentation framework for deep neural networks

C22. Cambricon-r: A fully fused accelerator for real-time learning of neural scene representation

C21. RM-STC: Row-merge dataflow inspired GPU sparse tensor core for energy-efficient sparse acceleration

C20. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

C19. Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training

C18. Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization

C17. SQuant: On-the-fly data-free quantization via diagonal hessian approximation

C16. Dual-side sparse tensor core

C15. Boosting mobile CNN inference through semantic memory

C14. Scylla: Qoe-aware continuous mobile vision with fpga-based dynamic deep neural network reconfiguration

C13. Ladabert: Lightweight adaptation of bert through hybrid model compression

C12. Live video analytics with FPGA-based smart cameras

C11. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity

C10. Balanced sparsity for efficient dnn inference on gpu

C9. Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization

C8. Best-effort FPGA programming: A few steps can go a long way

C7. Using data compression for optimizing FPGA-based convolutional neural network accelerators

C6. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster

C5. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks

C4. Optimizing FPGA-based accelerator design for deep convolutional neural networks

C3. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD

C2. Memory partitioning for multidimensional arrays in high-level synthesis

C1. Automatic multidimensional memory partitioning for FPGA-based accelerators

Patent