Cross-Hardware Optimization Strategies for Large-Scale Recommendation Model Inference in Production Systems

Authors

  • Zijian Shen Carnegie Mellon University, United States Author
  • Zimeng Wang New England College, United States Author
  • Yang Liu Worcester Polytechnic Institute, United States Author

DOI:

https://doi.org/10.71465/fair524

Keywords:

recommendation systems, cross-hardware optimization, model inference, heterogeneous computing, neural architecture search, model compression, production systems, GPU acceleration

Abstract

Large-scale recommendation systems have become indispensable components of modern digital platforms, processing billions of user interactions daily to deliver personalized content and services. The computational demands of recommendation model inference in production environments present significant challenges, particularly when deploying across heterogeneous hardware architectures. This review examines cross-hardware optimization strategies for large-scale recommendation model inference, focusing on techniques that enable efficient deployment across graphics processing units (GPUs), central processing units (CPUs), tensor processing units (TPUs), and field-programmable gate arrays (FPGAs). We systematically analyze recent advances in model compression, including quantization and pruning techniques specifically designed for recommendation models. The paper explores hardware-aware neural architecture search (NAS) methods that optimize model structures for target hardware platforms while maintaining prediction accuracy. We investigate dynamic resource allocation strategies and load balancing mechanisms that improve throughput in multi-device production systems. Additionally, we examine emerging heterogeneous computing frameworks that enable seamless model deployment across diverse hardware infrastructures. Our analysis reveals that successful cross-hardware optimization requires careful consideration of model architecture, hardware characteristics, and system-level constraints. The review identifies critical research gaps in real-time inference optimization, automated hardware selection, and energy-efficient deployment strategies. We conclude that integrated optimization approaches combining multiple techniques offer the most promising path toward efficient large-scale recommendation system deployment in heterogeneous production environments.

Downloads

Download data is not yet available.

Downloads

Published

2025-12-11