Cross-Hardware Optimization Strategies for Large-Scale Recommendation Model Inference in Production Systems
DOI:
https://doi.org/10.71465/fair524Keywords:
recommendation systems, cross-hardware optimization, model inference, heterogeneous computing, neural architecture search, model compression, production systems, GPU accelerationAbstract
Large-scale recommendation systems have become indispensable components of modern digital platforms, processing billions of user interactions daily to deliver personalized content and services. The computational demands of recommendation model inference in production environments present significant challenges, particularly when deploying across heterogeneous hardware architectures. This review examines cross-hardware optimization strategies for large-scale recommendation model inference, focusing on techniques that enable efficient deployment across graphics processing units (GPUs), central processing units (CPUs), tensor processing units (TPUs), and field-programmable gate arrays (FPGAs). We systematically analyze recent advances in model compression, including quantization and pruning techniques specifically designed for recommendation models. The paper explores hardware-aware neural architecture search (NAS) methods that optimize model structures for target hardware platforms while maintaining prediction accuracy. We investigate dynamic resource allocation strategies and load balancing mechanisms that improve throughput in multi-device production systems. Additionally, we examine emerging heterogeneous computing frameworks that enable seamless model deployment across diverse hardware infrastructures. Our analysis reveals that successful cross-hardware optimization requires careful consideration of model architecture, hardware characteristics, and system-level constraints. The review identifies critical research gaps in real-time inference optimization, automated hardware selection, and energy-efficient deployment strategies. We conclude that integrated optimization approaches combining multiple techniques offer the most promising path toward efficient large-scale recommendation system deployment in heterogeneous production environments.
Downloads
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.