Urban Sustainability Assessment through Multi-Modal Learning: A Vision Transformer–Graph Neural Network Framework Integrating Remote Sensing and Economic Indicators
DOI:
https://doi.org/10.71465/fair722Keywords:
Urban sustainability, Remote sensing, Vision Transformer, Graph Neural Network, Multi-modal fusion, Smart city, Data-driven governanceAbstract
Urban sustainability assessment is a crucial challenge in achieving balanced economic growth and environmental protection in modern cities. Traditional statistical evaluation methods often overlook spatial heterogeneity and environmental patterns that can be captured from remote sensing imagery. To address this limitation, this study proposes a multi-modal deep learning framework that integrates high-resolution remote sensing data with socioeconomic indicators for comprehensive urban sustainability assessment. Specifically, a Vision Transformer (ViT) is employed to extract fine-grained spatial and environmental representations—such as vegetation coverage, surface temperature, and built-up density—from Sentinel-2 satellite imagery, while a Graph Neural Network (GNN) models the spatial and economic dependencies between cities, enabling cross-modal and inter-city information fusion. The proposed ViT–GNN framework effectively captures both environmental and socioeconomic dynamics to generate a composite sustainability score. Experiments conducted on a dataset covering 120 major Chinese cities from 2019 to 2023 demonstrate that the model achieves an MSE of 0.012, an MAE of 0.071, and an () of 0.931, outperforming existing regression and CNN-based baselines. The results highlight that the model can accurately evaluate urban sustainability levels, providing an interpretable and data-driven tool for policymakers and planners to support sustainable urban development, resource allocation, and green policy formulation.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Jiajiang Shen, Huaiyu Wang (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.