Transformer-based vision models are increasingly popular and we need better ways to interpret and visualize their predictions. Previous works have been limited to visualizing attention maps; we apply a Shapley-value based method (FastSHAP) to Vision Transformers and Masked Au- toencoders, comparing the results to a classical ResNet. We find that choosing ResNet as the surrogate model for FastSHAP lets us successfully interpret and visualize transformer-based vision models. We observe that the estimated Shapley values of ResNet and ViT trained on CIFAR-10 are qualitatively different, even though the models’ predictions are mostly consistent. Keywords: Interpretability, Visualization, Shapley values, Vision Transformer, Masked Auto-encoder

Paper