The transformer model used in the VQGAN paper has nothing to do with the autoencoder, it has been used to predict the quantized tokens. So no, you don't need to slide a window with the unet, you can directly predict images with different aspect ratios and resolutions like with SD.