Towards Multilingual Vision-Language Models