Combine And Conquer: Representation Learning From Multmodal Data