Towards Efficient And Effective Representation Learning For Image And Video Understanding