iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
— AK (@_akhaliq) June 18, 2022
abs: https://t.co/dkZUd4QC81 pic.twitter.com/pJFpxd7ckU
iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
— AK (@_akhaliq) June 18, 2022
abs: https://t.co/dkZUd4QC81 pic.twitter.com/pJFpxd7ckU
Disentangling visual and written concepts in CLIP
— AK (@_akhaliq) June 18, 2022
abs: https://t.co/VsyuDV4HNI
project page: https://t.co/2hTQnhR2o1 pic.twitter.com/LbWpnpTTHT
Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: https://t.co/lJf5aVv3Jr
— Sergey Levine (@svlevine) June 17, 2022
A thread: pic.twitter.com/pHj76WHnUA
Action recognition is a challenging task but has seen improvements through multimodality. In this latest blog post, learn how @Disney uses PyTorch to improve activity recognition through multimodal approaches. https://t.co/Xqy79N7Eh9 pic.twitter.com/t5kJq5msxx
— PyTorch (@PyTorch) June 17, 2022
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning
— AK (@_akhaliq) June 17, 2022
abs: https://t.co/SunKnQH3NJ
project page: https://t.co/MgIrCZlQJv pic.twitter.com/yMHmZmk3J2
Efficient Decoder-free Object Detection with Transformers
— AK (@_akhaliq) June 15, 2022
abs: https://t.co/YW4QcqztiW
experiments on the MS COCO benchmark demonstrate that DFFT_SMALL outperforms DETR by 2.5% AP with 28% computation cost reduction and more than 10× fewer training epochs pic.twitter.com/ICOvqgA8xQ
Peripheral Vision Transformer
— AK (@_akhaliq) June 15, 2022
abs: https://t.co/c6R8BfNDPS
propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data pic.twitter.com/S78e7WXDKh
DETR++: Taming Your Multi-Scale Detection Transformer
— AK (@_akhaliq) June 8, 2022
abs: https://t.co/kOQ5V4vC3C
DETR++, a new architecture that improves detection results by 1.9% AP on MS COCO 2017, 11.5% AP on RICO icon detection, and 9.1% AP on RICO layout extraction over existing baselines pic.twitter.com/Kt3EQRXwuH
Yes, machine learning is everywhere. But this one application where it really delivers, my ugly handwriting and all. (Fun fact: my students implemented and live-demoed sth similar as their class project). pic.twitter.com/6VU8BsGcFX
— Sebastian Raschka (@rasbt) June 6, 2022
EfficientFormer: Vision Transformers at MobileNet Speed
— AK (@_akhaliq) June 3, 2022
abs: https://t.co/lfdbJdS46J
EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12, which is even a bit faster than MobileNetV2 (1.7 ms, 71.8% top-1) pic.twitter.com/4zcvunXe57
X-ViT: High Performance Linear Vision Transformer without Softmax
— AK (@_akhaliq) May 30, 2022
abs: https://t.co/A6HZ2vXKDB pic.twitter.com/kArY0Tm4VE
GIT: A Generative Image-to-text Transformer for Vision and Language
— AK (@_akhaliq) May 30, 2022
abs: https://t.co/iFly0pcoXM
model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr) pic.twitter.com/vn9LV98Dwr