Computer Vision

Segmentation

Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

PortraitNet: Real-time Portrait Segmentation Network for Mobile Device

Real-time Hair Segmentation and Recoloring on Mobile GPUs

TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss

SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

(PP-LiteSeg) A Superior Real-Time Semantic Segmentation Model

Object Detection

Scaled-YOLOv4: Scaling Cross Stage Partial Network

Pose Estimation

(OpenPose) Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Image Classification

(Background Splitting) Finding Rare Classes in a Sea of Background

Image Inpainting

(PiiGAN) Generative Adversarial Networks for Pluralistic Image Inpainting

Recurrent Feature Reasoning for Image Inpainting

Image Editing

Spatially-invariant Style-codes Controlled Makeup Transfer

Adaptive semantic attribute decoupling for precise face image editing

(Arbitrary Facial Attribute Editing) Only Change What You Want

Face Swap

(SimSwap) An Efficient Framework For High Fidelity Face Swapping

(MobileFaceSwap) A Lightweight Framework for Video Face Swapping

(MobileFSGAN) MIGRATING FACE SWAP TO MOBILE DEVICES: A LIGHTWEIGHT FRAMEWORK AND A SUPERVISED TRAINING SOLUTION

(A new face swap method for image and video domains) a technical report

(Smooth-Swap) A Simple Enhancement for Face-Swapping with Smoothness

Region-Aware Face Swapping

GHOST — A New Face Swap Approach for Image and Video Domains

Video Generation

PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

(MakeItTalk) Speaker-Aware Talking-Head Animation

First Order Motion Model for Image Animation

(DaGAN) Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Thin-Plate Spline Motion Model for Image Animation

(SadTalker) Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Diffusion Model

(InstructPix2Pix) Learning to Follow Image Editing Instructions

High-Resolution Image Synthesis with Latent Diffusion Models

Null-text Inversion for Editing Real Images using Guided Diffusion Models

Volume Rendering

(NeRF) Representing Scenes as Neural Radiance Fields for View Synthesis

(R2L) Distilling Neural Radiance Field to Neural Light Field for Efficient Novel View Synthesis

Real-Time Neural Light Field on Mobile Devices

(Instant-NGP) Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

(MobileNeRF) Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures

(Re-ReND) Real-time Rendering of NeRFs across Devices

(BakedSDF) Meshing Neural SDFs for Real-Time View Synthesis

Virtual Try On

(ARShoe) Real-Time Augmented Reality Shoe Try-on System on Smartphones


Large Language Model

(Video-LLaVA) Learning United Visual Representation by Alignment Before Projection

(ChipNeMo) Domain-Adapted LLMs for Chip DesignBefore Projection

CONTINUAL PRE-TRAINING OF LANGUAGE MODELS

(Reuse, Don’t Retrain) A Recipe for Continued Pretraining of Language Models

Natural Language

Text-to-Speech

(YourTTS) Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

(VITS) Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

(NaturalSpeech2) Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

(NaturalSpeech) End-to-End Text to Speech Synthesis with Human-Level Quality

Voice Conversion

Voice Conversion With Just Nearest Neighbors

LOW-LATENCY REAL-TIME VOICE CONVERSION ON CPU

(QuickVC) Any-To-Many Voice Conversion Using Inverse Short-Time Fourier Transform for Faster Conversion

Speech Recognition

(Whisper) Robust Speech Recognition via Large-Scale Weak Supervision

(WhisperX) Time-Accurate Speech Transcription of Long-Form Audio

Music Fingerprinting

(SpectroMap) Peak detection algorithm for audio fingerprinting

MUSIC AUGMENTATION AND DENOISING FOR PEAK-BASED AUDIO FINGERPRINTING


Fundamental

Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Searching for MobileNetV3

Supervised Contrastive Learning

(Wavelet Knowledge Distillation) Towards Efficient Image-to-Image Translation

(Teachers Do More Than Teach) Compressing Image-to-Image Models

Coordinate Attention for Efficient Mobile Network Design

Image Augmentations for GAN Training

Improved Consistency Regularization for GANs

(GraN-GAN) Piecewise Gradient Normalization for Generative Adversarial Networks

TOWARDS FASTER AND STABILIZED GAN TRAINING FOR HIGH-FIDELITY FEW-SHOT IMAGE SYNTHESIS

(GAN Compression) Efficient Architectures for Interactive Conditional GANs

Improving GANs with A Dynamic Discriminator

Systematic Analysis and Removal of Circular Artifacts for StyleGAN