Publications

You can also find my articles on my Google Scholar profile.

First-Author Papers

Twins: Revisiting the design of spatial attention in vision transformers, NeurIPS21
Conditional Positional Encodings for Vision Transformers, ICLR23
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks, ECCV24
GPG: A simple and strong reinforcement learning baseline for model reasoning, ICLR26
MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
FairNAS: Rethinking evaluation fairness of weight sharing neural architecture search, ICCV21
Fair DARTS: Eliminating unfair advantages in differentiable architecture search, ECCV20
DARTS-: Robustly stepping out of performance collapse without indicators, ICLR21
ROME: Robustifying memory-efficient NAS via topology disentanglement and gradients accumulation, ICCV23
Make RepVGG Greater Again: A Quantization-aware Approach, AAAI24
MixPATH: A unified approach for one-shot neural architecture search, ICCV23
USP: Unified self-supervised pretraining for image generation and understanding, ICCV25
Noisy differentiable architecture search, BMVC21
A Unified Mixture-View Framework for Unsupervised Representation Learning, BMVC22
Multi-objective reinforced evolution in mobile neural architecture search, ECCVW2020
Fast, accurate and lightweight super-resolution with neural architecture search, ICPR20
MoGA: Searching beyond MobileNetV3, ICASSP2020
Scarlet-NAS: Bridging the gap between stability and scalability in weight-sharing NAS, ICCVW21
Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning
Improved crowding distance for NSGA-II
Policy optimization with penalized point probability distance: An alternative to PPO

Collaborative Papers

Image Generation & Editing

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers, CVPR26
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing, CVPR26
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
From editor to dense geometry estimator, CVPR26
Ragsr: Regional attention guided diffusion for image super-resolution
S-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models, ICLR26
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling, ICCV25
Flux-text: A simple and advanced diffusion transformer baseline for scene text editing
Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation
FlowDreamer: exploring high fidelity text-to-3D generation via rectified flow
TEXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution, ICASSP26
Accurate and efficient single image super-resolution with matrix channel attention network, ACCV20

Video Generation & Understanding

Video-CoE: Reinforcing Video Event Prediction via Chain of Events, CVPR26
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on, CVPR26 Findings
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints, AAAI26
Video-star: Reinforcing open-vocabulary action recognition with tools, ICLR26
Omni-effects: Unified and spatially-controllable visual effects generation, AAAI26
Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation models, ICLR26
VMBench: A Benchmark for Perception-Aligned Video Motion Generation, ICCV25
FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos, ACM MM25
Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V, ICASSP26
Artifact-Aware Evaluation for High-Quality Video Generation, ICASSP26
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation, CVPR22

LLM Reasoning & Agents

Code2World: A GUI World Model via Renderable Code Generation
Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation, ICLR26
AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting, AAAI26
Tree search for LLM agent reinforcement learning, ICLR26
AutoDrive-R: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving, ICLR26
Position bias mitigates position bias: Mitigate position bias through inter-position knowledge distillation, EMNLP25 oral
HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation, EMNLP25 oral
Ranking-aware Reinforcement Learning for Ordinal Ranking, ICASSP26

Multimodal & Vision-Language

Detection, Segmentation & 3D Perception

UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement, ICCV25
PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus, CVPR25
SCTNet: Single Branch CNN with Transformer Semantic Information for Real-time Segmentation, AAAI24
FastPillars: A Deployment-friendly Pillar-based 3D Detector, IEEE TCSVT
Yolov6 v3.0: A full-scale reloading
AeDet: Azimuth-invariant multi-view 3D object detection, CVPR23
SegViT: Semantic segmentation with plain vision transformers, NeurIPS22
YOLOv6: A single-stage object detection framework for industrial applications, arXiv
Fully convolutional one-stage 3D object detection on LiDAR range images, NeurIPS22
PromptDet: Towards open-vocabulary detection using uncurated images, ECCV22
Cctrans: Simplifying and improving crowd counting with transformer

Foundation Model Architectures

Model Compression & AutoML

LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection, ICLR24
Masked Autoencoders Are Robust Neural Architecture Search Learners
A Speed Odyssey for Deployable Quantization of LLMs
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models, AAAI24
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, BMVC22
DAAS: Differentiable architecture and augmentation policy search
AutoKWS: Keyword Spotting with Differentiable Architecture Search, ICASSP21
Neural Architecture Search on Acoustic Scene Classification, InterSpeech20

Maps, Mobility & Recommendation