Publications
You can also find my articles on my Google Scholar profile.
First-Author Papers
- Twins: Revisiting the design of spatial attention in vision transformers, NeurIPS21
- Conditional Positional Encodings for Vision Transformers, ICLR23
- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks, ECCV24
- GPG: A simple and strong reinforcement learning baseline for model reasoning, ICLR26
- MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
- FairNAS: Rethinking evaluation fairness of weight sharing neural architecture search, ICCV21
- Fair DARTS: Eliminating unfair advantages in differentiable architecture search, ECCV20
- DARTS-: Robustly stepping out of performance collapse without indicators, ICLR21
- ROME: Robustifying memory-efficient NAS via topology disentanglement and gradients accumulation, ICCV23
- Make RepVGG Greater Again: A Quantization-aware Approach, AAAI24
- MixPATH: A unified approach for one-shot neural architecture search, ICCV23
- USP: Unified self-supervised pretraining for image generation and understanding, ICCV25
- Noisy differentiable architecture search, BMVC21
- A Unified Mixture-View Framework for Unsupervised Representation Learning, BMVC22
- Multi-objective reinforced evolution in mobile neural architecture search, ECCVW2020
- Fast, accurate and lightweight super-resolution with neural architecture search, ICPR20
- MoGA: Searching beyond MobileNetV3, ICASSP2020
- Scarlet-NAS: Bridging the gap between stability and scalability in weight-sharing NAS, ICCVW21
- Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning
- Improved crowding distance for NSGA-II
- Policy optimization with penalized point probability distance: An alternative to PPO
Collaborative Papers
Image Generation & Editing
- Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers, CVPR26
- Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing
- From Scale to Speed: Adaptive Test-Time Scaling for Image Editing, CVPR26
- Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
- From editor to dense geometry estimator, CVPR26
- Ragsr: Regional attention guided diffusion for image super-resolution
- S-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models, ICLR26
- LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling, ICCV25
- Flux-text: A simple and advanced diffusion transformer baseline for scene text editing
- Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation
- FlowDreamer: exploring high fidelity text-to-3D generation via rectified flow
- TEXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution, ICASSP26
- Accurate and efficient single image super-resolution with matrix channel attention network, ACCV20
Video Generation & Understanding
- Video-CoE: Reinforcing Video Event Prediction via Chain of Events, CVPR26
- Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
- Eevee: Towards Close-up High-resolution Video-based Virtual Try-on, CVPR26 Findings
- ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints, AAAI26
- Video-star: Reinforcing open-vocabulary action recognition with tools, ICLR26
- Omni-effects: Unified and spatially-controllable visual effects generation, AAAI26
- Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation models, ICLR26
- VMBench: A Benchmark for Perception-Aligned Video Motion Generation, ICCV25
- FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos, ACM MM25
- Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V, ICASSP26
- Artifact-Aware Evaluation for High-Quality Video Generation, ICASSP26
- Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation, CVPR22
LLM Reasoning & Agents
- Code2World: A GUI World Model via Renderable Code Generation
- Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models
- Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation, ICLR26
- AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting, AAAI26
- Tree search for LLM agent reinforcement learning, ICLR26
- AutoDrive-R: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving, ICLR26
- Position bias mitigates position bias: Mitigate position bias through inter-position knowledge distillation, EMNLP25 oral
- HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation, EMNLP25 oral
- Ranking-aware Reinforcement Learning for Ordinal Ranking, ICASSP26
Multimodal & Vision-Language
- What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
- Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models, ICLR26
- Urban Socio-Semantic Segmentation with Vision-Language Reasoning, ICLR26
- Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning, AAAI26
- Univg-r1: Reasoning guided universal visual grounding with reinforcement learning
- Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model
- Mmgenbench: Evaluating the limits of LMMs from the text-to-image generation perspective
- Lenna: Language Enhanced Reasoning Detection Assistant, ICASSP25
Detection, Segmentation & 3D Perception
- UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement, ICCV25
- PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus, CVPR25
- SCTNet: Single Branch CNN with Transformer Semantic Information for Real-time Segmentation, AAAI24
- FastPillars: A Deployment-friendly Pillar-based 3D Detector, IEEE TCSVT
- Yolov6 v3.0: A full-scale reloading
- AeDet: Azimuth-invariant multi-view 3D object detection, CVPR23
- SegViT: Semantic segmentation with plain vision transformers, NeurIPS22
- YOLOv6: A single-stage object detection framework for industrial applications, arXiv
- Fully convolutional one-stage 3D object detection on LiDAR range images, NeurIPS22
- PromptDet: Towards open-vocabulary detection using uncurated images, ECCV22
- Cctrans: Simplifying and improving crowd counting with transformer
Foundation Model Architectures
- FASA: Frequency-Aware Sparse Attention, ICLR26
- AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models?
- Semantic Context Matters: Improving Conditioning for Autoregressive Models
- There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training, ICLR26
- Scalar: Scale-wise controllable visual autoregressive learning, AAAI26
- Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition, ECCV24
- Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness, ICML24
- PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution, CVPR24
- Efficientrep: an efficient repvgg-style convnets with hardware-aware neural network design
Model Compression & AutoML
- LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection, ICLR24
- Masked Autoencoders Are Robust Neural Architecture Search Learners
- A Speed Odyssey for Deployable Quantization of LLMs
- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models, AAAI24
- FPTQ: Fine-grained Post-Training Quantization for Large Language Models
- EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, BMVC22
- DAAS: Differentiable architecture and augmentation policy search
- AutoKWS: Keyword Spotting with Differentiable Architecture Search, ICASSP21
- Neural Architecture Search on Acoustic Scene Classification, InterSpeech20
Maps, Mobility & Recommendation
- MobilityBench: A Scalable Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
- IntRR: A Framework for Integrating SID Redistribution and Length Reduction for Generative Recommendation
- IntTravel: A Real-World Dataset and Generative Framework for Integrated Multi-Task Travel Recommendation
- GenMRP: A Generative Multi-Route Planning Framework for Efficient and Personalized Real-Time Industrial Navigation
- SCASRec: A Self-Correcting and Auto-Stopping Model for Generative Route List Recommendation
- Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
- Intsr: An integrated generative framework for search and recommendation
- Comprehensive Comparison Network: a framework for locality-aware, routes-comparable and interpretable route recommendation
- Effective Probabilistic Time Series Forecasting with Fourier Adaptive Noise-Separated Diffusion
- DSFNet: Learning Disentangled Scenario Factorization for Multi-Scenario Route Ranking, WWW25
