Publications
You can also find my articles on my Google Scholar profile.
First-Author Papers
- Twins: Revisiting the design of spatial attention in vision transformers, NeurIPS21
- Conditional Positional Encodings for Vision Transformers, ICLR23
- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks, ECCV24
- GPG: A simple and strong reinforcement learning baseline for model reasoning, ICLR26
- MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
- FairNAS: Rethinking evaluation fairness of weight sharing neural architecture search, ICCV21
- Fair DARTS: Eliminating unfair advantages in differentiable architecture search, ECCV20
- DARTS-: Robustly stepping out of performance collapse without indicators, ICLR21
- ROME: Robustifying memory-efficient NAS via topology disentanglement and gradients accumulation, ICCV23
- Make RepVGG Greater Again: A Quantization-aware Approach, AAAI24
- MixPATH: A unified approach for one-shot neural architecture search, ICCV23
- USP: Unified self-supervised pretraining for image generation and understanding, ICCV25
- Noisy differentiable architecture search, BMVC21
- A Unified Mixture-View Framework for Unsupervised Representation Learning, BMVC22
- Multi-objective reinforced evolution in mobile neural architecture search, ECCVW2020
- Fast, accurate and lightweight super-resolution with neural architecture search, ICPR20
- MoGA: Searching beyond MobileNetV3, ICASSP2020
- Scarlet-NAS: Bridging the gap between stability and scalability in weight-sharing NAS, ICCVW21
- Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning
- Improved crowding distance for NSGA-II
- Policy optimization with penalized point probability distance: An alternative to PPO
Collaborative Papers
- Code2World: A GUI World Model via Renderable Code Generation
- Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation, ICLR26
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models, ICLR26
- There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training, ICLR26
- Video-star: Reinforcing open-vocabulary action recognition with tools, ICLR26
- Tree search for LLM agent reinforcement learning, ICLR26
- AutoDrive-R: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving, ICLR26
- S-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models, ICLR26
- Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation models, ICLR26
- Ranking-aware Reinforcement Learning for Ordinal Ranking, ICASSP26
- Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V, ICASSP26
- Artifact-Aware Evaluation for High-Quality Video Generation, ICASSP26
- TEXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution, ICASSP26
- Urban Socio-Semantic Segmentation with Vision-Language Reasoning
- Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
- Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
- Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
- Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
- Semantic Context Matters: Improving Conditioning for Autoregressive Models
- Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning, AAAI26
- Scalar: Scale-wise controllable visual autoregressive learning, AAAI26
- Omni-effects: Unified and spatially-controllable visual effects generation, AAAI26
- AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting, AAAI26
- ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints, AAAI26
- Intsr: An integrated generative framework for search and recommendation
- From editor to dense geometry estimator
- Ragsr: Regional attention guided diffusion for image super-resolution
- Comprehensive Comparison Network: a framework for locality-aware, routes-comparable and interpretable route recommendation
- Univg-r1: Reasoning guided universal visual grounding with reinforcement learning
- Effective Probabilistic Time Series Forecasting with Fourier Adaptive Noise-Separated Diffusion
- Flux-text: A simple and advanced diffusion transformer baseline for scene text editing
- Position bias mitigates position bias: Mitigate position bias through inter-position knowledge distillation, EMNLP25 oral
- HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation, EMNLP25 oral
- UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement, ICCV25
- VMBench: A Benchmark for Perception-Aligned Video Motion Generation, ICCV25
- LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling, ICCV25
- FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos, ACM MM25
- Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition, ECCV24
- Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness, ICML24
- PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution, CVPR24
- LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection, ICLR24
- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models, AAAI24
- YOLOv6: A single-stage object detection framework for industrial applications, arXiv
- A Speed Odyssey for Deployable Quantization of LLMs
- FPTQ: Fine-grained Post-Training Quantization for Large Language Models
- Lenna: Language Enhanced Reasoning Detection Assistant, ICASSP25
- SCTNet: Single Branch CNN with Transformer Semantic Information for Real-time Segmentation, AAAI24
- PromptDet: Towards open-vocabulary detection using uncurated images, ECCV22
- SegViT: Semantic segmentation with plain vision transformers, NeurIPS22
- Fully convolutional one-stage 3D object detection on LiDAR range images, NeurIPS22
- Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation, CVPR22
- AeDet: Azimuth-invariant multi-view 3D object detection, CVPR23
- EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, BMVC22
- AutoKWS: Keyword Spotting with Differentiable Architecture Search, ICASSP21
- Neural Architecture Search on Acoustic Scene Classification, InterSpeech20
- Accurate and efficient single image super-resolution with matrix channel attention network, ACCV20
- STRETCH meat grinder with ICCOS, IEEE Transactions on Plasma Science
- Comparisons of three inductive pulse power supplies, IEEE Transactions on Plasma Science
- FastPillars: A Deployment-friendly Pillar-based 3D Detector, IEEE TCSVT
- Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model
- Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation
- Mmgenbench: Evaluating the limits of LMMs from the text-to-image generation perspective
- FlowDreamer: exploring high fidelity text-to-3D generation via rectified flow
- PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus, CVPR25
- Adafedfr: Federated face recognition with adaptive inter-class representation learning
- DSFNet: Learning Disentangled Scenario Factorization for Multi-Scenario Route Ranking, WWW25
- Masked Autoencoders Are Robust Neural Architecture Search Learners
- Efficientrep: an efficient repvgg-style convnets with hardware-aware neural network design
- Yolov6 v3.0: A full-scale reloading
- DAAS: Differentiable architecture and augmentation policy search
- Cctrans: Simplifying and improving crowd counting with transformer
