Publications

My publications follow a long arc from neural architecture search and efficient vision backbones to multimodal foundation models, LLM reasoning, world models, and spatial intelligence. Recent AMAP-ML work is designed to connect top-tier research, reproducible open source, and AMAP products serving 300M+ users every day.

120+ research papers and preprints
15K+ Google Scholar citations
6K+ citations from first-authored works
30+ open-source AMAP-ML projects

You can also find my articles on my Google Scholar profile.


Representative Works

Representative works are intentionally weighted toward first-author contributions, because they best show my own research taste, technical judgment, and long-term arc. A smaller set of team-led and open-source systems is included to show how that arc scales through engineering leadership, released code, and product-facing AI.

First-Author Representative Works

Reasoning RL
A minimal reinforcement-learning baseline for model reasoning: no critic, no reference model, no KL penalty. Adopted by ByteDance's VERL as an official algorithm.
ICLR 2026 · First Author · code
Generation
Unified self-supervised pretraining that bridges image generation and understanding, continuing the first-author line from efficient architectures to foundation models.
ICCV 2025 · First Author · code
Architecture
A unified LLaMA-style backbone for vision tasks, introducing auto-scaling 2D RoPE for multimodal Transformers across generation, classification, segmentation, and detection.
ECCV 2024 · First Author · code
Mobile VLM
A compact open vision-language assistant designed for real-time on-device deployment, with follow-up V2 work improving the mobile VLM baseline.
First Author · code
Vision Transformer
Revisited spatial attention design in Vision Transformers, pairing strong accuracy with a simpler architecture and practical deployment properties.
NeurIPS 2021 · First Author · Most Influential · code
Position Encoding
Conditional positional encodings for Vision Transformers, a clean architectural contribution later recognized on PaperDigest's Most Influential list.
ICLR 2023 · First Author · Most Influential · code
AutoML
A fairness-centered rethink of weight-sharing NAS evaluation, representing the earlier AutoML line that shaped the transition into efficient vision backbones.
ICCV 2021 · First Author · Most Influential · code
Quantization
A quantization-aware solution for RepVGG-style re-parameterized networks, addressing the structural quantization challenge behind YOLOv6-like industrial detectors.
AAAI 2024 · First Author · code

Team-Led & Open-Source Systems

Industrial Vision
An industrial object-detection framework with a full training-to-deployment toolchain and broad open-source adoption.
Open Source · code
Spatial AI
A real-world benchmark for evaluating route-planning agents in mobility scenarios, anchoring AMAP-ML's spatial-intelligence research direction.
KDD 2026 Oral · code
World Models
A general-purpose interactive world model release with controllable camera navigation, prompt-driven world events, model weights, and inference code.
AMAP-ML · 2026 · code
Agents
An agentic skill-evolution system that turns real interaction traces into reusable skill libraries across sessions, devices, and agents.
AMAP-ML · code

First-Author Papers

  1. GPG: A simple and strong reinforcement learning baseline for model reasoning, ICLR 2026 [code]
  2. USP: Unified self-supervised pretraining for image generation and understanding, ICCV 2025 [code]
  3. VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks, ECCV 2024 [code]
  4. MobileVLM V2: Faster and Stronger Baseline for Vision Language Model [code]
  5. MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices [code]
  6. Make RepVGG Greater Again: A Quantization-aware Approach, AAAI 2024 [code]
  7. Conditional Positional Encodings for Vision Transformers, ICLR 2023 [code]
  8. ROME: Robustifying memory-efficient NAS via topology disentanglement and gradients accumulation, ICCV 2023
  9. MixPATH: A unified approach for one-shot neural architecture search, ICCV 2023 [code]
  10. A Unified Mixture-View Framework for Unsupervised Representation Learning, BMVC 2022
  11. Twins: Revisiting the design of spatial attention in vision transformers, NeurIPS 2021 [code]
  12. DARTS-: Robustly stepping out of performance collapse without indicators, ICLR 2021 [code]
  13. FairNAS: Rethinking evaluation fairness of weight sharing neural architecture search, ICCV 2021 [code]
  14. Noisy differentiable architecture search, BMVC 2021 [code]
  15. Scarlet-NAS: Bridging the gap between stability and scalability in weight-sharing NAS, ICCV Workshops 2021 [code]
  16. Fair DARTS: Eliminating unfair advantages in differentiable architecture search, ECCV 2020 [code]
  17. MoGA: Searching beyond MobileNetV3, ICASSP 2020 [code]
  18. Fast, accurate and lightweight super-resolution with neural architecture search, ICPR 2020 [code]
  19. Multi-objective reinforced evolution in mobile neural architecture search, ECCV Workshops 2020
  20. Policy optimization with penalized point probability distance: An alternative to PPO
  21. Improved crowding distance for NSGA-II
  22. Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning

Collaborative Papers

Image Generation & Editing

  1. E2PO: Embedding-perturbed Exploration Preference Optimization for Flow Models, ICML 2026
  2. MAR-GRPO: Stabilized GRPO for AR-Diffusion Hybrid Image Generation
  3. ConceptWeaver: Weaving Disentangled Concepts with Flow
  4. Elucidating the SNR-t Bias of Diffusion Probabilistic Models, CVPR 2026 [code]
  5. Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers, CVPR 2026
  6. Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing [code]
  7. From Scale to Speed: Adaptive Test-Time Scaling for Image Editing, CVPR 2026
  8. Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning, CVPR 2026
  9. From editor to dense geometry estimator, CVPR 2026 [code]
  10. Ragsr: Regional attention guided diffusion for image super-resolution
  11. S2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models, ICLR 2026 [code]
  12. LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling, ICCV 2025 [code]
  13. Flux-text: A simple and advanced diffusion transformer baseline for scene text editing
  14. Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation
  15. FlowDreamer: exploring high fidelity text-to-3D generation via rectified flow
  16. TEXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution, ICASSP 2026
  17. Accurate and efficient single image super-resolution with matrix channel attention network, ACCV 2020

Video Generation & Understanding

  1. MIGA: Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos, ICML 2026
  2. Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
  3. Video-CoE: Reinforcing Video Event Prediction via Chain of Events, CVPR 2026
  4. Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
  5. Eevee: Towards Close-up High-resolution Video-based Virtual Try-on, CVPR 2026 Findings [code]
  6. ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints, AAAI 2026 [code]
  7. Video-star: Reinforcing open-vocabulary action recognition with tools, ICLR 2026
  8. Omni-effects: Unified and spatially-controllable visual effects generation, AAAI 2026 [code]
  9. Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation models, ICLR 2026 [code]
  10. VMBench: A Benchmark for Perception-Aligned Video Motion Generation, ICCV 2025 [code]
  11. FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos, ACM MM 2025 [code]
  12. Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V, ICASSP 2026
  13. Artifact-Aware Evaluation for High-Quality Video Generation, ICASSP 2026
  14. Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation, CVPR 2022

LLM Reasoning & Agents

  1. D2Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning, ICML 2026
  2. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver [code]
  3. Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
  4. Learning Agentic Policy from Action Guidance
  5. CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution, ACL 2026
  6. Code2World: A GUI World Model via Renderable Code Generation [code]
  7. Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models
  8. Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation, ICLR 2026 [code]
  9. AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting, AAAI 2026
  10. Tree search for LLM agent reinforcement learning, ICLR 2026 [code]
  11. AutoDrive-R2: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving, ICLR 2026
  12. Position bias mitigates position bias: Mitigate position bias through inter-position knowledge distillation, EMNLP 2025 oral [code]
  13. HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation, EMNLP 2025 oral [code]
  14. Ranking-aware Reinforcement Learning for Ordinal Ranking, ICASSP 2026

Multimodal & Vision-Language

  1. UniMRG: Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation, ICML 2026
  2. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
  3. LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics, ACL 2026
  4. What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
  5. Visually-Guided Policy Optimization for Multimodal Reasoning, ACL 2026
  6. Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty, WWW 2026
  7. Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment [code]
  8. Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models, ICLR 2026 [code]
  9. Urban Socio-Semantic Segmentation with Vision-Language Reasoning, ICLR 2026 [code]
  10. Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning, AAAI 2026 [code]
  11. Univg-r1: Reasoning guided universal visual grounding with reinforcement learning
  12. Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model
  13. Mmgenbench: Evaluating the limits of LMMs from the text-to-image generation perspective
  14. Lenna: Language Enhanced Reasoning Detection Assistant, ICASSP 2025 [code]

Detection, Segmentation & 3D Perception

  1. UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement, ICCV 2025 [code]
  2. PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus, CVPR 2025
  3. SCTNet: Single Branch CNN with Transformer Semantic Information for Real-time Segmentation, AAAI 2024
  4. FastPillars: A Deployment-friendly Pillar-based 3D Detector, IEEE TCSVT
  5. Yolov6 v3.0: A full-scale reloading
  6. AeDet: Azimuth-invariant multi-view 3D object detection, CVPR 2023
  7. SegViT: Semantic segmentation with plain vision transformers, NeurIPS 2022
  8. YOLOv6: A single-stage object detection framework for industrial applications, arXiv [code]
  9. Fully convolutional one-stage 3D object detection on LiDAR range images, NeurIPS 2022
  10. PromptDet: Towards open-vocabulary detection using uncurated images, ECCV 2022
  11. Cctrans: Simplifying and improving crowd counting with transformer

Foundation Model Architectures

  1. FASA: Frequency-Aware Sparse Attention, ICLR 2026 [code]
  2. Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models, ACL 2026
  3. AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models? [code]
  4. Semantic Context Matters: Improving Conditioning for Autoregressive Models, CVPR 2026
  5. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation, CVPR 2026
  6. There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training, ICLR 2026 [code]
  7. Scalar: Scale-wise controllable visual autoregressive learning, AAAI 2026 [code]
  8. Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition, ECCV 2024
  9. Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness, ICML 2024
  10. PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution, CVPR 2024
  11. Efficientrep: an efficient repvgg-style convnets with hardware-aware neural network design

Model Compression & AutoML

  1. Robust MAE-Driven NAS: From Mask Reconstruction to Architecture Innovation, ICASSP 2026
  2. LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection, ICLR 2024
  3. Masked Autoencoders Are Robust Neural Architecture Search Learners
  4. A Speed Odyssey for Deployable Quantization of LLMs
  5. Norm Tweaking: High-performance Low-bit Quantization of Large Language Models, AAAI 2024
  6. FPTQ: Fine-grained Post-Training Quantization for Large Language Models
  7. EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, BMVC 2022
  8. DAAS: Differentiable architecture and augmentation policy search
  9. AutoKWS: Keyword Spotting with Differentiable Architecture Search, ICASSP 2021
  10. Neural Architecture Search on Acoustic Scene Classification, InterSpeech 2020

Maps, Mobility & Recommendation

  1. MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios, KDD 2026 Oral [code]
  2. IntRR: A Framework for Integrating SID Redistribution and Length Reduction for Generative Recommendation [code]
  3. IntTravel: A Real-World Dataset and Generative Framework for Integrated Multi-Task Travel Recommendation [code]
  4. GenMRP: A Generative Multi-Route Planning Framework for Efficient and Personalized Real-Time Industrial Navigation
  5. SCASRec: A Self-Correcting and Auto-Stopping Model for Generative Route List Recommendation
  6. Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization, ACL 2026 Findings
  7. Intsr: An integrated generative framework for search and recommendation
  8. Comprehensive Comparison Network: a framework for locality-aware, routes-comparable and interpretable route recommendation
  9. Effective Probabilistic Time Series Forecasting with Fourier Adaptive Noise-Separated Diffusion
  10. DSFNet: Learning Disentangled Scenario Factorization for Multi-Scenario Route Ranking, WWW 2025 [code]