Publications
My publications follow a long arc from neural architecture search and efficient vision backbones to multimodal foundation models, LLM reasoning, world models, and spatial intelligence. Recent AMAP-ML work is designed to connect top-tier research, reproducible open source, and AMAP products serving 300M+ users every day.
120+ research papers and preprints
15K+ Google Scholar citations
6K+ citations from first-authored works
30+ open-source AMAP-ML projects
Representative works First-author papers LLM reasoning & agents Maps & mobility Generative AI Multimodal Architectures
You can also find my articles on my Google Scholar profile.
Representative Works
Representative works are intentionally weighted toward first-author contributions, because they best show my own research taste, technical judgment, and long-term arc. A smaller set of team-led and open-source systems is included to show how that arc scales through engineering leadership, released code, and product-facing AI.
First-Author Representative Works
Reasoning RL
A minimal reinforcement-learning baseline for model reasoning: no critic, no reference model, no KL penalty. Adopted by ByteDance's VERL as an official algorithm.
Generation
Unified self-supervised pretraining that bridges image generation and understanding, continuing the first-author line from efficient architectures to foundation models.
Architecture
A unified LLaMA-style backbone for vision tasks, introducing auto-scaling 2D RoPE for multimodal Transformers across generation, classification, segmentation, and detection.
Mobile VLM
A compact open vision-language assistant designed for real-time on-device deployment, with follow-up V2 work improving the mobile VLM baseline.
Vision Transformer
Revisited spatial attention design in Vision Transformers, pairing strong accuracy with a simpler architecture and practical deployment properties.
Position Encoding
Conditional positional encodings for Vision Transformers, a clean architectural contribution later recognized on PaperDigest's Most Influential list.
AutoML
A fairness-centered rethink of weight-sharing NAS evaluation, representing the earlier AutoML line that shaped the transition into efficient vision backbones.
Quantization
A quantization-aware solution for RepVGG-style re-parameterized networks, addressing the structural quantization challenge behind YOLOv6-like industrial detectors.
Team-Led & Open-Source Systems
Industrial Vision
An industrial object-detection framework with a full training-to-deployment toolchain and broad open-source adoption.
Spatial AI
A real-world benchmark for evaluating route-planning agents in mobility scenarios, anchoring AMAP-ML's spatial-intelligence research direction.
World Models
A general-purpose interactive world model release with controllable camera navigation, prompt-driven world events, model weights, and inference code.
Agents
An agentic skill-evolution system that turns real interaction traces into reusable skill libraries across sessions, devices, and agents.
First-Author Papers
- GPG: A simple and strong reinforcement learning baseline for model reasoning, ICLR 2026 [code]
- USP: Unified self-supervised pretraining for image generation and understanding, ICCV 2025 [code]
- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks, ECCV 2024 [code]
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model [code]
- MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices [code]
- Make RepVGG Greater Again: A Quantization-aware Approach, AAAI 2024 [code]
- Conditional Positional Encodings for Vision Transformers, ICLR 2023 [code]
- ROME: Robustifying memory-efficient NAS via topology disentanglement and gradients accumulation, ICCV 2023
- MixPATH: A unified approach for one-shot neural architecture search, ICCV 2023 [code]
- A Unified Mixture-View Framework for Unsupervised Representation Learning, BMVC 2022
- Twins: Revisiting the design of spatial attention in vision transformers, NeurIPS 2021 [code]
- DARTS-: Robustly stepping out of performance collapse without indicators, ICLR 2021 [code]
- FairNAS: Rethinking evaluation fairness of weight sharing neural architecture search, ICCV 2021 [code]
- Noisy differentiable architecture search, BMVC 2021 [code]
- Scarlet-NAS: Bridging the gap between stability and scalability in weight-sharing NAS, ICCV Workshops 2021 [code]
- Fair DARTS: Eliminating unfair advantages in differentiable architecture search, ECCV 2020 [code]
- MoGA: Searching beyond MobileNetV3, ICASSP 2020 [code]
- Fast, accurate and lightweight super-resolution with neural architecture search, ICPR 2020 [code]
- Multi-objective reinforced evolution in mobile neural architecture search, ECCV Workshops 2020
- Policy optimization with penalized point probability distance: An alternative to PPO
- Improved crowding distance for NSGA-II
- Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning
Collaborative Papers
Image Generation & Editing
- E2PO: Embedding-perturbed Exploration Preference Optimization for Flow Models, ICML 2026
- MAR-GRPO: Stabilized GRPO for AR-Diffusion Hybrid Image Generation
- ConceptWeaver: Weaving Disentangled Concepts with Flow
- Elucidating the SNR-t Bias of Diffusion Probabilistic Models, CVPR 2026 [code]
- Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers, CVPR 2026
- Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing [code]
- From Scale to Speed: Adaptive Test-Time Scaling for Image Editing, CVPR 2026
- Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning, CVPR 2026
- From editor to dense geometry estimator, CVPR 2026 [code]
- Ragsr: Regional attention guided diffusion for image super-resolution
- S2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models, ICLR 2026 [code]
- LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling, ICCV 2025 [code]
- Flux-text: A simple and advanced diffusion transformer baseline for scene text editing
- Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation
- FlowDreamer: exploring high fidelity text-to-3D generation via rectified flow
- TEXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution, ICASSP 2026
- Accurate and efficient single image super-resolution with matrix channel attention network, ACCV 2020
Video Generation & Understanding
- MIGA: Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos, ICML 2026
- Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
- Video-CoE: Reinforcing Video Event Prediction via Chain of Events, CVPR 2026
- Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
- Eevee: Towards Close-up High-resolution Video-based Virtual Try-on, CVPR 2026 Findings [code]
- ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints, AAAI 2026 [code]
- Video-star: Reinforcing open-vocabulary action recognition with tools, ICLR 2026
- Omni-effects: Unified and spatially-controllable visual effects generation, AAAI 2026 [code]
- Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation models, ICLR 2026 [code]
- VMBench: A Benchmark for Perception-Aligned Video Motion Generation, ICCV 2025 [code]
- FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos, ACM MM 2025 [code]
- Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V, ICASSP 2026
- Artifact-Aware Evaluation for High-Quality Video Generation, ICASSP 2026
- Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation, CVPR 2022
LLM Reasoning & Agents
- D2Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning, ICML 2026
- SkillClaw: Let Skills Evolve Collectively with Agentic Evolver [code]
- Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
- Learning Agentic Policy from Action Guidance
- CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution, ACL 2026
- Code2World: A GUI World Model via Renderable Code Generation [code]
- Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models
- Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation, ICLR 2026 [code]
- AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting, AAAI 2026
- Tree search for LLM agent reinforcement learning, ICLR 2026 [code]
- AutoDrive-R2: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving, ICLR 2026
- Position bias mitigates position bias: Mitigate position bias through inter-position knowledge distillation, EMNLP 2025 oral [code]
- HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation, EMNLP 2025 oral [code]
- Ranking-aware Reinforcement Learning for Ordinal Ranking, ICASSP 2026
Multimodal & Vision-Language
- UniMRG: Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation, ICML 2026
- Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
- LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics, ACL 2026
- What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
- Visually-Guided Policy Optimization for Multimodal Reasoning, ACL 2026
- Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty, WWW 2026
- Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment [code]
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models, ICLR 2026 [code]
- Urban Socio-Semantic Segmentation with Vision-Language Reasoning, ICLR 2026 [code]
- Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning, AAAI 2026 [code]
- Univg-r1: Reasoning guided universal visual grounding with reinforcement learning
- Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model
- Mmgenbench: Evaluating the limits of LMMs from the text-to-image generation perspective
- Lenna: Language Enhanced Reasoning Detection Assistant, ICASSP 2025 [code]
Detection, Segmentation & 3D Perception
- UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement, ICCV 2025 [code]
- PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus, CVPR 2025
- SCTNet: Single Branch CNN with Transformer Semantic Information for Real-time Segmentation, AAAI 2024
- FastPillars: A Deployment-friendly Pillar-based 3D Detector, IEEE TCSVT
- Yolov6 v3.0: A full-scale reloading
- AeDet: Azimuth-invariant multi-view 3D object detection, CVPR 2023
- SegViT: Semantic segmentation with plain vision transformers, NeurIPS 2022
- YOLOv6: A single-stage object detection framework for industrial applications, arXiv [code]
- Fully convolutional one-stage 3D object detection on LiDAR range images, NeurIPS 2022
- PromptDet: Towards open-vocabulary detection using uncurated images, ECCV 2022
- Cctrans: Simplifying and improving crowd counting with transformer
Foundation Model Architectures
- FASA: Frequency-Aware Sparse Attention, ICLR 2026 [code]
- Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models, ACL 2026
- AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models? [code]
- Semantic Context Matters: Improving Conditioning for Autoregressive Models, CVPR 2026
- Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation, CVPR 2026
- There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training, ICLR 2026 [code]
- Scalar: Scale-wise controllable visual autoregressive learning, AAAI 2026 [code]
- Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition, ECCV 2024
- Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness, ICML 2024
- PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution, CVPR 2024
- Efficientrep: an efficient repvgg-style convnets with hardware-aware neural network design
Model Compression & AutoML
- Robust MAE-Driven NAS: From Mask Reconstruction to Architecture Innovation, ICASSP 2026
- LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection, ICLR 2024
- Masked Autoencoders Are Robust Neural Architecture Search Learners
- A Speed Odyssey for Deployable Quantization of LLMs
- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models, AAAI 2024
- FPTQ: Fine-grained Post-Training Quantization for Large Language Models
- EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, BMVC 2022
- DAAS: Differentiable architecture and augmentation policy search
- AutoKWS: Keyword Spotting with Differentiable Architecture Search, ICASSP 2021
- Neural Architecture Search on Acoustic Scene Classification, InterSpeech 2020
Maps, Mobility & Recommendation
- MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios, KDD 2026 Oral [code]
- IntRR: A Framework for Integrating SID Redistribution and Length Reduction for Generative Recommendation [code]
- IntTravel: A Real-World Dataset and Generative Framework for Integrated Multi-Task Travel Recommendation [code]
- GenMRP: A Generative Multi-Route Planning Framework for Efficient and Personalized Real-Time Industrial Navigation
- SCASRec: A Self-Correcting and Auto-Stopping Model for Generative Route List Recommendation
- Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization, ACL 2026 Findings
- Intsr: An integrated generative framework for search and recommendation
- Comprehensive Comparison Network: a framework for locality-aware, routes-comparable and interpretable route recommendation
- Effective Probabilistic Time Series Forecasting with Fourier Adaptive Noise-Separated Diffusion
- DSFNet: Learning Disentangled Scenario Factorization for Multi-Scenario Route Ranking, WWW 2025 [code]
