Publications

My publications follow a long arc from neural architecture search and efficient vision backbones to multimodal foundation models, LLM reasoning, world models, and spatial intelligence. Recent AMAP-ML work is designed to connect top-tier research, reproducible open source, and AMAP products serving 300M+ users every day.

120+ research papers and preprints

15K+ Google Scholar citations

6K+ citations from first-authored works

30+ open-source AMAP-ML projects

Representative works First-author papers LLM reasoning & agents Maps & mobility Generative AI Multimodal Architectures

You can also find my articles on my Google Scholar profile.

Representative Works

Representative works are intentionally weighted toward first-author contributions, because they best show my own research taste, technical judgment, and long-term arc. A smaller set of team-led and open-source systems is included to show how that arc scales through engineering leadership, released code, and product-facing AI.

First-Author Representative Works

Reasoning RL

A minimal reinforcement-learning baseline for model reasoning: no critic, no reference model, no KL penalty. Adopted by ByteDance's VERL as an official algorithm.

ICLR 2026 · First Author · code

Generation

Unified self-supervised pretraining that bridges image generation and understanding, continuing the first-author line from efficient architectures to foundation models.

ICCV 2025 · First Author · code

Architecture

A unified LLaMA-style backbone for vision tasks, introducing auto-scaling 2D RoPE for multimodal Transformers across generation, classification, segmentation, and detection.

ECCV 2024 · First Author · code

Mobile VLM

A compact open vision-language assistant designed for real-time on-device deployment, with follow-up V2 work improving the mobile VLM baseline.

First Author · code

Vision Transformer

Revisited spatial attention design in Vision Transformers, pairing strong accuracy with a simpler architecture and practical deployment properties.

NeurIPS 2021 · First Author · Most Influential · code

Position Encoding

Conditional positional encodings for Vision Transformers, a clean architectural contribution later recognized on PaperDigest's Most Influential list.

ICLR 2023 · First Author · Most Influential · code

AutoML

A fairness-centered rethink of weight-sharing NAS evaluation, representing the earlier AutoML line that shaped the transition into efficient vision backbones.

ICCV 2021 · First Author · Most Influential · code

Quantization

A quantization-aware solution for RepVGG-style re-parameterized networks, addressing the structural quantization challenge behind YOLOv6-like industrial detectors.

AAAI 2024 · First Author · code

Team-Led & Open-Source Systems

Industrial Vision

An industrial object-detection framework with a full training-to-deployment toolchain and broad open-source adoption.

Open Source · code

Spatial AI

A real-world benchmark for evaluating route-planning agents in mobility scenarios, anchoring AMAP-ML's spatial-intelligence research direction.

KDD 2026 Oral · code

World Models

A general-purpose interactive world model release with controllable camera navigation, prompt-driven world events, model weights, and inference code.

AMAP-ML · 2026 · code

Agents

An agentic skill-evolution system that turns real interaction traces into reusable skill libraries across sessions, devices, and agents.

AMAP-ML · code

First-Author Papers

GPG: A simple and strong reinforcement learning baseline for model reasoning, ICLR 2026 [code]
USP: Unified self-supervised pretraining for image generation and understanding, ICCV 2025 [code]
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks, ECCV 2024 [code]
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model [code]
MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices [code]
Make RepVGG Greater Again: A Quantization-aware Approach, AAAI 2024 [code]
Conditional Positional Encodings for Vision Transformers, ICLR 2023 [code]
ROME: Robustifying memory-efficient NAS via topology disentanglement and gradients accumulation, ICCV 2023
MixPATH: A unified approach for one-shot neural architecture search, ICCV 2023 [code]
A Unified Mixture-View Framework for Unsupervised Representation Learning, BMVC 2022
Twins: Revisiting the design of spatial attention in vision transformers, NeurIPS 2021 [code]
DARTS-: Robustly stepping out of performance collapse without indicators, ICLR 2021 [code]
FairNAS: Rethinking evaluation fairness of weight sharing neural architecture search, ICCV 2021 [code]
Noisy differentiable architecture search, BMVC 2021 [code]
Scarlet-NAS: Bridging the gap between stability and scalability in weight-sharing NAS, ICCV Workshops 2021 [code]
Fair DARTS: Eliminating unfair advantages in differentiable architecture search, ECCV 2020 [code]
MoGA: Searching beyond MobileNetV3, ICASSP 2020 [code]
Fast, accurate and lightweight super-resolution with neural architecture search, ICPR 2020 [code]
Multi-objective reinforced evolution in mobile neural architecture search, ECCV Workshops 2020
Policy optimization with penalized point probability distance: An alternative to PPO
Improved crowding distance for NSGA-II
Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning

Collaborative Papers

Image Generation & Editing

E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models, ICML 2026
MAR-GRPO: Stabilized GRPO for AR-Diffusion Hybrid Image Generation
ConceptWeaver: Weaving Disentangled Concepts with Flow
Elucidating the SNR-t Bias of Diffusion Probabilistic Models, CVPR 2026 [code]
Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers, CVPR 2026
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing [code]
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing, CVPR 2026
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning, CVPR 2026
From editor to dense geometry estimator, CVPR 2026 [code]
Ragsr: Regional attention guided diffusion for image super-resolution
S2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models, ICLR 2026 [code]
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling, ICCV 2025 [code]
Flux-text: A simple and advanced diffusion transformer baseline for scene text editing
Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation
FlowDreamer: exploring high fidelity text-to-3D generation via rectified flow
TEXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution, ICASSP 2026
Accurate and efficient single image super-resolution with matrix channel attention network, ACCV 2020

Video Generation & Understanding

MIGA: Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos, ICML 2026
Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Video-CoE: Reinforcing Video Event Prediction via Chain of Events, CVPR 2026
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on, CVPR 2026 Findings [code]
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints, AAAI 2026 [code]
Video-star: Reinforcing open-vocabulary action recognition with tools, ICLR 2026
Omni-effects: Unified and spatially-controllable visual effects generation, AAAI 2026 [code]
Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation models, ICLR 2026 [code]
VMBench: A Benchmark for Perception-Aligned Video Motion Generation, ICCV 2025 [code]
FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos, ACM MM 2025 [code]
Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V, ICASSP 2026
Artifact-Aware Evaluation for High-Quality Video Generation, ICASSP 2026
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation, CVPR 2022

LLM Reasoning & Agents

D²Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning, ICML 2026
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver [code]
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
Learning Agentic Policy from Action Guidance
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution, ACL 2026
Code2World: A GUI World Model via Renderable Code Generation [code]
Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation, ICLR 2026 [code]
AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting, AAAI 2026
Tree search for LLM agent reinforcement learning, ICLR 2026 [code]
AutoDrive-R2: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving, ICLR 2026
Position bias mitigates position bias: Mitigate position bias through inter-position knowledge distillation, EMNLP 2025 oral [code]
HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation, EMNLP 2025 oral [code]
Ranking-aware Reinforcement Learning for Ordinal Ranking, ICASSP 2026

Multimodal & Vision-Language

Detection, Segmentation & 3D Perception

UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement, ICCV 2025 [code]
PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus, CVPR 2025
SCTNet: Single Branch CNN with Transformer Semantic Information for Real-time Segmentation, AAAI 2024
FastPillars: A Deployment-friendly Pillar-based 3D Detector, IEEE TCSVT
Yolov6 v3.0: A full-scale reloading
AeDet: Azimuth-invariant multi-view 3D object detection, CVPR 2023
SegViT: Semantic segmentation with plain vision transformers, NeurIPS 2022
YOLOv6: A single-stage object detection framework for industrial applications, arXiv [code]
Fully convolutional one-stage 3D object detection on LiDAR range images, NeurIPS 2022
PromptDet: Towards open-vocabulary detection using uncurated images, ECCV 2022
Cctrans: Simplifying and improving crowd counting with transformer

Foundation Model Architectures

Model Compression & AutoML

Robust MAE-Driven NAS: From Mask Reconstruction to Architecture Innovation, ICASSP 2026
LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection, ICLR 2024
Masked Autoencoders Are Robust Neural Architecture Search Learners
A Speed Odyssey for Deployable Quantization of LLMs
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models, AAAI 2024
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, BMVC 2022
DAAS: Differentiable architecture and augmentation policy search
AutoKWS: Keyword Spotting with Differentiable Architecture Search, ICASSP 2021
Neural Architecture Search on Acoustic Scene Classification, InterSpeech 2020

Maps, Mobility & Recommendation