Technical Report · XPeng Foundation Model Team

TuringViT Making SOTA Vision Transformers Accessible to All

A VLM-native, data-efficient and linear-complexity Vision Transformer. We show a concrete, reproducible path for designing and training your own SOTA-level visual encoder under a controlled resource budget — outperforming leading open-source ViT baselines using only 10% of the data, with substantially flatter latency scaling for high-resolution and dynamic-resolution inputs.

faster at 1536² vs seed1.5-vit
0% of baseline data
0 avg. zero-shot acc. (24L)
Three Coordinated Designs

One ViT, designed end-to-end for the VLM era.

TuringViT rethinks visual encoders along three orthogonal axes — architecture, data, and training — to make SOTA-level encoders trainable and customizable under realistic resource budgets.

Architecture-Efficient

The Turing Linear Attention (TLA) dominates the stack; only 1 of every 6 layers keeps softmax MHA for global routing. Quadratic → near-linear scaling, with input-dependent gating that preserves local & high-frequency cues.

  • 5× TLA + 1× MHA
  • 2D-RoPE
  • RMSNorm + SwiGLU

Data-Efficient

VISTA-Curation turns noisy web image–text and video–text data into grounded, discriminative, temporally informative supervision. Result: 10× data reduction without sacrificing quality.

  • multi-candidate recap
  • relative scoring
  • temporal aggregation

VLM-Native

Trained natively at dynamic resolution from the start; no post-hoc adaptation, no extra parameters when plugged into downstream VLMs. Generation-aligned objectives further bridge contrastive ViT and generative VLM training.

  • progressive res. schedule
  • plug-and-play
  • generation alignment
Architecture

The Turing Block — linear where it matters, softmax where it counts.

Five Turing Linear Attention layers handle global context at O(N). One vanilla MHA per block keeps token-level precision. The result: latency stays flat as resolution climbs.

TuringViT structure: hybrid linear attention backbone
TuringViT-18L3 Turing Blocks
15× TLA 3× MHA
Efficient & deployment-friendly under dynamic-resolution inputs.
TuringViT-24L4 Turing Blocks
20× TLA 4× MHA
Stronger representation, computation still dominated by linear attention.
Linear Scaling

The longer the sequence, the wider the gap.

Standard ViTs scale quadratically with visual token count. TuringViT stays flat.

Inference latency scaling: TuringViT vs standard ViT baselines

Latency measured per image using FP16 TensorRT on an NVIDIA GeForce RTX 3080 Ti (CUDA 11.6). TuringViT shows flatter high-resolution scaling than standard ViT baselines.

VISTA-Curation

More supervision per sample, not more samples.

Vision Data Curation via Image–Video Scoring and Temporal Aggregation. Each sample is filtered, recaptioned, scored, grounded, and (for video) temporally aggregated — turning noisy web data into denser supervision.

Image-language curation engine overview
Image-language curation engine
Video-language data curation pipeline
Video-language data curation pipeline
Pre-training Recipe

A four-stage progressive recipe, native to VLM usage.

The goal is not to introduce complex objectives, but to progressively expose the encoder to the same inputs downstream VLMs see — dynamic-resolution images, high-resolution visual content, and mixed image–video data. Both 18L and 24L follow the same recipe.

STAGE 01

MIM-based Visual Initialization

Reconstruct masked EVA02-CLIP-E features (40% mask rate) to bootstrap geometric priors before language alignment.

data
55M images
res.
256²
obj.
MIM distill.
cost
9,984 H800h
STAGE 02

Bounded Dynamic Image–Text

Aspect-preserving dynamic resolution with longer side ∈ [256, 512]; SigLIP + SuperClass loss for large-scale alignment.

data
850M pairs
res.
bounded dyn.
obj.
SigLIP + SuperClass
cost
30,720 H800h
STAGE 03

High-Resolution Refinement

Preserve original aspect ratio & resolution whenever possible, making visual preprocessing consistent with downstream VLM usage.

data
850M pairs
res.
unrestricted dyn.
obj.
SigLIP + SuperClass
cost
12,288 H800h
STAGE 04

Image–Video Mixed Training

Mix video–text with image–text replay. T=8 frames → attention-pool → mean-pool video embedding aligned via LVT.

data
2M video + replay
res.
unrestricted dyn.
obj.
SigLIP + SuperClass
cost
460 H800h
Results

Stronger backbone, smaller data, faster inference.

0
ImageNet-1K · zero-shot
TuringViT-24L · vs. SigLIP2-L 83.1
0
ImageNet-A
+5.4 over SigLIP2-L (84.3)
0
Zero-shot retrieval avg.
TuringViT-24L · COCO + Flickr30K
0
Pretraining samples
vs. 10B for SigLIP2-L

Zero-shot performance across classification & retrieval

MobileCLIP2-L SigLIP2-L Seed1.5-ViT TuringViT-18L TuringViT-24L
Model configuration
Resolution224384dyn.dyn.dyn.
Pretraining data1.9B10B4.8B0.85B0.85B
Zero-shot classification
Avg77.183.482.582.783.6
ImageNet-1K81.983.183.683.183.9
ImageNet-v274.777.477.677.678.0
ObjectNet75.384.479.279.781.1
ImageNet-A69.084.385.587.989.7
ImageNet-R91.795.795.295.095.3
ImageNet-Sketch69.875.574.172.873.6
Zero-shot retrieval
Avg72.576.778.979.4
COCO T→I51.655.358.659.4
COCO I→T69.071.475.976.0
Flickr30K T→I77.285.085.786.0
Flickr30K I→T92.095.295.396.3