Technical Report · XPeng Foundation Model Team

TuringViT Making SOTA Vision Transformers Accessible to All

A VLM-native, data-efficient and linear-complexity Vision Transformer. We show a concrete, reproducible path for designing and training your own SOTA-level visual encoder under a controlled resource budget — outperforming leading open-source ViT baselines using only 10% of the data, with substantially flatter latency scaling for high-resolution and dynamic-resolution inputs.

0× faster at 1536² vs seed1.5-vit

0% of baseline data

0 avg. zero-shot acc. (24L)

Explore the Design See Latency Demo

Three Coordinated Designs

One ViT, designed end-to-end for the VLM era.

TuringViT rethinks visual encoders along three orthogonal axes — architecture, data, and training — to make SOTA-level encoders trainable and customizable under realistic resource budgets.

Architecture-Efficient

The Turing Linear Attention (TLA) dominates the stack; only 1 of every 6 layers keeps softmax MHA for global routing. Quadratic → near-linear scaling, with input-dependent gating that preserves local & high-frequency cues.

5× TLA + 1× MHA
2D-RoPE
RMSNorm + SwiGLU

Data-Efficient

VISTA-Curation turns noisy web image–text and video–text data into grounded, discriminative, temporally informative supervision. Result: 10× data reduction without sacrificing quality.

multi-candidate recap
relative scoring
temporal aggregation

VLM-Native

Trained natively at dynamic resolution from the start; no post-hoc adaptation, no extra parameters when plugged into downstream VLMs. Generation-aligned objectives further bridge contrastive ViT and generative VLM training.

progressive res. schedule
plug-and-play
generation alignment

Architecture

The Turing Block — linear where it matters, softmax where it counts.

Five Turing Linear Attention layers handle global context at O(N). One vanilla MHA per block keeps token-level precision. The result: latency stays flat as resolution climbs.

TuringViT structure: hybrid linear attention backbone

TuringViT-18L3 Turing Blocks

15× TLA 3× MHA

Efficient & deployment-friendly under dynamic-resolution inputs.

TuringViT-24L4 Turing Blocks

20× TLA 4× MHA

Stronger representation, computation still dominated by linear attention.

Linear Scaling

The longer the sequence, the wider the gap.

Standard ViTs scale quadratically with visual token count. TuringViT stays flat.

Inference latency scaling: TuringViT vs standard ViT baselines

Latency measured per image using FP16 TensorRT on an NVIDIA GeForce RTX 3080 Ti (CUDA 11.6). TuringViT shows flatter high-resolution scaling than standard ViT baselines.

VISTA-Curation

More supervision per sample, not more samples.

Vision Data Curation via Image–Video Scoring and Temporal Aggregation. Each sample is filtered, recaptioned, scored, grounded, and (for video) temporally aggregated — turning noisy web data into denser supervision.

Image-language curation engine overview — Image-language curation engine

Pre-training Recipe

A four-stage progressive recipe, native to VLM usage.

The goal is not to introduce complex objectives, but to progressively expose the encoder to the same inputs downstream VLMs see — dynamic-resolution images, high-resolution visual content, and mixed image–video data. Both 18L and 24L follow the same recipe.

STAGE 01

MIM-based Visual Initialization

Reconstruct masked EVA02-CLIP-E features (40% mask rate) to bootstrap geometric priors before language alignment.

data: 55M images
res.: 256²
obj.: MIM distill.
cost: 9,984 H800h

STAGE 02

Bounded Dynamic Image–Text

Aspect-preserving dynamic resolution with longer side ∈ [256, 512]; SigLIP + SuperClass loss for large-scale alignment.

data: 850M pairs
res.: bounded dyn.
obj.: SigLIP + SuperClass
cost: 30,720 H800h

STAGE 03

High-Resolution Refinement

Preserve original aspect ratio & resolution whenever possible, making visual preprocessing consistent with downstream VLM usage.

data: 850M pairs
res.: unrestricted dyn.
obj.: SigLIP + SuperClass
cost: 12,288 H800h

STAGE 04

Image–Video Mixed Training

Mix video–text with image–text replay. T=8 frames → attention-pool → mean-pool video embedding aligned via L_VT.

data: 2M video + replay
res.: unrestricted dyn.
obj.: SigLIP + SuperClass
cost: 460 H800h

Results

Stronger backbone, smaller data, faster inference.

ImageNet-1K · zero-shot

TuringViT-24L · vs. SigLIP2-L 83.1

ImageNet-A

+5.4 over SigLIP2-L (84.3)

Zero-shot retrieval avg.

TuringViT-24L · COCO + Flickr30K

Pretraining samples

vs. 10B for SigLIP2-L

Zero-shot performance across classification & retrieval

	MobileCLIP2-L	SigLIP2-L	Seed1.5-ViT	TuringViT-18L	TuringViT-24L
Model configuration
Resolution	224	384	dyn.	dyn.	dyn.
Pretraining data	1.9B	10B	4.8B	0.85B	0.85B
Zero-shot classification
Avg	77.1	83.4	82.5	82.7	83.6
ImageNet-1K	81.9	83.1	83.6	83.1	83.9
ImageNet-v2	74.7	77.4	77.6	77.6	78.0
ObjectNet	75.3	84.4	79.2	79.7	81.1
ImageNet-A	69.0	84.3	85.5	87.9	89.7
ImageNet-R	91.7	95.7	95.2	95.0	95.3
ImageNet-Sketch	69.8	75.5	74.1	72.8	73.6
Zero-shot retrieval
Avg	72.5	76.7	—	78.9	79.4
COCO T→I	51.6	55.3	—	58.6	59.4
COCO I→T	69.0	71.4	—	75.9	76.0
Flickr30K T→I	77.2	85.0	—	85.7	86.0
Flickr30K I→T	92.0	95.2	—	95.3	96.3