StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving

1AIR, Tsinghua University, 2The University of Manchester, 3The University of Hong Kong

*
Corresponding to zaiqing@air.tsinghua.edu.cn.

Abstract

Despite the growing capability of end-to-end autonomous driving (E2EAD) systems, personalization remains largely overlooked. Aligning driving behavior with individual user preferences is essential for comfort, trust, and real-world deployment— but existing datasets and benchmarks lack the necessary structure and scale to support this goal.

StyleDrive bridges this gap with the following key contributions:

  • Large-scale dataset: A real-world dataset for personalized E2EAD, annotated with both objective behavior and subjective style preferences across diverse traffic scenarios.
  • Multi-stage annotation: A hybrid pipeline combining rule-based heuristics, vision-language model reasoning, and human verification to ensure consistent and interpretable labels.
  • Personalized benchmark: The first benchmark for style-conditioned driving evaluation across different model families.
  • Empirical validation: Extensive experiments show style-aware models align better with human behavior, proving the value of personalization in autonomy.

Method Overview

Overview Figure

The figure illustrates the motivation and overview of StyleDrive. Users increasingly expect AVs not just to drive safely—but to drive like them. Integrating personalization into E2EAD is challenging due to (1) the lack of real-world datasets with style annotations enabling E2EAD, and (2) limited architectures consider style preference as condition in E2E manner. To bridge this gap, we present the first real-world dataset and benchmark tailored for personalized E2EAD.

The StyleDrive Dataset

StyleDrive is the first large-scale real-world dataset tailored for personalized end-to-end autonomous driving (E2EAD). It is constructed on top of the OpenScene dataset, and contains nearly 30,000 driving scenarios across urban and rural environments.

Each scenario is annotated with both:

  • Objective behavioral features: e.g. velocity, acceleration, yaw rate, lane position, motion history.
  • Subjective driving style preferences: categorized as Aggressive, Normal, or Conservative.

To ensure high-quality and interpretable annotations, we propose a multi-stage labeling pipeline:

  • Rule-based heuristics: Classify motion patterns using interpretable thresholds (speed, jerk, safety ratio) under each traffic scenario.
  • VLM-based reasoning: Leverages a fine-tuned vision-language model to extract semantic intents from bird's-eye view images.
  • Human-in-the-loop verification: Final labels are verified and calibrated via human annotators for robustness and consistency.

The dataset spans 11 real-world traffic scenario types (e.g., intersections, merges, roundabouts), and provides per-scenario safety scores, relative positioning, and temporal style labels.

With its hybrid annotation strategy and diverse behavior coverage, StyleDrive offers a unique resource for modeling, analyzing, and benchmarking human-aligned autonomous driving.

Dataset Construction

The StyleDrive dataset is constructed via a multi-stage annotation pipeline designed to capture both low-level behavior and high-level style preferences, grounded in rich semantic contexts.

The process begins by segmenting raw driving clips into traffic scenarios using high-definition map topology. Within each segment, semantic context—such as proximity to lead vehicles, pedestrians, or lane merges—is extracted through a fine-tuned Vision-Language Model (VLM), enabling interpretable behavioral grounding.

Style labeling is performed in three key stages:

  • Rule-based analysis: Apply scenario-specific thresholds over motion features (speed, acceleration, yaw rate) to detect patterns like rapid acceleration or risky merges.
  • VLM-based semantic inference: Use VLMs to reason over BEV semantic images and infer behavioral intent via multimodal prompts.
  • Human-in-the-loop refinement: Annotation outputs are reviewed and refined by human annotators to ensure style consistency and realism.

A risk-aware fusion strategy combines rule-based and VLM outputs, yielding consistent, interpretable, and scalable style labels. The final dataset includes style annotations across 11 real-world traffic scenario types, enabling robust personalized driving policy learning.

Annotation Framework Diagram

Annotation pipeline integrating topology segmentation, semantic extraction, rule-based analysis, VLM inference, and human verification.

Dataset Structure

Each annotated sample in the StyleDrive dataset encapsulates a wide array of semantic, dynamic, and contextual attributes, enabling rich representation of driving behaviors and personalized preferences. The data schema is designed to support learning models that condition on style, motion, perception, and safety cues.

Below is a representative structure of one data entry:

Field Description
vx_ego, vy_ego, v_egoVelocity vector (x, y) and magnitude of ego vehicle
ax_ego, ay_ego, a_egoAcceleration vector and overall acceleration
yaw, yaw_diffHeading angle and angular change across frames
v_avg, v_stdAverage and variability of velocity (last 10 frames)
vy_maxPeak lateral velocity (side-slip indicator)
a_max, a_std, ax_avgMax, std of acceleration and average forward acceleration
ini_direction_judgeInitial motion direction classification
scenario_typeSemantic scenario type parsed from VLM (e.g., roundabout, crosswalk)
scene_tokenGlobally unique identifier for each driving scene
has_left_rear / right_rearFlags indicating rear adjacent vehicles
left_rear_min / right_rear_minMinimum distances to left/right rear vehicles
speed_modePreferred driving style label (Aggressive, Normal, Conservative)
front_frameTemporal distance series to front vehicle
safe_frameBoolean safety labels per frame
safe_ratio, unsafe_ratio, oversafe_ratioStyle-based safety distribution over entire scenario

The structured schema supports learning from motion, interaction, semantics, and style.

Traffic Scenario Videos

StyleDrive includes 11 diverse real-world traffic scenarios, capturing personalized driving behaviors across structured and unstructured environments.

StyleDrive Benchmark

The StyleDrive Benchmark introduces a standardized evaluation suite for personalized end-to-end autonomous driving. It extends the StyleDrive dataset into a closed-loop testing environment built upon the NavSim simulator, allowing models to be evaluated in realistic, interactive traffic scenes.

At the core of this benchmark is the Style-Modulated Predictive Driver Model Score (SM-PDMS). This metric jointly captures:

  • Feasibility: including lane adherence, goal completion, and safety (e.g., collision-free trajectory, time-to-collision).
  • Style Alignment: assessing how well the generated behaviors reflect target preferences across axes such as comfort, caution, and responsiveness.

Evaluation is conducted under three canonical driving styles — Aggressive, Normal, and Conservative. Each policy is conditioned on a target style label and deployed in multiple real-world driving contexts.

We implement baseline controllers spanning multiple model families — including MLP-based predictors, transformer architectures, and diffusion-policy networks. Experimental results consistently demonstrate that style-aware models achieve higher SM-PDMS scores and produce behaviors more closely aligned with human demonstrations.

This benchmark provides a reproducible, quantitative platform for evaluating style-conditioned policy learning. It serves as a crucial step toward developing human-aligned, preference-aware driving agents in real-world environments.

Benchmark Performance across Model Families

The following table summarizes the performance of various E2EAD models under style-conditioned evaluation using the SM-PDMS framework. Higher scores indicate better feasibility and alignment with target driving preferences.

Model NC ↑ DAC ↑ TTC ↑ Comf. ↑ EP ↑ SM-PDMS ↑
AD-MLP92.6377.6883.8399.7578.0163.72
TransFuser96.7488.4391.0899.6584.3978.12
WoTE97.2992.3992.5399.1376.3179.56
DiffusionDrive96.6691.4590.6399.7380.3979.33
AD-MLP-Style92.3873.2383.1499.9078.5560.02
TransFuser-Style97.2390.3692.6199.7384.9581.09
WoTE-Style97.5893.4493.7099.2677.3881.38
DiffusionDrive-Style97.8193.4592.8199.8584.8484.10

Trajectory Consistency with Human Demonstrations

We further evaluate trajectory-level similarity via L2 error across prediction horizons. Style-aware models consistently reduce average L2 distance, indicating stronger alignment with human-like driving behavior.

Model L2 (2s) ↓ L2 (3s) ↓ L2 (4s) ↓ L2 Avg ↓
WoTE0.7331.4342.3491.506
AD-MLP0.5031.2622.3831.382
TransFuser0.4310.9631.7011.032
DiffusionDrive0.4711.0861.9451.167
WoTE-Style0.6731.3402.2231.412
AD-MLP-Style0.5101.2302.3211.354
TransFuser-Style0.4240.9371.6561.006
DiffusionDrive-Style 0.417 0.940 1.646 1.001


Qualitative Case Study of Style Effects

To further analyze the impact of style-conditioning, we visualize DiffusionDrive-Style predictions under controlled scenario conditions. The same traffic environment is used while varying only the driving style input: Aggressive, Normal, and Conservative.

Style-Conditioned Trajectories

Red dotted lines represent predicted trajectories generated by the model under each style condition, while green dotted lines show the corresponding human demonstrations. Distinct patterns emerge:

  • Aggressive: higher speeds, sharper turns, assertive merging behavior.
  • Normal: stable lane following, moderate acceleration, balanced risk.
  • Conservative: longer following distances, reduced velocity, safer decisions at intersections.
These visual comparisons illustrate the model's controllability and its capability to align outputs with high-level human preferences across complex driving scenes.

Dataset View

Alternative Annotation Framework

Leaderboard (Coming Soon)

Evaluate your model on the StyleDrive Benchmark and see how it ranks.

📖 Cite StyleDrive

If you find StyleDrive useful in your research, please consider citing:

@article{hao2025styledrive,
  title={StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving},
  author={Hao, Ruiyang and Jing, Bowen and Yu, Haibao and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2506.23982},
  year={2025}
}