StyleDrive

Abstract

Despite the growing capability of end-to-end autonomous driving (E2EAD) systems, personalization remains largely overlooked. Aligning driving behavior with individual user preferences is essential for comfort, trust, and real-world deployment— but existing datasets and benchmarks lack the necessary structure and scale to support this goal.

StyleDrive bridges this gap with the following key contributions:

Large-scale dataset: A real-world dataset for personalized E2EAD, annotated with both objective behavior and subjective style preferences across diverse traffic scenarios.
Multi-stage annotation: A hybrid pipeline combining rule-based heuristics, vision-language model reasoning, and human verification to ensure consistent and interpretable labels.
Personalized benchmark: The first benchmark for style-conditioned driving evaluation across different model families.
Empirical validation: Extensive experiments show style-aware models align better with human behavior, proving the value of personalization in autonomy.

Method Overview

The figure illustrates the motivation and overview of StyleDrive. Users increasingly expect AVs not just to drive safely—but to drive like them. Integrating personalization into E2EAD is challenging due to (1) the lack of real-world datasets with style annotations enabling E2EAD, and (2) limited architectures consider style preference as condition in E2E manner. To bridge this gap, we present the first real-world dataset and benchmark tailored for personalized E2EAD.

The StyleDrive Dataset

StyleDrive is the first large-scale real-world dataset tailored for personalized end-to-end autonomous driving (E2EAD). It is constructed on top of the OpenScene dataset, and contains nearly 30,000 driving scenarios across urban and rural environments.

Each scenario is annotated with both:

Objective behavioral features: e.g. velocity, acceleration, yaw rate, lane position, motion history.
Subjective driving style preferences: categorized as Aggressive, Normal, or Conservative.

To ensure high-quality and interpretable annotations, we propose a multi-stage labeling pipeline:

Rule-based heuristics: Classify motion patterns using interpretable thresholds (speed, jerk, safety ratio) under each traffic scenario.
VLM-based reasoning: Leverages a fine-tuned vision-language model to extract semantic intents from bird's-eye view images.
Human-in-the-loop verification: Final labels are verified and calibrated via human annotators for robustness and consistency.

The dataset spans 11 real-world traffic scenario types (e.g., intersections, merges, roundabouts), and provides per-scenario safety scores, relative positioning, and temporal style labels.

With its hybrid annotation strategy and diverse behavior coverage, StyleDrive offers a unique resource for modeling, analyzing, and benchmarking human-aligned autonomous driving.

Dataset Construction

The StyleDrive dataset is constructed via a multi-stage annotation pipeline designed to capture both low-level behavior and high-level style preferences, grounded in rich semantic contexts.

The process begins by segmenting raw driving clips into traffic scenarios using high-definition map topology. Within each segment, semantic context—such as proximity to lead vehicles, pedestrians, or lane merges—is extracted through a fine-tuned Vision-Language Model (VLM), enabling interpretable behavioral grounding.

Style labeling is performed in three key stages:

Rule-based analysis: Apply scenario-specific thresholds over motion features (speed, acceleration, yaw rate) to detect patterns like rapid acceleration or risky merges.
VLM-based semantic inference: Use VLMs to reason over BEV semantic images and infer behavioral intent via multimodal prompts.
Human-in-the-loop refinement: Annotation outputs are reviewed and refined by human annotators to ensure style consistency and realism.

A risk-aware fusion strategy combines rule-based and VLM outputs, yielding consistent, interpretable, and scalable style labels. The final dataset includes style annotations across 11 real-world traffic scenario types, enabling robust personalized driving policy learning.

Annotation pipeline integrating topology segmentation, semantic extraction, rule-based analysis, VLM inference, and human verification.

Dataset Structure

Each annotated sample in the StyleDrive dataset encapsulates a wide array of semantic, dynamic, and contextual attributes, enabling rich representation of driving behaviors and personalized preferences. The data schema is designed to support learning models that condition on style, motion, perception, and safety cues.

Below is a representative structure of one data entry:

Field	Description
vx_ego, vy_ego, v_ego	Velocity vector (x, y) and magnitude of ego vehicle
ax_ego, ay_ego, a_ego	Acceleration vector and overall acceleration
yaw, yaw_diff	Heading angle and angular change across frames
v_avg, v_std	Average and variability of velocity (last 10 frames)
vy_max	Peak lateral velocity (side-slip indicator)
a_max, a_std, ax_avg	Max, std of acceleration and average forward acceleration
ini_direction_judge	Initial motion direction classification
scenario_type	Semantic scenario type parsed from VLM (e.g., roundabout, crosswalk)
scene_token	Globally unique identifier for each driving scene
has_left_rear / right_rear	Flags indicating rear adjacent vehicles
left_rear_min / right_rear_min	Minimum distances to left/right rear vehicles
speed_mode	Preferred driving style label (Aggressive, Normal, Conservative)
front_frame	Temporal distance series to front vehicle
safe_frame	Boolean safety labels per frame
safe_ratio, unsafe_ratio, oversafe_ratio	Style-based safety distribution over entire scenario

The structured schema supports learning from motion, interaction, semantics, and style.

Traffic Scenario Videos

StyleDrive includes 11 diverse real-world traffic scenarios, capturing personalized driving behaviors across structured and unstructured environments.

Crosswalks
Navigating pedestrian crossings with varying yielding behaviors

Lane Following
Maintaining lane center under speed and curvature variations

Protected Intersections
Turning or crossing with clear traffic signals or stop signs

Unprotected Intersections
Negotiating multi-agent crossings without explicit right-of-way

Lane Change
Executing safe or assertive lane shifts in dense or sparse traffic

Side Ego to Main
Entering a main road from a side street with merging considerations

Side to Main Ego
Ego vehicle on main road encounters merging side traffic

Special Interior Road
Driving through complex local environments like alleys, parking, or loops

Countryside Road
Navigating long, curved roads with sparse surrounding vehicles

Roundabout Interior
Maintaining trajectory inside multi-lane roundabout with potential exits

Roundabout Entrance
Merging into circular flow under dynamic vehicle interactions

StyleDrive Benchmark

The StyleDrive Benchmark introduces a standardized evaluation suite for personalized end-to-end autonomous driving. It extends the StyleDrive dataset into a closed-loop testing environment built upon the NavSim simulator, allowing models to be evaluated in realistic, interactive traffic scenes.

At the core of this benchmark is the Style-Modulated Predictive Driver Model Score (SM-PDMS). This metric jointly captures:

Feasibility: including lane adherence, goal completion, and safety (e.g., collision-free trajectory, time-to-collision).
Style Alignment: assessing how well the generated behaviors reflect target preferences across axes such as comfort, caution, and responsiveness.

Evaluation is conducted under three canonical driving styles — Aggressive, Normal, and Conservative. Each policy is conditioned on a target style label and deployed in multiple real-world driving contexts.

We implement baseline controllers spanning multiple model families — including MLP-based predictors, transformer architectures, and diffusion-policy networks. Experimental results consistently demonstrate that style-aware models achieve higher SM-PDMS scores and produce behaviors more closely aligned with human demonstrations.

This benchmark provides a reproducible, quantitative platform for evaluating style-conditioned policy learning. It serves as a crucial step toward developing human-aligned, preference-aware driving agents in real-world environments.

Benchmark Performance across Model Families

The following table summarizes the performance of various E2EAD models under style-conditioned evaluation using the SM-PDMS framework. Higher scores indicate better feasibility and alignment with target driving preferences.

Model	NC ↑	DAC ↑	TTC ↑	Comf. ↑	EP ↑	SM-PDMS ↑
AD-MLP	92.63	77.68	83.83	99.75	78.01	63.72
TransFuser	96.74	88.43	91.08	99.65	84.39	78.12
WoTE	97.29	92.39	92.53	99.13	76.31	79.56
DiffusionDrive	96.66	91.45	90.63	99.73	80.39	79.33
AD-MLP-Style	92.38	73.23	83.14	99.90	78.55	60.02
TransFuser-Style	97.23	90.36	92.61	99.73	84.95	81.09
WoTE-Style	97.58	93.44	93.70	99.26	77.38	81.38
DiffusionDrive-Style	97.81	93.45	92.81	99.85	84.84	84.10

Trajectory Consistency with Human Demonstrations

We further evaluate trajectory-level similarity via L2 error across prediction horizons. Style-aware models consistently reduce average L2 distance, indicating stronger alignment with human-like driving behavior.

Model	L2 (2s) ↓	L2 (3s) ↓	L2 (4s) ↓	L2 Avg ↓
WoTE	0.733	1.434	2.349	1.506
AD-MLP	0.503	1.262	2.383	1.382
TransFuser	0.431	0.963	1.701	1.032
DiffusionDrive	0.471	1.086	1.945	1.167
WoTE-Style	0.673	1.340	2.223	1.412
AD-MLP-Style	0.510	1.230	2.321	1.354
TransFuser-Style	0.424	0.937	1.656	1.006
DiffusionDrive-Style	0.417	0.940	1.646	1.001

Qualitative Case Study of Style Effects

To further analyze the impact of style-conditioning, we visualize DiffusionDrive-Style predictions under controlled scenario conditions. The same traffic environment is used while varying only the driving style input: Aggressive, Normal, and Conservative.

Red dotted lines represent predicted trajectories generated by the model under each style condition, while green dotted lines show the corresponding human demonstrations. Distinct patterns emerge:

Aggressive: higher speeds, sharper turns, assertive merging behavior.
Normal: stable lane following, moderate acceleration, balanced risk.
Conservative: longer following distances, reduced velocity, safer decisions at intersections.

These visual comparisons illustrate the model's controllability and its capability to align outputs with high-level human preferences across complex driving scenes.

Dataset View

Leaderboard (Coming Soon)

Evaluate your model on the StyleDrive Benchmark and see how it ranks.

📖 Cite StyleDrive

If you find StyleDrive useful in your research, please consider citing:

@article{hao2025styledrive,
  title={StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving},
  author={Hao, Ruiyang and Jing, Bowen and Yu, Haibao and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2506.23982},
  year={2025}
}