A Matter of Time: Revealing the Structure of Time in Vision-Language Models

Research Overview - Time Assessment Pipeline

Overview: (a) Time prompts and query images serve as inputs, (b) VLM processes inputs through text and image encoders, (c) Time embeddings form a chronological manifold in 3D space, (d) Images are mapped to timeline positions, (e) Final temporal predictions are output.

Highlights

Timeline Construction

We discover that temporal information forms a low-dimensional manifold in VLM embedding spaces. Our Bézier curve approach explicitly models chronological progression, enabling efficient temporal inference through geometric timeline representations.

TIME10k Dataset

A comprehensive benchmark with 10,091 temporally annotated images spanning 309 years (1715-2024) across 6 object categories: Aircraft, Cars, Mobile Phones, Music Instruments, Ships, and Weapons & Ammunition.

Time Probing Benchmark

Systematic evaluation of 37 state-of-the-art VLMs reveals significant temporal awareness capabilities. EVA-CLIP and OpenCLIP achieve the best performance with 6.2-6.3 year Mean Absolute Error and 0.85-0.86 Time Awareness Index.

Temporal Manifold Discovery

First discovery that temporal information can be represented as a ~13-dimensional non-linear manifold within high-dimensional VLM embedding spaces, enabling both analysis and practical applications.

Timeline Construction with Bézier Curves

Timeline Construction: (a) Time embeddings with control points, (b) Bézier curve fitting process, (c) Final timeline with image embeddings mapped to temporal positions.

Benchmark Results: Performance evaluation of 37 VLMs showing Mean Absolute Error (left) and Time Awareness Index (right) across different model releases and architectures.

Key Contributions

TIME10k Dataset: A temporally annotated dataset with over 10,000 images from 6 classes of objects, enabling systematic evaluation and comparison of VLMs with respect to temporal awareness and time prediction capabilities.

Comprehensive Evaluation: A framework for objectively evaluating time-awareness and investigating 37 state-of-the-art VLMs, examining various backbones, architectures, and prompting strategies.

Timeline Modeling: A novel approach for deriving explicit "timeline" representations from VLM embedding spaces using UMAP and Bézier curve approximation.

Temporal Structure Discovery: First discovery that temporal information forms a low-dimensional, non-linear manifold in VLM embedding spaces with strong chronological structure.

Dataset Statistics

10,091

Total Images

Object Categories

309

Years Covered

VLMs Evaluated

Aircraft: 69 images (1893-2017) | Cars: 4,393 images (1888-2024) | Mobile Phones: 4,337 images (1984-2024)
Music Instruments: 436 images (1715-2009) | Ships: 841 images (1744-1999) | Weapons & Ammo: 15 images (1939-2003)

TIME10k Dataset Samples: Representative images from all six object categories spanning different time periods, showcasing the temporal diversity and historical range of our benchmark dataset.

Citation

@article{todo }