A multimodal search system for dance video archives.
This immediately makes your site look like an actual product.
Project Overview
This project builds a multimodal system that transforms raw video into a structured, queryable data representation. Instead of treating video as a continuous stream, we break it into meaningful segments enriched with text, visual features, and semantic tags. This allows users to search video content using natural language queries and retrieve relevant moments instantly.
The Problem
Unstructured Data
Videos are continuous streams with no structure.
Disconnected Modalities
Text (Whisper) and vision (YOLO) operate independently, with no unified representation.
No Temporal Alignment
Frame-based detection and time-based transcripts are not naturally aligned, preventing effective retrieval.
This is not just a model — it is a system
Our goal is to build an infrastructure layer for video understanding by integrating vision, language, and time into a unified representation. We convert video into timestamped segments enriched with both textual and visual metadata, enabling powerful search capabilities across modalities.
Overview of Method

From Video to Structured Data

Segmentation + tagging converts video into a searchable representation
System Pipeline

Each segment becomes a structured, searchable unit rather than part of a continuous video stream.
This pipeline converts raw video into structured, searchable segments by aligning text, motion, and semantic tags.
Dataset
We evaluated our system on a diverse dataset of 6–8 videos, including both sponsor-provided and external clips. The dataset contains over 20 minutes of footage, approximately 30,000 frames, and more than 6,000 object detections. It includes high-motion dance sequences designed to stress both object detection and temporal alignment challenges.
Exploratory Results
Segments per Video
badbunny: ~75
single ladies: ~8
katseye touch: ~11
Tag Distribution
solo: 23
duo: 24
group: 44
high_energy: 10
medium_energy: 15
low_energy: 69
Example Output
Example 1:
[badbunny] | 30–33 sec
Tags: solo, high_energy
Text: “This is the time when it’s dark.”
Description: solo dancer with fast, high-energy movement
Example 2:
[singleladies] | 9–12 sec
Tags: duo, high_energy
Description: two dancers with fast, high-energy movement
Each segment is converted into a structured unit that supports search and retrieval.
Results
The system successfully processes raw video into structured data and generates over 100 timestamped segments across multiple videos. Each segment includes multimodal features such as text, motion characteristics, and semantic tags. This enables cross-video retrieval using natural language queries and motion-based filtering.
Technical Challenges
- Temporal alignment between Whisper (time-based) and YOLO (frame-based)
- Multimodal integration across vision, language, and time
- High computational cost of motion detection
- Complexity of capturing subtle and fast dance movements
Key Insight
Make this bold-looking with a Quote block:
This is not a computer vision problem.
This is a data coordination problem.
Value comes from structuring data, not just detecting it.
Next Steps
Short Term
- Improve timestamp alignment
- Reduce segmentation noise
Medium Term
- Introduce vector embeddings for semantic search
- Improve ranking of results
Long Term
- Scale to large video datasets
- Build an interactive search interface
Conclusion
We built a working multimodal pipeline that converts video into structured, searchable data. By aligning vision, language, and time, our system enables efficient query-based retrieval across videos.
Final takeaway:
We make video as searchable as text.
Acknowledgements
Thank you to our sponsor Kurt Johnston and Professor Caliskan for their guidance and support throughout this project.