Skip to content

Mosaic: Multimodal Video Retrieval for Ravenous Design Studios

A multimodal search system for dance video archives. 

This immediately makes your site look like an actual product. 

Project Overview

This project builds a multimodal system that transforms raw video into a structured, queryable data representation. Instead of treating video as a continuous stream, we break it into meaningful segments enriched with text, visual features, and semantic tags. This allows users to search video content using natural language queries and retrieve relevant moments instantly.

The Problem

Unstructured Data

Videos are continuous streams with no structure.

Disconnected Modalities

Text (Whisper) and vision (YOLO) operate independently, with no unified representation.

No Temporal Alignment

Frame-based detection and time-based transcripts are not naturally aligned, preventing effective retrieval.

This is not just a model — it is a system

Our goal is to build an infrastructure layer for video understanding by integrating vision, language, and time into a unified representation. We convert video into timestamped segments enriched with both textual and visual metadata, enabling powerful search capabilities across modalities.

Overview of Method

From Video to Structured Data 

Segmentation + tagging converts video into a searchable representation

System Pipeline

Each segment becomes a structured, searchable unit rather than part of a continuous video stream.

This pipeline converts raw video into structured, searchable segments by aligning text, motion, and semantic tags.

Dataset

We evaluated our system on a diverse dataset of 6–8 videos, including both sponsor-provided and external clips. The dataset contains over 20 minutes of footage, approximately 30,000 frames, and more than 6,000 object detections. It includes high-motion dance sequences designed to stress both object detection and temporal alignment challenges.

Exploratory Results

Segments per Video

badbunny: ~75

single ladies: ~8

katseye touch: ~11

Tag Distribution

solo: 23

duo: 24

group: 44

high_energy: 10

medium_energy: 15

low_energy: 69

Example Output

Example 1:
[badbunny] | 30–33 sec
Tags: solo, high_energy
Text: “This is the time when it’s dark.”
Description: solo dancer with fast, high-energy movement

Example 2:
[singleladies] | 9–12 sec
Tags: duo, high_energy
Description: two dancers with fast, high-energy movement

Each segment is converted into a structured unit that supports search and retrieval.

Results

The system successfully processes raw video into structured data and generates over 100 timestamped segments across multiple videos. Each segment includes multimodal features such as text, motion characteristics, and semantic tags. This enables cross-video retrieval using natural language queries and motion-based filtering.

Technical Challenges

  • Temporal alignment between Whisper (time-based) and YOLO (frame-based)
  • Multimodal integration across vision, language, and time
  • High computational cost of motion detection
  • Complexity of capturing subtle and fast dance movements

Key Insight

Make this bold-looking with a Quote block:

This is not a computer vision problem.
This is a data coordination problem.

Value comes from structuring data, not just detecting it.

Next Steps

Short Term

  • Improve timestamp alignment
  • Reduce segmentation noise

Medium Term

  • Introduce vector embeddings for semantic search
  • Improve ranking of results

Long Term

  • Scale to large video datasets
  • Build an interactive search interface

Conclusion

We built a working multimodal pipeline that converts video into structured, searchable data. By aligning vision, language, and time, our system enables efficient query-based retrieval across videos.

Final takeaway:
We make video as searchable as text.

Acknowledgements

Thank you to our sponsor Kurt Johnston and Professor Caliskan for their guidance and support throughout this project.