URMC-Otolarnyology: Predicting Sphenoidotomy Performance in Residents

I. Introduction

Reliable assessment of surgical performance is essential for resident education, feedback, and progression within procedural training. Although attending-based evaluations such as the Objective Structured assessment of Technical Skills (OSATS) provide clinically meaningful judgments of technical skill, these assessments remain partly subjective and may vary across evaluators and operative settings. As motion-
tracking systems become more available in surgical education, an important question is whether objective movement-derived metrics can help explain or predict expert ratings of resident performance.

In this study, we investigated whether motion-tracking and procedural features could predict OSATS-related outcomes in sphenoidotomy. By comparing baseline, motion-only, and engineered mixed models, this project aimed to determine whether objective motion and procedural data capture meaningful signal related to attending-rated technical skill. This project also investigated whether kinematic resident based
features can predict experience level of residents. Through comparison of baseline nonlinear models with more complex models, this study determines whether motion data can predict resident experience level. More broadly, this work explores the potential role of quantitative motion analysis as a complementary approach to expert surgical assessment.

II. Data Overview

To analyze the performance and experience levels of Otolaryngology residents during sphenoidotomy, we utilized data from four distinct sources. A key characteristic of the collected data is its paired structure, where attending surgeons and residents operate on the same case on opposite nostrils. This enables direct comparison of performance under similar clinical conditions. This data is categorized into three primary domains:

Survey Data: Includes two post-sphenoidotomy surveys, an attending survey evaluating technical features and resident performance, and a resident self-assessment survey.

Motion Data: Objective kinematic data captured from the surgical device to measure movement and instrument efficiency.

Experience Level Data: A longitudinal dataset tracking total surgical volume and specific Endoscopic Sinus Surgery (ESS) cases.

Our data included analysis of 8 residents and 2 attendings across ~45–51 surgical procedures

The data naturally had a paired structure (resident vs. attending on the same case). The would have attending one sphenoid and the resident would have the other sphenoid.

Overall, the survey data provides subjective ratings, motion data captures objective kinematic metrics, and experience data defines proficiency levels based on surgical volume.

III. Exploratory Data Analysis

The EDA focused on transforming raw surgical motion data into features that better captured operative efficiency and resident–attending differences. Time metrics were decomposed into static, idle, and active times, path-based efficiency measures were created, and skewed variables such as time, distance, and jerk were log-transformed to reduce the influence of extreme cases. Because each resident and attending worked on the same case, the data was reorganized into a paired wide format, allowing direct comparison through raw and percent difference metrics.

Overall, the exploratory analysis showed clear differences between resident and attending motion patterns. Residents generally required more time, had longer idle and non-static periods, and produced larger movement paths, suggesting lower motion efficiency and less direct instrument use. Some of the strongest differences appeared in mean static time, scope idle time, scope non-static time, and Hosemann total path length. Jerk metrics showed a more mixed pattern: attendings sometimes had higher jerk values, which may reflect faster and more decisive movements rather than poorer control.

Experience and performance analyses suggested that sphenoidotomy-specific experience was more meaningful than total surgical experience. Residents with more sphenoidotomy cases tended to receive higher OSATS-related performance scores, especially for time-motion skills, while total case volume showed weaker and less consistent relationships with performance.

Regression analysis further showed that performance was most closely tied to efficiency-related motion features rather than every motion metric equally. In particular, suction-related time inefficiency was negatively associated with performance, meaning residents who spent proportionally more time using suction relative to attendings tended to receive lower ratings. Normalized movement-efficiency measures were also meaningful predictors, suggesting that strong performance depends not only on how long instruments are used, but how efficiently motion is produced during active operative time.

IV. Statistical Hypothesis Testing

We tested whether resident performance was related to operative participation, case difficulty, experience level, and motion-tracking features. Because the data included ordinal ratings, paired resident-attending comparisons, and non-normal motion variables, we used Spearman correlations, Wilcoxon signed-rank tests, and Cohen’s kappa.

Greater participation was linked to higher OSATS ratings
Residents who completed a larger portion of the procedure tended to receive higher attending-rated technical skill, flow, and overall performance scores.
Key result: ρ = 0.44–0.52, p ≤ 0.004

Higher technical challenge was linked to lower performance
More difficult cases were associated with lower attending-rated time/motion skill, flow, and overall sphenoidotomy performance.
Key result: ρ = -0.325 to -0.385, p < 0.05

OSATS domains were internally consistent
Technical skill, flow, and overall performance were strongly correlated, suggesting that these rating categories captured closely related aspects of operative competency.
Key result: ρ = 0.68–0.75, p < 0.001

Residents also recognized performance patterns
Resident self-ratings showed similar trends: greater participation was associated with higher self-rated performance, while higher technical challenge was associated with lower self-rated performance.
Key result: participation ρ = 0.52, p < 0.001; challenge ρ = -0.41, p = 0.008

Motion differences were strongest in efficiency metrics
Paired resident-attending comparisons showed significant differences in total time, idle time, and static time, indicating that the clearest motion-based gap was procedural efficiency.
Key result: total time p = 0.000008; idle time p = 0.000024; static time p = 0.000029

V. Experience Level Prediction

A resident-only machine learning model was built to classify sphenoidotomy procedures as Advanced or Non-Advanced using motion-based features. To keep the model generalizable and avoid leakage, we removed attending-only variables, resident/attending ratios, identifiers, administrative fields, and direct experience labels. The final dataset included 40 procedures from 8 residents, split evenly into 20 Advanced and 20 Non-Advanced cases after combining the single beginner case with the intermediate group.

Starting from 111 motion features, LASSO logistic regression was used to narrow the feature set to 12 predictors. Next, highly correlated/redundant variables were removed to produce a final set of four resident-focused features capturing motion efficiency, smoothness, and performance-related behavior. These features were used in a tuned RBF-kernel SVM with median imputation, standardization, and GridSearchCV hyperparameter tuning. Because the dataset was small, performance was evaluated using leave-one-out cross-validation.

The observed accuracy of 0.85 corresponded to a p-value of 0.0020, meaning that only 0.2% of the permuted runs achieved similar or better performance. This suggests the model is capturing a real relationship between the kinematic features and surgical expertise, rather than fitting to noise, even with
the small sample size.

VI. OSATS

We tested whether objective motion-tracking and procedural variables could predict attending-rated OSATS performance. The main outcome was attending-rated resident time-motion skill, while overall sphenoidotomy performance was used as a secondary outcome.

Model	Target	Main Predictors	R²	RMSE	MAE
Elastic Net baseline	Time-motion skill	Experience, completion %, idle-time difference	0.552	0.694	0.564
Gradient Boosting	Time-motion skill	Experience, completion %, idle-time difference, engineered gap features	0.634	0.628	0.500
Elastic Net baseline	Overall performance	Experience, completion %, idle-time difference	0.532	0.660	0.551

Baseline model showed meaningful prediction
The Elastic Net model used prior sphenoidotomy experience, attending-rated sphenoidotomy completion percentage, and mean idle-time difference.
Key result: R² = 0.552 for predicting time-motion skill

Gradient Boosting performed best
Adding engineered resident-attending motion-gap features improved prediction of attending-rated time-motion skill.
Key result: R² = 0.634, RMSE = 0.628, MAE = 0.500

Top predictors were clinically interpretable
The strongest predictors were prior sphenoidotomy experience, mean idle-time difference, and attending-rated sphenoidotomy completion percentage.

Main takeaway: performance was most strongly linked to experience, procedural completion, and efficiency.

Overall performance was also predictable
The same baseline predictor set performed well for attending-rated overall sphenoidotomy performance.
Key result: R² = 0.532, RMSE = 0.660, MAE = 0.551

Main conclusion
OSATS ratings can be meaningfully predicted using a combination of resident experience, procedural completion, and time-efficiency features. This supports motion-tracking data as an objective supplement to traditional attending-based surgical assessment.

VII. K-Means Clustering

We used unsupervised clustering to explore whether surgical motion data naturally formed meaningful groups without relying on performance labels. Three K-means clustering approaches were tested: resident-only motion features, combined resident and attending motion features, and paired resident–attending difference features. Across these analyses, the clusters reflected distinct motion patterns involving scope path length, jerk, active time, idle time, and non-scope movement behavior.

The most informative results came from clustering the paired resident–attending difference features, which preserved the matched structure of the dataset. These clusters were strongly separated by scope-based motion differences, including active path length, jerk, active time, idle time, and path efficiency measures. Unlike the resident-only clusters, the difference-based clusters showed a significant relationship with sphenoidotomy-specific experience and a near-significant relationship with attending-rated overall performance. This suggests that the motion gap between a resident and the attending on the same case may capture clinically meaningful variation better than resident motion alone.

We also explored an attending-based similarity approach, where PCA and distance-based scoring were used to compare each resident’s motion profile to attending motion patterns. More experienced residents generally had higher similarity scores, while less experienced residents showed larger improvement over time as their motion patterns moved closer to attending behavior. Overall, the clustering analysis showed that unsupervised learning can reveal hidden surgical motion patterns and may provide a useful foundation for automated, objective feedback in surgical training.

VIII. Conclusions

This study investigated how well surgical motion-tracking data could predict resident performance during sphenoidotomy and support more objective feedback for surgical training. Exploratory analysis, hypothesis testing, and preliminary supervised and unsupervised modeling showed that time-efficiency metrics were the strongest indicators of attending OSATS ratings.

Residents with more experience generally received higher OSATS scores, especially in time-motion skills. Motion analysis further showed that resident-attending differences were most consistent in temporal efficiency, including higher scope idle time, static time, and mean idle time for residents. In contrast, path length and jerk metrics were less consistently informative, suggesting that performance gaps were driven more by how efficiently residents used time than by overall movement magnitude or smoothness.

Experience-level classification also showed promising results. Using only four resident motion features, an SVM classifier achieved 85% accuracy under LOOCV in distinguishing advanced from non-advanced residents. This suggests that motion features alone may support scalable proficiency assessment without requiring paired attending observations.

The main limitation was the small sample size, so future work should validate these findings with more data. Additional integration of computer vision could help capture visual and spatial aspects of performance, supporting the development of an automated resident feedback tool to complement existing surgical assessment methods.

IX. Acknowledgements

We would like to acknowledge the URMC Otolaryngology team—particularly Dr. Schmale and Dr. Ryan—for their sponsorship and support throughout this project. We also sincerely thank Professor Anand of the Goergen Institute for Data Science for his guidance and expertise.

X. References

Lindgren, F., Bjørn Gunnar Hansen, Karcher, W., Sjöstróm, M., & Eriksson, L. (1996). Model validation by permutation tests: Applications to variable selection. Journal of Chemometrics, 10(5-6), 521–532. https://doi.org/10.1002/(sici)1099-128x(199609)10:5/6%3C521::aid-cem448%3E3.0.co;2-j
Lam, K., Cheng, A., Haider, H., Jafferji, M., Dasgupta, P., Ahmed, K., & Khan, M. S. (2022). Machine learning for technical skill assessment in surgery: A systematic review. NPJ Digital Medicine, 5, Article 24.