DSC Archive Machine Learning Archive

Rochester Transit Service


Yihe Chen

Harry Huang

Junting Chen

Kehan Yu


Cantay Caliskan


Predictive Analytics for Demand Responsive Para- transportation

Vision & Goal

● Create a productive schedule for Demand Responsive Para-transportation by predicting the customers’ cancellation.

● Provide executable Python code and classification model.

● Discover best performance metrics.

● Generate well-organized supporting

Data Overview

Internal data

●  We acquired the internal data from our sponsor

●  Our original dataset contains 102754 observations, and 21 explanatory variables from May 17th, 2021 to December 5th, 2021.

External Data

●  We acquired the From NOAA (National Oceanic and Atmospheric Administration)

 Acquired daily weather information

Data Visualization

Feature engineering

●  Created a label for the cancellation (1 for canceled, 0 for performed)

●  Transformed ‘date’ variables into informative variables (e.g.. month, day, weekday)

● Encoded categorical variables

● Aggregated passengers by type (with children, need lift)


Handle the Class Imbalance

Random Forest with SMOTE

Accuracy: 81.5% -> 84.8%

Precision: 32.2% -> 62.1%

Precision: 53.9 -> 57.7%

Precision is significantly improved by 92.8%, while Recall and Accuracy are slightly improved

Weighted Random Forest Classifier

Compared to the Random Forest Classifier, Weighted Random Forest Classifier penalizes the misclassification of minority class more

Confusion matrix

Actually CanceledActually Uncanceled
Predicted CanceledTP = 1808FP = 598
Predicted UncanceledFN = 2207TN = 15938

XGBoost Classifier

Accuracy: 81.5% -> 86.1%

Precision: 32.2% -> 65.2%

Recall: 53.9% -> 62.3%

Confusion matrix

Actually CanceledActually Uncanceled
Predicted CanceledTP = 2503FP = 1336
Predicted UncanceledFN = 1512TN = 15200

Key Insights

● Our sponsor(RTS) has an extra bus on standby to cover any missing cases.

● During busy hours (from 8 am to 3 pm):

○ Excessively running the extra bus is costly when the prediction is not precise ○ It’s better to use the Weighted Random Forest Classifier, which gives the highest precision

● During other times:

○ It’s less costly for running extra buses since fewer clients use the service
○ It’s better to use XGBoost Classifier, which balances recall (covering more

canceled trip) and precision (making fewer errors when predicting the cancellation)