Skip to content

Predictive Maintainence for Trucks

Team members

Yu Hao
Yu Hao

Graduate Student ’21
Data Science Major

Xinyu Guo
Xinyu Guo

Undergraduate Student ’21
Data Science, Economics

Yuman Xie
Yuman Xie

Undergraduate Student ’21
Data Science, Financial Econ

Pinyi Wu
Pinyi Wu

Undergraduate Student ’21
Data Science, Brain and Cognitive Science

Yuqi Zeng

Undergraduate Student ’21
Data Science, Business, Economics


Ajay Anand, Pedro Fernandez

Department of Data Science, UR


Lloyd Palum, Mathew Mayo

Vnomics Corporation


Vision: Identify scenarios where DPF (Diesel Particulate Filter) failure is likely to happen so that the customer can be alerted in advance to avoid costly roadside breakdowns.

Goal: By performing data preprocessing, data visualization, and building classification models, we would like to have our outputs be recall scores and confusion matrices of classification results.

Terminology and Features

DPF (Diesel Particulate Filter): a filter in the truck that filters out environmentally harmful matter from exhaust gases; truck will self-clean its filter during operation

DPF failure: a situation where the filter’s self-cleaning (regeneration) does not work properly; and to a point, the filter is so clogged that it cannot be cleaned solely through regeneration, thus needs maintenance. 

dpf_regen_inhibited_duration:  the total duration (minutes) where dpf regeneration is inhibited for the day, in this case, regeneration cannot occur even if it needs to, and dpf_regen_not_inhibited_duration is the reverse.

dpf_regen_active: the total duration (minutes) where dpf regeneration is active for the day, which means regeneration is taking place, and dpf_regen_not_active is the reverse

dpf_regen_needed_duration_mins: the total duration (minutes) where dpf regeneration is reported as being needed for the day, in which regeneration is needed but is not taking place. We regard this parameter as a critical sign of an upcoming failure.

DTC (Diagnostic Trouble Codes): leading up to DPF failure, the trucks may throw one or more DTCs. Each DTC consists of an SPN code and an FMI code. The combination of these codes will return any possible issue of a truck, thus yielding thousands of possible outcomes.

Data Preprocessing & Cleaning

  • Data merging

The raw data we got has one dataset for each individual truck, in a total of 161 CSV files. We merged them all together and got a big dataframe with all data.

  • Data cleaning

We imputed null data based on the data of the previous day since we needed continuous time series. We added features “month” and “vehicle year”, which we thought might give us some patterns. We also added DTC. After communicating with our sponsor, we kept only 8 of the DTC pairs that were related to DPF failures. We only kept the trucks with DPF failures and dropped all others. Within these trucks, we kept the data 30 days before the service date, and later based on our tuning result we changed the length to 60 days. As illustrated in the graph below, the service date is indicated by the arrow, and the bold lines are the data we choose. No matter when the service date was, we only kept 60 days before that. If a truck, like truck D, didn’t have enough data, we dropped the truck.

 Finally, we got 87 trucks left. In total there are 5220 rows of raw data.

  • Target variable

We set our target variable (y=1) 3 days before the actual failure date in order to predict the failure in advance. For all the other dates (rows), their y values were set 0.

  • Data Normalization

We normalized the data based on duration mins. Because data from a truck that works over 10 hours per day is not comparable to a truck that only works a few minutes per day. So we divided every numeric variable by the corresponding duration mins. In this way, we normalized the numerical data and were able to compoare them across different trucks.

Explorative Data Analysis

As mentioned earlier, it is necessary to normalize the numerical data. Here, we used histograms and correlation coefficients to examine which normalization methods we should utilize. We chose to normalize the data by “distance_miles” first. However, as might be seen in the histograms, the distributions of all features were heavily affected by a large number of outliers. We were concerned that such large number of outliers might decrease the quality of our model. Therefore, we tried normalization by “duration_mins”, and the distributions were much reasonable. All values fell in the range of 0 to 1, and the shapes of the distributions were clearer. In addition, we didn’t see many outliers. 

Histogram distribution of data normalized by distance miles
Histogram distribution of data normalized by duration mins

The graph below is the correlation coefficient across all different features. The data here has been normalized by duration mins. Comparing to the correlation we had from data normalized by distance miles, we saw an increase in the linear relationship between the dependent variables “dpf_failure_2weeks” and all other explanatory variables. In addition, we found two sets of explanatory variables that had very high correlations. The first one is “fuel_used_gallons” and “distance_miles”, which were highlighted in blue boxes. Their correlation was 0.96. Another one is “dpf_regen_not_active_duration_mins” and “dpf_regen_inhibited_duration_mins”, which were highlighted in the yellow boxes. Their correlation was 0.85.

We also noticed that the correlation coefficient has decreased a lot between most of the explanatory features after we normalized the data by “duration_mins”.

Correlation heat map

Since the data was more reasonable when it was normalized by duration_mins, we decided to go ahead and use this normalization for our data.

Then we compared line plots for different windowing intervals. We selected 7 days, 14 days, and 30 days as windowing intervals. X-axis is the number of days to the service date, i.e. the end of the x-axis is the service date (failure date). And y-axis is the value for corresponding features.

The graph below is the plot for 7 days. Dark green lines are the average trend, and light green areas are the 95 percent confidence intervals. We can see from this graph that, starting from 4 days before failure, there was a downward trend for the first two rows (8 graphs), but an upward trend for the last graph. We interpret it as, when the failure is about to occur, the truck is driven less, thus less the distance miles, fuel used, etc, yet higher the regeneration duration needed because the truck is not able to regenerate itself.

Line plot for 7 days

Then we wanted to see for how many days is this downward trend valid. So we extend the windowing interval to 14 and 30 days.

The graph below is for 14 days. And we see the trend is valid between 3 to 5 days before failure.

Line plot for 14 days

And down below is for 30 days, we can see more clearly that the trend is valid 5 days before the failure date.

Line plot for 30 days

Thus according to our explorative data analysis, our raw data (also called the base data) was normalized by “duration_mins data”, with 5 days’ data before service date counting as failure. That is, only those data that are 5 days before the service date are counted as y equals to 1. Later, based on our tsFresh result, we extended failure days to 10 days before the service date.

Methods and Results

  • tsfresh

In feature engineering, we used python package tsfresh to generate windowed data, calculate and select a large number of time-series characterized features. We also incorporated SMOTE to implement oversampling, and PCA to further reduce dependency between calculated features and data dimensionality. 

The graph below this paragraph is an overview of the process of our feature engineering. We separated it into two scenarios, the upper one is test/train split based on windows. The one below is splitting by the trucks. The key difference here is where the test/train split takes place. Instead of splitting at the start, the splitting by windows one split after features generation. Since there are only 88 trucks used in our data, we believe splitting by windows can focus on driving behaviors and eliminate the differences between each truck. Therefore, we proceeded with splitting by windows (the one below).

To help understand the process, we will use our baseline model to introduce each stage, as illustrated in the graph below. (Because it’s the baseline, the days we used are different from our final model. Nevertheless, it is the concept that matters.) The first part is rolling and windowing to convert the data into windows. This part is powered by tsfresh. In order to predict failure in advance, we left 1 day out and moved y = 1 to one day before the service date, shown in blue. We used 30 days of data, so the data starts from 31 days before the service date to 1 day before the service date. The rectangles in the graph represent the windowing period. Based on the results of our EDA, we used a window size of 5 days, shown in yellow. In order to generate more data, we used an overlapped percentage at 40%, giving 2-days overlapped between windows, and this is shown in red. After windowing, all windows, except the last window, have a y = 0, and the last window that includes the 1 day before the service date has a y = 1. The resulting data frame has 4752 rows and 21 columns.

tsfresh feature selection

After we got windowed data, we used tsFresh to generate a large number of time series characterized features. Each window has 5 days of readings and 21 features to feed in tsfresh. We used ComprehensiveParameters that calculates all tsfresh features. After feature generation, each window will become a feature vector with a length of 13379. The resulting data frame has 792 rows and 13379 columns. 

tsfresh feature generation

Out of these 792 rows of data, we only had 88 of them that have a y = 1. In order to solve the imbalance classes here, we utilized SMOTE to oversample the minority class, and we got balanced classes as a result. Then we conducted test/train split. After splitting, we conducted feature selection, a method powered by tsfresh, to get selected features. We selected the feature based on the train set by the feature selection method in tsfresh. Based on these selected features, we then extracted the same features from the test set. As tsfresh produced highly correlated features and we still have too many features, we incorporated PCA to further reduce the dimensionality and dealt with those highly correlated features. We fit the train set to the PCA model and concluded that, at a principal component number of 4, we are able to preserve 99% variation. Then, we fit the same PCA transformation into the test data. And now, our test and train data set are ready to feed into the model. 

  • Modeling

We measured our models based on Recall value. Recall is calculated by dividing the true-positive value by the sum of true-positive and false-negative values. We would like to see how sensitive our model is, and recall tells us that, out of the actual positive data, how many times we predict correctly. Since the cost of predicting DPF failure is relatively low, and since being able to predict ahead of time is the main goal of this project, we set our criterion to be the recall value. In all, It’s better to be safe than sorry. 

We fit the data obtained previously into 9 models: Logistics Regression, SVM, Random Forest, KNN, Extra Trees, Naive Bayes, Decision Tree, Bagging, Gradient Boosting. Here we’re using Logistics Regression as an example to show our result.

  • Tuning

After constructing the baseline model, we tuned five groups of parameters.

The first one is moving the date we set as y=1 further in advance. Our baseline model sets the y=1 date only one day before the actual recorded service date, that means we’re only able to predict failure, one day before the failure. So we moved the y=1, 2 to 5 days before the service date. Results show that 1 day and 3 days before service date have similar model recall, which are higher than any other days. And since three days before service date allows us to predict failure earlier and thus give truck drivers higher flexibility, we are choosing 3 days before service date as the y=1.

The second parameter we tuned was the overlap days in tsfresh windowing. In the baseline model, we tried 40 to 80 percent of overlapping, that’s from overlap of 2 days to 4 days in a total of 5 days windowing. Results show that 80 percent (or 4 days) of overlapping has the highest recall from the best performance model.

Then our third try was on the tuning of the failure date. We changed failure days along with overlapping days. For the failure date, we tried 5 days, 6 days, and 10 days. While we were trying, we figured that when we increased the windowing period from 5 days to 10 days, there’s a decrease of almost half of the after-rolling data entries. To make them comparable, we enlarged the total data used for 10-day windowing from 30 days to 60 days. For overlapping days we tried a range between 70% to 90%. Note here that the percentage is the actual percentage. For example, in our baseline model of 5 days, 70% to 90% overlap all correspond to 4 days of overlapping after the roundup, but we will use 80%, as 4 days out of 5 days is exactly 80%. The resultant percentage we compared are the recall rates of y=1 from best performance models of corresponding failure and overlapping days. Due to a small mistake in our method, we’re currently re-processing this part. Final results should be available soon.

The fourth parameter we tuned was a setting in tsfresh’s extraction function. There are three predefined dictionaries, ComprehensiveFCParameter, EfficientFCParameter, and MinimalFCParameter. We ran each of them. Results show that comprehensive and efficient give only a 2% difference in recall. But Comprehensive has a runtime of around half an hour, efficient has a runtime of only 5 minutes, so we decided to use efficientFCParameter.

The fifth parameter we will be focusing on after we solve the small problem will be the model parameters. By changing internal parameters of each model, we were able to raise model performance by as much as around 27%.


Due to the mistake that we made in tsfresh, we are still tuning our last part of the model. The highest recall value we got before tuning is around 60%, which means out of those trucks experiencing failure, we have 60% confidence that we will predict the failure. This result is better than we expected due to the fact that we didn’t have a lot of data, and the data is very unbalanced (for 161 trucks in two years, we only have 87 rows of data that can be count as y=1, that’s 87 out of 10225110). We do think we can improve the model performance by further tuning. But upon that, we also think there are several other things to do.

Firstly, we could still try to split our data by trucks at the tsfresh rolling step. By doing that we can test our model validity.

Secondly, we could try to predict normal trucks. We can predict the performance of normal trucks and set a standard for that. Then if a truck doesn’t meet this standard, we can predict that this truck will have a failure. The problem with this method is that if we don’t have enough normal trucks, we won’t be able to include all the normal driving behavior for our “normal standard”.

We will try these methods and some other steps to improve our model performance before the semester ends. We will update our further change, and if you have any questions and suggestions, please leave us a comment below!


We would like to express our sincere gratitude to Professor Anand, Professor Palum, Mr. Mayo. Thank you for all your help and support throughout the semester!


Return to the top of the page