Predictive Maintenance with DataRobot
Mis à jour : 25 mai 2020
Author: Panos Zisopoulos
The technique of predicting whether hardware equipment will break or not can be proved beneficial for modern companies that utilize machinery in order to provide services.
For example, consider a bus company which has recently bought a fleet of new buses. Engine failures can occur due to the wearing of mechanical parts and/or human error, which will result in many unsatisfied customers and a dip in the company's tickets sales. If the company decides to perform regular maintenance on the engines of the buses, in order to catch any up-coming failures, this could reduce the associated risk, however it would drive the company costs up.
This is one of the reasons that predictive maintenance has become popular nowadays, and major companies employ data scientists to build Machine Learning models in order to perform such predictions. A market sector in which profit is significantly dependent on machinery failure is the oil market. Whenever drilling equipment breaks during operation, the oil extraction production stops. As a result, oil supply in the market falls which results in losses for the company and potentially in a disturbance of the global energy supply and markets.
The very next question that pops in our head is whether we could come up with an analysis as data scientists, which would help companies perform predictions on whether their machinery needs maintenance. Luckily, machine learning can also be used in such a scenario, where by using the right data, we can develop models, which would let us know whether such a company should proceed in actions of maintenance.
Automated Machine Learning with DataRobot
It takes a lot of time and dedication to master the basic notions of machine learning, and even longer to train and produce models that can be put into production and exposed to end users on the business side. The reasons are the manual data wrangling and writing of code to train, test, validate and compare different models.
DataRobot has developed a platform which allows for Automatic Machine Learning (AML). This platform frees the data scientist or analyst from the time consuming and repetitive tasks of writing the code for each model to train, test and compare. DataRobot also greatly simplifies the process of pushing a model to the production environment since it allows for the automatic deployment and monitoring of machine learning models.
The impact of this product is huge in modern companies, as they can incorporate machine learning models in production, in order to optimize their performance. Here at Argusa, we are proud to be partners of DataRobot, and with this article, we would like to demonstrate how the DataRobot application can help us to understand and analyze business problems and eventually lead us to useful conclusions.
In this example, we are using a dataset which contains information on the quantity of machinery that has been replaced over a specific period of time. This dataset is found among a rich collection of use cases, which are only available to DataRobot partners, and it has been deliberately anonymized for reasons of privacy. The goal of the analysis is to use DataRobot AML functionality to identify parameters, relevant to the data-set, which can increase or decrease the probability of hardware replacement i.e. machinery failure.
Description of the data-set
The data-set consists of 10000 rows X 28 features (columns) with information on the materials of the machinery (such as weight, material identifiers etc.), plus a Boolean variable (qty_replaced) which represents the fact that a replacement took place (value 1) or not ( value 0). Therefore, the nature of the problem that we are dealing with is binary classification.
This variable is our target variable i.e. we will build models in DataRobot in order to predict the probability of a change in the drilling equipment.
The features that we will use in our model consist of categorical, numerical and Boolean types, and they act as free parameters, in the sense that DataRobot will try to build a model by varying their values. Examples of such features in our dataset is the weight of the machinery components, and the type of material that are made of.
By considering the needs of modern businesses, DataRobot has added flexibility in utilizing their product, by offering a web application and a Python API. In this example, we are going to demonstrate the power of DataRobot in the web application.
When we visit the web page of the application, we can import our data in a plethora of ways: by uploading a local file (such as csv or Excel files), by using a URL (e.g. for data stored in AWS), by using HDFS files in case we are using the Hadoop system, or a by connecting to a relational database (e.g. SQL Server).
A new feature of DataRobot is that we can browse the AI Catalog, which includes projects that we have worked on and shared with our team. It provides easy access for all collaborators to the data which are required for answering business problems, in a consistent and secure way.
Fig. 1: The DataRobot page where the user can upload data in many ways.
After this small introduction, let’s import some data! We choose to upload data from a local file and for the current dataset, it take DataRobot about 3 seconds to read and pre-analyze the data. The pre-analysis includes statistical calculations for each of the columns of our data-set, together with information on data distributions in the form histograms. The main page of DataRobot after uploading our data consists of a field to enter the target variable, a button with the choice of analysis mode that we want to perform, and an option to activate a Time Series modelling, in case our data demonstrate seasonality and we would have liked to predict variables in the future.
When we choose the variable qty_replaced as a target variable, DataRobot offers the histogram of these observations, together with the automatic recognition of the machine learning analysis that our data need (Classification). The big Start button is there to initiate the analysis, and we can choose the Modeling Mode that we prefer. The choices are:
Autopilot, where DataRobot algorithms performs every operation from data analysis to model building automatically.
Quick, which offers the same functionality as Autopilot, but it runs analysis on a subset of our data. Great option in case our dataset is huge and we want to quickly perform some modelling.
Manual, where we have full control in the analysis, by choosing which models we are going to use on our data.
Fig. 2: After uploading our data, we can configure the analysis according to our needs.
In this example, we are going to use the Autopilot option. DataRobot also suggests the penalty function that we should use in order to measure the accuracy of our models. For our case, the best function is the LogLoss function.
Before jumping to the analysis, we can have a quick look at the features of our data, and inspect their statistical properties. DataRobot offers a preview of the raw data, the histograms for each feature’s distribution, and an automatic recognition of each features variable type (categorical, Boolean etc.). For example, we can see that our target variable, which is Boolean, has an average value of 0.57 and a standard deviation of 0.50, which immediately tells us that our target variable is adequately balanced. Being balanced, for a classification problem, refers to the property of having equal or almost equal number of observations of the two outcomes (0 or 1 in this case). For our machine learning analysis, the results would yield the prediction on the probability of some equipment to need replacement. These probabilities will come as continuous values, from 0 to 1, and everything below 0.5 would mean no replacement, whereas above 0.5 would mean to be replaced.
Fig. 3: DataRobot offers a statistical view of our data before starting to build models.
In the same list, we can also click on a feature and DataRobot will automatically draw a histogram for us. For example, when we click on the m_weight variable, we can inspect the distribution of the weights of the our equipment. We immediately conclude that our machinery is mostly ‘’light’’, however there are units that are significantly more heavy.
Fig. 4: Histogram of the m_weight variable. The values in the horizontal axis are in abstract units, while the vertical axis refers to the number of rows that exhibit these values.
And now to the fun part!
We click on the Start button and we use the default option of Autopilot in order to allow DataRobot to automatically perform feature engineering, which includes many tedious tasks such as identification of missing values and their imputation, measurement of correlation between features, automatic creation of feature list according to their importance. This functionality can be extremely useful to the user of DataRobot in terms of time and effort.
In Fig.5, the features of the data-set is shown, ordered by importance, which is a statistical measure of the impact of each feature on the target variable. This view is very convenient since we can get a first feeling of what are the relationships between the target variable and our dataset.
Fig. 5: View from the data loading pane in DataRobot after initiating the analysis.
In the same view, we notice the Feature List button, where we can see an automatically generated list of feature collections, with different number of features. DataRobot takes subsets of our initial data (along the column axis) in order to reduce the number of features. In general, a simpler model is always more favorable than a more complicated one, as long as bias and variance do not over or under compensate each other. DataRobot has automatically created list of 12 features (out of the initial 28). In addition to the previous advantages, this results in shorter computation times, and an easier interpretation of the results. However, we are always free to choose the initial list which includes all the features, or even create our own lists.
After the initial feature engineering, the Autopilot begins to build the first models. A total of 75 models are run automatically on our data in less than 20 minutes, however this number is largely influenced by the data size and the network speed. It is important to mention that while DataRobot keeps on building models, we are free to work in the application and explore any models that have been built so far. Note however that since DataRobot is designed to perform machine learning analysis in the most efficient way, it performs k-fold Cross Validation in order to assign scores to the models. This technique is included in the best practices of machine learning in order avoid the introduction of bias in our predictions and maximize accuracy. The general way that this is done in DataRobot is
The data are randomly shuffled for each column.
The training data (80% percent of our dataset) are divided in k = 5 groups.
One group from the k = 5 groups is reserved for the validation of our models i.e. the measurement of the model’s accuracy. Then the training of the model is performed with the other 4 groups. The model is finally validated on the reserved group.
The previous step is repeated k = 5 times, by using a different group each time for the validation of our models.
The final result is taken as the average of these 5-fold validation scores.
The rest of the 20% of our data in Step 2. is reserved as a Holdout for the final validation of our models. As an extra step to increase accuracy, DataRobot allows for the testing of our model on the Holdout data only as a final step. This is performed in order to reduce bias in the model selection that the user will perform.
After the analysis is finished and the models are built, they are sorted in the Leaderboard section according to each model's score in the LogLoss metric, where smaller values yield more accurate models. As we have mentioned in the beginning, DataRobot is designed for the modern business needs, so it also includes models that did a great job in terms of computation speed and accuracy. This means that this models are very good candidates for deployment in order to perform predictions in production. DataRobot informs us about the model that is most suitable for deployment.
The accuracy is given by two values in the Leaderboard: The validation score is presented in the Validation field, which is just the accuracy of our model but by using a single chunk of our training data for validation, while the cross-validation score can be seen in the Cross-Validation field. Between the two, it is a good practice to prefer the cross-validation score when choosing our model. We have also the opportunity to filter the Leaderboard according to the feature lists that we have included in the analysis. In this example, we will use the list that was automatically generated by DataRobot, which is built by using 64% of our data. The most accurate models, as it can be seen in Fig. 6, are the blender models which are mixtures of various models.
Fig. 6: Leaderboard of the DataRobot models for the predictive maintenance analysis.
We can receive rich information for each model with a single click on their name. For example, the Describe section provides information on the blueprint i.e. the flow diagram of the DataRobot operations that were performed for this model, on the computation speed, and we can even read the raw log file. An example of the blueprint of the Advanced GLM Blender blueprint is shown in Figure 7. Blender here means ‘’mixture of models’’, and indeed the first step in the flow was the preprocessing phase of each of these models.
Fig. 7: Blueprint of the Advanced GLM Blender model.
After the analysis from DataRobot is finished, we can visit the Insights section, to evaluate some qualitative characteristics of our dataset.
For example, in the Word Cloud tab, we can see a word cloud with the most frequent labels in our dataset, where the magnitude of their occurrence is reflected in their size in the cloud. In addition, word cloud offers a qualitative relationship of these words to the target variable. (positive correlation in red, negative in blue).
More red means greater probability for the equipment to be changed. In Fig.8, the world cloud is shown, where we can make deductions like :
Drills are more prone to damage than pipe.
Hose is the less probable equipment to be changed.
Fig.8: World Cloud in the Insights section.
Understanding our model
DataRobot offers many tools for understanding the produced models, and evaluating useful metrics like accuracy.
The model that we choose for the analysis is the Advanced GLM blender, which is also suggested by DataRobot as the most accurate. This can be also confirmed in the section of Learning curves, where this particular model has indeed the best score in the log-loss metric. If we visit the Speed VS Accuracy section, we can confirm the fact that blender models are usually not fast and therefore one should take this into account, before deploying a model for real-time predictions of the model.
By visiting the Evaluate section, we can see the Lift Chart in Fig. 9, which gives the comparison between the response of the chosen model (Advanced GLM blender in this case) and the actual values of the data. What is shown here is the average value of the target variable (qty_replaced) for 10 bins of the training data (each bin consists of 8000 rows of the dataset), sorted from low to high values. This value is shown from the predictions of the model (blue line), whereas the response from the actual data is shown in orange.
We can confirm that the model agrees very well with the actual data, although the model slightly underestimates the probability that the equipment will not be replaced. This can be seen in the section of the plot where a bump is visible in the response of the actual data.
Fig 9: Lift chart for the selected model in DataRobot. The predicted probability of the target variable is in blue, the actual in orange.
In Fig. 10, DataRobot allows for the evaluation of the Response Of Caller (ROC) curve, which gives a qualitative and quantitative overview of the accuracy of our model. In other words, we can use ROC curves to evaluate our binary classifier as successful or not. In the horizontal axis, the False Positive rate (FP) is shown (the equipment does not need to be replaced but the model predicted that it should be replaced), and the True Positive rate (TP) (the equipment needed to be replaced and the model predicted it should be replaced) is shown in the vertical axis. For 60% of FP rate, we get more than 90% in the TP rate, which makes our classifier good at its job. DataRobot informs us that the overall Accuracy of our classifier is 0.71, which is a relatively good score.
Fig. 10: The ROC curve of our binary predictor.
One of the most important functionalities of DataRobot is the Feature Impact, which offers the user the ability to see the importance of each feature to the target variable. This view can drive future decisions and strategy planning, with respect to the problem at stake.
For example, in Fig. 11, the impact of each features is shown as a percentage on the horizontal axis. The material_id is the strongest predictor of qty_replaced, which means that for some materials (we have one id number per material) there is a higher chance for the equipment to be replaced. Some other strong predictors are also highlighted, such as m_weight which corresponds to the weight of each machinery unit.
Fig. 11: Weight of each feature in the response of our model.
We can further inspect how exactly any variable in our model affects the response of our models. DataRobot can automatically creates plots, which allow for a quantitative and qualitative analysis of the relationships between the values of our features, and the target variable. We would like to produce such plots for the material_id feature, since we concluded that they lie among the strongest predictors of our target variable.
In Fig. 12, we can see this plot for the variable material_id, where DataRobot plots the actual and predicted probabilities of the target variable, for each value of the material_id column, together with the partial dependence. We can safely conclude that the model nicely predicts that some materials are more likely to be replaced than others, which makes sense since the properties of a material is a strong predictor of its endurance. Partial dependence here is the response of the model in case that the material_id was the only independent variable in our model. In other words, it is the probability in case that this variable was the only feature that could change values in our dataset.
Fig. 12: Predictions with respect to material_id.
With automated machine learning analysis delivered by DataRobot, the oil company can take some decisions regarding the strategy to be followed in the maintenance process. For instance, instead of having the expense of periodic maintenance, even for equipment that does not need to be changed, the model can be used for everyday predictions, as soon as data are loaded for each of the equipment.
These predictions can then show machinery that needs to be replaced, minimizing substantially the risk of stalling oil production due to broken equipment. DataRobot offers extremely useful analytics, plus the automation of tedious processes such as the model deployment and management of the prediction server.
If you are curious about DataRobot and you think your company can benefit from this platform, do not hesitate to contact us for more information at email@example.com.