• Data pre-processing techniques and tools for predictive modelling using unstructured inputs

      Maslowski, Przemyslaw (University of Bedfordshire, 2020-07)
      Data is a crucial factor within machine learning, as most of the neural networks and machine learning models are data-driven. A trained neural network can be used to predict new data that has not been seen by the model but under the trained patterns. The performance of the predictive model can vary based on the data that is being used while training. Multiple metrics have been produced after a model is trained to evaluate model performance. However, it is difficult to get an intuitive measurement that indicates if the data pre-processing of a model has been improved or not. Therefore, a constructive performance indicator tool that can be used to intuitively measure the performance of pre-processing mechanisms for a given model, has been developed through multiple experiments with 32 datasets. The experiments are set up by collecting multiple unstructured datasets which are subsequently converted into structured datasets and then evaluated by their modelling performance. The experiment results are used to evaluate the importance of each metric and priorities via weights for contextualising the preprocessing experience within the constructivist paradigm. Furthermore, a set of tools have been developed throughout the project to improve the efficiency of machine learning experiments. The developed set of tools are a part of the main software, which is named as the pre-processing assistant. The pre-processing assistant has been published to the public, and it can be used for preparing, processing, and analysing data. The software tools allow users to manipulate datasets and generate Python scripts to train a predictive model. Also, the TensorFlow framework and its machine-learning algorithms have been utilised to develop Python scripts for training and predicting datasets. The software has been used to effectively carry out the experiments which have helped to configure the performance indicator tool. In the end, the most important metrics have been discovered through various experiments. The experiments consist of training the model with and without data pre-processing techniques. The increase in each metric has been adopted to discover significant metrics. The metrics which improve frequently are estimated to be more critical and have been assigned with a higher weight. The performance indicator has been configured based on the final experiment results, and it can be used by others to measure the performance of a predictive model.