This project studies the effectiveness of using machine learning on predicting algal blooms. Using Southern California's Coastal Observing System's (SCOOS) meteorological data, I designed and developed models to predict phytoplankton concentrations much before their onset.
Learned all of the steps of the machine learning model process! Some of my details follow as below
Data processing: Had to find datasets for coastal places and then use pandas and numpy to observe and clean the data. The biggest problem was dealing with the amount of 0/NAn concentrations, so I used XGBoost to impute missing phytoplankton concentrations for NaN and 0 values.
Feature Importance: Using XGBoost's feature importance method, I determined the relative importance and contribution of each of the features using different subsets to retrieve the optimal subset of parameters.
Studied and trained XGBoost, AdaBoost, and KNN models. To optimize my model training with the right hyper-parameters, I took advantage of sklearn's GridSearchSV module to preform an exhaustive search over specified parameter values via grid search.
Evaluated the model with RMSE to best figure out the best model for predicting across all 3 piers between the 3 models. While the models performed regression in order to be able to spit out the estimated phytoplankton concentration, I then split the concentrations into different tiers based on how much of a risk they store: healthy water, warning for potential upcoming bloom, and infected.
Learned a lot about using python's visualization libraries, including seaborn and matplotlib! These tools were crucial for data processing and model evaluation.