Water Quality Gradient Boost Model

I will share a raw table of laboratory-measured water-quality parameters (pH, turbidity, dissolved oxygen, heavy-metal readings and several others). Using those records I want to end up with a supervised, decision-tree-based solution—specifically a Gradient Boosting model—that predicts overall water quality with the highest possible accuracy. Here is what I need from you: • Inspect and clean the dataset, handling outliers, missing values and inconsistent units. • Explore feature importance and, where helpful, create additional engineered features. • Train and fine-tune a Gradient Boosting decision-tree model (scikit-learn, XGBoost or LightGBM are all acceptable) with solid cross-validation. • Compare its performance against at least one other tree-based approach so I can see the gains. • Deliver a concise report containing the final metrics (accuracy, precision/recall, F1, ROC-AUC) plus a short interpretation of the most influential parameters. • Hand over the cleaned dataset, Python scripts or Jupyter notebook, and the serialized model file so I can reproduce results on my end. Keep the code readable, add comments where decisions might need context, and structure the repository so I can drop in new samples later and get predictions immediately.

Python

Регистрация