Discussion Board

You need to log in to create posts and topics.

Question on "rules" for evaluating predictions, as related to the purity of the testing/validation period

We are predicting daily water temperature at multiple depths for a variety of lakes using a machine learning (ML) model that has some physics integrated into it. The loss function in the ML model includes a penalty for the magnitude of violations of conservation of energy, calculated per time step. Additionally, we have found that the output of an existing uncalibrated process-based lake temperature model is excellent "pre-training" data for a ML model, since it is complete in time and space and represents a fairly physically realistic response to the inputs. We train the ML model on the process-based model water temperature outputs (treating them the same way as if they were actual observations), and then use ML network state at the end of "pre-training" as the initial network state for the actual training, which is done as a separate step with real temperature observations.
For example, if we have a modeling period of 2000-2010 and the first 5 years are training and the second 5 years are testing, then the ML model is first "pre-trained" on 1) weather drivers + uncalibrated process-model output of water temperature from 2000-2010, and 2) trained on weather drivers + observations of water temperature from 2000-2005. Then, we make predictions of water temperature in 2000-2010 using our trained ML model driven by the weather dataset and finally test model performance using water temperature observations from 2006-2010.
Blue line represents the uncalibrated process-based model water temperature predictions and the yellow dots represent the water temperature observations
Figure. Blue line represents the uncalibrated process-based model water temperature predictions and the yellow dots represent the water temperature observations.
The question we have is if there is an established set of rules regarding the "purity" of the model test period. We know the water temperature observations in the test dataset are completely off limits - we do nothing with them except use them to evaluate the model. But, is there anything that suggests the model inputs that exist during the test period (e.g., the weather data, which is used to drive both process-based and ML models) are also off-limits? Since generating more "pre-training" data using the uncalibrated process-based model for longer time periods is essentially free and has no feedback with the water temperature observations, we include the test period in the pre-training.
Any thoughts on this would be much appreciated! And links or citations supporting / negating our evaluation method would be great.