Agenda
Part 1: Synthetic health dataset
Based on publicly available data and medical literature we will first construct/provide a simplified but realistic function from various individual health risk factors like BMI, blood pressure, age, etc. to one or several health outcomes, e.g. incidence rates of cardio-vascular diseases or mortality rates. We will then simulate a longitudinal dataset..
Part 2: Generalized linear models
Using the synthetic dataset, we will show how to set up generalized linear models to model or essentially ”reverse-engineer” the function that has been used to create the dataset. We will very briefly go through the usual challenges:
- Defining the ”formula” of the GLM.
- Linear, quadratic, higher order polynomials, other types of functions.
- How to select interactions?
- How to treat missing values?
- How to make use of the information provided by an earlier or later incidence?
Part 3: Neural networks
In this part of the tutorial, we will show (and visualize where possible)
- that neural networks with linear activation functions are equivalent to GLMs,
- how activation functions, network structures, network depths, etc. affect the family of functions that can be modelled by the neural network,
- how GLMs can be used to ”pre-train” neural networks.
Part 4: Model explainability and risk factor importance
- Why explainability is always relevant.
- Permutation importance: How to measure feature importance for any model?
- From Individual Conditional Expectations to Partial Dependence: Studying feature effects for any model.