- Ground-motion model is a nonphysical function (subsymbolic) (polynomial) of predictor variables (M
_{w}, r_{jb}, V_{s,30}, fault mechanism and depth to top of rupture) with 48 coefficients (not reported) (14 for M_{w}, 5 for r_{jb}, 4 for V_{s,30}, 6 for rupture depth, 15 for combination of M_{w}and r_{jb}, intercept parameter, pseudo-depth and 2 for mechanism). Use polynomials because simple, flexible and easy to understand. - Characterize sites using V
_{s,30}. - Use three faulting mechanisms:
- Reverse
- Rake angle between 30 and 150
^{∘}. 19 earthquakes and 1870 records. - Normal
- Rake angle between -150 and -30
^{∘}. 11 earthquakes and 49 records. - Strike slip
- Other rake angle. 30 earthquakes and 741 records.

- Use data from NGA project because best dataset currently available. Note that significant amount of metadata are missing. Discuss the problems of missing metadata. Assume that metadata are missing at random, which means that it is possible to perform unbiased statistical inference. To overcome missing metadata only select records where all metadata exist, which note is only strictly valid when metadata are missing completely at random.
- Select only records that are representative of free-field conditions based on Geomatrix classification C1.
- Exclude some data from Chi-Chi sequence due to poor quality or co-located instruments.
- Exclude data from r
_{jb}> 200km because of low engineering significance and to reduce correlation between magnitude and distance. Also note that this reduces possible bias due to different attenuation in different regions. - In original selection one record with M
_{w}5.2 and the next at M_{w}5.61. Record with M_{w}5.2 had a dominant role for small magnitudes so it was removed. - Discuss the problem of over-fitting (modelling more spurious details of sample than are supported by data generating process) and propose the use of generalization error (estimated using cross validation), which directly estimates the average prediction error for data not used to develop model, to counteract it. Judge quality of model primarily in terms of predictive power. Conclude that approach is viable for large datasets.
- State that objective is not to develop a fully-fledged alternative NGA model but to present an extension to traditional modelling strategies, based on intelligent data analysis from the fields of machine learning and artificial intelligence.
- For k-fold cross validation, split data into k roughly equal-sized subsets. Fit model to k - 1 subsets and compute prediction error for unused subset. Repeat for all k subsets. Combine k prediction error estimates to obtain estimate of generalization error. Use k = 10, which is often used for this approach.
- Use r
_{jb}because some trials with simple functional form show that it gives a smaller generalization error than, e.g., r_{rup}. - Start with simple functional form and add new terms and retain those that lead to a reduction in generalization error.
- Note that some coefficients not statistically significant at 5% level but note that 5% is an arbitrary level and they result in lower generalization error.
- Compare generalization error of final model to that from fitting the functional form of Akkar and Bommer (2007b) and an over-fit polynomial model with 58 coefficients and find they have considerably higher generalization errors.
- After having found the functional form, refit equation using random-effects regression.
- Note that little data for r
_{jb}< 5km. - Note that weakness of model is that it is not physically interpretable and it cannot be extrapolated. Also note that could have problems if dataset is not representative of underlying data generating process.
- Note that problem with magnitude scaling of model since available data is not representative of underlying distribution.