2.315 Kuehn et al. (2009)

• Ground-motion model is a nonphysical function (subsymbolic) (polynomial) of predictor variables (Mw, rjb, V s,30, fault mechanism and depth to top of rupture) with 48 coefficients (not reported) (14 for Mw, 5 for rjb, 4 for V s,30, 6 for rupture depth, 15 for combination of Mw and rjb, intercept parameter, pseudo-depth and 2 for mechanism). Use polynomials because simple, flexible and easy to understand.
• Characterize sites using V s,30.
• Use three faulting mechanisms:
Reverse
Rake angle between 30 and 150. 19 earthquakes and 1870 records.
Normal
Rake angle between -150 and -30. 11 earthquakes and 49 records.
Strike slip
Other rake angle. 30 earthquakes and 741 records.
• Use data from NGA project because best dataset currently available. Note that significant amount of metadata are missing. Discuss the problems of missing metadata. Assume that metadata are missing at random, which means that it is possible to perform unbiased statistical inference. To overcome missing metadata only select records where all metadata exist, which note is only strictly valid when metadata are missing completely at random.
• Select only records that are representative of free-field conditions based on Geomatrix classification C1.
• Exclude some data from Chi-Chi sequence due to poor quality or co-located instruments.
• Exclude data from rjb > 200km because of low engineering significance and to reduce correlation between magnitude and distance. Also note that this reduces possible bias due to different attenuation in different regions.
• In original selection one record with Mw5.2 and the next at Mw5.61. Record with Mw5.2 had a dominant role for small magnitudes so it was removed.
• Discuss the problem of over-fitting (modelling more spurious details of sample than are supported by data generating process) and propose the use of generalization error (estimated using cross validation), which directly estimates the average prediction error for data not used to develop model, to counteract it. Judge quality of model primarily in terms of predictive power. Conclude that approach is viable for large datasets.
• State that objective is not to develop a fully-fledged alternative NGA model but to present an extension to traditional modelling strategies, based on intelligent data analysis from the fields of machine learning and artificial intelligence.
• For k-fold cross validation, split data into k roughly equal-sized subsets. Fit model to k - 1 subsets and compute prediction error for unused subset. Repeat for all k subsets. Combine k prediction error estimates to obtain estimate of generalization error. Use k = 10, which is often used for this approach.
• Use rjb because some trials with simple functional form show that it gives a smaller generalization error than, e.g., rrup.