Analyzing and Interpreting Marathon Results A Deeper Jog
Camille Dominguez, Christina Bui, Youqi Qi
May 2nd, 2024
1. Preliminary Work
Boston Marathon Finishers (2017 data)
- kaggle.com
- "Finishers Boston Marathon 2015, 2016 & 2017"
- This data has the names, times and general demographics of the finishers
- Focus on 2017 data (most relevant)
- Why the Boston Marathon?
- Oldest marathon in US
- You have to qualify to participate!
- Predictors: Age group, countries, gender
'Your content is liked, but it engages much more if it is interactive'
3. Goals
Description of Data
Goals of Analysis
- Develop a model to predict finishing times based on demographic information and past running records
- Cateogrize certain groups and conduct analysis on specific factors
- Use model to predict current or future groups and their finishing times
- 26,410 observations, 8 variables
- Age, gender, City, Country, Half time, 40k time, Pace, Official Time
- Methodologies:
- Data Preprocessing
- Explanatory Data Analysis
- Splitting Data
- Model Building & Diagnostics
- Variable Selection
- Cross-Validation
- Model Evaluation
2. Graphs
Explanatory Data Analysis: Plots and Histograms
- Code: hist(marathon$Age, main="Histogram of Ages")
- Skewed right, center approx. ~45 years
3. Graphs Cont.
Histogram Half Time
- Code: hist(marathon$`40K`, main="Histogram of 40K Times", xlab="40K Time (seconds)", col="red")
- Skewed right, center approx ~ 12500 seconds
Histogram 40K Times
- Code: hist(marathon$Half, main="Histogram of Half Marathon Times", xlab="Half Marathon Time (seconds)", col="green")
- Skewed right, center approx ~ 6250
4. Graphs cont.
Histogram of Pace
- Code: hist(marathon$Pace, main="Histogram of Pace", xlab="Pace (seconds per km)", col="purple")
- Skewed right, center approx ~ 500 seconds per km
Histogram of Official Times
Code: hist(marathon$Official, main="Histogram of Official Times", xlab="Official Time (seconds)", col="orange") Skewed right, center approx ~ 13500 seconds
Pairwise Scatter Plot
- Correlations between finishers' running times
- Age vs. Offical
- Half vs. 40k
- Pace vs. Official Time
Scatter Plot of Marathon Performance by Gender
- Visualization of data
- Seeing the Male (blue) line being positioned slightly higher than the Female (pink) line.
- Overlap and Spread
- Significant overlap
- Blue points being more distributed
- Linear Correlation
- Faster Half times corresponding well to fast Offical times
- Interpretation
- Althetic preparation, biologies, training differences
Splitting Data & Model Building
- Split the data into train and test set
- Used random sampling seed method
- Sample 80% of the dataset for the training set, 20% to the test set
- Performed SLR (fitted using training data)
- Interpretation of results:
- Age, Gender, Location vs '40K' & 'Pace'
- Model fit: Adj. R-squared & R-Sq of 1
- Indicators of overfitting
Model Building Cont.
- Data Preparation:
- Prepare model matrices (x.train and x.test) and response variable (y.train) for training and test data.
- Ridge Regression:
- Fit a Ridge regression model (glmnet) using a sequence of lambda values.
- Perform cross-validation (cv.glmnet) to select the best lambda (lambda.min).
- Refit the Ridge model with the selected lambda (final.ridge).
- Use the final Ridge model to predict on the test data (ridge.predict).
- Lasso Regression:
- Fit a Lasso regression model (glmnet) using a sequence of lambda values.
- Perform cross-validation (cv.glmnet) to select the best lambda (lambda.min).
- Refit the Lasso model with the selected lambda (final.lasso).
- Use the final Lasso model to predict on the test data (lasso.predict)
Model Dianostics
- Ridge Regression
- Utilized optimal lambda value (best.lambda.ridge).
- Achieved a minimum test error of 41070.41.
- Lasso Regression
- Utilized optimal lambda value (best.lambda.lasso).
- Achieved a significantly lower minimum test error of 5687.40.
- Key Findings
- Ridge: MSE= 41070.41
- Lasso: MSE = 5687.40
- Interpretation
- Lasso outperformed Ridge with a notably lower test error.
- Lasso's feature selection likely contributed to enhanced model precision.
Backward Elimination
Code/Stats
We explored three variable selection methods: Backward Elimination, Ridge Regression, and Lasso Regression Method: Starting with a full model containing all predictors, we repeatedly removed the least significant variables until the models performance improved or stabilized The final model retained the following predictors: Gender (M) City (Top City) 40K Pace
```{r, include = TRUE} # Backward Elimination for the linear regression model # Fit the full model with all predictors full.model = lm(Official ~ ., data = train) # Perform backward elimination reduced.model = step(full.model, direction = "backward") # Check the summary of the reduced model to see which variables were kept summary(reduced.model) F Stat: 5.912e+08. p-value: 2.2e-16
Lasso Regression
Ridge Regression
Method: Ridge Regression adds a penalty term to the coefficients to prevent overfitting by shrinking them towards zero Coefficients for Ridge: Age: 1.5672 GenderM: 40.9327 CityTopCity: -13.437 CountryUSA: 16.5415 Half: 0.4066 40K: 0.4057 Pace: 10.708
Method: Lasso Regression adds an L1 penalty term, creating sparcity in the coefficients and effectively selecting variables Coefficients for Lasso: 40K: 0.109 Pace: 22.741
Interpretation of Results
Effect of Predictors:
- Backward Elimination (Linear Regression):
- The reduced model includes variables such as GenderM, CityTopCity, 40K, and Pace indicating that these factors have an impact on the predicted finishing times
- Ridge Regression:
- Notably the coefficients for GenderM, CityTopCity, CountryUSA, 40K, and Pace demonstrate their influence on predicted marathon finishing times
- Lasso Regression:
- The coefficients retained by Lasso, like 40K and Pace suggest their significant impact on the predicted marathon finishing times
Model Selection
- The basic linear regression model provides a straightforward understanding of predictor effects but can suffer from overfitiing if too many variables are included
- Ridge regression generally outperforms the basic linear model by penalizing large coefficients, which helps prevent overfitting and improves generalization to new data
- Lasso regression uses variable selection to enhance prediction accuracy and reduce model complexity by selecting the most relevant predictors while setting less important coefficients to zero
- Therefore, both ridge and lasso regression techniques offer improvements over the basic linear model in terms of balancing complexity and accuracy.
Possible Modifications and Additional Steps
- Future work could focus on refining the predictive models by incorporating additional features such as weather conditions and course elevation
- Weather plays a significant role in outdoor events like races
- Oragnizers can collect historical weather data for the race location and time of year and can include this information as features in their predictive models
- By analyzing the elevation data along the race route, organizers can obtain metrics like total elevation gain, and the distribution of uphill and downhill sections
- Including these features in the model allows for better understanding of how course elevation influences finishing times.
Thank you!
Learning Stat Final Project
Christina Bui
Created on April 28, 2024
Start designing with a free template
Discover more than 1500 professional designs like these:
View
Higher Education Presentation
View
Psychedelic Presentation
View
Vaporwave presentation
View
Geniaflix Presentation
View
Vintage Mosaic Presentation
View
Modern Zen Presentation
View
Newspaper Presentation
Explore all templates
Transcript
Analyzing and Interpreting Marathon Results A Deeper Jog
Camille Dominguez, Christina Bui, Youqi Qi
May 2nd, 2024
1. Preliminary Work
Boston Marathon Finishers (2017 data)
'Your content is liked, but it engages much more if it is interactive'
3. Goals
Description of Data
Goals of Analysis
2. Graphs
Explanatory Data Analysis: Plots and Histograms
3. Graphs Cont.
Histogram Half Time
Histogram 40K Times
4. Graphs cont.
Histogram of Pace
Histogram of Official Times
Code: hist(marathon$Official, main="Histogram of Official Times", xlab="Official Time (seconds)", col="orange") Skewed right, center approx ~ 13500 seconds
Pairwise Scatter Plot
Scatter Plot of Marathon Performance by Gender
Splitting Data & Model Building
Model Building Cont.
Model Dianostics
Backward Elimination
Code/Stats
We explored three variable selection methods: Backward Elimination, Ridge Regression, and Lasso Regression Method: Starting with a full model containing all predictors, we repeatedly removed the least significant variables until the models performance improved or stabilized The final model retained the following predictors: Gender (M) City (Top City) 40K Pace
```{r, include = TRUE} # Backward Elimination for the linear regression model # Fit the full model with all predictors full.model = lm(Official ~ ., data = train) # Perform backward elimination reduced.model = step(full.model, direction = "backward") # Check the summary of the reduced model to see which variables were kept summary(reduced.model) F Stat: 5.912e+08. p-value: 2.2e-16
Lasso Regression
Ridge Regression
Method: Ridge Regression adds a penalty term to the coefficients to prevent overfitting by shrinking them towards zero Coefficients for Ridge: Age: 1.5672 GenderM: 40.9327 CityTopCity: -13.437 CountryUSA: 16.5415 Half: 0.4066 40K: 0.4057 Pace: 10.708
Method: Lasso Regression adds an L1 penalty term, creating sparcity in the coefficients and effectively selecting variables Coefficients for Lasso: 40K: 0.109 Pace: 22.741
Interpretation of Results
Effect of Predictors:
Model Selection
Possible Modifications and Additional Steps
Thank you!