Want to create interactive content? It’s easy in Genially!

Get started free

Learning Stat Final Project

Christina Bui

Created on April 28, 2024

Start designing with a free template

Discover more than 1500 professional designs like these:

Higher Education Presentation

Psychedelic Presentation

Vaporwave presentation

Geniaflix Presentation

Vintage Mosaic Presentation

Modern Zen Presentation

Newspaper Presentation

Transcript

Analyzing and Interpreting Marathon Results A Deeper Jog

Camille Dominguez, Christina Bui, Youqi Qi

May 2nd, 2024

1. Preliminary Work

Boston Marathon Finishers (2017 data)

  • kaggle.com
    • "Finishers Boston Marathon 2015, 2016 & 2017"
      • This data has the names, times and general demographics of the finishers
  • Focus on 2017 data (most relevant)
  • Why the Boston Marathon?
    • Oldest marathon in US
      • You have to qualify to participate!
  • Predictors: Age group, countries, gender

'Your content is liked, but it engages much more if it is interactive'

3. Goals

Description of Data

Goals of Analysis

  • Develop a model to predict finishing times based on demographic information and past running records
  • Cateogrize certain groups and conduct analysis on specific factors
  • Use model to predict current or future groups and their finishing times
  • 26,410 observations, 8 variables
    • Age, gender, City, Country, Half time, 40k time, Pace, Official Time
  • Methodologies:
    • Data Preprocessing
    • Explanatory Data Analysis
    • Splitting Data
    • Model Building & Diagnostics
    • Variable Selection
    • Cross-Validation
    • Model Evaluation

2. Graphs

Explanatory Data Analysis: Plots and Histograms

  • Code: hist(marathon$Age, main="Histogram of Ages")
    • Skewed right, center approx. ~45 years

3. Graphs Cont.

Histogram Half Time

  • Code: hist(marathon$`40K`, main="Histogram of 40K Times", xlab="40K Time (seconds)", col="red")
  • Skewed right, center approx ~ 12500 seconds

Histogram 40K Times

  • Code: hist(marathon$Half, main="Histogram of Half Marathon Times", xlab="Half Marathon Time (seconds)", col="green")
  • Skewed right, center approx ~ 6250

4. Graphs cont.

Histogram of Pace

  • Code: hist(marathon$Pace, main="Histogram of Pace", xlab="Pace (seconds per km)", col="purple")
  • Skewed right, center approx ~ 500 seconds per km

Histogram of Official Times

Code: hist(marathon$Official, main="Histogram of Official Times", xlab="Official Time (seconds)", col="orange") Skewed right, center approx ~ 13500 seconds

Pairwise Scatter Plot

  • Correlations between finishers' running times
    • Age vs. Offical
    • Half vs. 40k
    • Pace vs. Official Time

Scatter Plot of Marathon Performance by Gender

  • Visualization of data
    • Seeing the Male (blue) line being positioned slightly higher than the Female (pink) line.
  • Overlap and Spread
    • Significant overlap
    • Blue points being more distributed
  • Linear Correlation
    • Faster Half times corresponding well to fast Offical times
  • Interpretation
    • Althetic preparation, biologies, training differences

Splitting Data & Model Building

  • Split the data into train and test set
    • Used random sampling seed method
    • Sample 80% of the dataset for the training set, 20% to the test set
  • Performed SLR (fitted using training data)
    • Interpretation of results:
      • Age, Gender, Location vs '40K' & 'Pace'
    • Model fit: Adj. R-squared & R-Sq of 1
      • Indicators of overfitting

Model Building Cont.

  • Data Preparation:
    • Prepare model matrices (x.train and x.test) and response variable (y.train) for training and test data.
  • Ridge Regression:
    • Fit a Ridge regression model (glmnet) using a sequence of lambda values.
    • Perform cross-validation (cv.glmnet) to select the best lambda (lambda.min).
    • Refit the Ridge model with the selected lambda (final.ridge).
    • Use the final Ridge model to predict on the test data (ridge.predict).
  • Lasso Regression:
    • Fit a Lasso regression model (glmnet) using a sequence of lambda values.
    • Perform cross-validation (cv.glmnet) to select the best lambda (lambda.min).
    • Refit the Lasso model with the selected lambda (final.lasso).
    • Use the final Lasso model to predict on the test data (lasso.predict)

Model Dianostics

  • Ridge Regression
    • Utilized optimal lambda value (best.lambda.ridge).
    • Achieved a minimum test error of 41070.41.
  • Lasso Regression
    • Utilized optimal lambda value (best.lambda.lasso).
    • Achieved a significantly lower minimum test error of 5687.40.
  • Key Findings
    • Ridge: MSE= 41070.41
    • Lasso: MSE = 5687.40
  • Interpretation
    • Lasso outperformed Ridge with a notably lower test error.
    • Lasso's feature selection likely contributed to enhanced model precision.

Backward Elimination

Code/Stats

We explored three variable selection methods: Backward Elimination, Ridge Regression, and Lasso Regression Method: Starting with a full model containing all predictors, we repeatedly removed the least significant variables until the models performance improved or stabilized The final model retained the following predictors: Gender (M) City (Top City) 40K Pace

```{r, include = TRUE} # Backward Elimination for the linear regression model # Fit the full model with all predictors full.model = lm(Official ~ ., data = train) # Perform backward elimination reduced.model = step(full.model, direction = "backward") # Check the summary of the reduced model to see which variables were kept summary(reduced.model) F Stat: 5.912e+08. p-value: 2.2e-16

Lasso Regression

Ridge Regression

Method: Ridge Regression adds a penalty term to the coefficients to prevent overfitting by shrinking them towards zero Coefficients for Ridge: Age: 1.5672 GenderM: 40.9327 CityTopCity: -13.437 CountryUSA: 16.5415 Half: 0.4066 40K: 0.4057 Pace: 10.708

Method: Lasso Regression adds an L1 penalty term, creating sparcity in the coefficients and effectively selecting variables Coefficients for Lasso: 40K: 0.109 Pace: 22.741

Interpretation of Results

Effect of Predictors:

  • Backward Elimination (Linear Regression):
    • The reduced model includes variables such as GenderM, CityTopCity, 40K, and Pace indicating that these factors have an impact on the predicted finishing times
  • Ridge Regression:
    • Notably the coefficients for GenderM, CityTopCity, CountryUSA, 40K, and Pace demonstrate their influence on predicted marathon finishing times
  • Lasso Regression:
    • The coefficients retained by Lasso, like 40K and Pace suggest their significant impact on the predicted marathon finishing times

Model Selection

  • The basic linear regression model provides a straightforward understanding of predictor effects but can suffer from overfitiing if too many variables are included
  • Ridge regression generally outperforms the basic linear model by penalizing large coefficients, which helps prevent overfitting and improves generalization to new data
  • Lasso regression uses variable selection to enhance prediction accuracy and reduce model complexity by selecting the most relevant predictors while setting less important coefficients to zero
  • Therefore, both ridge and lasso regression techniques offer improvements over the basic linear model in terms of balancing complexity and accuracy.

Possible Modifications and Additional Steps

  • Future work could focus on refining the predictive models by incorporating additional features such as weather conditions and course elevation
  • Weather plays a significant role in outdoor events like races
  • Oragnizers can collect historical weather data for the race location and time of year and can include this information as features in their predictive models
  • By analyzing the elevation data along the race route, organizers can obtain metrics like total elevation gain, and the distribution of uphill and downhill sections
  • Including these features in the model allows for better understanding of how course elevation influences finishing times.

Thank you!