Want to create interactive content? It’s easy in Genially!

Learning Stat Final Project

Christina Bui

Created on April 28, 2024

Start designing with a free template

Discover more than 1500 professional designs like these:

Higher Education Presentation

Psychedelic Presentation

Vaporwave presentation

Geniaflix Presentation

Vintage Mosaic Presentation

Modern Zen Presentation

Newspaper Presentation

Explore all templates

Analyzing and Interpreting Marathon Results A Deeper Jog

Camille Dominguez, Christina Bui, Youqi Qi

May 2nd, 2024

1. Preliminary Work

Boston Marathon Finishers (2017 data)

kaggle.com

"Finishers Boston Marathon 2015, 2016 & 2017"

This data has the names, times and general demographics of the finishers

Focus on 2017 data (most relevant)
Why the Boston Marathon?

Oldest marathon in US

You have to qualify to participate!

Predictors: Age group, countries, gender

'Your content is liked, but it engages much more if it is interactive'

3. Goals

Description of Data

Goals of Analysis

Develop a model to predict finishing times based on demographic information and past running records

Cateogrize certain groups and conduct analysis on specific factors

Use model to predict current or future groups and their finishing times

26,410 observations, 8 variables

Age, gender, City, Country, Half time, 40k time, Pace, Official Time

Methodologies:

Data Preprocessing
Explanatory Data Analysis
Splitting Data
Model Building & Diagnostics
Variable Selection
Cross-Validation
Model Evaluation

2. Graphs

Explanatory Data Analysis: Plots and Histograms

Code: hist(marathon$Age, main="Histogram of Ages")

Skewed right, center approx. ~45 years

3. Graphs Cont.

Histogram Half Time

Code: hist(marathon$`40K`, main="Histogram of 40K Times", xlab="40K Time (seconds)", col="red")
Skewed right, center approx ~ 12500 seconds

Histogram 40K Times

Code: hist(marathon$Half, main="Histogram of Half Marathon Times", xlab="Half Marathon Time (seconds)", col="green")
Skewed right, center approx ~ 6250

4. Graphs cont.

Histogram of Pace

Code: hist(marathon$Pace, main="Histogram of Pace", xlab="Pace (seconds per km)", col="purple")
Skewed right, center approx ~ 500 seconds per km

Histogram of Official Times

Code: hist(marathon$Official, main="Histogram of Official Times", xlab="Official Time (seconds)", col="orange") Skewed right, center approx ~ 13500 seconds

Pairwise Scatter Plot

Correlations between finishers' running times

Age vs. Offical
Half vs. 40k
Pace vs. Official Time

Scatter Plot of Marathon Performance by Gender

Visualization of data

Seeing the Male (blue) line being positioned slightly higher than the Female (pink) line.

Overlap and Spread

Significant overlap
Blue points being more distributed

Linear Correlation

Faster Half times corresponding well to fast Offical times

Interpretation

Althetic preparation, biologies, training differences

Splitting Data & Model Building

Split the data into train and test set

Used random sampling seed method
Sample 80% of the dataset for the training set, 20% to the test set

Performed SLR (fitted using training data)

Interpretation of results:

Age, Gender, Location vs '40K' & 'Pace'

Model fit: Adj. R-squared & R-Sq of 1

Indicators of overfitting

Model Building Cont.

Data Preparation:

Prepare model matrices (x.train and x.test) and response variable (y.train) for training and test data.

Ridge Regression:

Fit a Ridge regression model (glmnet) using a sequence of lambda values.
Perform cross-validation (cv.glmnet) to select the best lambda (lambda.min).
Refit the Ridge model with the selected lambda (final.ridge).
Use the final Ridge model to predict on the test data (ridge.predict).

Lasso Regression:

Fit a Lasso regression model (glmnet) using a sequence of lambda values.
Perform cross-validation (cv.glmnet) to select the best lambda (lambda.min).
Refit the Lasso model with the selected lambda (final.lasso).
Use the final Lasso model to predict on the test data (lasso.predict)

Model Dianostics

Ridge Regression

Utilized optimal lambda value (best.lambda.ridge).
Achieved a minimum test error of 41070.41.

Lasso Regression

Utilized optimal lambda value (best.lambda.lasso).
Achieved a significantly lower minimum test error of 5687.40.

Key Findings

Ridge: MSE= 41070.41
Lasso: MSE = 5687.40

Interpretation

Lasso outperformed Ridge with a notably lower test error.
Lasso's feature selection likely contributed to enhanced model precision.

Backward Elimination

Code/Stats

We explored three variable selection methods: Backward Elimination, Ridge Regression, and Lasso Regression Method: Starting with a full model containing all predictors, we repeatedly removed the least significant variables until the models performance improved or stabilized The final model retained the following predictors: Gender (M) City (Top City) 40K Pace

```{r, include = TRUE} # Backward Elimination for the linear regression model # Fit the full model with all predictors full.model = lm(Official ~ ., data = train) # Perform backward elimination reduced.model = step(full.model, direction = "backward") # Check the summary of the reduced model to see which variables were kept summary(reduced.model) F Stat: 5.912e+08. p-value: 2.2e-16

Lasso Regression

Ridge Regression

Method: Ridge Regression adds a penalty term to the coefficients to prevent overfitting by shrinking them towards zero Coefficients for Ridge: Age: 1.5672 GenderM: 40.9327 CityTopCity: -13.437 CountryUSA: 16.5415 Half: 0.4066 40K: 0.4057 Pace: 10.708

Method: Lasso Regression adds an L1 penalty term, creating sparcity in the coefficients and effectively selecting variables Coefficients for Lasso: 40K: 0.109 Pace: 22.741

Interpretation of Results

Effect of Predictors:

Backward Elimination (Linear Regression):

The reduced model includes variables such as GenderM, CityTopCity, 40K, and Pace indicating that these factors have an impact on the predicted finishing times

Ridge Regression:

Notably the coefficients for GenderM, CityTopCity, CountryUSA, 40K, and Pace demonstrate their influence on predicted marathon finishing times

Lasso Regression:

The coefficients retained by Lasso, like 40K and Pace suggest their significant impact on the predicted marathon finishing times

Model Selection

The basic linear regression model provides a straightforward understanding of predictor effects but can suffer from overfitiing if too many variables are included
Ridge regression generally outperforms the basic linear model by penalizing large coefficients, which helps prevent overfitting and improves generalization to new data
Lasso regression uses variable selection to enhance prediction accuracy and reduce model complexity by selecting the most relevant predictors while setting less important coefficients to zero
Therefore, both ridge and lasso regression techniques offer improvements over the basic linear model in terms of balancing complexity and accuracy.

Possible Modifications and Additional Steps

Future work could focus on refining the predictive models by incorporating additional features such as weather conditions and course elevation

Weather plays a significant role in outdoor events like races

Oragnizers can collect historical weather data for the race location and time of year and can include this information as features in their predictive models
By analyzing the elevation data along the race route, organizers can obtain metrics like total elevation gain, and the distribution of uphill and downhill sections
Including these features in the model allows for better understanding of how course elevation influences finishing times.

View

Higher Education Presentation

View

Psychedelic Presentation

View

Vaporwave presentation

View

Geniaflix Presentation

View

Vintage Mosaic Presentation

View

Modern Zen Presentation

View

Newspaper Presentation

Learning Stat Final Project

Start designing with a free template

View

Higher Education Presentation

View

Psychedelic Presentation

View

Vaporwave presentation

View

Geniaflix Presentation

View

Vintage Mosaic Presentation

View

Modern Zen Presentation

View

Newspaper Presentation

Transcript

Analyzing and Interpreting Marathon Results A Deeper Jog

Camille Dominguez, Christina Bui, Youqi Qi

May 2nd, 2024

1. Preliminary Work

Boston Marathon Finishers (2017 data)

'Your content is liked, but it engages much more if it is interactive'

3. Goals

Description of Data

Goals of Analysis

2. Graphs

Explanatory Data Analysis: Plots and Histograms

3. Graphs Cont.

Histogram Half Time

Histogram 40K Times

4. Graphs cont.

Histogram of Pace

Histogram of Official Times

Pairwise Scatter Plot

Scatter Plot of Marathon Performance by Gender

Splitting Data & Model Building

Model Building Cont.

Model Dianostics

Backward Elimination

Code/Stats

Lasso Regression

Ridge Regression

Interpretation of Results

Model Selection

Possible Modifications and Additional Steps

Thank you!