Evaluation Metrics for Machine Learning Algorithms
Why Evaluate a Supervised Learning-Based Model?
Is my smartwatch detecting my stress levels accurately?
Evaluation is about measuring how well a trained machine learning model performs when tested on new data it has never seen before. In supervised learning, we have ground truth labels (e.g., “stressed” vs. “calm”), which allows us to compute key metrics.
Key Evaluation Metrics for Supervised Learning (click to reveal):
Accuracy
Precision
Recall
Four Possibilities in Model Prediction (Confusion Matrix)
Example: Good Model
Imagine the smartwatch is very reliable, that is, it usually detects stress correctly and rarely makes mistakes. The confusion matrix of the model could look like below:
Example: Bad Model
Now, the smartwatch often gets it wrong — it predicts stress too often and misses many true stress moments. The confusion matrix could look as follows:
Evaluting an Unsupervised Learning Based Model
In unsupervised learning, we do not have labels. Instead, the model discovers structure or groups in the data (like clustering sleep patterns or glucose cycles). So we can not compare predictions to true labels directly. Instead, we evaluate how well the discovered structure makes sense — internally and sometimes externally. It is also important to clinically verify (domain-driven) the discovered structures.
External Evaluation Metrics
Domain Driven Evaluation
Internal Evaluation Metrics
Example: Metrics to Evaluate Quality of Identified Clusters in Unsupervised Learning
It means checking how well-separated and coherent your sleep clusters are.
Good Quality Clusters
Poor Quality Clusters
Complete the Key Characteristic Table for Two Main ML Algorithms
Drag the words and place them in the correct column
Unsupervised
Supervised
Output
Goal
Data
Example
Feature
Evaluation
Find hidden structures
Accuracy, precision, recall
Classifications or regressions
Unlabeled (X only)
Sleep pattern grouping
Clusters or groups
Silhouette score, visual clusters
Predict outcomes
Labeled (X, y)
Stress Prediction
Congratulations, you have completed this activity.
Internal Evaluation Metrics
Measure how well data points fit their assigned clusters — based on distance or compactness. These metrics tell you if your clusters are tight and well-separated.
Clarifies how close each point is to its own cluster vs. others +1 (best), 0 (overlap), -1 (wrong cluster)
Evaluates separation between different clusters as well as their compactness Lower values are better than higher values
Ratio of between-cluster to within-cluster dispersion to evaluate the distinction between clusters Higher values are better than lower values
Internal Evaluation Metrics
Sometimes, you have “reference” labels to compare your clusters against. These are used in validation studies, not during model training.
Measures similarity between clustering and true labels. Example: To compare algorithm’s patient clusters to known disease groups. Ideal value is 1, which indicates perfect match.
Measures shared information between clusters and true labels. Example: Compare discovered behavioral segments to survey-based groups. A value closer to 1 indicates a match closer to the true cluster.
Domain-Driven Metrics
In health contexts, even high Silhouette or ARI values mean little unless the clusters are meaningful clinically. Practical evaluation in mHealth often includes:
Interpretability
Stability
Actionability
What is the practical utility of the identified clusters? Do the clusters help tailor interventions or insights? Do the clusters reveal something useful about health behavior or risk?
Can clinicians or users understand what each cluster represents?
Do clusters remain consistent when new data arrives?
Evaluation Metrics for Machine Learning Algorithms
Beenish Chaudhry
Created on October 8, 2025
Start designing with a free template
Discover more than 1500 professional designs like these:
View
Essential Learning Unit
View
Akihabara Learning Unit
View
Genial learning unit
View
History Learning Unit
View
Primary Unit Plan
View
Vibrant Learning Unit
View
Art learning unit
Explore all templates
Transcript
Evaluation Metrics for Machine Learning Algorithms
Why Evaluate a Supervised Learning-Based Model?
Is my smartwatch detecting my stress levels accurately?
Evaluation is about measuring how well a trained machine learning model performs when tested on new data it has never seen before. In supervised learning, we have ground truth labels (e.g., “stressed” vs. “calm”), which allows us to compute key metrics.
Key Evaluation Metrics for Supervised Learning (click to reveal):
Accuracy
Precision
Recall
Four Possibilities in Model Prediction (Confusion Matrix)
Example: Good Model
Imagine the smartwatch is very reliable, that is, it usually detects stress correctly and rarely makes mistakes. The confusion matrix of the model could look like below:
Example: Bad Model
Now, the smartwatch often gets it wrong — it predicts stress too often and misses many true stress moments. The confusion matrix could look as follows:
Evaluting an Unsupervised Learning Based Model
In unsupervised learning, we do not have labels. Instead, the model discovers structure or groups in the data (like clustering sleep patterns or glucose cycles). So we can not compare predictions to true labels directly. Instead, we evaluate how well the discovered structure makes sense — internally and sometimes externally. It is also important to clinically verify (domain-driven) the discovered structures.
External Evaluation Metrics
Domain Driven Evaluation
Internal Evaluation Metrics
Example: Metrics to Evaluate Quality of Identified Clusters in Unsupervised Learning
It means checking how well-separated and coherent your sleep clusters are.
Good Quality Clusters
Poor Quality Clusters
Complete the Key Characteristic Table for Two Main ML Algorithms
Drag the words and place them in the correct column
Unsupervised
Supervised
Output
Goal
Data
Example
Feature
Evaluation
Find hidden structures
Accuracy, precision, recall
Classifications or regressions
Unlabeled (X only)
Sleep pattern grouping
Clusters or groups
Silhouette score, visual clusters
Predict outcomes
Labeled (X, y)
Stress Prediction
Congratulations, you have completed this activity.
Internal Evaluation Metrics
Measure how well data points fit their assigned clusters — based on distance or compactness. These metrics tell you if your clusters are tight and well-separated.
Clarifies how close each point is to its own cluster vs. others +1 (best), 0 (overlap), -1 (wrong cluster)
Evaluates separation between different clusters as well as their compactness Lower values are better than higher values
Ratio of between-cluster to within-cluster dispersion to evaluate the distinction between clusters Higher values are better than lower values
Internal Evaluation Metrics
Sometimes, you have “reference” labels to compare your clusters against. These are used in validation studies, not during model training.
Measures similarity between clustering and true labels. Example: To compare algorithm’s patient clusters to known disease groups. Ideal value is 1, which indicates perfect match.
Measures shared information between clusters and true labels. Example: Compare discovered behavioral segments to survey-based groups. A value closer to 1 indicates a match closer to the true cluster.
Domain-Driven Metrics
In health contexts, even high Silhouette or ARI values mean little unless the clusters are meaningful clinically. Practical evaluation in mHealth often includes:
Interpretability
Stability
Actionability
What is the practical utility of the identified clusters? Do the clusters help tailor interventions or insights? Do the clusters reveal something useful about health behavior or risk?
Can clinicians or users understand what each cluster represents?
Do clusters remain consistent when new data arrives?