Want to create interactive content? It’s easy in Genially!

Get started free

Evaluation Metrics for Machine Learning Algorithms

Beenish Chaudhry

Created on October 8, 2025

Start designing with a free template

Discover more than 1500 professional designs like these:

Essential Learning Unit

Akihabara Learning Unit

Genial learning unit

History Learning Unit

Primary Unit Plan

Vibrant Learning Unit

Art learning unit

Transcript

Evaluation Metrics for Machine Learning Algorithms

Why Evaluate a Supervised Learning-Based Model?

Is my smartwatch detecting my stress levels accurately?

Evaluation is about measuring how well a trained machine learning model performs when tested on new data it has never seen before. In supervised learning, we have ground truth labels (e.g., “stressed” vs. “calm”), which allows us to compute key metrics.

Key Evaluation Metrics for Supervised Learning (click to reveal):
Accuracy
Precision
Recall
Four Possibilities in Model Prediction (Confusion Matrix)

Example: Good Model

Imagine the smartwatch is very reliable, that is, it usually detects stress correctly and rarely makes mistakes. The confusion matrix of the model could look like below:

Example: Bad Model

Now, the smartwatch often gets it wrong — it predicts stress too often and misses many true stress moments. The confusion matrix could look as follows:

Evaluting an Unsupervised Learning Based Model

In unsupervised learning, we do not have labels. Instead, the model discovers structure or groups in the data (like clustering sleep patterns or glucose cycles). So we can not compare predictions to true labels directly. Instead, we evaluate how well the discovered structure makes sense — internally and sometimes externally. It is also important to clinically verify (domain-driven) the discovered structures.

External Evaluation Metrics
Domain Driven Evaluation
Internal Evaluation Metrics

Example: Metrics to Evaluate Quality of Identified Clusters in Unsupervised Learning

It means checking how well-separated and coherent your sleep clusters are.

Good Quality Clusters
Poor Quality Clusters

Complete the Key Characteristic Table for Two Main ML Algorithms

Drag the words and place them in the correct column

Unsupervised

Supervised

Output

Goal

Data

Example

Feature

Evaluation

Find hidden structures

Accuracy, precision, recall

Classifications or regressions

Unlabeled (X only)

Sleep pattern grouping

Clusters or groups

Silhouette score, visual clusters

Predict outcomes

Labeled (X, y)

Stress Prediction

Congratulations, you have completed this activity.

Internal Evaluation Metrics

Measure how well data points fit their assigned clusters — based on distance or compactness. These metrics tell you if your clusters are tight and well-separated.

Clarifies how close each point is to its own cluster vs. others +1 (best), 0 (overlap), -1 (wrong cluster)

Evaluates separation between different clusters as well as their compactness Lower values are better than higher values

Ratio of between-cluster to within-cluster dispersion to evaluate the distinction between clusters Higher values are better than lower values

Internal Evaluation Metrics

Sometimes, you have “reference” labels to compare your clusters against. These are used in validation studies, not during model training.

Measures similarity between clustering and true labels. Example: To compare algorithm’s patient clusters to known disease groups. Ideal value is 1, which indicates perfect match.

Measures shared information between clusters and true labels. Example: Compare discovered behavioral segments to survey-based groups. A value closer to 1 indicates a match closer to the true cluster.

Domain-Driven Metrics

In health contexts, even high Silhouette or ARI values mean little unless the clusters are meaningful clinically. Practical evaluation in mHealth often includes:

Interpretability

Stability

Actionability

What is the practical utility of the identified clusters? Do the clusters help tailor interventions or insights? Do the clusters reveal something useful about health behavior or risk?

Can clinicians or users understand what each cluster represents?

Do clusters remain consistent when new data arrives?