Want to create interactive content? It’s easy in Genially!

Get started free

Data Science 101

Luiz Gabriel Bongiol

Created on March 28, 2025

Start designing with a free template

Discover more than 1500 professional designs like these:

Microcourse: Key Skills for the Professional Environment

Microcourse: Introduction to HTML

The Meeting Microlearning

The Meeting Microlearning Mobile

Isometric video mobile

Circles video mobile

3d corporate video mobile

Transcript

Data Science 101

History, Methods & Applications

Start

Overview

01

Next

WHO AM I?

02

Next

LUIZ GABRIEL BONGIOLO

Populariezed fueled by several factors, including the advancements in big data technologies, greater computational power (such as GPUs), and the open-source movement which led to the development and enhancement of numerous machine learning frameworks.

Supervised Learning

  • Data Cleaning: Ensuring data quality and relevance.
  • Normalization: Standardizing data features.
  • Categorization: Converting categories to numerical values.
  • Pick the model based on the problem

Set Target & Split Data: Define the target variable (binary or multi-class) and divide the dataset into training, testing, and validation

Info

WHAT I ACTUALLY DO

03

Not plug and play

Next

WHAT YOU THINK YOU WILL DO

WHAT YOU ACTUALLY DO

BUILDING MODELS Density Based Clustering

CLEANING DATA Partitioning Data into K Distinct Groups

SQL Finding the nearest Neighbors

BUILDING DASHBOARDS Density Based Clustering

Performance

Accuracy

We've trained 1,000 models by resampling the data, and the results are:

83%

model is robust over unseen dataAccuracy for not churn: 80% Accuracy FOR CHURN: 85%

Info

Classification & Clustering Models

06

Categorizing Data and Discovering Patterns

Next

Probabilistic models

03

These models explicitly use probability distributions to make predictions.

Next

Naïve Bayes Classifier

Inbox

Probabilistic classifier based on Bayes' Theorem, assuming that features are independent given the class label

Mail

Spam?

Type:

  • Supervised Machine Learning
  • Classifier
Use Cases:
  • Spam detection in email.
  • Sentiment analysis in social media.
  • Document classification
  • Disease prediction (diabeates)

Info

Models

Prob of (A | B) - A given that B happened P(A) - Prob of A happening regardless of B happeningWe need to know P(A) and P(B) - Prob of an event happening regardless of the other

Regression models

05

Leveraging Relationships in Data for Continuous Predictions

Next

+ Info

+ Info

+ Info

Models

Classification & Clustering Models

06

Categorizing Data and Discovering Patterns

Next

DBSCAN Density Based Clustering

K- Means Partitioning Data into K Distinct Groups

K-NN Finding the nearest Neighbors

Info

Tree models

07

Structured Decision Making for Complex Data

Next

Xgboost

Random Forest

Decision Trees

Isolation Forest

Neural Networks

06

Deep Learning for Advanced Pattern Recognition

Next

What is a Neural Network?

Challenges
+ Models

Neural Networks differ from other ML models in their capacity to automatically learn and model complex, non-linear relationships, without the need for manual feature engineering or reliance on linear assumptions.

Model Selection

10

Strategies for Optimal Algorithm Choice

Next

What is the Target?

Info

Resources

10

Next

RESOURCES

www.kaggle.com

https://huggingface.co/

https://medium.com/

11

THANK YOU

Deep Learning

When neural networks have multiple hidden layers, they are often referred to as deep neural networks, allowing them to model complex patterns and interactions within the data through deep learning.

Other types of Regression Methods

Isolation Forests

Forests are an anomaly detection algorithm that isolates outliers instead of profiling normal data points. It uses a forest of trees to partition data and identifies anomalies based on the ease with which samples can be isolated, effectively detecting deviations with less susceptibility to overfitting.

  • Fraud Detection
  • Network Security
  • Health Monitoring Systems
  • Quality Control in Manufacturing
DBSCAN

Unlike simpler clustering methods such as K-means, which primarily group points based on proximity, DBSCAN explores more intricate relationships. It identifies clusters not only by examining the straightforward distances between points but also by considering the density of the surrounding data points and their interactions. This enables DBSCAN to effectively discover varied shapes and sizes of clusters, making it highly effective for complex data sets.

K-Nearest Neighbors

(K-NN) algorithm that classifies new cases based on a similarity measure (usually distance functions). It involves selecting the 'K' closest data points in the feature space and predicting the label based on the majority vote of these neighbors.

  • Customer Segmentation.
  • Recommendation Systems.
  • Fraud Detection.
  • Image Recognition.

Model Types

Info

Gini

Decision trees are a supervised learning algorithm used for classification and regression. They split the data into nodes based on certain criteria, forming a tree structure with branches leading to outcomes. Each node tests an attribute, guiding decisions down to the final leaves where predictions are made. The process stops when criteria such as maximum depth or minimum node size are met.

  • Credit Scoring
  • Medical Diagnosis
  • Customer Segmentation
  • Inventory Management

Applications

K-Means Clustering

K-Means clustering is a unsupervised machine learning algorithm that sorts data into a specified number (K) of distinct clusters based on similarity. Useful for identifying patterns and insights by grouping similar data points together.

  • Data Cleaning: Ensuring data quality and relevance.
  • Normalization: Standardizing data features.
  • Categorization: Converting categories to numerical values.
  • Pick the model based on the problem

Network Intrusion Dataset

Customer Churn

Info

Boosting employs a "wisdom of the crowds" technique with a twist: it assigns weights to individual models. In this method, each decision tree prioritizes previously misclassified data points, adjusting their weights in the next iteration. This iterative refining enhances accuracy but requires careful management to avoid overfitting. By aggregating the strength of multiple weak predictors, boosting creates a highly robust and accurate model, proving invaluable in predictive analytics.

Restaurant Revenue

What Sets Neural Networks Apart from Other Machine Learning Models?

Neural Networks differ from other ML models in their capacity to automatically learn and model complex, non-linear relationships, without the need for manual feature engineering or reliance on linear assumptions.

Info

Random Forest

Random Forests aggregate multiple decision trees to improve predictive accuracy and control over-fitting. By building numerous trees and averaging their predictions, Random Forests ensure robustness and reduce variance, making them effective for a wide range of tasks.

  • Predictive Maintenance
  • Biomedical Applications
  • Stock Market Analysis
  • E-commerce Personalization
SELL YOUR YOURSELF

Unlike simpler clustering methods such as K-means, which primarily group points based on proximity, DBSCAN explores more intricate relationships. It identifies clusters not only by examining the straightforward distances between points but also by considering the density of the surrounding data points and their interactions. This enables DBSCAN to effectively discover varied shapes and sizes of clusters, making it highly effective for complex data sets.

DBSCAN

Unlike simpler clustering methods such as K-means, which primarily group points based on proximity, DBSCAN explores more intricate relationships. It identifies clusters not only by examining the straightforward distances between points but also by considering the density of the surrounding data points and their interactions. This enables DBSCAN to effectively discover varied shapes and sizes of clusters, making it highly effective for complex data sets.

How Spam Detection Works Using Naïve Bayes

Training Phase
  • Collect a dataset of emails labeled as Spam or Not Spam.
  • Extract features from emails (e.g., individual words, word frequencies).
  • Calculate probabilities of each word appearing in Spam vs. Not Spam emails.
Prediction Phase (Classifying a New Email)
  • Extract words from the new email.
  • Use Bayes' Theorem to compute the probability that the email is Spam or Not Spam based on the words.
  • Assume feature independence: Treat each word as contributing independently to the final probability.
  • Assign the label (Spam or Not Spam) based on the highest probability.

Challenges

Building deep neural networks presents significant challenges, requiring meticulous design and tuning of numerous layers and parameters to optimize performance.

K-Means Clustering

K-Means clustering is a unsupervised machine learning algorithm that sorts data into a specified number (K) of distinct clusters based on similarity. Useful for identifying patterns and insights by grouping similar data points together.

  • Market Segmentation
  • Document Clustering
  • Image Segmentation
  • Anomaly Detection
K-Nearest Neighbors

(K-NN) algorithm that classifies new cases based on a similarity measure (usually distance functions). It involves selecting the 'K' closest data points in the feature space and predicting the label based on the majority vote of these neighbors.

  • Customer Segmentation.
  • Recommendation Systems.
  • Fraud Detection.
  • Image Recognition.

Learning Process

They learn by adjusting these connection weights based on the errors in predictions during the training phase, using algorithms such as backpropagation combined with an optimization technique like gradient descent.