Want to create interactive content? It’s easy in Genially!

Data Science 101

Luiz Gabriel Bongiol

Created on March 28, 2025

Start designing with a free template

Discover more than 1500 professional designs like these:

Microcourse: Key Skills for the Professional Environment

Microcourse: Introduction to HTML

The Meeting Microlearning

The Meeting Microlearning Mobile

Isometric video mobile

Circles video mobile

3d corporate video mobile

Explore all templates

Data Science 101

History, Methods & Applications

Start

Overview

01

02

LUIZ GABRIEL BONGIOLO

Populariezed fueled by several factors, including the advancements in big data technologies, greater computational power (such as GPUs), and the open-source movement which led to the development and enhancement of numerous machine learning frameworks.

Supervised Learning

Data Cleaning: Ensuring data quality and relevance.

Normalization: Standardizing data features.

Categorization: Converting categories to numerical values.

Pick the model based on the problem

Set Target & Split Data: Define the target variable (binary or multi-class) and divide the dataset into training, testing, and validation

Info

WHAT I ACTUALLY DO

03

Not plug and play

WHAT YOU ACTUALLY DO

BUILDING MODELS Density Based Clustering

CLEANING DATA Partitioning Data into K Distinct Groups

SQL Finding the nearest Neighbors

BUILDING DASHBOARDS Density Based Clustering

Performance

Accuracy

We've trained 1,000 models by resampling the data, and the results are:

83%

model is robust over unseen dataAccuracy for not churn: 80% Accuracy FOR CHURN: 85%

Info

Classification & Clustering Models

06

Categorizing Data and Discovering Patterns

03

These models explicitly use probability distributions to make predictions.

Inbox

Probabilistic classifier based on Bayes' Theorem, assuming that features are independent given the class label

Mail

Spam?

Type:

Supervised Machine Learning
Classifier

Use Cases:

Spam detection in email.
Sentiment analysis in social media.
Document classification
Disease prediction (diabeates)

Info

Models

Prob of (A | B) - A given that B happened P(A) - Prob of A happening regardless of B happeningWe need to know P(A) and P(B) - Prob of an event happening regardless of the other

Regression models

05

Leveraging Relationships in Data for Continuous Predictions

+ Info

Models

Classification & Clustering Models

06

Categorizing Data and Discovering Patterns

K- Means Partitioning Data into K Distinct Groups

K-NN Finding the nearest Neighbors

Info

Tree models

07

Structured Decision Making for Complex Data

Random Forest

Decision Trees

Isolation Forest

Neural Networks

06

Deep Learning for Advanced Pattern Recognition

Challenges

+ Models

Neural Networks differ from other ML models in their capacity to automatically learn and model complex, non-linear relationships, without the need for manual feature engineering or reliance on linear assumptions.

Model Selection

10

Strategies for Optimal Algorithm Choice

What is the Target?

Info

Resources

10

RESOURCES

www.kaggle.com

https://huggingface.co/

https://medium.com/

11 THANK YOU

Deep Learning

When neural networks have multiple hidden layers, they are often referred to as deep neural networks, allowing them to model complex patterns and interactions within the data through deep learning.

Other types of Regression Methods

Isolation Forests

Forests are an anomaly detection algorithm that isolates outliers instead of profiling normal data points. It uses a forest of trees to partition data and identifies anomalies based on the ease with which samples can be isolated, effectively detecting deviations with less susceptibility to overfitting.

Fraud Detection
Network Security
Health Monitoring Systems
Quality Control in Manufacturing

DBSCAN

Unlike simpler clustering methods such as K-means, which primarily group points based on proximity, DBSCAN explores more intricate relationships. It identifies clusters not only by examining the straightforward distances between points but also by considering the density of the surrounding data points and their interactions. This enables DBSCAN to effectively discover varied shapes and sizes of clusters, making it highly effective for complex data sets.

K-Nearest Neighbors

(K-NN) algorithm that classifies new cases based on a similarity measure (usually distance functions). It involves selecting the 'K' closest data points in the feature space and predicting the label based on the majority vote of these neighbors.

Customer Segmentation.
Recommendation Systems.
Fraud Detection.
Image Recognition.

Model Types

Info

Gini

Decision trees are a supervised learning algorithm used for classification and regression. They split the data into nodes based on certain criteria, forming a tree structure with branches leading to outcomes. Each node tests an attribute, guiding decisions down to the final leaves where predictions are made. The process stops when criteria such as maximum depth or minimum node size are met.

Credit Scoring
Medical Diagnosis

Customer Segmentation
Inventory Management

Applications

K-Means Clustering

K-Means clustering is a unsupervised machine learning algorithm that sorts data into a specified number (K) of distinct clusters based on similarity. Useful for identifying patterns and insights by grouping similar data points together.

Data Cleaning: Ensuring data quality and relevance.
Normalization: Standardizing data features.
Categorization: Converting categories to numerical values.
Pick the model based on the problem

Network Intrusion Dataset

Customer Churn

Info

Boosting employs a "wisdom of the crowds" technique with a twist: it assigns weights to individual models. In this method, each decision tree prioritizes previously misclassified data points, adjusting their weights in the next iteration. This iterative refining enhances accuracy but requires careful management to avoid overfitting. By aggregating the strength of multiple weak predictors, boosting creates a highly robust and accurate model, proving invaluable in predictive analytics.

Restaurant Revenue

What Sets Neural Networks Apart from Other Machine Learning Models?

Info

Random Forest

Random Forests aggregate multiple decision trees to improve predictive accuracy and control over-fitting. By building numerous trees and averaging their predictions, Random Forests ensure robustness and reduce variance, making them effective for a wide range of tasks.

Predictive Maintenance
Biomedical Applications
Stock Market Analysis
E-commerce Personalization

SELL YOUR YOURSELF

DBSCAN

How Spam Detection Works Using Naïve Bayes

Training Phase

Collect a dataset of emails labeled as Spam or Not Spam.
Extract features from emails (e.g., individual words, word frequencies).
Calculate probabilities of each word appearing in Spam vs. Not Spam emails.

Prediction Phase (Classifying a New Email)

Extract words from the new email.
Use Bayes' Theorem to compute the probability that the email is Spam or Not Spam based on the words.
Assume feature independence: Treat each word as contributing independently to the final probability.
Assign the label (Spam or Not Spam) based on the highest probability.

Challenges

Building deep neural networks presents significant challenges, requiring meticulous design and tuning of numerous layers and parameters to optimize performance.

K-Means Clustering

Market Segmentation
Document Clustering
Image Segmentation
Anomaly Detection

K-Nearest Neighbors

Customer Segmentation.
Recommendation Systems.
Fraud Detection.
Image Recognition.

Learning Process

They learn by adjusting these connection weights based on the errors in predictions during the training phase, using algorithms such as backpropagation combined with an optimization technique like gradient descent.

Data Science 101

Start designing with a free template

View

Microcourse: Key Skills for the Professional Environment

View

Microcourse: Introduction to HTML

View

The Meeting Microlearning

View

The Meeting Microlearning Mobile

View

Isometric video mobile

View

Circles video mobile

View

3d corporate video mobile

Transcript

Data Science 101

History, Methods & Applications

Overview

01

WHO AM I?

02

LUIZ GABRIEL BONGIOLO

Supervised Learning

WHAT I ACTUALLY DO

03

Not plug and play

WHAT YOU THINK YOU WILL DO

WHAT YOU ACTUALLY DO

BUILDING MODELS Density Based Clustering

CLEANING DATA Partitioning Data into K Distinct Groups

SQL Finding the nearest Neighbors

BUILDING DASHBOARDS Density Based Clustering

Classification & Clustering Models

06

Categorizing Data and Discovering Patterns

Probabilistic models

03

These models explicitly use probability distributions to make predictions.

Naïve Bayes Classifier

Inbox

Mail

Spam?

Regression models

05

Leveraging Relationships in Data for Continuous Predictions

Classification & Clustering Models

06

Categorizing Data and Discovering Patterns

DBSCAN Density Based Clustering

K- Means Partitioning Data into K Distinct Groups

K-NN Finding the nearest Neighbors

Tree models

07

Structured Decision Making for Complex Data

Xgboost

Random Forest

Decision Trees

Isolation Forest

Neural Networks

06

Deep Learning for Advanced Pattern Recognition

What is a Neural Network?

Challenges

+ Models

Model Selection

10

Strategies for Optimal Algorithm Choice

What is the Target?

Resources

10

RESOURCES

11

THANK YOU

Deep Learning

Other types of Regression Methods

Isolation Forests

DBSCAN

K-Nearest Neighbors