Want to create interactive content? It’s easy in Genially!
Data Science 101
Luiz Gabriel Bongiol
Created on March 28, 2025
Start designing with a free template
Discover more than 1500 professional designs like these:
View
Microcourse: Key Skills for the Professional Environment
View
Microcourse: Introduction to HTML
View
The Meeting Microlearning
View
The Meeting Microlearning Mobile
View
Isometric video mobile
View
Circles video mobile
View
3d corporate video mobile
Transcript
Data Science 101
History, Methods & Applications
Start
Overview
01
Next
WHO AM I?
02
Next
LUIZ GABRIEL BONGIOLO
Populariezed fueled by several factors, including the advancements in big data technologies, greater computational power (such as GPUs), and the open-source movement which led to the development and enhancement of numerous machine learning frameworks.
Supervised Learning
- Data Cleaning: Ensuring data quality and relevance.
- Normalization: Standardizing data features.
- Categorization: Converting categories to numerical values.
- Pick the model based on the problem
Set Target & Split Data: Define the target variable (binary or multi-class) and divide the dataset into training, testing, and validation
Info
WHAT I ACTUALLY DO
03
Not plug and play
Next
WHAT YOU THINK YOU WILL DO
WHAT YOU ACTUALLY DO
BUILDING MODELS Density Based Clustering
CLEANING DATA Partitioning Data into K Distinct Groups
SQL Finding the nearest Neighbors
BUILDING DASHBOARDS Density Based Clustering
Performance
Accuracy
We've trained 1,000 models by resampling the data, and the results are:
83%
model is robust over unseen dataAccuracy for not churn: 80% Accuracy FOR CHURN: 85%
Info
Classification & Clustering Models
06
Categorizing Data and Discovering Patterns
Next
Probabilistic models
03
These models explicitly use probability distributions to make predictions.
Next
Naïve Bayes Classifier
Inbox
Probabilistic classifier based on Bayes' Theorem, assuming that features are independent given the class label
Spam?
Type:
- Supervised Machine Learning
- Classifier
- Spam detection in email.
- Sentiment analysis in social media.
- Document classification
- Disease prediction (diabeates)
Info
Models
Prob of (A | B) - A given that B happened P(A) - Prob of A happening regardless of B happeningWe need to know P(A) and P(B) - Prob of an event happening regardless of the other
Regression models
05
Leveraging Relationships in Data for Continuous Predictions
Next
+ Info
+ Info
+ Info
Models
Classification & Clustering Models
06
Categorizing Data and Discovering Patterns
Next
DBSCAN Density Based Clustering
K- Means Partitioning Data into K Distinct Groups
K-NN Finding the nearest Neighbors
Info
Tree models
07
Structured Decision Making for Complex Data
Next
Xgboost
Random Forest
Decision Trees
Isolation Forest
Neural Networks
06
Deep Learning for Advanced Pattern Recognition
Next
What is a Neural Network?
Challenges
+ Models
Neural Networks differ from other ML models in their capacity to automatically learn and model complex, non-linear relationships, without the need for manual feature engineering or reliance on linear assumptions.
Model Selection
10
Strategies for Optimal Algorithm Choice
Next
What is the Target?
Info
Resources
10
Next
RESOURCES
www.kaggle.com
https://huggingface.co/
https://medium.com/
11
THANK YOU
Deep Learning
When neural networks have multiple hidden layers, they are often referred to as deep neural networks, allowing them to model complex patterns and interactions within the data through deep learning.
Other types of Regression Methods
Isolation Forests
Forests are an anomaly detection algorithm that isolates outliers instead of profiling normal data points. It uses a forest of trees to partition data and identifies anomalies based on the ease with which samples can be isolated, effectively detecting deviations with less susceptibility to overfitting.
- Fraud Detection
- Network Security
- Health Monitoring Systems
- Quality Control in Manufacturing
DBSCAN
Unlike simpler clustering methods such as K-means, which primarily group points based on proximity, DBSCAN explores more intricate relationships. It identifies clusters not only by examining the straightforward distances between points but also by considering the density of the surrounding data points and their interactions. This enables DBSCAN to effectively discover varied shapes and sizes of clusters, making it highly effective for complex data sets.
K-Nearest Neighbors
(K-NN) algorithm that classifies new cases based on a similarity measure (usually distance functions). It involves selecting the 'K' closest data points in the feature space and predicting the label based on the majority vote of these neighbors.
- Customer Segmentation.
- Recommendation Systems.
- Fraud Detection.
- Image Recognition.
Model Types
Info
Gini
Decision trees are a supervised learning algorithm used for classification and regression. They split the data into nodes based on certain criteria, forming a tree structure with branches leading to outcomes. Each node tests an attribute, guiding decisions down to the final leaves where predictions are made. The process stops when criteria such as maximum depth or minimum node size are met.
- Credit Scoring
- Medical Diagnosis
- Customer Segmentation
- Inventory Management
Applications
K-Means Clustering
K-Means clustering is a unsupervised machine learning algorithm that sorts data into a specified number (K) of distinct clusters based on similarity. Useful for identifying patterns and insights by grouping similar data points together.
- Data Cleaning: Ensuring data quality and relevance.
- Normalization: Standardizing data features.
- Categorization: Converting categories to numerical values.
- Pick the model based on the problem
Network Intrusion Dataset
Customer Churn
Info
Boosting employs a "wisdom of the crowds" technique with a twist: it assigns weights to individual models. In this method, each decision tree prioritizes previously misclassified data points, adjusting their weights in the next iteration. This iterative refining enhances accuracy but requires careful management to avoid overfitting. By aggregating the strength of multiple weak predictors, boosting creates a highly robust and accurate model, proving invaluable in predictive analytics.
Restaurant Revenue
What Sets Neural Networks Apart from Other Machine Learning Models?
Neural Networks differ from other ML models in their capacity to automatically learn and model complex, non-linear relationships, without the need for manual feature engineering or reliance on linear assumptions.
Info
Random Forest
Random Forests aggregate multiple decision trees to improve predictive accuracy and control over-fitting. By building numerous trees and averaging their predictions, Random Forests ensure robustness and reduce variance, making them effective for a wide range of tasks.
- Predictive Maintenance
- Biomedical Applications
- Stock Market Analysis
- E-commerce Personalization
SELL YOUR YOURSELF
Unlike simpler clustering methods such as K-means, which primarily group points based on proximity, DBSCAN explores more intricate relationships. It identifies clusters not only by examining the straightforward distances between points but also by considering the density of the surrounding data points and their interactions. This enables DBSCAN to effectively discover varied shapes and sizes of clusters, making it highly effective for complex data sets.
DBSCAN
Unlike simpler clustering methods such as K-means, which primarily group points based on proximity, DBSCAN explores more intricate relationships. It identifies clusters not only by examining the straightforward distances between points but also by considering the density of the surrounding data points and their interactions. This enables DBSCAN to effectively discover varied shapes and sizes of clusters, making it highly effective for complex data sets.
How Spam Detection Works Using Naïve Bayes
Training Phase
- Collect a dataset of emails labeled as Spam or Not Spam.
- Extract features from emails (e.g., individual words, word frequencies).
- Calculate probabilities of each word appearing in Spam vs. Not Spam emails.
Prediction Phase (Classifying a New Email)
- Extract words from the new email.
- Use Bayes' Theorem to compute the probability that the email is Spam or Not Spam based on the words.
- Assume feature independence: Treat each word as contributing independently to the final probability.
- Assign the label (Spam or Not Spam) based on the highest probability.
Challenges
Building deep neural networks presents significant challenges, requiring meticulous design and tuning of numerous layers and parameters to optimize performance.
K-Means Clustering
K-Means clustering is a unsupervised machine learning algorithm that sorts data into a specified number (K) of distinct clusters based on similarity. Useful for identifying patterns and insights by grouping similar data points together.
- Market Segmentation
- Document Clustering
- Image Segmentation
- Anomaly Detection
K-Nearest Neighbors
(K-NN) algorithm that classifies new cases based on a similarity measure (usually distance functions). It involves selecting the 'K' closest data points in the feature space and predicting the label based on the majority vote of these neighbors.
- Customer Segmentation.
- Recommendation Systems.
- Fraud Detection.
- Image Recognition.
Learning Process
They learn by adjusting these connection weights based on the errors in predictions during the training phase, using algorithms such as backpropagation combined with an optimization technique like gradient descent.