Want to create interactive content? It’s easy in Genially!
Get started free
Machine Learning
Quentin Saguer
Created on November 27, 2024
Start designing with a free template
Discover more than 1500 professional designs like these:
Transcript
Team Members :SAGUER Quentin/GUESSOUS Samy CARRE Pablo/PONTHIEU Gabriel/TALLARON Matéo
Machine Learning
Dataset Overview
- Source : Cyber Threat Detection document
- This dataset contains 1430 rows and 23 columns
- Problem statement :
- Goal : Classify network activities as either malicious or mild
- Target : Label column where 1 = malicious and mild = 0
Exploring Dataset
Data Cleaning
- Tasks performed :
- Removed irrelevant columns
- Verified there were no missing values in the dataset
- Checked for duplicates : "No duplicate rows found"
Splitting features and target
Data Visualization
- Features : All relevant columns except Label
- Target : The Label column
- Charts to include :
- Bar chart : Distribution of Label values
- Histogram : Distribution of Packet_Length or another numeric features
- Heatmap : Correlation between features (use seaborn or similar)
Splitting Data
- Process :
- Split dataset into training (80%) and testing (20%)
- Use the python code above
- Why ?
- To ensure model is tested on unseen data
Training Models
- Models used :
- Logistic Regression
- Random Forest Classifier
- Support Vector Machine (SVM)
- Process :
- Train each model using the training data
- Use the python code here shown ->
Models Evaluation
- Key Insight :
- Class imbalance is a challenge but can be managed with Random Forest
- Key features such as Packet_Length and Bytes_Sent are critical for classification
- Visualizing the data helped in understanding distributions and feature importance, leading to a better model
Conclusion
- Best Model :
- Random Forest showed the best overall performance for this classification task
Any questions