Want to create interactive content? It’s easy in Genially!

Get started free

ML_PRESENTATION

163_SHREYA SUMBLY

Created on May 4, 2023

Start designing with a free template

Discover more than 1500 professional designs like these:

Audio tutorial

Pechakucha Presentation

Desktop Workspace

Decades Presentation

Psychology Presentation

Medical Dna Presentation

Geometric Project Presentation

Transcript

MACHINE LEARNING WITH PYTHON

Fake News

Build a system to identify unreliable news articles

Presented by: 1163 - Shreya Sumbly 2004 - Aditi Jagtap 2005 - Shreya Ambeti 2007 - Mrudula Arvikar

CONTENT

01

02

03

DATA COLLECTION AND PROCESSING

VISUALIZATION

NTRODUCTION

05

06

07

MODEL DISCRIPTION

MODEL SELECTION

TESTING AND EVALUATION OF MODEL

08

09

OUTCOME

CONCLUSION

INTRODUCTION

Do you trust all the news you hear from social media?

All news are not real, right?

How will you detect fake news?

What is Fake News? A type of yellow journalism, fake news encapsulates pieces of news that may be hoaxes and is generally spread through social media and other online media.
The project is related to building a machine learning model to classify news articles as real or fake. The model is trained on a dataset containing news articles and their corresponding labels. The goal is to accurately classify news articles as real or fake based on their textual content.
The project involves several steps, including data preprocessing, feature extraction, model training, and evaluation. The data preprocessing step involves cleaning and processing the raw data to remove noise and inconsistencies. The feature extraction step involves converting the textual data into numerical form using a technique called TfidfVectorizer. The model training step involves training a logistic regression model on the preprocessed data. Finally, the model is evaluated using various metrics, including accuracy, confusion matrix, and classification report. The project is useful for detecting fake news articles and preventing the spread of misinformation. It can be applied in various domains, including social media, news websites, and online forums. The project can be further improved by using more advanced machine learning algorithms and incorporating other features such as image analysis, social network analysis, and sentiment analysis.

DATA COLLECTION AND PROCESSING

  • In this project, we used a publicly available dataset of news articles that has been collected from various sources.
  • The dataset contains both real and fake news articles, which are labeled accordingly.
  • The news dataset used in this project was downloaded from Kaggle, a popular platform for data science projects. The dataset was originally compiled by William Yang Wang from the University of California, Santa Barbara, and it contains 20,000 news articles, half of which are labeled as real news and the other half as fake news. The dataset can be downloaded from [here](https://www.kaggle.com/c/fake-news/data).

PROCESSING

Before feeding the textual data to the machine learning model, we need to preprocess it to make it suitable for analysis. Here are the steps we followed for preprocessing: By following these preprocessing steps, we were able to convert the textual data into a suitable format that can be fed to a machine learning model.
Converting textual data to numerical data
Handling Missing Values
Text Cleaning
Merging the author name and news title
Stemming

VISUALIZATION

WORD CLOUD

BAR PLOT

BAR PLOT

CONFUSION MATRIX

PRECISON, RECALL, F1-SCORE

FEATURE SELECTION

  • Feature selection was done as part of the pre-processing stage using the TfidfVectorizer function from the sklearn.feature_extraction.text module.
  • This function converts the textual data into a matrix of features by creating a vocabulary of unique words in the text corpus and assigning a weight to each word based on its frequency in each document and the entire corpus.
  • The TfidfVectorizer function has built-in mechanisms for feature selection, such as:
1.Removing stop words 2.Stemming

MODEL SELECTION

We have used the logistic regression algorithm as it is one of the popular algorithms for binary classification problems. However, before selecting the logistic regression algorithm, we have evaluated several other algorithms such as Naive Bayes. 8 Here are the steps we followed for model selection: 1. Split the data into training and testing sets. 2. Train the model on the training set using different algorithms. 3. Evaluate the performance of each algorithm using the testing set. 4. Select the algorithm with the highest accuracy score. After evaluating the performance of all the algorithms, we found that the logistic regression algorithm performed the best in terms of accuracy. Therefore, we have used the logistic regression algorithm to classify the news articles as real or fake.

Model Description/Algorithm

The model used in this project is Logistic Regression, which is a popular machine learning algorithm for binary classification problems. ALGORITHM1. Import the required libraries 2. Download the stopwords from the NLTK package 3. Load the news dataset into a Pandas DataFrame 4. Replace any missing values in the dataset with empty strings 5. Merge the author name and news title columns into a single 'content' column 6. Perform text preprocessing on the 'content' column using stemming and stopword removal 7. Convert the textual data to numerical data using TfidfVectorizer 8. Split the dataset into training and test data using train_test_split 9. Train a Logistic Regression model on the training data 10. Evaluate the model using accuracy score, confusion matrix, and classification report 11. Visualize the confusion matrix using seaborn and matplotlib 12. Calculate precision, recall, and F1-score for the classification report 13. Print the precision, recall, and F1-score values

Results

Logistic Regression Model: - Accuracy: 0.89 - Precision: 0.91 - Recall: 0.85 - F1-Score: 0.88
Naive Bayes Model: - Accuracy: 0.85 - Precision: 0.89 - Recall: 0.78 - F1-Score: 0.83

OUTCOME

1. Comparison: - The results show that the Logistic Regression model outperformed the Naive Bayes model in terms of accuracy and F1-Score. - The precision and recall values were also higher for the Logistic Regression model. - Therefore, we can conclude that the Logistic Regression model is a better choice for this proble2. The confusion matrix for the logistic regression algorithm showed that it was able to correctly identify 596 out of 623 real news articles and 677 out of 712 fake news articles. Our project was successful in developing an accurate model for predicting fake news articles. Our logistic regression algorithm achieved high accuracy scores and demonstrated strong performance in identifying both fake and real news articles.

CONCLUSION

Our project demonstrates the effectiveness of machine learning algorithms in detecting fake news. The model we developed can be used to identify fake news and prevent its spread, thereby improving the quality and credibility of news sources. Future work could involve exploring other machine learning algorithms, improving the dataset used for training the model, and implementing the model in real-world scenarios.