Ads Can Make Both Medium and You Richer

There are ways to both increase new members as well as increasing writer income on Medium. Ads is the main method that provides options to make this happen.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Build a system to identify fake news articles!

Solving Real-World Problem Using Text Analytics and Classical Machine Learning Approach!

The epidemic spread of fake news is a side effect of the expansion of social networks to circulate news, in contrast to traditional mass media such as newspapers, magazines, radio, and television. Human inefficiency to distinguish between true and false facts exposes fake news as a threat to logical truth, democracy, journalism, and credibility in government institutions.

The veracity of information is an essential part of its integrity. The combat against fake news makes indissoluble the integrity and veracity checking of social networks’ information and data consumption in the application layer. The disclosure of fake content implies a waste of processing and network resources. Further, it consists of a serious threat to the information integrity and credibility of the provided service. Hence, the sharing of untrue information concerns the Quality of Trust (QoT) applied to the news dissemination, referring to how much a user trusts the content of a particular source.

In this post, we survey methods for preprocessing data in natural language, vectorization, dimensionality reduction, machine learning, and quality assessment of information retrieval. We also contextualize the identification of fake news.

To develop a machine learning program to identify when an article might be fake news.

What is fake news? — The fake news term originally refers to false and often sensationalist information disseminated under the guise of relevant news. However, this term’s use has evolved and is now considered synonymous with the spread of false information on social media. It is noteworthy that, according to Google Trends, the “fake news” term reached significant popularity in 2017.

Data will be in a file train.csv. There are a total 20,800 number of observations or rows. It contains 5 columns: id(unique id for a news article), title(the title of a news article), author(author of the news article), text(the text of the article; could be incomplete), label(a label that marks the article as potentially unreliable) denoted by 1(unreliable or fake) or 0(reliable).

First 10 data observations

It is a binary classification problem, for a given news article we need to predict if that is reliable or not.

What is the total number of observations for training? — — 20,800.

Are there any missing values in the data?

What is the distribution of class labels?

The number of unreliable articles(fake or 1) is 7924 and the number of reliable articles(0) is 10361, data don’t follow an imbalanced situation but almost 43% of the articles are FAKE! 😥

Which are the common authors of fake news articles?

Which are the common authors of reliable news articles?

Why do we need to convert text data to numeric data?

Before we proceed, let’s understand two terms that will be used frequently in this post.

Now we are good to go !!

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

It is called a bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

Let us understand from the following example:

Here, the, cat, sat, in, hat, with are the 6 features for 3 documents and entries are the frequencies of those features. Very Simple. Isn’t it?

We will get 5000 features from the BoW.

Top 7 feature names

Here, we will split our data into train-validation(test) in the ratio of 75:25. The number of data points in train data: 21141 and the number of data points in validation data: 7048.

After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion.

A baseline prediction algorithm provides a set of predictions that you can evaluate as you would any predictions for your problems, such as classification accuracy or loss. The random prediction algorithm predicts a random outcome as observed in the training data. It means that the random model predicts labels 0 or 1 randomly. The scores from these algorithms provide the required point of comparison when evaluating all other machine learning algorithms on your problem…

Multinomial Naive Bayes algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output.

Naive Bayes classifier is a collection of many algorithms where all the algorithms share one common principle, and that is each feature being classified is not related to any other feature. The presence or absence of a feature does not affect the presence or absence of the other feature.

Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the few online-learning algorithms. In online ML algorithms, the input data comes in sequential order and the ML model is updated step-by-step. This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the data because of the sheer size of the data. We can simply say that an online-learning algorithm will get a training example, update the classifier, and then throw away the example. A good example is to detect fake news on Twitter, where new data is being added every second.

How does this algorithm work?

In simple terms:

Thanks for reading ❤

For any suggestions or queries, leave your comments below and follow for updates.

If you liked the article, please hit the 👏 icon to support it. This will help other Medium users find it. Share it, so that others can read it!

Happy Learning! 😊

Add a comment

Related posts:

Get it out there! About Storytelling

Out on the Wire by Jessica Abel - An insanely good book about storytelling. Looking for good books on storytelling? Out on the Wire by Jessica Abel might be for you.

Talking about the nature of stories

Medium starts out with a little prompt. It’s in that light gray font color that lets you know it’s a placeholder. That you’re meant to fill that little something in, that you’re supposed to replace…

How to defy a cockroach

I see you and my heart goes still. You consider me, your long whiskers twirling and waving as if you are apprehending, sensing my fear to be exact. It takes me less than a second to scream and dart…