Earlier than we’re going by way of the code, right here’s some packages libraries that ought to be put in and imported first:
# Putting in Packages!pip set up pandas
!pip isntall seaborn
!pip set up Matplotlib
!pip set up tqdm
!pip set up nltk
!pip set up wordcloud
# Importing Librariesimport pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Datasets
Then, we have to obtain the dataset on our native repository. The dataset accommodates varied information from many fields, equivalent to politics, well being, and others. On this challenge, we’ll use the well being information, politics information, and all information as knowledge prepare. Listed below are the variations of information train-test that we’ll gather:
- All information as knowledge prepare and politics information as knowledge check
- All information as knowledge prepare and well being information as knowledge check
- Politics information as knowledge prepare and knowledge check
- Well being information as knowledge prepare and knowledge check
- Politics information as knowledge prepare and well being information as knowledge check
- Well being information as knowledge prepare and politics information as knowledge check
Dataset could be downloaded: here
To stop knowledge bias in knowledge coaching, we should always prepare the pretend information knowledge dan actual information knowledge with the identical quantity (we’ll drop the remaining knowledge that’s an excessive amount of).
Gist Codes
Gist code for all information as knowledge coaching: here
Gist code for well being as knowledge coaching: here
Gist code for politics as knowledge coaching: here
After coaching all information knowledge, politics information knowledge, and well being information knowledge, we are able to discover the accuracy of classifier mannequin as acknowledged beneath.
Accuracy of information prepare mannequin utilizing determination tree methodology is larger than logistic regression methodology for all information, politics information, and well being information.
Aside from that, we are able to additionally discover out the accuracy of information testing with varied forms of knowledge coaching used as acknowledged beneath.
As proven above, accuracy for knowledge testing utilizing determination tree considerably is larger than utilizing logistic regression. However, on well being information as knowledge coaching and knowledge testing, we bought that the accuracy of information check on determination tree methodology is a bit bit decrease than on logistic regression. The accuracy appears low for each methodology if we use completely different sort of knowledge coaching of a unique sort than the info testing. For all information as knowledge coaching and well being information as knowledge testing, we are able to see that the accuracy is the bottom amongst the entire variations that we check. As we checked the all information knowledge itself, we discovered that the info in all information largely about politics (well being information knowledge is lower than 10% of the info). That’s in all probability the rationale why we bought the bottom accuracy, 47.87%, on all information as knowledge coaching and well being as knowledge testing.
From this challenge we are able to conclude a number of fascinating issues, equivalent to:
- Information coaching utilizing determination tree classifier could have higher accuracy than logistic regression with an accuracy worth of as much as 98%.
- The accuracy of information testing will enhance if the coaching knowledge that’s used is in the identical sort as the info testing.
- The accuracy of information testing beneath 80% for knowledge testing with completely different sort of information coaching.