Hey there!
On this article, you’re going create a mannequin that predicts the gold value utilizing the under dataset.
Dataset: Gold Price Dataset🪙
If in case you have been studying my articles you recognize what you need to be doing now. If that is the primary time you might be studying my article, all I need you to do is,
=> Obtain the dataset utilizing the above hyperlink (You’re going to get a zipper file).
=> Extract the dataset from the zip file.
=> Open your Google Colab.
=> Add the dataset into your Google Colab and observe alongside.
All set?
Let’s go.
Import pandas and cargo the dataset right into a variable known as df.
import pandas as pddf = pd.read_csv('gld_price_data.csv')
It’s check out the dataset and know extra about it utilizing completely different strategies and attributes.
df.head()"""
OUTPUT:
Date SPX GLD USO SLV EUR/USD
0 1/2/2008 1447.160034 84.860001 78.470001 15.180 1.471692
1 1/3/2008 1447.160034 85.570000 78.370003 15.285 1.474491
2 1/4/2008 1411.630005 85.129997 77.309998 15.167 1.475492
3 1/7/2008 1416.180054 84.769997 75.500000 15.053 1.468299
4 1/8/2008 1390.189941 86.779999 76.059998 15.590 1.557099
"""
As you possibly can see from the above output there are values of gold, shares, and currencies with respective dates. Our focus right here is to create a mannequin that can predict the value of “GLD” which stands for GOLD utilizing different options comparable to “SPX” which I believe is S&P 500 inventory, “USO” which is United States Oil Fund, “SLV” which can also be a inventory, and “EUR/USD” that are currencies as you recognize.
Let’s examine the form of this dataset and whether or not it has any null values in it.
df.form #(2290, 6)df.isnull().sum()
"""
OUTPUT:
Date 0
SPX 0
GLD 0
USO 0
SLV 0
EUR/USD 0
"""
Our dataset has 2290 data and 6 columns and there are not any null values current in it.
Earlier than transferring additional let’s drop the “Date” column as a result of there isn’t a use of that anyway.
df = df.drop(columns=['Date'])
I wish to know the correlation between all of the columns within the dataset. Let’s use Plotly to create a heatmap.
import plotly.specific as pxpx.imshow(df.corr(), text_auto=True)
Solely the “SLV” column has an enormous correlation with the “GLD” column of 80%.
df.corr()['GLD']"""
OUTPUT:
SPX 0.049345
GLD 1.000000
USO -0.186360
SLV 0.866632
EUR/USD -0.024375
"""
It is a quite simple dataset. We don’t have a lot to do anymore. Let’s merely break up it right into a practice set and a take a look at set.
X = df.drop(columns=['GLD'])
y = df['GLD']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=345)
X_train.form, y_train.form #((1832, 4), (1832,))
X_test.form, y_test.form #((458, 4), (458,))
(Be aware: We’re going to construct two fashions particularly, RandomForestRegressor and DecisionTreeRegressor so there isn’t a have to standardize the dataset however when utilizing different fashions like LinearRegressor you’ll want to standardize the dataset earlier than becoming the mannequin to the dataset.)
Let’s first begin with,
Random Forest Regressor
from sklearn.ensemble import RandomForestRegressorrfr = RandomForestRegressor()
rfr.match(X_train, y_train)
rfr_prediction = rfr.predict(X_test)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mean_absolute_error(y_test, rfr_prediction) #1.3395399058296906
mean_squared_error(y_test, rfr_prediction) #7.142444162289108
r2_score(y_test, rfr_prediction) #0.9863465735466982
The above analysis metrics are actually good. You’ll be able to consider the r2_score in regression like accuracy in classification.
Now let’s attempt,
Determination Tree Regressor
from sklearn.tree import DecisionTreeRegressordtr = DecisionTreeRegressor()
dtr.match(X_train, y_train)
dtr_prediction = dtr.predict(X_test)
mean_absolute_error(y_test, dtr_prediction) #1.5539302336244543
mean_squared_error(y_test, dtr_prediction) #12.746302031895944
r2_score(y_test, dtr_prediction) #0.9756342936129743
That is good however Random Forest Regressor is best. Let’s keep on with that.
That’s it.