Will They Click on the Advertisement?
--
In this project, we will be working with a made-up advertising dataset, indicating whether or not a particular internet user clicked on an advertisement. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.
This project and the dataset can also be found on my GitHub.
Exploratory Data Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinesns.set_style('whitegrid')
sns.set_palette('viridis')df = pd.read_csv('advertising.csv')
df.head()
df.info()
This data set contains the following features:
- ‘Daily Time Spent on Site’: consumer time on site in minutes
- ‘Age’: cutomer age in years
- ‘Area Income’: Avg. Income of geographical area of consumer
- ‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet
- ‘Ad Topic Line’: Headline of the advertisement
- ‘City’: City of consumer
- ‘Male’: Whether or not consumer was male
- ‘Country’: Country of consumer
- ‘Timestamp’: Time at which consumer clicked on Ad or closed window
- ‘Clicked on Ad’: 0 or 1 indicated clicking on Ad
df.describe()
df.isna().sum()
Visualize the data
sns.pairplot(df, hue='Clicked on Ad')
sns.countplot(x='Clicked on Ad', data=df)
plt.title('Label Count')
Label count has the same amount
sns.displot(x='Age', data=df, bins=30, kde=True)
plt.title('Age Distributions')
Looks like age 30–40 dominated our data
sns.jointplot(x='Age', y='Area Income', data=df, hue='Clicked on Ad')
plt.title('Age vs Area Income')
sns.jointplot(x='Age', y='Daily Time Spent on Site', data=df, kind='kde')
plt.title('Age vs Daily Time Spent on Site')
sns.jointplot(x='Daily Time Spent on Site', y='Daily Internet Usage', data=df, hue='Clicked on Ad')
plt.title('Daily Time Spent on Site vs Daily Internet Usage')
Datetime Object
The Timestamp
column is still in string, so we need to convert it into datetime.
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Timestamp'].head()
From this datetime object, we can extract features like year, month, day, etc.
df['year'] = df['Timestamp'].apply(lambda x: x.year)
df['month'] = df['Timestamp'].apply(lambda x: x.month)
df['day'] = df['Timestamp'].apply(lambda x: x.day)
df['dayofweek'] = df['Timestamp'].apply(lambda x: x.dayofweek)
df['hour'] = df['Timestamp'].apply(lambda x: x.hour)
df.head()
df['year'].value_counts()
Since there is only 1 unique year value, we’re going to drop this.
df['month'].value_counts()
Not all months are present, but it’s not a problem
df['day'].nunique()
df['dayofweek'].value_counts()
0: Monday, 6: Sunday.
df['hour'].nunique()
sns.countplot(x='month', data=df, palette='winter')
plt.title('Month Count')
sns.countplot(x='dayofweek', data=df, palette='summer_r')
plt.title('Day of Week Count')
plt.xlabel('dayofweek. 0=monday, 6=sunday')
sns.displot(x='day', data=df, alpha=0.4, kde=True, hue='Clicked on Ad', palette='Dark2')
plt.title('Day Distributions')
sns.displot(x='hour', data=df, kde=True, hue='Clicked on Ad', palette='Dark2')
plt.title('Hour Distributions')
Based on the day
plot, ads are clicked mostly in the middle of the month and near end of the month. Also based on the hour
plot, ads are clicked mostly at late morning and probably at when people going home from work/school.
Let’s drop the Timestamp
and year
columns.
df.drop('Timestamp', axis=1, inplace=True)
df.drop('year', axis=1, inplace=True)
Categorical Features
df.info()
Ad Topic Line
df['Ad Topic Line'].nunique()
df['Ad Topic Line'].head(30)
All of them are unique values. Maybe we can extract some features with natural language processing, but it would take a long time. For now, we will not do that and will not be using this column.
City
df['City'].nunique()
This column too almost has unique features as many as the index. We won’t be using this column either.
Country
df['Country'].nunique()
We can actually group this column by continents like Asia, Africa, North America, etc. But it would also take time to list them. We will do it next time, but for now we won’t be using this column.
Let’s drop these columns.
df.drop('Ad Topic Line', axis=1, inplace=True)
df.drop('City', axis=1, inplace=True)
df.drop('Country', axis=1, inplace=True)
So what we did so far:
- Converted
Timestamp
to datetime object and got themonth
,day
,dayofweek
, andhour
- Dropped
Timestamp
,Ad Topic Line
,City
, andCountry
We can now move on to the next step, which is building the model.
Build the Model
from sklearn.model_selection import train_test_splitX = df.drop('Clicked on Ad', axis=1)
y = df['Clicked on Ad']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
len(X_train), len(X_test)
We will be using logistic regression for the classification
from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression(max_iter=300)
model.fit(X_train, y_train)
Predictions and Evaluations
from sklearn.metrics import classification_report, confusion_matrixpreds = model.predict(X_test)print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))
The model did a pretty good job with 90% accuracy. Now let’s save the model.
import joblibjoblib.dump(model, 'log-reg-adv.joblib')
Test on Brand New Data
We will now test the model on a new data it has never seen before
test_data = {
'Daily Time Spent on Site': 75.19,
'Age': 41,
'Area Income': 51473.48,
'Daily Internet Usage': 174.26,
'Ad Topic Line': 'Hi friend, click me please!!!',
'City': 'Aakhen',
'Male': 1,
'Country': 'Cidonia',
'Timestamp': '2016-05-21 10:32:19'
}df_test = pd.DataFrame([test_data])
df_test
Dealing with Timestamp
df_test['Timestamp'] = pd.to_datetime(df_test['Timestamp'])df_test['month'] = df_test['Timestamp'].apply(lambda x: x.month)
df_test['day'] = df_test['Timestamp'].apply(lambda x: x.day)
df_test['dayofweek'] = df_test['Timestamp'].apply(lambda x: x.dayofweek)
df_test['hour'] = df_test['Timestamp'].apply(lambda x: x.hour)df_test.drop('Timestamp', axis=1, inplace=True)
Dealing with other features
df_test.drop('Ad Topic Line', axis=1, inplace=True)
df_test.drop('City', axis=1, inplace=True)
df_test.drop('Country', axis=1, inplace=True)df.info()
Now that our test data has the same structure as the data that was ‘fed’ into the model, we can predict this new data.
Load the model
model = joblib.load('log-reg-adv.joblib')
Prediction
pred = model.predict(df_test)df_predicted = pd.DataFrame([test_data])
df_predicted['predicted'] = preddf_predicted
The model predicted that when an internet user has the above features, they will click on the ad.
Conclusion
So what sort of features that will likely click on the ad?
Age between 35–45 years old
sns.displot(x='Age', data=df, bins=30, kde=True, hue='Clicked on Ad')
Daily time spent on site less than 60
sns.displot(x='Daily Time Spent on Site', data=df, bins=30, kde=True, hue='Clicked on Ad')
Area income between 45000–55000
sns.displot(x='Area Income', data=df, bins=30, kde=True, hue='Clicked on Ad')
Daily internet usage less than 175
sns.displot(x='Daily Internet Usage', data=df, bins=30, kde=True, hue='Clicked on Ad')
In the middle of the months or near end
sns.displot(x='day', data=df, kde=True, hue='Clicked on Ad')
At around late morning or when people going home from work/school
sns.displot(x='hour', data=df, kde=True, hue='Clicked on Ad')