Will They Click on the Advertisement?

Jafar Shodiq
7 min readJan 26, 2021

In this project, we will be working with a made-up advertising dataset, indicating whether or not a particular internet user clicked on an advertisement. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

This project and the dataset can also be found on my GitHub.

Exploratory Data Analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.set_palette('viridis')
df = pd.read_csv('advertising.csv')
df.head()
Data
df.info()
Data info

This data set contains the following features:

  • ‘Daily Time Spent on Site’: consumer time on site in minutes
  • ‘Age’: cutomer age in years
  • ‘Area Income’: Avg. Income of geographical area of consumer
  • ‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet
  • ‘Ad Topic Line’: Headline of the advertisement
  • ‘City’: City of consumer
  • ‘Male’: Whether or not consumer was male
  • ‘Country’: Country of consumer
  • ‘Timestamp’: Time at which consumer clicked on Ad or closed window
  • ‘Clicked on Ad’: 0 or 1 indicated clicking on Ad
df.describe()
Data describe
df.isna().sum()
Missing value

Visualize the data

sns.pairplot(df, hue='Clicked on Ad')
Pairplot
sns.countplot(x='Clicked on Ad', data=df)
plt.title('Label Count')
Label count

Label count has the same amount

sns.displot(x='Age', data=df, bins=30, kde=True)
plt.title('Age Distributions')
Age distributions

Looks like age 30–40 dominated our data

sns.jointplot(x='Age', y='Area Income', data=df, hue='Clicked on Ad')
plt.title('Age vs Area Income')
Age vs area income
sns.jointplot(x='Age', y='Daily Time Spent on Site', data=df, kind='kde')
plt.title('Age vs Daily Time Spent on Site')
Age vs daily time spent on site
sns.jointplot(x='Daily Time Spent on Site', y='Daily Internet Usage', data=df, hue='Clicked on Ad')
plt.title('Daily Time Spent on Site vs Daily Internet Usage')
Daily time spent on site vs daily internet usage

Datetime Object

The Timestamp column is still in string, so we need to convert it into datetime.

df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Timestamp'].head()
Convert str to datetime

From this datetime object, we can extract features like year, month, day, etc.

df['year'] = df['Timestamp'].apply(lambda x: x.year)
df['month'] = df['Timestamp'].apply(lambda x: x.month)
df['day'] = df['Timestamp'].apply(lambda x: x.day)
df['dayofweek'] = df['Timestamp'].apply(lambda x: x.dayofweek)
df['hour'] = df['Timestamp'].apply(lambda x: x.hour)
df.head()
Data plus datetime features
df['year'].value_counts()
Year unique value

Since there is only 1 unique year value, we’re going to drop this.

df['month'].value_counts()
Month unique value

Not all months are present, but it’s not a problem

df['day'].nunique()
Day unique value
df['dayofweek'].value_counts()
Day of week unique value

0: Monday, 6: Sunday.

df['hour'].nunique()
Hour unique value
sns.countplot(x='month', data=df, palette='winter')
plt.title('Month Count')
Month count
sns.countplot(x='dayofweek', data=df, palette='summer_r')
plt.title('Day of Week Count')
plt.xlabel('dayofweek. 0=monday, 6=sunday')
Day of week count
sns.displot(x='day', data=df, alpha=0.4, kde=True, hue='Clicked on Ad', palette='Dark2')
plt.title('Day Distributions')
Day distributions
sns.displot(x='hour', data=df, kde=True, hue='Clicked on Ad', palette='Dark2')
plt.title('Hour Distributions')
Hour distributions

Based on the day plot, ads are clicked mostly in the middle of the month and near end of the month. Also based on the hour plot, ads are clicked mostly at late morning and probably at when people going home from work/school.

Let’s drop the Timestamp and year columns.

df.drop('Timestamp', axis=1, inplace=True)
df.drop('year', axis=1, inplace=True)

Categorical Features

df.info()
Data info

Ad Topic Line

df['Ad Topic Line'].nunique()
Ad topic line unique value
df['Ad Topic Line'].head(30)
Ad topic line values

All of them are unique values. Maybe we can extract some features with natural language processing, but it would take a long time. For now, we will not do that and will not be using this column.

City

df['City'].nunique()
City unique value

This column too almost has unique features as many as the index. We won’t be using this column either.

Country

df['Country'].nunique()
Country unique value

We can actually group this column by continents like Asia, Africa, North America, etc. But it would also take time to list them. We will do it next time, but for now we won’t be using this column.

Let’s drop these columns.

df.drop('Ad Topic Line', axis=1, inplace=True)
df.drop('City', axis=1, inplace=True)
df.drop('Country', axis=1, inplace=True)

So what we did so far:

  • Converted Timestamp to datetime object and got the month , day , dayofweek , and hour
  • Dropped Timestamp , Ad Topic Line , City , and Country

We can now move on to the next step, which is building the model.

Build the Model

from sklearn.model_selection import train_test_splitX = df.drop('Clicked on Ad', axis=1)
y = df['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
len(X_train), len(X_test)
Length of train and test

We will be using logistic regression for the classification

from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression(max_iter=300)
model.fit(X_train, y_train)
Logistic regression

Predictions and Evaluations

from sklearn.metrics import classification_report, confusion_matrixpreds = model.predict(X_test)print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))
Classification report and confusion matrix

The model did a pretty good job with 90% accuracy. Now let’s save the model.

import joblibjoblib.dump(model, 'log-reg-adv.joblib')

Test on Brand New Data

We will now test the model on a new data it has never seen before

test_data = {
'Daily Time Spent on Site': 75.19,
'Age': 41,
'Area Income': 51473.48,
'Daily Internet Usage': 174.26,
'Ad Topic Line': 'Hi friend, click me please!!!',
'City': 'Aakhen',
'Male': 1,
'Country': 'Cidonia',
'Timestamp': '2016-05-21 10:32:19'
}
df_test = pd.DataFrame([test_data])
df_test
Test data

Dealing with Timestamp

df_test['Timestamp'] = pd.to_datetime(df_test['Timestamp'])df_test['month'] = df_test['Timestamp'].apply(lambda x: x.month)
df_test['day'] = df_test['Timestamp'].apply(lambda x: x.day)
df_test['dayofweek'] = df_test['Timestamp'].apply(lambda x: x.dayofweek)
df_test['hour'] = df_test['Timestamp'].apply(lambda x: x.hour)
df_test.drop('Timestamp', axis=1, inplace=True)

Dealing with other features

df_test.drop('Ad Topic Line', axis=1, inplace=True)
df_test.drop('City', axis=1, inplace=True)
df_test.drop('Country', axis=1, inplace=True)
df.info()
Test info

Now that our test data has the same structure as the data that was ‘fed’ into the model, we can predict this new data.

Load the model

model = joblib.load('log-reg-adv.joblib')

Prediction

pred = model.predict(df_test)df_predicted = pd.DataFrame([test_data])
df_predicted['predicted'] = pred
df_predicted
Predicted test

The model predicted that when an internet user has the above features, they will click on the ad.

Conclusion

So what sort of features that will likely click on the ad?

Age between 35–45 years old

sns.displot(x='Age', data=df, bins=30, kde=True, hue='Clicked on Ad')
Age

Daily time spent on site less than 60

sns.displot(x='Daily Time Spent on Site', data=df, bins=30, kde=True, hue='Clicked on Ad')
Daily time on site

Area income between 45000–55000

sns.displot(x='Area Income', data=df, bins=30, kde=True, hue='Clicked on Ad')
Area income

Daily internet usage less than 175

sns.displot(x='Daily Internet Usage', data=df, bins=30, kde=True, hue='Clicked on Ad')
Daily internet usage

In the middle of the months or near end

sns.displot(x='day', data=df, kde=True, hue='Clicked on Ad')
Day

At around late morning or when people going home from work/school

sns.displot(x='hour', data=df, kde=True, hue='Clicked on Ad')
Hour
Unlisted

--

--