Will They Click on the Advertisement?

7 min readJan 26, 2021

In this project, we will be working with a made-up advertising dataset, indicating whether or not a particular internet user clicked on an advertisement. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

This project and the dataset can also be found on my GitHub.

Exploratory Data Analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinesns.set_style('whitegrid')
sns.set_palette('viridis')df = pd.read_csv('advertising.csv')
df.head()

df.info()

This data set contains the following features:

‘Daily Time Spent on Site’: consumer time on site in minutes
‘Age’: cutomer age in years
‘Area Income’: Avg. Income of geographical area of consumer
‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet
‘Ad Topic Line’: Headline of the advertisement
‘City’: City of consumer
‘Male’: Whether or not consumer was male
‘Country’: Country of consumer
‘Timestamp’: Time at which consumer clicked on Ad or closed window
‘Clicked on Ad’: 0 or 1 indicated clicking on Ad

df.describe()

df.isna().sum()

Visualize the data

sns.pairplot(df, hue='Clicked on Ad')

sns.countplot(x='Clicked on Ad', data=df)
plt.title('Label Count')

Label count has the same amount

sns.displot(x='Age', data=df, bins=30, kde=True)
plt.title('Age Distributions')

Looks like age 30–40 dominated our data

sns.jointplot(x='Age', y='Area Income', data=df, hue='Clicked on Ad')
plt.title('Age vs Area Income')

sns.jointplot(x='Age', y='Daily Time Spent on Site', data=df, kind='kde')
plt.title('Age vs Daily Time Spent on Site')

sns.jointplot(x='Daily Time Spent on Site', y='Daily Internet Usage', data=df, hue='Clicked on Ad')
plt.title('Daily Time Spent on Site vs Daily Internet Usage')

Daily time spent on site vs daily internet usage

Datetime Object

The Timestamp column is still in string, so we need to convert it into datetime.

df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Timestamp'].head()

From this datetime object, we can extract features like year, month, day, etc.

df['year'] = df['Timestamp'].apply(lambda x: x.year)
df['month'] = df['Timestamp'].apply(lambda x: x.month)
df['day'] = df['Timestamp'].apply(lambda x: x.day)
df['dayofweek'] = df['Timestamp'].apply(lambda x: x.dayofweek)
df['hour'] = df['Timestamp'].apply(lambda x: x.hour)
df.head()

df['year'].value_counts()

Year unique value

Since there is only 1 unique year value, we’re going to drop this.

df['month'].value_counts()

Not all months are present, but it’s not a problem

df['day'].nunique()

Day unique value

df['dayofweek'].value_counts()

0: Monday, 6: Sunday.

df['hour'].nunique()

Hour unique value

sns.countplot(x='month', data=df, palette='winter')
plt.title('Month Count')

sns.countplot(x='dayofweek', data=df, palette='summer_r')
plt.title('Day of Week Count')
plt.xlabel('dayofweek. 0=monday, 6=sunday')

sns.displot(x='day', data=df, alpha=0.4, kde=True, hue='Clicked on Ad', palette='Dark2')
plt.title('Day Distributions')

sns.displot(x='hour', data=df, kde=True, hue='Clicked on Ad', palette='Dark2')
plt.title('Hour Distributions')

Based on the day plot, ads are clicked mostly in the middle of the month and near end of the month. Also based on the hour plot, ads are clicked mostly at late morning and probably at when people going home from work/school.

Let’s drop the Timestamp and year columns.

df.drop('Timestamp', axis=1, inplace=True)
df.drop('year', axis=1, inplace=True)

Categorical Features

df.info()

Ad Topic Line

df['Ad Topic Line'].nunique()

Ad topic line unique value

df['Ad Topic Line'].head(30)

All of them are unique values. Maybe we can extract some features with natural language processing, but it would take a long time. For now, we will not do that and will not be using this column.

City

df['City'].nunique()

City unique value

This column too almost has unique features as many as the index. We won’t be using this column either.

Country

df['Country'].nunique()

Country unique value

We can actually group this column by continents like Asia, Africa, North America, etc. But it would also take time to list them. We will do it next time, but for now we won’t be using this column.

Let’s drop these columns.

df.drop('Ad Topic Line', axis=1, inplace=True)
df.drop('City', axis=1, inplace=True)
df.drop('Country', axis=1, inplace=True)

So what we did so far:

Converted Timestamp to datetime object and got the month , day , dayofweek , and hour
Dropped Timestamp , Ad Topic Line , City , and Country

We can now move on to the next step, which is building the model.

Build the Model

from sklearn.model_selection import train_test_splitX = df.drop('Clicked on Ad', axis=1)
y = df['Clicked on Ad']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
len(X_train), len(X_test)

Length of train and test

We will be using logistic regression for the classification

from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression(max_iter=300)
model.fit(X_train, y_train)

Logistic regression

Predictions and Evaluations

from sklearn.metrics import classification_report, confusion_matrixpreds = model.predict(X_test)print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))

Classification report and confusion matrix

The model did a pretty good job with 90% accuracy. Now let’s save the model.

import joblibjoblib.dump(model, 'log-reg-adv.joblib')

Test on Brand New Data

We will now test the model on a new data it has never seen before

test_data = {
    'Daily Time Spent on Site': 75.19,
    'Age': 41,
    'Area Income': 51473.48,
    'Daily Internet Usage': 174.26,
    'Ad Topic Line': 'Hi friend, click me please!!!',
    'City': 'Aakhen',
    'Male': 1,
    'Country': 'Cidonia',
    'Timestamp': '2016-05-21 10:32:19'
}df_test = pd.DataFrame([test_data])
df_test

Test data

Dealing with Timestamp

df_test['Timestamp'] = pd.to_datetime(df_test['Timestamp'])df_test['month'] = df_test['Timestamp'].apply(lambda x: x.month)
df_test['day'] = df_test['Timestamp'].apply(lambda x: x.day)
df_test['dayofweek'] = df_test['Timestamp'].apply(lambda x: x.dayofweek)
df_test['hour'] = df_test['Timestamp'].apply(lambda x: x.hour)df_test.drop('Timestamp', axis=1, inplace=True)

Dealing with other features

df_test.drop('Ad Topic Line', axis=1, inplace=True)
df_test.drop('City', axis=1, inplace=True)
df_test.drop('Country', axis=1, inplace=True)df.info()

Now that our test data has the same structure as the data that was ‘fed’ into the model, we can predict this new data.

Load the model

model = joblib.load('log-reg-adv.joblib')

Prediction

pred = model.predict(df_test)df_predicted = pd.DataFrame([test_data])
df_predicted['predicted'] = preddf_predicted

Predicted test

The model predicted that when an internet user has the above features, they will click on the ad.

Conclusion

So what sort of features that will likely click on the ad?

Age between 35–45 years old

sns.displot(x='Age', data=df, bins=30, kde=True, hue='Clicked on Ad')

Daily time spent on site less than 60

sns.displot(x='Daily Time Spent on Site', data=df, bins=30, kde=True, hue='Clicked on Ad')

Area income between 45000–55000

sns.displot(x='Area Income', data=df, bins=30, kde=True, hue='Clicked on Ad')

Daily internet usage less than 175

sns.displot(x='Daily Internet Usage', data=df, bins=30, kde=True, hue='Clicked on Ad')

In the middle of the months or near end

sns.displot(x='day', data=df, kde=True, hue='Clicked on Ad')

At around late morning or when people going home from work/school

sns.displot(x='hour', data=df, kde=True, hue='Clicked on Ad')