In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier

Predicting whether a customer will purchase an insurance policy after receiving a quote¶

In [2]:
import pytz
import datetime

timestamp = datetime.datetime.now(pytz.timezone('Europe/Helsinki'))
print("Notebook exported at", timestamp.now(), "(Europe/Helsinki)")
Notebook exported at 2026-04-16 20:20:34.780711 (Europe/Helsinki)
In [3]:
insurance_quotes = pd.read_excel("input/telepass.xlsx", sheet_name="Insurance Quotes")
print("First 5 rows of insurance quotes")
insurance_quotes.head()
First 5 rows of insurance quotes
Out[3]:
client_id quotation_id driving_type car_immatriculation_date car_brand car_model insurance_expires_at birth_date gender county ... natural_events theft_fire kasko license_revoked collision vandalism key_loss price_sale price_full discount_percent
0 23789 3649 more_than_26 2011-07-20 FORD C-MAX 2020-06-12 NaN NaN FC ... NaN 22.70 NaN NaN 121.83 NaN NaN 325.48 325.48 0.000000
1 10133 28552 more_than_26 2007-03-01 FIAT - INNOCENTI FIAT CROMA 2020-01-22 NaN NaN MS ... NaN 22.70 NaN NaN 89.99 NaN NaN 271.25 306.80 0.115874
2 20785 31958 more_than_26 2017-08-30 NISSAN NISSAN QASHQAI 2020-10-02 1992-09-22 00:00:00 M SA ... NaN NaN NaN NaN NaN NaN NaN 640.84 684.01 0.063113
3 33892 27614 more_than_26 2017-07-25 FIAT - INNOCENTI FIAT PANDA 2020-07-25 1996-02-09 00:00:00 M NaN ... NaN NaN NaN NaN NaN NaN NaN 2185.38 2204.87 0.008840
4 34153 1529 more_than_26 2006-12-19 FIAT - INNOCENTI FIAT PUNTO 2020-10-13 2000-01-08 00:00:00 M CN ... NaN 24.34 NaN NaN 156.11 NaN NaN 207.68 231.20 0.101730

5 rows × 40 columns

In [4]:
transactions = pd.read_excel("input/telepass.xlsx", sheet_name="Transactions")
print("First 5 rows of transactions")
transactions.head()
First 5 rows of transactions
Out[4]:
client_id year_month service_type telepass_pay number_transactions expenditures
0 20785 2020-05-01 PARCHEGGI/TRAGHETTI 0 1 2.00
1 20785 2019-09-01 PARCHEGGI/TRAGHETTI 0 1 2.00
2 34153 2020-02-01 PARCHEGGI/TRAGHETTI 0 1 3.75
3 21141 2020-01-01 PARCHEGGI/TRAGHETTI 0 1 3.50
4 21141 2020-02-01 PARCHEGGI/TRAGHETTI 0 1 7.00
In [5]:
total_transactions = transactions[["client_id", "number_transactions", "year_month", "expenditures"]].sort_values(by=["client_id", "year_month"]).groupby(["client_id"]).agg(
    first_month_when_transactions=('year_month', 'min'),
    latest_month_when_transactions=('year_month', 'max'),
    total_year_months_when_transactions=('year_month', 'count'),
    total_number_transactions=('number_transactions', 'sum'),
    total_expenditures=('expenditures', 'sum')
)
total_transactions.loc[total_transactions["total_year_months_when_transactions"].isna(), 'total_year_months_when_transactions'] = 0
total_transactions.head()
Out[5]:
first_month_when_transactions latest_month_when_transactions total_year_months_when_transactions total_number_transactions total_expenditures
client_id
2 2019-12-01 2020-01-01 2 2 615.707998
3 2019-12-01 2019-12-01 1 1 4.200000
5 2020-01-01 2020-01-01 1 1 5.000000
8 2019-07-01 2019-12-01 3 5 27.000000
10 2020-01-01 2020-05-01 2 2 24.700000
In [6]:
base_subscriptions = insurance_quotes[["client_id",  "base_subscription"]].merge(total_transactions, on="client_id", how="left")

base_subscriptions.loc[base_subscriptions["first_month_when_transactions"].notnull() & base_subscriptions["latest_month_when_transactions"].notnull(), 'months_between_first_last_transactions'] = base_subscriptions["latest_month_when_transactions"].dt.to_period('M').astype(int) - base_subscriptions["first_month_when_transactions"].dt.to_period('M').astype(int)
base_subscriptions.loc[base_subscriptions["months_between_first_last_transactions"].isna(), 'months_between_first_last_transactions'] = -1

base_subscriptions.loc[base_subscriptions["base_subscription"].notnull() & base_subscriptions["first_month_when_transactions"].notnull(), 'days_between_base_subscription_and_first_transaction'] = base_subscriptions["first_month_when_transactions"].dt.to_period('D').astype(int) - base_subscriptions["base_subscription"].dt.to_period('D').astype(int)
base_subscriptions.loc[base_subscriptions["days_between_base_subscription_and_first_transaction"].isna(), 'days_between_base_subscription_and_first_transaction'] = -1

base_subscriptions = base_subscriptions.filter(items=["client_id", "months_between_first_last_transactions", "days_between_base_subscription_and_first_transaction"])

base_subscriptions.head()
Out[6]:
client_id months_between_first_last_transactions days_between_base_subscription_and_first_transaction
0 23789 -1.0 -1.0
1 10133 -1.0 -1.0
2 20785 8.0 6.0
3 33892 -1.0 -1.0
4 34153 0.0 156.0
In [7]:
subscription_status = insurance_quotes[["client_id", "driving_type",  "pay_subscription", "pay_cancellation", "premium_subscription", "premium_cancellation"]]

subscription_status.loc[subscription_status["pay_subscription"].notnull(), 'has_pay_subscribed'] = True
subscription_status.loc[subscription_status["pay_cancellation"] == "9998-12-31", 'active_pay_subscription'] = True
subscription_status.loc[(subscription_status["pay_cancellation"] != "9998-12-31") & (subscription_status["pay_cancellation"].notnull()), 'cancelled_pay_subscription'] = True

subscription_status.loc[subscription_status["premium_subscription"].notnull(), 'has_premium_subscribed'] = True
subscription_status.loc[subscription_status["premium_cancellation"] == "9998-12-31", 'active_premium_subscription'] = True
subscription_status.loc[(subscription_status["premium_cancellation"] != "9998-12-31") & (subscription_status["premium_cancellation"].notnull()), 'cancelled_premium_subscription'] = True

subscription_status.loc[subscription_status["driving_type"] == "more_than_26", 'age_group'] = 1
subscription_status.loc[subscription_status["driving_type"] == "between_23_and_25", 'age_group'] = 2
subscription_status.loc[subscription_status["driving_type"] == "less_than_23", 'age_group'] = 3

subscription_status = subscription_status.sort_values(by=["client_id"])
subscription_status = subscription_status[["client_id", "age_group", "has_pay_subscribed", "active_pay_subscription", "cancelled_pay_subscription", "has_premium_subscribed", "active_premium_subscription", "cancelled_premium_subscription"]]

subscription_status.head()
Out[7]:
client_id age_group has_pay_subscribed active_pay_subscription cancelled_pay_subscription has_premium_subscribed active_premium_subscription cancelled_premium_subscription
18155 1 1.0 True True NaN NaN NaN NaN
10634 2 1.0 True True NaN True NaN True
28159 3 1.0 NaN NaN NaN True True NaN
31965 4 1.0 NaN NaN NaN NaN NaN NaN
15937 5 1.0 True True NaN True NaN True
In [8]:
quotes_issued = insurance_quotes[["client_id",  "issued"]]
quotes_issued = quotes_issued.sort_values(by=["client_id"])
quotes_issued.loc[quotes_issued["issued"] == True, 'is_issued'] = 1
quotes_issued.loc[quotes_issued["is_issued"].isna(), 'is_issued'] = 0
quotes_issued = quotes_issued.drop(["issued"], axis=1)
quotes_issued.head()
Out[8]:
client_id is_issued
18155 1 1.0
10634 2 1.0
28159 3 0.0
31965 4 0.0
15937 5 0.0
In [9]:
merged = total_transactions.merge(subscription_status, on=["client_id"], how="right")
merged = merged.merge(base_subscriptions, on=["client_id"], how="left")

merged.loc[merged["active_pay_subscription"].notnull(), 'pay_subscription'] = 1
merged.loc[merged["cancelled_pay_subscription"].notnull(), 'pay_subscription'] = -1
merged.loc[merged["pay_subscription"].isna(), 'pay_subscription'] = 0

merged.loc[merged["active_premium_subscription"].notnull(), 'premium_subscription'] = 1
merged.loc[merged["cancelled_premium_subscription"].notnull(), 'premium_subscription'] = -1
merged.loc[merged["premium_subscription"].isna(), 'premium_subscription'] = 0

columns = ["age_group", "total_number_transactions", "total_expenditures", "pay_subscription", "premium_subscription", "total_year_months_when_transactions", "total_number_transactions", "months_between_first_last_transactions", "days_between_base_subscription_and_first_transaction"]
merged = merged.filter(items=columns)

merged.loc[merged["total_number_transactions"].isna(), 'total_number_transactions'] = 0
merged.loc[merged["total_expenditures"].isna(), 'total_expenditures'] = 0
merged.loc[merged["total_year_months_when_transactions"].isna(), 'total_year_months_when_transactions'] = 0
merged.loc[merged["months_between_first_last_transactions"].isna(), 'months_between_first_last_transactions'] = -1
merged.loc[merged["days_between_base_subscription_and_first_transaction"].isna(), 'days_between_base_subscription_and_first_transaction'] = -1

merged.head()
Out[9]:
age_group total_number_transactions total_expenditures pay_subscription premium_subscription total_year_months_when_transactions months_between_first_last_transactions days_between_base_subscription_and_first_transaction
0 1.0 0.0 0.000000 1.0 0.0 0.0 -1.0 -1.0
1 1.0 2.0 615.707998 1.0 -1.0 2.0 1.0 914.0
2 1.0 1.0 4.200000 0.0 1.0 1.0 0.0 2034.0
3 1.0 0.0 0.000000 0.0 0.0 0.0 -1.0 -1.0
4 1.0 1.0 5.000000 1.0 -1.0 1.0 0.0 176.0

Training the prediction models¶

Preparing the data¶

In [10]:
random_state = 42
max_iter = 1000
X, y = merged, quotes_issued["is_issued"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

Logistic Regression¶

In [11]:
logistic_regresion = LogisticRegression(max_iter=max_iter)
logistic_regresion.fit(X_train, y_train)
logistic_regression_result = logistic_regresion.predict(X_test)
In [12]:
logistic_regresion_cv = LogisticRegression(max_iter=max_iter)
logistic_regression_cv_result = cross_validate(logistic_regresion_cv, X, y)

Decision trees¶

In [13]:
decision_tree_classifier = DecisionTreeClassifier(random_state=random_state)
decision_tree_classifier.fit(X_train, y_train)
decision_tree_classifier_result = decision_tree_classifier.predict(X_test)
In [14]:
decision_tree_classifier_cv = DecisionTreeClassifier(random_state=random_state)
decision_tree_classifier_cv_result = cross_validate(decision_tree_classifier_cv, X, y)

Random forests¶

In [15]:
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=random_state),
                            n_iter=10,
                            param_distributions={'n_estimators': randint(1, 5),
                       'max_depth': randint(5, 10)},
                            random_state=random_state)
search.fit(X_train, y_train)
random_forest_result = search.score(X_test, y_test)

Analyzing the results¶

Logistic regression¶

In [16]:
accuracy_score(y_test, logistic_regression_result)
Out[16]:
0.7261921216309606
In [17]:
logistic_regression_cv_result["test_score"].mean()
Out[17]:
np.float64(0.7242418403916543)
In [18]:
accuracy_score(y_test, decision_tree_classifier_result)
Out[18]:
0.7505183137525916
In [19]:
decision_tree_classifier_cv_result["test_score"].mean()
Out[19]:
np.float64(0.7529647317369276)
In [20]:
random_forest_result
Out[20]:
0.15799849437938862

Reasoning behind the models¶

What was your approach to building the model? What variables did you include, and what guided your modeling choices?

My approach for building the models was to try having same variables in all models. I selected this, because I wanted to see how well accuracy of models differs. From the perspective of the data, the approach was to analyze how app usage affects on buying insurance policy.

Variables selection started with mapping transactions table into one row per client id. From transactions I focused on columns number_transactions, year_month and expenditures. From this I calculated aggregate results like when was first and last month when transactions happened and further on how many months this was, counts of months when transactions happended, how many transactions was in total and how much user had spent money in total. I also calculated the time in days between base subscription to Telepass service and first month of transactions. This data was combined with information of TelepassPay subscription state and Telepass Premium subscription state. I wanted to see how having additional subscriptions affected on selecting insurance policy.

How did you evaluate the model? Which performance metric(s) guided your decisions, and why?

I evaluated models with both splitting the test data into single series with 20% going into testing and rest into training. For comparision, the same data was run trough 5-fold cross validation. In this case, only performance metric was accuracy how well the model indicated the user will take insurance policy.

How did the model perform on the data? What were its main strengths, weaknesses, and limitations?

Logistic regression and decision tree models worked quite similarly on the data. Difference between the models is quite marginal. Both models achieved over 70% accuracy on predicting if user is taking insurance policy.

Main weaknesses of the models are that them do not take age group features in consideration. For example older people are preferring more traditional ways of dealing with insurance policies.

Random forest model worked quite unwell. In first rounds of testing, it gave out scores like 0.05. When adding more data for selection process, it improved to current 0.16. It looks like current transformed dataset is totally wrong to be used with random forest models. I would just ingore it and continue with optimizing other models presented.

Limitations of the models comes from the quite narrowed data selection process. Better analytics needs more transformation before training the models.

Based on your analysis, which model performed best in predicting insurance purchase decisions? Explain why, and briefly discuss what your results imply for Telepass.

Based on this analysis, I would continue with decision tree model. It showed the most accurate results. I see this as starting point for more optimization. I would continue to optimize it with more user background information. For Telepass I see this results as promising direction to move on. This enables them to make more targeted offers towards the app users.

Sources¶

Source codes of the analytics are available at https://github.com/FinThunderstorm/product-analytics-t6-2 and built notebook at https://product-analytics-t6-2.alanen.dev.