import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
Predicting whether a customer will purchase an insurance policy after receiving a quote¶
import pytz
import datetime
timestamp = datetime.datetime.now(pytz.timezone('Europe/Helsinki'))
print("Notebook exported at", timestamp.now(), "(Europe/Helsinki)")
Notebook exported at 2026-04-16 20:20:34.780711 (Europe/Helsinki)
insurance_quotes = pd.read_excel("input/telepass.xlsx", sheet_name="Insurance Quotes")
print("First 5 rows of insurance quotes")
insurance_quotes.head()
First 5 rows of insurance quotes
| client_id | quotation_id | driving_type | car_immatriculation_date | car_brand | car_model | insurance_expires_at | birth_date | gender | county | ... | natural_events | theft_fire | kasko | license_revoked | collision | vandalism | key_loss | price_sale | price_full | discount_percent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 23789 | 3649 | more_than_26 | 2011-07-20 | FORD | C-MAX | 2020-06-12 | NaN | NaN | FC | ... | NaN | 22.70 | NaN | NaN | 121.83 | NaN | NaN | 325.48 | 325.48 | 0.000000 |
| 1 | 10133 | 28552 | more_than_26 | 2007-03-01 | FIAT - INNOCENTI | FIAT CROMA | 2020-01-22 | NaN | NaN | MS | ... | NaN | 22.70 | NaN | NaN | 89.99 | NaN | NaN | 271.25 | 306.80 | 0.115874 |
| 2 | 20785 | 31958 | more_than_26 | 2017-08-30 | NISSAN | NISSAN QASHQAI | 2020-10-02 | 1992-09-22 00:00:00 | M | SA | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 640.84 | 684.01 | 0.063113 |
| 3 | 33892 | 27614 | more_than_26 | 2017-07-25 | FIAT - INNOCENTI | FIAT PANDA | 2020-07-25 | 1996-02-09 00:00:00 | M | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2185.38 | 2204.87 | 0.008840 |
| 4 | 34153 | 1529 | more_than_26 | 2006-12-19 | FIAT - INNOCENTI | FIAT PUNTO | 2020-10-13 | 2000-01-08 00:00:00 | M | CN | ... | NaN | 24.34 | NaN | NaN | 156.11 | NaN | NaN | 207.68 | 231.20 | 0.101730 |
5 rows × 40 columns
transactions = pd.read_excel("input/telepass.xlsx", sheet_name="Transactions")
print("First 5 rows of transactions")
transactions.head()
First 5 rows of transactions
| client_id | year_month | service_type | telepass_pay | number_transactions | expenditures | |
|---|---|---|---|---|---|---|
| 0 | 20785 | 2020-05-01 | PARCHEGGI/TRAGHETTI | 0 | 1 | 2.00 |
| 1 | 20785 | 2019-09-01 | PARCHEGGI/TRAGHETTI | 0 | 1 | 2.00 |
| 2 | 34153 | 2020-02-01 | PARCHEGGI/TRAGHETTI | 0 | 1 | 3.75 |
| 3 | 21141 | 2020-01-01 | PARCHEGGI/TRAGHETTI | 0 | 1 | 3.50 |
| 4 | 21141 | 2020-02-01 | PARCHEGGI/TRAGHETTI | 0 | 1 | 7.00 |
total_transactions = transactions[["client_id", "number_transactions", "year_month", "expenditures"]].sort_values(by=["client_id", "year_month"]).groupby(["client_id"]).agg(
first_month_when_transactions=('year_month', 'min'),
latest_month_when_transactions=('year_month', 'max'),
total_year_months_when_transactions=('year_month', 'count'),
total_number_transactions=('number_transactions', 'sum'),
total_expenditures=('expenditures', 'sum')
)
total_transactions.loc[total_transactions["total_year_months_when_transactions"].isna(), 'total_year_months_when_transactions'] = 0
total_transactions.head()
| first_month_when_transactions | latest_month_when_transactions | total_year_months_when_transactions | total_number_transactions | total_expenditures | |
|---|---|---|---|---|---|
| client_id | |||||
| 2 | 2019-12-01 | 2020-01-01 | 2 | 2 | 615.707998 |
| 3 | 2019-12-01 | 2019-12-01 | 1 | 1 | 4.200000 |
| 5 | 2020-01-01 | 2020-01-01 | 1 | 1 | 5.000000 |
| 8 | 2019-07-01 | 2019-12-01 | 3 | 5 | 27.000000 |
| 10 | 2020-01-01 | 2020-05-01 | 2 | 2 | 24.700000 |
base_subscriptions = insurance_quotes[["client_id", "base_subscription"]].merge(total_transactions, on="client_id", how="left")
base_subscriptions.loc[base_subscriptions["first_month_when_transactions"].notnull() & base_subscriptions["latest_month_when_transactions"].notnull(), 'months_between_first_last_transactions'] = base_subscriptions["latest_month_when_transactions"].dt.to_period('M').astype(int) - base_subscriptions["first_month_when_transactions"].dt.to_period('M').astype(int)
base_subscriptions.loc[base_subscriptions["months_between_first_last_transactions"].isna(), 'months_between_first_last_transactions'] = -1
base_subscriptions.loc[base_subscriptions["base_subscription"].notnull() & base_subscriptions["first_month_when_transactions"].notnull(), 'days_between_base_subscription_and_first_transaction'] = base_subscriptions["first_month_when_transactions"].dt.to_period('D').astype(int) - base_subscriptions["base_subscription"].dt.to_period('D').astype(int)
base_subscriptions.loc[base_subscriptions["days_between_base_subscription_and_first_transaction"].isna(), 'days_between_base_subscription_and_first_transaction'] = -1
base_subscriptions = base_subscriptions.filter(items=["client_id", "months_between_first_last_transactions", "days_between_base_subscription_and_first_transaction"])
base_subscriptions.head()
| client_id | months_between_first_last_transactions | days_between_base_subscription_and_first_transaction | |
|---|---|---|---|
| 0 | 23789 | -1.0 | -1.0 |
| 1 | 10133 | -1.0 | -1.0 |
| 2 | 20785 | 8.0 | 6.0 |
| 3 | 33892 | -1.0 | -1.0 |
| 4 | 34153 | 0.0 | 156.0 |
subscription_status = insurance_quotes[["client_id", "driving_type", "pay_subscription", "pay_cancellation", "premium_subscription", "premium_cancellation"]]
subscription_status.loc[subscription_status["pay_subscription"].notnull(), 'has_pay_subscribed'] = True
subscription_status.loc[subscription_status["pay_cancellation"] == "9998-12-31", 'active_pay_subscription'] = True
subscription_status.loc[(subscription_status["pay_cancellation"] != "9998-12-31") & (subscription_status["pay_cancellation"].notnull()), 'cancelled_pay_subscription'] = True
subscription_status.loc[subscription_status["premium_subscription"].notnull(), 'has_premium_subscribed'] = True
subscription_status.loc[subscription_status["premium_cancellation"] == "9998-12-31", 'active_premium_subscription'] = True
subscription_status.loc[(subscription_status["premium_cancellation"] != "9998-12-31") & (subscription_status["premium_cancellation"].notnull()), 'cancelled_premium_subscription'] = True
subscription_status.loc[subscription_status["driving_type"] == "more_than_26", 'age_group'] = 1
subscription_status.loc[subscription_status["driving_type"] == "between_23_and_25", 'age_group'] = 2
subscription_status.loc[subscription_status["driving_type"] == "less_than_23", 'age_group'] = 3
subscription_status = subscription_status.sort_values(by=["client_id"])
subscription_status = subscription_status[["client_id", "age_group", "has_pay_subscribed", "active_pay_subscription", "cancelled_pay_subscription", "has_premium_subscribed", "active_premium_subscription", "cancelled_premium_subscription"]]
subscription_status.head()
| client_id | age_group | has_pay_subscribed | active_pay_subscription | cancelled_pay_subscription | has_premium_subscribed | active_premium_subscription | cancelled_premium_subscription | |
|---|---|---|---|---|---|---|---|---|
| 18155 | 1 | 1.0 | True | True | NaN | NaN | NaN | NaN |
| 10634 | 2 | 1.0 | True | True | NaN | True | NaN | True |
| 28159 | 3 | 1.0 | NaN | NaN | NaN | True | True | NaN |
| 31965 | 4 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 15937 | 5 | 1.0 | True | True | NaN | True | NaN | True |
quotes_issued = insurance_quotes[["client_id", "issued"]]
quotes_issued = quotes_issued.sort_values(by=["client_id"])
quotes_issued.loc[quotes_issued["issued"] == True, 'is_issued'] = 1
quotes_issued.loc[quotes_issued["is_issued"].isna(), 'is_issued'] = 0
quotes_issued = quotes_issued.drop(["issued"], axis=1)
quotes_issued.head()
| client_id | is_issued | |
|---|---|---|
| 18155 | 1 | 1.0 |
| 10634 | 2 | 1.0 |
| 28159 | 3 | 0.0 |
| 31965 | 4 | 0.0 |
| 15937 | 5 | 0.0 |
merged = total_transactions.merge(subscription_status, on=["client_id"], how="right")
merged = merged.merge(base_subscriptions, on=["client_id"], how="left")
merged.loc[merged["active_pay_subscription"].notnull(), 'pay_subscription'] = 1
merged.loc[merged["cancelled_pay_subscription"].notnull(), 'pay_subscription'] = -1
merged.loc[merged["pay_subscription"].isna(), 'pay_subscription'] = 0
merged.loc[merged["active_premium_subscription"].notnull(), 'premium_subscription'] = 1
merged.loc[merged["cancelled_premium_subscription"].notnull(), 'premium_subscription'] = -1
merged.loc[merged["premium_subscription"].isna(), 'premium_subscription'] = 0
columns = ["age_group", "total_number_transactions", "total_expenditures", "pay_subscription", "premium_subscription", "total_year_months_when_transactions", "total_number_transactions", "months_between_first_last_transactions", "days_between_base_subscription_and_first_transaction"]
merged = merged.filter(items=columns)
merged.loc[merged["total_number_transactions"].isna(), 'total_number_transactions'] = 0
merged.loc[merged["total_expenditures"].isna(), 'total_expenditures'] = 0
merged.loc[merged["total_year_months_when_transactions"].isna(), 'total_year_months_when_transactions'] = 0
merged.loc[merged["months_between_first_last_transactions"].isna(), 'months_between_first_last_transactions'] = -1
merged.loc[merged["days_between_base_subscription_and_first_transaction"].isna(), 'days_between_base_subscription_and_first_transaction'] = -1
merged.head()
| age_group | total_number_transactions | total_expenditures | pay_subscription | premium_subscription | total_year_months_when_transactions | months_between_first_last_transactions | days_between_base_subscription_and_first_transaction | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.000000 | 1.0 | 0.0 | 0.0 | -1.0 | -1.0 |
| 1 | 1.0 | 2.0 | 615.707998 | 1.0 | -1.0 | 2.0 | 1.0 | 914.0 |
| 2 | 1.0 | 1.0 | 4.200000 | 0.0 | 1.0 | 1.0 | 0.0 | 2034.0 |
| 3 | 1.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | -1.0 | -1.0 |
| 4 | 1.0 | 1.0 | 5.000000 | 1.0 | -1.0 | 1.0 | 0.0 | 176.0 |
Training the prediction models¶
Preparing the data¶
random_state = 42
max_iter = 1000
X, y = merged, quotes_issued["is_issued"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
Logistic Regression¶
logistic_regresion = LogisticRegression(max_iter=max_iter)
logistic_regresion.fit(X_train, y_train)
logistic_regression_result = logistic_regresion.predict(X_test)
logistic_regresion_cv = LogisticRegression(max_iter=max_iter)
logistic_regression_cv_result = cross_validate(logistic_regresion_cv, X, y)
Decision trees¶
decision_tree_classifier = DecisionTreeClassifier(random_state=random_state)
decision_tree_classifier.fit(X_train, y_train)
decision_tree_classifier_result = decision_tree_classifier.predict(X_test)
decision_tree_classifier_cv = DecisionTreeClassifier(random_state=random_state)
decision_tree_classifier_cv_result = cross_validate(decision_tree_classifier_cv, X, y)
Random forests¶
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=random_state),
n_iter=10,
param_distributions={'n_estimators': randint(1, 5),
'max_depth': randint(5, 10)},
random_state=random_state)
search.fit(X_train, y_train)
random_forest_result = search.score(X_test, y_test)
Analyzing the results¶
Logistic regression¶
accuracy_score(y_test, logistic_regression_result)
0.7261921216309606
logistic_regression_cv_result["test_score"].mean()
np.float64(0.7242418403916543)
accuracy_score(y_test, decision_tree_classifier_result)
0.7505183137525916
decision_tree_classifier_cv_result["test_score"].mean()
np.float64(0.7529647317369276)
random_forest_result
0.15799849437938862
Reasoning behind the models¶
What was your approach to building the model? What variables did you include, and what guided your modeling choices?
My approach for building the models was to try having same variables in all models. I selected this, because I wanted to see how well accuracy of models differs. From the perspective of the data, the approach was to analyze how app usage affects on buying insurance policy.
Variables selection started with mapping transactions table into one row per client id. From transactions I focused on columns number_transactions, year_month and expenditures. From this I calculated aggregate results like when was first and last month when transactions happened and further on how many months this was, counts of months when transactions happended, how many transactions was in total and how much user had spent money in total. I also calculated the time in days between base subscription to Telepass service and first month of transactions. This data was combined with information of TelepassPay subscription state and Telepass Premium subscription state. I wanted to see how having additional subscriptions affected on selecting insurance policy.
How did you evaluate the model? Which performance metric(s) guided your decisions, and why?
I evaluated models with both splitting the test data into single series with 20% going into testing and rest into training. For comparision, the same data was run trough 5-fold cross validation. In this case, only performance metric was accuracy how well the model indicated the user will take insurance policy.
How did the model perform on the data? What were its main strengths, weaknesses, and limitations?
Logistic regression and decision tree models worked quite similarly on the data. Difference between the models is quite marginal. Both models achieved over 70% accuracy on predicting if user is taking insurance policy.
Main weaknesses of the models are that them do not take age group features in consideration. For example older people are preferring more traditional ways of dealing with insurance policies.
Random forest model worked quite unwell. In first rounds of testing, it gave out scores like 0.05. When adding more data for selection process, it improved to current 0.16. It looks like current transformed dataset is totally wrong to be used with random forest models. I would just ingore it and continue with optimizing other models presented.
Limitations of the models comes from the quite narrowed data selection process. Better analytics needs more transformation before training the models.
Based on your analysis, which model performed best in predicting insurance purchase decisions? Explain why, and briefly discuss what your results imply for Telepass.
Based on this analysis, I would continue with decision tree model. It showed the most accurate results. I see this as starting point for more optimization. I would continue to optimize it with more user background information. For Telepass I see this results as promising direction to move on. This enables them to make more targeted offers towards the app users.
Sources¶
Source codes of the analytics are available at https://github.com/FinThunderstorm/product-analytics-t6-2 and built notebook at https://product-analytics-t6-2.alanen.dev.