The dataset is having many binary features, each representing the presence of some extra equipment in the car. This makes the dataset suitable to run association analysis on the equipment fields.
By definition, Association analysis (or Market Basket Analysis) is mainly a data mining process that helps identify co-occurrence of certain events/activities performed by a user group.
In our case we will use the results to see which pairs of the equipment features are found together most often. There are 3 main concepts that help us measure the strength of an association rule. They are as follows:
Support :
$ supp(X) = {\text{# of listings in which }X \text{ appears} \over \text{Total # of listings}}$
Support of an itemset $ X $ is defined as a proportion of transactions in the database that contain $ X $
Confidence:
$conf(X \to Y) = {supp(X \cup Y)\over supp(X)}$
Confidence measures the probability of itemset $ Y $ occuring with itemset $ X $.
Lift:
- $lift(X \to Y) = {supp(X \cup Y)\over supp(X) \times supp(Y)}$
- Lift measures the ratio of the observed support to that expected if $ X $ and $ Y $ were independent.
- If $ lift(X \to Y) = 1 $, then it would imply that probabilities of occurrences of itemset X and itemset Y are independent of each other, meaning that the rule doesn’t show any statistically proven relationship.
- If $ lift(X \to Y) > 1 $, that lets us know the degree to which those two occurrences are dependent on one another
- If $ lift(X \to Y) < 1 $, that lets us know the items are substitute to each other
- If $ lift(X \to Y) = 1 $, then it would imply that probabilities of occurrences of itemset X and itemset Y are independent of each other, meaning that the rule doesn’t show any statistically proven relationship.
- $lift(X \to Y) = {supp(X \cup Y)\over supp(X) \times supp(Y)}$
We are sorting the association table by the lift measure, as it is the most complex one and most usefull in our dataset.
import numpy as np
import pandas as pd
from pymongo import MongoClient
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
tbr = ['1','10 months','11 months','112 months','12 months','13 months','14 months','15 months','16 months','17 months','18 months',
'19 months','2 months','20 months','21 months','22 months','23 months','24 months','25 months','26 months','27 months',
'28 months','29 months','3 months','30 months','31 months','32 months','33 months','34 months','35 months','36 months',
'38 months','4 months','40 months','41 months','42 months','43 months','44 months','45 months','46 months','47 months',
'48 months','5 months','50 months','52 months','53 months','54 months','55 months','56 months','58 months','59 months',
'6 months','60 months','7 months','72 months','8 months','84 months','88 months','9 months', '0 months','1 months']
def readData():
client = MongoClient('mongodb+srv://<User>:<Pass>@dwprojectcluster.lpqbf.mongodb.net/cars_database?retryWrites=true&w=majority')
df_cars = pd.DataFrame(list(client.cars_database.cars.find({})))
df_cars.drop('_id', axis = 1, inplace = True)
df_cars = df_cars[df_cars['Loaded_in_DW'].eq(False)]
return df_cars
df_cars = readData()
equipment = df_cars.iloc[:,15:]
equipment = equipment.replace({np.nan: False})
equipment = equipment.replace({1: True})
equipment = equipment.replace({'1': True})
equipment = equipment.replace(tbr , True)
ap = apriori(equipment, min_support=0.7, use_colnames=True)
rules_ap = association_rules(ap, metric="lift", min_threshold=0)
rules_ap.sort_values(by = 'lift', ascending = False)[0:20]
rules_ap.sort_values(by = 'confidence', ascending = False)[0:20]