Combining Datasets

In this part we are trying to expand the number of features in our dataset firstly by joining the Sellers and Ads collections in our MongoDB database. This will yield a dataset that will additionaly have the city where each listing is located.
Next this table is further expanded with a dataset downloaded from Kaggle which contains information about all the cities in Europe.

The columns of this dataset are as follows:

Country
City
Latitude
Longitude

We will be joining our dataset to this one on the field City. This way the resulting dataset will be richer with the longitude and latitude of the city.

import numpy as np
import pandas as pd
from pymongo import MongoClient 
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

tbr = ['1','10 months','11 months','112 months','12 months','13 months','14 months','15 months','16 months','17 months','18 months',
       '19 months','2 months','20 months','21 months','22 months','23 months','24 months','25 months','26 months','27 months',
       '28 months','29 months','3 months','30 months','31 months','32 months','33 months','34 months','35 months','36 months',
 '38 months','4 months','40 months','41 months','42 months','43 months','44 months','45 months','46 months','47 months',
 '48 months','5 months','50 months','52 months','53 months','54 months','55 months','56 months','58 months','59 months',
 '6 months','60 months','7 months','72 months','8 months','84 months','88 months','9 months', '0 months','1 months']


countries_dict = {'it': 'Italy','de':'Germany','fr':'France','cz':'Czech Republic','tr':'Turkey','at':'Austria','lu':'Luxemburg',
                  'be':'Belgium','nl':'Netherlands','gb':'Great Britain','es':'Spain','bg':'Bulgaria','ru':'Russia','pl':'Poland',
                  'ro':'Romania','hu':'Hungary'}

def readData():

    client = MongoClient('mongodb+srv://<User>:<Pass>@dwprojectcluster.lpqbf.mongodb.net/cars_database?retryWrites=true&w=majority')

    df_cars = pd.DataFrame(list(client.cars_database.cars.find({})))
    df_cars.drop('_id', axis = 1, inplace = True)
    df_cars = df_cars[df_cars['Loaded_in_DW'].eq(False)]


    df_ads = pd.DataFrame(list(client.cars_database.ads.find({})))
    df_ads.drop(['_id', 'Price'], axis =1, inplace = True)


    df_sellers = pd.DataFrame(list(client.cars_database.sellers.find({})))
    df_sellers.drop('_id', axis = 1, inplace = True)
    df_sellers['City'] = df_sellers['City'].str.lower()


    df_ads.drop_duplicates(subset=['CarID'], inplace = True)
    df_cars.drop_duplicates(subset=['ID'], inplace = True)
    df_sellers.drop_duplicates(subset = ['SellerID'], inplace = True)
    
    return df_cars,df_ads,df_sellers



def readCities():
    df_cities = pd.read_csv('cities2.csv')
    df_cities_eu = df_cities.loc[df_cities['Country'].eq('de') | df_cities['Country'].eq('it')
                             | df_cities['Country'].eq('at') | df_cities['Country'].eq('nl')
                             | df_cities['Country'].eq('be') | df_cities['Country'].eq('lu')
                             | df_cities['Country'].eq('pl') | df_cities['Country'].eq('ru')
                             | df_cities['Country'].eq('es') | df_cities['Country'].eq('fr')
                             | df_cities['Country'].eq('cz') | df_cities['Country'].eq('ro')
                             | df_cities['Country'].eq('gb') | df_cities['Country'].eq('cz')
                             | df_cities['Country'].eq('hu') | df_cities['Country'].eq('tr')
                             | df_cities['Country'].eq('bg')]

    df_cities_eu.drop_duplicates(subset=['City'], inplace = True)
    return df_cities_eu



def merge(df_cars,df_ads,df_sellers,df_cities_eu):
    df_cars['ID'] = pd.to_numeric(df_cars['ID'])
    df_ads['CarID'] = pd.to_numeric(df_ads['CarID'])
    df_ads['SellerID'] = pd.to_numeric(df_ads['SellerID'])
    df_sellers['SellerID']=pd.to_numeric(df_sellers['SellerID'])
    sellers_extended = df_sellers.join(df_cities_eu.set_index('City'), on = 'City',rsuffix='_other')
    sellers_extended.loc[sellers_extended['Country'].isnull(), 'Country'] = sellers_extended['Country_other'].map(countries_dict)
    sellers_extended['Country'] = sellers_extended['Country'].replace({np.nan: 'Undefined'})
    join1 = df_cars.merge(df_ads, left_on = 'ID', right_on = 'CarID', how = 'inner')
    join2 = join1.merge(sellers_extended, left_on = 'SellerID', right_on = 'SellerID', how = 'inner')
    return join2

Merging the datasets

Here we all doing all the necessary transformations in order to create the merged dataset discussed earlier. Next the listings are grouped by their location. The resulting table is shown.

[df_cars,df_ads,df_sellers] = readData()
df_cities_eu = readCities()
final = merge(df_cars,df_ads,df_sellers,df_cities_eu)

lat_lon = final[['Latitude',"Longitude", 'City']]
lat_lon = lat_lon.dropna().reset_index(drop = True)
gr = lat_lon.groupby(['Latitude','Longitude', 'City']).agg(
    count = ('Latitude', 'count'),
).reset_index()

gr['count_str'] = gr['count']
gr['count_str'] = gr['count_str'].apply(str)
gr = gr.sort_values('count', ascending = False)
gr[0:20]

	Latitude	Longitude	City	count	count_str
7981	54.033333	10.450000	berlin	1609	1609
7869	53.550000	10.000000	hamburg	511	511
436	40.408566	-3.692220	madrid	487	487
1039	43.060369	-8.965764	roma	345	345
7162	52.033333	8.533333	bielefeld	330	330
700	41.398371	2.174100	barcelona	303	303
6021	50.816667	4.550000	duisburg	299	299
4140	48.766667	9.183333	stuttgart	286	286
6393	51.050000	13.750000	dresden	283	283
7738	53.183333	8.516667	hannover	264	264
2856	47.833333	11.966667	mainz	264	264
6799	51.466667	4.466667	essen	248	248
4142	48.766667	11.433333	ingolstadt	242	242
4276	48.883333	12.566667	straubing	235	235
6835	51.516667	7.450000	dortmund	233	233
7893	53.633333	7.633333	bochum	209	209
6691	51.300000	12.333333	leipzig	206	206
4734	49.488333	8.464722	mannheim	188	188
5234	50.116667	8.683333	frankfurt am main	187	187
4472	49.140278	9.220000	heilbronn	182	182

Plotting the locations

The following plot represents the distribution of the listings by city.

import plotly.graph_objects as go
import plotly.tools as tls
from plotly.offline import plot, iplot, init_notebook_mode
from IPython.core.display import display, HTML
init_notebook_mode(connected = True)
config={'showLink': False, 'displayModeBar': False}

fig = go.Figure(data=go.Scattergeo(
        lon = gr['Longitude'],
        lat = gr['Latitude'],
        text = 'City name: ' + gr['City'] +  '<br> Number of ads: '+  gr['count_str'],
        marker  = dict (color = np.log(gr['count'][0:len(gr)-2]), colorbar=dict(
            title="Colorbar"
        ),
)
        ))

fig.update_layout(
        geo_scope='europe',
    )


plot(fig, filename = 'fig.html', config = config)


#Binder
#iplot(fig4, filename = 'plotly_figures/fig4.html', config = config)

#ThebeLab
display(HTML('fig.html'))

Comparing the datasets

df_cars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 265099 entries, 0 to 265098
Data columns (total 82 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ID                                   265099 non-null  float64
 1   Make                                 265099 non-null  object 
 2   Model                                265073 non-null  object 
 3   First Registration                   265099 non-null  object 
 4   Fuel                                 264842 non-null  object 
 5   Mileage                              264235 non-null  object 
 6   Power(hp)                            263614 non-null  object 
 7   Price                                265099 non-null  object 
 8   Loaded_in_DW                         265099 non-null  bool   
 9   Body                                 262560 non-null  object 
 10  Gearing Type                         261531 non-null  object 
 11  Displacement                         255744 non-null  object 
 12  Electrical side mirrors              197927 non-null  object 
 13  Driver-side airbag                   234504 non-null  object 
 14  Model Code                           79404 non-null   object 
 15  ABS                                  236346 non-null  object 
 16  Central door lock                    201807 non-null  object 
 17  Power steering                       229154 non-null  object 
 18  Fog lights                           151803 non-null  object 
 19  Air conditioning                     224328 non-null  object 
 20  Radio                                191859 non-null  object 
 21  Trailer hitch                        40211 non-null   object 
 22  Power windows                        223033 non-null  object 
 23  On-board computer                    175224 non-null  object 
 24  Sunroof                              28167 non-null   object 
 25  Non-smoking Vehicle                  118435 non-null  object 
 26  Alloy wheels                         184876 non-null  object 
 27  CD player                            147425 non-null  object 
 28  Automatic climate control            141846 non-null  object 
 29  Ski bag                              18181 non-null   object 
 30  4WD                                  36386 non-null   object 
 31  Immobilizer                          194894 non-null  object 
 32  Seat heating                         117688 non-null  object 
 33  Cruise control                       148478 non-null  object 
 34  Roof rack                            70402 non-null   object 
 35  Auxiliary heating                    17026 non-null   object 
 36  Sport seats                          68460 non-null   object 
 37  Armrest                              145227 non-null  object 
 38  Full Service                         140643 non-null  object 
 39  Alarm system                         71839 non-null   object 
 40  Passenger-side airbag                217886 non-null  object 
 41  Warranty                             95490 non-null   object 
 42  Leather steering wheel               146313 non-null  object 
 43  Sport suspension                     45255 non-null   object 
 44  USB                                  91494 non-null   object 
 45  Bluetooth                            125337 non-null  object 
 46  MP3                                  92420 non-null   object 
 47  Traction control                     163194 non-null  object 
 48  Sport package                        32746 non-null   object 
 49  Electronic stability control         195277 non-null  object 
 50  Side airbag                          207796 non-null  object 
 51  Digital radio                        34320 non-null   object 
 52  Start-stop system                    102441 non-null  object 
 53  Multi-function steering wheel        94911 non-null   object 
 54  Daytime running lights               116120 non-null  object 
 55  LED Daytime Running Lights           55535 non-null   object 
 56  Tire pressure monitoring system      104445 non-null  object 
 57  Isofix                               155087 non-null  object 
 58  Rain sensor                          118469 non-null  object 
 59  Parking assist system sensors rear   132059 non-null  object 
 60  Navigation system                    108163 non-null  object 
 61  Electrically adjustable seats        43724 non-null   object 
 62  Adaptive headlights                  49258 non-null   object 
 63  Adaptive Cruise Control              30959 non-null   object 
 64  Hill Holder                          80031 non-null   object 
 65  LED Headlights                       39561 non-null   object 
 66  Electric tailgate                    26888 non-null   object 
 67  Keyless central door lock            33902 non-null   object 
 68  Lane departure warning system        29340 non-null   object 
 69  Touch screen                         45414 non-null   object 
 70  Traffic sign recognition             23465 non-null   object 
 71  Parking assist system sensors front  90634 non-null   object 
 72  Xenon headlights                     54591 non-null   object 
 73  Electrically heated windshield       14801 non-null   object 
 74  Panorama roof                        20603 non-null   object 
 75  Night view assist                    3244 non-null    object 
 76  Blind spot monitor                   17495 non-null   object 
 77  Parking assist system camera         41091 non-null   object 
 78  Air suspension                       10229 non-null   object 
 79  Parking assist system self-steering  17183 non-null   object 
 80  Heated steering wheel                16009 non-null   object 
 81  Massage seats                        6524 non-null    object 
dtypes: bool(1), float64(1), object(80)
memory usage: 166.1+ MB

final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86313 entries, 0 to 86312
Data columns (total 98 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ID                                   86313 non-null  float64
 1   Make                                 86313 non-null  object 
 2   Model                                86289 non-null  object 
 3   First Registration                   86313 non-null  object 
 4   Fuel                                 86239 non-null  object 
 5   Mileage                              86261 non-null  object 
 6   Power(hp)                            86055 non-null  object 
 7   Price                                86313 non-null  object 
 8   Loaded_in_DW                         86313 non-null  bool   
 9   Body                                 85723 non-null  object 
 10  Gearing Type                         85371 non-null  object 
 11  Displacement                         84272 non-null  object 
 12  Electrical side mirrors              66855 non-null  object 
 13  Driver-side airbag                   77184 non-null  object 
 14  Model Code                           25439 non-null  object 
 15  ABS                                  78601 non-null  object 
 16  Central door lock                    68845 non-null  object 
 17  Power steering                       75831 non-null  object 
 18  Fog lights                           55755 non-null  object 
 19  Air conditioning                     75862 non-null  object 
 20  Radio                                64248 non-null  object 
 21  Trailer hitch                        9770 non-null   object 
 22  Power windows                        75941 non-null  object 
 23  On-board computer                    63110 non-null  object 
 24  Sunroof                              14345 non-null  object 
 25  Non-smoking Vehicle                  39940 non-null  object 
 26  Alloy wheels                         71828 non-null  object 
 27  CD player                            51795 non-null  object 
 28  Automatic climate control            58004 non-null  object 
 29  Ski bag                              7347 non-null   object 
 30  4WD                                  18902 non-null  object 
 31  Immobilizer                          63608 non-null  object 
 32  Seat heating                         44810 non-null  object 
 33  Cruise control                       52510 non-null  object 
 34  Roof rack                            19159 non-null  object 
 35  Auxiliary heating                    6410 non-null   object 
 36  Sport seats                          31473 non-null  object 
 37  Armrest                              51486 non-null  object 
 38  Full Service                         46124 non-null  object 
 39  Alarm system                         32802 non-null  object 
 40  Passenger-side airbag                72407 non-null  object 
 41  Warranty                             32730 non-null  object 
 42  Leather steering wheel               54358 non-null  object 
 43  Sport suspension                     24115 non-null  object 
 44  USB                                  32715 non-null  object 
 45  Bluetooth                            48130 non-null  object 
 46  MP3                                  30257 non-null  object 
 47  Traction control                     60142 non-null  object 
 48  Sport package                        19146 non-null  object 
 49  Electronic stability control         66239 non-null  object 
 50  Side airbag                          72338 non-null  object 
 51  Digital radio                        13489 non-null  object 
 52  Start-stop system                    39261 non-null  object 
 53  Multi-function steering wheel        49138 non-null  object 
 54  Daytime running lights               38495 non-null  object 
 55  LED Daytime Running Lights           22567 non-null  object 
 56  Tire pressure monitoring system      43021 non-null  object 
 57  Isofix                               53648 non-null  object 
 58  Rain sensor                          47030 non-null  object 
 59  Parking assist system sensors rear   48386 non-null  object 
 60  Navigation system                    44721 non-null  object 
 61  Electrically adjustable seats        24123 non-null  object 
 62  Adaptive headlights                  18748 non-null  object 
 63  Adaptive Cruise Control              11407 non-null  object 
 64  Hill Holder                          25154 non-null  object 
 65  LED Headlights                       18476 non-null  object 
 66  Electric tailgate                    14799 non-null  object 
 67  Keyless central door lock            16968 non-null  object 
 68  Lane departure warning system        11153 non-null  object 
 69  Touch screen                         9502 non-null   object 
 70  Traffic sign recognition             10135 non-null  object 
 71  Parking assist system sensors front  32209 non-null  object 
 72  Xenon headlights                     29061 non-null  object 
 73  Electrically heated windshield       5302 non-null   object 
 74  Panorama roof                        9433 non-null   object 
 75  Night view assist                    2008 non-null   object 
 76  Blind spot monitor                   7054 non-null   object 
 77  Parking assist system camera         16709 non-null  object 
 78  Air suspension                       6567 non-null   object 
 79  Parking assist system self-steering  7020 non-null   object 
 80  Heated steering wheel                5014 non-null   object 
 81  Massage seats                        2373 non-null   object 
 82  CarID                                86313 non-null  int64  
 83  SellerID                             86313 non-null  int64  
 84  Title                                86313 non-null  object 
 85  Type                                 86313 non-null  object 
 86  URL                                  86313 non-null  object 
 87  Vendor                               86313 non-null  object 
 88  City                                 86296 non-null  object 
 89  ZipCode                              86313 non-null  object 
 90  Country                              86313 non-null  object 
 91  Unnamed: 0                           55260 non-null  float64
 92  Country_other                        55260 non-null  object 
 93  AccentCity                           55260 non-null  object 
 94  Region                               55260 non-null  object 
 95  Population                           32563 non-null  float64
 96  Latitude                             55260 non-null  float64
 97  Longitude                            55260 non-null  float64
dtypes: bool(1), float64(5), int64(2), object(90)
memory usage: 64.6+ MB

Conclusion

Seeing the results above, it is obvious that the transformations we made in order to add another feature (location) to each listing resulted in shrinkage of the dataset size. From the inital 265099 listings we are down to 86313. Out of those only 55260 have information about the Latitude and Longitude of the city. The inital reducing of the size is due to the fact that we don't have information about each of the sellers and the majority of the listings are with unknown seller. The missing Latitude and Longitude fields are because of the unability to find the required city in the Kaggle dataset.

Knowning this, we decided it is best to stick with the inital dataset and have as many entries as possible.