Recommending a product price for a seller
This is another take home assignment that I did for a company as part of their interview process
This blog post has the code and details, of the take home project that I did for a company as part of their interview process, they gave me a dataset with sales details of a clothes/accessories from their website, the task was to make a price prediction/suggestion for a piece of wardrobe that the customer would like to sell on their company's platform.
!unzip /content/ds-take-home-dataset.zip
!pip install -U scikit-learn
import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
import matplotlib.pyplot as plt
from joblib import dump
import os
import json
file_path = r'/content/drive/My Drive/poshmark_files'
np.random.seed(0)
There are missing values in some columns, so let's start by analysing them first.
data[data['sold_price'] > 0].shape[0]
data[data['sold_price'] == 0].shape[0]
There are 20000 listings with no price details, we might be better off removing them
vars_with_na = [var for var in data.columns if data[var].isnull().sum() > 0]
# get percentage of missing values
data[vars_with_na].isnull().mean()
There is a sizeable chunk of missing data in feature attrs5 Let's try to see whether they influence the price of a listing.
def analyse_na_value(df, var):
    df = df.copy()
    # let's make a variable that indicates 1 if the observation was missing or zero otherwise
    df[var] = np.where(df[var].isnull(), 1, 0)
    # let's compare the median sold_price in the observations where data is missing
    # vs the observations where a value is available
    df.groupby(var)['sold_price'].median().plot.bar()
    plt.title(var)
    plt.show()
for var in vars_with_na:
    analyse_na_value(data, var)
There is a change in the value of the sold_price when some attributes are empty, we should try to input this relationship when performing feature engineering
Almost all the features we have are categorical according to the instuctions given, let us try to analyze the only numeric variable attr6
data['attr6'].describe()
var = 'attr6'
data[var].hist(bins=range(1, 10))
plt.ylabel('Number of listings')
plt.xlabel(var)
plt.title(var)
plt.show()
We can see that the variable is not normally distributed, We should remember to deal with this.
The feature attr6 is an interesting feature (Which can be clarified if it had a name) because the 20% values of that feature is Zero. This causes a confusion as to whether it is a categorical or numerical variable. A pie chart to view the counts.
data['attr6'].value_counts().plot(kind='pie')
Percentage of zeros in attr6
Percentage of non-zeros in attr6
As per the instructions the categorical variables are
cat_vars ='attr1	attr2	attr3	attr4	attr5'.split('	')
Let's check the cardinality
We do have huge number of categories in attr3 and attr4, let's see about the rare labels with respect to the sold prices of the listing.
def analyse_rare_labels(df, var, rare_perc):
    df = df.copy()
    # determine the % of observations per category
    tmp = df.groupby(var)['sold_price'].count() / len(df)
    # return categories that are rare
    return tmp[tmp < rare_perc]
# print categories that are present in less than
# .1 % of the observations
for var in cat_vars:
    print(analyse_rare_labels(data, var, 0.001))
    print()
attr4 has a lot of rare labels 6510 (98% of total unique labels)
We should try relating each category with the sold_price
def analyse_discrete(df, var):
    df = df.copy()
    df.groupby(var)['sold_price'].median().plot.bar()
    plt.title(var)
    plt.ylabel('Median sold_price')
    plt.show()
    
for var in cat_vars:
    analyse_discrete(data, var)
# let's make boxplots to visualise outliers in the continuous variables
def find_outliers(df, var):
    df = df.copy()
    df[var] = np.log1p(df[var])
    df.boxplot(column=var)
    plt.title(var)
    plt.ylabel(var)
    plt.show()
for var in ['attr6']:
    find_outliers(data, var)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
Before we do the actual feature engineering, let's split the data to train and test sets.
Since we need to be able to evaluate the model based on the bins of sold_price like 0-50, 50-100, 100-500, 500-1000 and 1000+., Let's stratify the splitting by creating bins of the target variable
Avoid zero price listings
zero_filter = data['sold_price'] > 0
bins = np.linspace(0, 1000, 10)
y_binned = np.digitize(data.loc[zero_filter, 'sold_price'], bins)
X_train, X_test, y_train, y_test = train_test_split(data[zero_filter].copy(),
                                                    data.loc[zero_filter, 'sold_price'].copy(),
                                                    test_size=0.3, stratify=y_binned)
bins
pd.Series(y_binned).value_counts()
Let's check the percentage of missing values in the data
# make a list of the variables that contain missing values
vars_with_na = [var for var in X_train.columns if data[var].isnull().sum() > 0]
# get percentage of missing values
X_train[vars_with_na].isnull().mean()
# make a list of the variables that contain missing values
vars_with_na = [var for var in X_test.columns if data[var].isnull().sum() > 0]
# get percentage of missing values
X_test[vars_with_na].isnull().mean()
All the features with missing variables are categorical/string variables, so let's just assign some arbitrary value missing to them.
# and with an arbitrary number for others
miss_cat = ['attr3', 'attr4', 'attr5']
X_train[miss_cat] = X_train[miss_cat].fillna(-9999)
X_test[miss_cat] = X_test[miss_cat].fillna(-9999)
X_train['title'] = X_train['title'].fillna('No Value Present')
X_test['title'] = X_test['title'].fillna('No Value Present')
As per the instruction the variable Id is kind of a temporal variable indicating the time in which the item was sold. So we can see if it helps with the prediction
Our only numeric variable as per the instruction is attr6 and it has no missing values so we do not need to touch that for now.
def find_frequent_labels(df, var, rare_perc):
    
    # function finds the labels that are shared by more than
    # a certain % of the listings in the dataset
    df = df.copy()
    tmp = df.groupby(var)['sold_price'].count() / len(df)
    return tmp[tmp > rare_perc].index
for var in cat_vars:
    
    # find the frequent categories
    frequent_ls = find_frequent_labels(X_train, var, 0.001)
    # replace rare categories by an arbitrary number
    X_train[var] = np.where(X_train[var].isin(
        frequent_ls), X_train[var], -999)
    
    X_test[var] = np.where(X_test[var].isin(
        frequent_ls), X_test[var], -999)
# so that the smaller value corresponds to the category that shows the smaller
# mean sold_price
def replace_categories(train, test, var, target):
    # order the categories in a variable from that with the lowest
    # house sale price, to that with the highest
    ordered_labels = train.groupby([var])[target].mean().sort_values().index
    # create a dictionary of ordered categories to integer values
    ordinal_label = {k: i for i, k in enumerate(ordered_labels, 0)}
    with open(f'{var}.json', 'w') as f: json.dump(ordinal_label, f)
    # use the dictionary to replace the categorical strings by integers
    train[var] = train[var].map(ordinal_label)
    test[var] = test[var].map(ordinal_label)
cat_vars
for var in cat_vars:
    replace_categories(X_train, X_test, var, 'sold_price')
# between labels and target
def analyse_vars(df, var):
    # function plots median sold price per encoded category
    
    df = df.copy()
    df.groupby(var)['sold_price'].median().plot.bar()
    plt.title(var)
    plt.ylabel('sold_price')
    plt.show()
    
for var in cat_vars:
    analyse_vars(X_train, var)
num_vars = ['attr6', 'id']
numeric_transformer = Pipeline(steps=[
    ('scaler', RobustScaler())]
    )
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
text_transformer = Pipeline(
    steps=[
           ('tfidf', TfidfVectorizer()),
           ('best', TruncatedSVD(n_components=5))
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_vars),
        ('text', text_transformer, 'title')
        ],
        remainder='passthrough'
)
Let's look at the target data in a scatter plot
plt.scatter(x=range(len(y_train)), y=y_train)
The target has outliers, So transforming the target might help the regression model to fit the data
from sklearn.compose import TransformedTargetRegressor
from sklearn.svm import SVR
from sklearn.linear_model import Lasso, LassoLarsIC, RidgeCV, LinearRegression, RANSACRegressor, SGDRegressor
from sklearn.preprocessing import QuantileTransformer, quantile_transform
# to evaluate the model
from sklearn.metrics import mean_squared_error, r2_score, median_absolute_error
from math import sqrt
The snippet is more or less copied from scikit-learn's example gallery, this graph shows the effect of transforming the target with quantiles.
density_param = {'density':True}
y = X_train.loc[:, 'sold_price'].squeeze()
y_trans = quantile_transform(X_train.loc[:, 'sold_price'].values.reshape(-1, 1),
                             n_quantiles=300,
                             output_distribution='normal',
                             copy=True).squeeze()
f, (ax0, ax1) = plt.subplots(1, 2)
ax0.hist(y, bins=100, **density_param)
ax0.set_ylabel('Probability')
ax0.set_xlabel('Target')
ax0.set_title('Target distribution')
ax1.hist(y_trans, bins=100, **density_param)
ax1.set_ylabel('Probability')
ax1.set_xlabel('Target')
ax1.set_title('Transformed target distribution')
f.suptitle("Listing data: distance to price centers", y=0.035)
f.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
Put together the feature engineering pipeline and model
regr_trans = TransformedTargetRegressor(regressor=LassoLarsIC(normalize=False),
                                        transformer=QuantileTransformer(output_distribution='normal'))
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', regr_trans)],
                 memory='/content'
               )
Model using regularized linear regression
model.fit(X_train, y_train)
Try predicting on train and test data to evaluate fit
Let's have a look at the mean square and median absolute errors of the train and test sets.
# We will evaluate performance using the mean squared error and
# the root of the mean squared error and r2
# make predictions for train set
pred = model.predict(X_train)
# determine mse and rmse
print('train mse: {}'.format(int(
    mean_squared_error(y_train, pred))))
print('train rmse: {}'.format(int(
    sqrt(mean_squared_error(y_train, pred)))))
print('train r2: {}'.format(
    r2_score(y_train, pred)))
print()
# make predictions for test set
pred = model.predict(X_test)
# determine mse and rmse
print('test mse: {}'.format(int(
    mean_squared_error(y_test, pred))))
print('test rmse: {}'.format(int(
    sqrt(mean_squared_error(y_test, pred)))))
print('test r2: {}'.format(
    r2_score(y_test, pred)))
print()
print('Median listing price: ', int(y_train.median()))
# We will evaluate performance using the median abs error and
# the root of the mean squared error and r2
# make predictions for train set
pred = model.predict(X_train)
# determine mae
print('train mae: {}'.format(int(
    median_absolute_error(y_train, pred))))
print('train r2: {}'.format(
    r2_score(y_train, pred)))
print()
# make predictions for test set
pred = model.predict(X_test)
# determine mae
print('test mae: {}'.format(int(
    median_absolute_error(y_test, pred))))
print('test r2: {}'.format(
    r2_score(y_test, pred)))
print()
print('Median listing price: ', int(y_train.median()))
It is kinda evident that the model has not performed well(looking at the $r^2$ values) maybe I should have used another modeling method to get better results.
Let's evaluate our predictions respect to the sold price, In the picture, The blue dots are predicted values and the red ones are the predicted values.
plt.scatter(range(0, len(y_test)), y_test, color='red')
plt.scatter(range(0, len(X_test)), model.predict(X_test), color='blue')
plt.xlabel('record number')
plt.ylabel('Actual and Predicted Listing Price')
plt.title('Evaluation of Lasso Predictions')
errors = y_test - model.predict(X_test)
errors.hist(bins=30)
The chart is not very informative but the range of numbers in x axis gives an idea of the error range.
I also did another exercise for the same company, which deals with calculating the feature importances in a model agnostic way. You can have look at that here