Exercise setup

Let's use the sample datasets offered directly in colab

import pandas as pd
import numpy as np

data = pd.read_csv('/content/sample_data/california_housing_train.csv')

data.head()

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB

Since there is no data missing, we will add random missing entries in the data

df = data.mask(np.random.random(data.shape) < .1)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           15290 non-null  float64
 1   latitude            15376 non-null  float64
 2   housing_median_age  15281 non-null  float64
 3   total_rooms         15325 non-null  float64
 4   total_bedrooms      15246 non-null  float64
 5   population          15358 non-null  float64
 6   households          15351 non-null  float64
 7   median_income       15298 non-null  float64
 8   median_house_value  15275 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB

Questions

  • Delete the column with the most missing values.

  • Convert the preprocessed dataset to the tensor format.

Delete column with most missing values

Peek at the data with na

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 NaN NaN
1 -114.47 34.40 NaN 7650.0 1901.0 1129.0 NaN 1.8200 NaN
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 NaN
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

Find the column name with most nas

column_with_most_na = max(list(df.columns),
                          key=lambda x: len(df.loc[df[x].isna(), x]))
'total_bedrooms'

Remove the column

longitude latitude housing_median_age total_rooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1015.0 472.0 NaN NaN
1 -114.47 34.40 NaN 7650.0 1129.0 NaN 1.8200 NaN
2 -114.56 33.69 17.0 720.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 515.0 226.0 3.1917 NaN
4 -114.57 33.57 20.0 1454.0 624.0 262.0 1.9250 65500.0

Conversion of the preprocessed dataset to the tensor format.

We can split the dataset to inputs and output, with the median_house_value as the output

inputs = df.iloc[:, :-1].copy()
outputs = df.iloc[:, -1].copy()
inputs.head()
longitude latitude housing_median_age total_rooms population households median_income
0 -114.31 34.19 15.0 5612.0 1015.0 472.0 NaN
1 -114.47 34.40 NaN 7650.0 1129.0 NaN 1.8200
2 -114.56 33.69 17.0 720.0 333.0 117.0 1.6509
3 -114.57 33.64 14.0 1501.0 515.0 226.0 3.1917
4 -114.57 33.57 20.0 1454.0 624.0 262.0 1.9250
outputs.head()
0        NaN
1        NaN
2    85700.0
3        NaN
4    65500.0
Name: median_house_value, dtype: float64
import tensorflow as tf

X, y = tf.constant(inputs.values), tf.constant(outputs.values)
X, y
(<tf.Tensor: shape=(17000, 7), dtype=float64, numpy=
 array([[-114.31  ,   34.19  ,   15.    , ..., 1015.    ,  472.    ,
               nan],
        [-114.47  ,   34.4   ,       nan, ..., 1129.    ,       nan,
            1.82  ],
        [-114.56  ,   33.69  ,   17.    , ...,  333.    ,  117.    ,
            1.6509],
        ...,
        [-124.3   ,   41.84  ,   17.    , ..., 1244.    ,  456.    ,
            3.0313],
        [-124.3   ,   41.8   ,   19.    , ..., 1298.    ,  478.    ,
            1.9797],
        [-124.35  ,   40.54  ,   52.    , ...,  806.    ,  270.    ,
            3.0147]])>,
 <tf.Tensor: shape=(17000,), dtype=float64, numpy=array([    nan,     nan,  85700., ..., 103600.,  85800.,  94600.])>)

This completes the second part of the preliminaries.