Exercise setup

Let's use the sample datasets offered directly in colab

import pandas as pd
import numpy as np

data = pd.read_csv('/content/sample_data/california_housing_train.csv')

data.head()

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB

Since there is no data missing, we will add random missing entries in the data

df = data.mask(np.random.random(data.shape) < .1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           15290 non-null  float64
 1   latitude            15376 non-null  float64
 2   housing_median_age  15281 non-null  float64
 3   total_rooms         15325 non-null  float64
 4   total_bedrooms      15246 non-null  float64
 5   population          15358 non-null  float64
 6   households          15351 non-null  float64
 7   median_income       15298 non-null  float64
 8   median_house_value  15275 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB

Questions

Delete the column with the most missing values.
Convert the preprocessed dataset to the tensor format.

Delete column with most missing values

Peek at the data with na

Find the column name with most nas

column_with_most_na = max(list(df.columns),
                          key=lambda x: len(df.loc[df[x].isna(), x]))

'total_bedrooms'

Remove the column

Conversion of the preprocessed dataset to the tensor format.

We can split the dataset to inputs and output, with the median_house_value as the output

inputs = df.iloc[:, :-1].copy()
outputs = df.iloc[:, -1].copy()
inputs.head()

outputs.head()

0        NaN
1        NaN
2    85700.0
3        NaN
4    65500.0
Name: median_house_value, dtype: float64

import tensorflow as tf

X, y = tf.constant(inputs.values), tf.constant(outputs.values)
X, y

(<tf.Tensor: shape=(17000, 7), dtype=float64, numpy=
 array([[-114.31  ,   34.19  ,   15.    , ..., 1015.    ,  472.    ,
               nan],
        [-114.47  ,   34.4   ,       nan, ..., 1129.    ,       nan,
            1.82  ],
        [-114.56  ,   33.69  ,   17.    , ...,  333.    ,  117.    ,
            1.6509],
        ...,
        [-124.3   ,   41.84  ,   17.    , ..., 1244.    ,  456.    ,
            3.0313],
        [-124.3   ,   41.8   ,   19.    , ..., 1298.    ,  478.    ,
            1.9797],
        [-124.35  ,   40.54  ,   52.    , ...,  806.    ,  270.    ,
            3.0147]])>,
 <tf.Tensor: shape=(17000,), dtype=float64, numpy=array([    nan,     nan,  85700., ..., 103600.,  85800.,  94600.])>)

This completes the second part of the preliminaries.

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	73400.0
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	NaN	NaN
1	-114.47	34.40	NaN	7650.0	1901.0	1129.0	NaN	1.8200	NaN
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	NaN
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0

	longitude	latitude	housing_median_age	total_rooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1015.0	472.0	NaN	NaN
1	-114.47	34.40	NaN	7650.0	1129.0	NaN	1.8200	NaN
2	-114.56	33.69	17.0	720.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	515.0	226.0	3.1917	NaN
4	-114.57	33.57	20.0	1454.0	624.0	262.0	1.9250	65500.0

	longitude	latitude	housing_median_age	total_rooms	population	households	median_income
0	-114.31	34.19	15.0	5612.0	1015.0	472.0	NaN
1	-114.47	34.40	NaN	7650.0	1129.0	NaN	1.8200
2	-114.56	33.69	17.0	720.0	333.0	117.0	1.6509
3	-114.57	33.64	14.0	1501.0	515.0	226.0	3.1917
4	-114.57	33.57	20.0	1454.0	624.0	262.0	1.9250