Data Preprocessing - d2l.ai Exercises - Part 2
The second notebook in a series to be posted aiming to solve and understand exercises from d2l.ai curriculum on deep learning
Let's use the sample datasets offered directly in colab
import pandas as pd
import numpy as np
data = pd.read_csv('/content/sample_data/california_housing_train.csv')
data.head()
data.info()
Since there is no data missing, we will add random missing entries in the data
df = data.mask(np.random.random(data.shape) < .1)
df.info()
Delete the column with the most missing values.
Convert the preprocessed dataset to the tensor format.
Peek at the data with na
Find the column name with most na
s
column_with_most_na = max(list(df.columns),
key=lambda x: len(df.loc[df[x].isna(), x]))
Remove the column
We can split the dataset to inputs and output, with the median_house_value
as the output
inputs = df.iloc[:, :-1].copy()
outputs = df.iloc[:, -1].copy()
inputs.head()
outputs.head()
import tensorflow as tf
X, y = tf.constant(inputs.values), tf.constant(outputs.values)
X, y
This completes the second part of the preliminaries.