Kayla Bollinger

Data Cleaning in SQL: Tennessee Housing Data

(SQL Code Available Here)

This portfolio project is a practice in cleaning data using SQL. The dataset was taken from Alex the Analyst's "Data Analyst Portfolio Project" series. The SQL code can be found on my GitHub here. While this work is my own, I am grateful for this series for being such a helpful guide.

Summary and notes on the data cleaning process:

To not alter the original data source, the cleaned data was put into a new table labeled "TNHousing_clean". The cleaned dataset can be found on my GitHub here.

For some reason while importing the original dataset into Microsoft SQL Server SMS, rows in the dataset were duplicated many times. Only the unique rows are stored in the cleaned dataset.

Alex the Analyst's guide for this project suggests using the attribute "ParcelID" to fill in missing/NULL values in the "PropertyAddress" column--the suggestion being that ParcelID is unique to each PropertyAddress. However, when I investigated this further I found that this was not the case (e.g., ParcelID 033 16 0 131.00 has two different addresses associated with it). This meant that there was no reliable way to find the missing addresses, and so the NULL values were not changed.

Full process:

Create a new table.

/*
Create new table to be the cleaned version of the old table (to no overwrite or lose original data).
*/

CREATE TABLE TNHousing_clean 
	(UniqueID int,
	ParcelID nvarchar(255),
	LandUse nvarchar(255),
	PropertyAddress nvarchar(255),
	SaleDate date,
	SalePrice int,
	LegalReference nvarchar(255),
	SoldAsVacant nvarchar(255), -- will update type to nchar (1) later
	OwnerName nvarchar(255),
	OwnerAddress nvarchar(255),
	Acreage float,
	TaxDistrict nvarchar(255),
	LandValue int,
	BuildingValue int,
	TotalValue int,
	YearBuilt int,
	Bedrooms int,
	FullBath int,
	HalfBath int,
	PRIMARY KEY (UniqueID)
	);

When loading the dataset, there seemed to be more datapoints than expected. We investigate this more closely here.

/*
Check total number of data points in original dataset--there should only be 56477.
*/

SELECT COUNT(*)
FROM dbo.TNHousing;
-- 1048575 rows--more than expected

Indeed there are many more rows than expected, so we check to see if any data points were duplicated (for some reason). "UniqueID" should be the primary key for this dataset, so they should be unique and could help indicate duplicates.

/*
UniqueID
- no null values
- non-negative integer values
- unique identifier for each row, total of 56477 rows desired
*/

-- check NULL or negative values
SELECT COUNT(*)
FROM dbo.TNHousing
WHERE UniqueID IS NULL OR UniqueID < 0;
-- none

-- check for duplicates 
SELECT UniqueID, COUNT(*)
FROM dbo.TNHousing
GROUP BY UniqueID
HAVING COUNT(*) > 1;
-- 56477 unique UniqueID values (as expected)
-- check that indeed each UniqueID corresponds to a unique row
SELECT UniqueID, COUNT(*)
FROM dbo.TNHousing
GROUP BY [UniqueID ]
      ,[ParcelID]
      ,[LandUse]
      ,[PropertyAddress]
      ,[SaleDate]
      ,[SalePrice]
      ,[LegalReference]
      ,[SoldAsVacant]
      ,[OwnerName]
      ,[OwnerAddress]
      ,[Acreage]
      ,[TaxDistrict]
      ,[LandValue]
      ,[BuildingValue]
      ,[TotalValue]
      ,[YearBuilt]
      ,[Bedrooms]
      ,[FullBath]
      ,[HalfBath]
HAVING COUNT(*) > 1;
-- 56477 unique rows--indeed UniqueID = unique row

-- add only unique rows to clean table
WITH CTE_dup AS(
   SELECT *,
       RN = ROW_NUMBER()OVER(PARTITION BY UniqueID ORDER BY UniqueID)
   FROM dbo.TNHousing
)
INSERT INTO dbo.TNHousing_clean
SELECT [UniqueID ]
      ,[ParcelID]
      ,[LandUse]
      ,[PropertyAddress]
      ,[SaleDate]
      ,[SalePrice]
      ,[LegalReference]
      ,[SoldAsVacant]
      ,[OwnerName]
      ,[OwnerAddress]
      ,[Acreage]
      ,[TaxDistrict]
      ,[LandValue]
      ,[BuildingValue]
      ,[TotalValue]
      ,[YearBuilt]
      ,[Bedrooms]
      ,[FullBath]
      ,[HalfBath]
FROM CTE_dup 
WHERE RN = 1;

Now that all duplicates are removed, we can proceed to check on different attribute properties. We will check each attribute one by one to look for any potential problems. Note that we will be checking for NULL values, but we will only be able to fill them in if there is a clear pattern that we can use from the available data (since this is just a practice dataset, there are no additional sources to use).

First up, "ParcelID". There are no particular expectations for this attribute.

/*
ParcelID
- check for null values
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE ParcelID IS NULL;
-- no NULL values

"LandUse" describes the type of property, e.g. "SINGLE FAMILY". There are only a handful of categories for this attribute, so we can check for any typos/issues directly.

/*
LandUse
- check for null values
- check distinct values for any irregularities
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE LandUse IS NULL;
-- no NULL values

SELECT DISTINCT(LandUse)
FROM dbo.TNHousing_clean;
-- no irregularities

For "PropertyAddress", ideally there would be no null values (as this is, in some sense, the main descriptor for each data point). However, we find that there are null values, and check to see if there are any other IDs that we could use to help fill these NULL values. The only potential candidate would be "ParcelID", but after further investigation it is clear that it cannot be used as these IDs are not unique to known "PropertyAddress" values (there are instances where two different addresses have the same "ParcelID"). We will also clean this data by removing extra white space and splitting the address into Street and City.

/*
PropertyAddress
- no null values
- if null values, check if can be filled using ParcelID
- split address into Street and City
*/

-- check for NULL values
SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE PropertyAddress IS NULL;
-- resulted in: 29 NULL values

-- remove extra white spaces from address
UPDATE dbo.TNHousing_clean
SET PropertyAddress = RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(PropertyAddress,' ','<>'),'><',''),'<>',' ')));
--

-- check to see if two different addresses can share the same ParcelID
SELECT sub.ParcelID, sub.PropertyAddress
FROM (
	SELECT ParcelID, PropertyAddress
	FROM dbo.TNHousing_clean
	WHERE ParcelID IN (
						SELECT ParcelID
						FROM dbo.TNHousing_clean
						GROUP BY ParcelID
						HAVING COUNT(*) > 1
						)
	) as sub
	GROUP BY sub.ParcelID, sub.PropertyAddress
	HAVING COUNT(*) = 1
ORDER BY ParcelID;
-- indeed they can, e.g. ParcelID = 033 16 0 131.00
-- cannot use ParcelID to fill in NULL values for PropertyAddress

-- split address into street and city
-- check streets
SELECT LEFT(PropertyAddress,CHARINDEX(',',PropertyAddress)-1)
FROM dbo.TNHousing_clean;
-- check cities
SELECT DISTINCT(RIGHT(PropertyAddress,LEN(PropertyAddress)-CHARINDEX(',',PropertyAddress)-1))
FROM dbo.TNHousing_clean;
-- no irregularities in city names; also shows that indeed original format was "[STREET], [CITY]"
-- add new columns
ALTER TABLE dbo.TNHousing_clean
ADD PropertyAddress_Street nvarchar(255), PropertyAddress_City nvarchar(255);

UPDATE dbo.TNHousing_clean
SET PropertyAddress_Street = LEFT(PropertyAddress,CHARINDEX(',',PropertyAddress)-1),
	PropertyAddress_City = RIGHT(PropertyAddress,LEN(PropertyAddress)-CHARINDEX(',',PropertyAddress)-1);

"SaleDate" is given in datetime format, but we do not need time for this attribute. We convert to date format instead.

/*
SaleDate
-- remove time
*/

-- test conversion
SELECT SaleDate, CONVERT(date,SaleDate)
FROM dbo.TNHousing_clean;

UPDATE dbo.TNHousing_clean
SET SaleDate = CONVERT(date,SaleDate);

"SalePrice" should not be negative, so we check for that.

/*
SalePrice
- non-negative values
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE SalePrice < 0;
-- no negative values

"LegalReference"--nothing to check.

/*
LegalReference
-- none
*/

"SoldAsVacant" should be labeled yes or no. First we check for any irregularities/typos. There was a mix of "Y/N" and "Yes/No" convention, so we convert all to "Y/N" convention.

/*
SoldAsVacant
-- check for irregularities
*/

SELECT DISTINCT(SoldAsVacant)
FROM dbo.TNHousing_clean;
-- mix of "Y/N" and "Yes/No" convention

-- convert to "Y/N" convention
UPDATE dbo.TNHousing_clean
SET SoldAsVacant = CASE
		WHEN SoldAsVacant = 'Yes' THEN 'Y'
		WHEN SoldAsVacant = 'No' THEN 'N'
		ELSE SoldAsVacant
		END;

-- alter column type to indicate single character description
ALTER TABLE dbo.TNHousing_clean
ALTER COLUMN SoldAsVacant nchar(1);

For "OwnerName", we remove any extra whitespace.

/*
OwnerName
- remove extra whitespace
*/

UPDATE dbo.TNHousing_clean
SET OwnerName = RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(OwnerName,' ','<>'),'><',''),'<>',' ')));

For "OwnerAddress", we remove any extra whitespace and split into Street, City, and State.

/*
OwnerAddress
- remove extra whitespace
- split into Street, City, and State
*/

UPDATE dbo.TNHousing_clean
SET OwnerAddress = RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(OwnerAddress,' ','<>'),'><',''),'<>',' ')));

-- split address into street, city, and state
-- check streets
SELECT LEFT(OwnerAddress,CHARINDEX(',',OwnerAddress)-1)
FROM dbo.TNHousing_clean;
-- check cities
SELECT DISTINCT(RIGHT(OwnerAddress,LEN(OwnerAddress)-CHARINDEX(',',OwnerAddress)-1))
FROM dbo.TNHousing_clean;
-- no irregularities
-- this also includes state info--will temporarily store as "_City", then remove state

-- add new columns
ALTER TABLE dbo.TNHousing_clean
ADD OwnerAddress_Street nvarchar(255), OwnerAddress_City nvarchar(255);

UPDATE dbo.TNHousing_clean
SET OwnerAddress_Street = LEFT(OwnerAddress,CHARINDEX(',',OwnerAddress)-1),
	OwnerAddress_City = RIGHT(OwnerAddress,LEN(OwnerAddress)-CHARINDEX(',',OwnerAddress)-1);

-- fix OwnerAddress_City and add State column
-- check cities
SELECT DISTINCT(LEFT(OwnerAddress_City,CHARINDEX(',',OwnerAddress_City)-1))
FROM dbo.TNHousing_clean;
-- check states
SELECT DISTINCT(RIGHT(OwnerAddress_City,LEN(OwnerAddress_City)-CHARINDEX(',',OwnerAddress_City)-1))
FROM dbo.TNHousing_clean;

-- add new columns
ALTER TABLE dbo.TNHousing_clean
ADD OwnerAddress_State nvarchar(255);

UPDATE dbo.TNHousing_clean
SET OwnerAddress_City = LEFT(OwnerAddress_City,CHARINDEX(',',OwnerAddress_City)-1),
	OwnerAddress_State = RIGHT(OwnerAddress_City,LEN(OwnerAddress_City)-CHARINDEX(',',OwnerAddress_City)-1);

"Acreage" should be nonnegative, so we check for that.

/*
Acreage
- nonnegative
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE Acreage < 0;
-- no negative values

"TaxDistrict" has only a handful of categories, so we check for any typos/issues directly.

/*
TaxDistrict
- check for irregularities
*/

SELECT DISTINCT(TaxDistrict)
FROM dbo.TNHousing_clean;
-- no irregularities

-- remove extra whitespace (if any)
UPDATE dbo.TNHousing_clean
SET TaxDistrict = RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(TaxDistrict,' ','<>'),'><',''),'<>',' ')));

"LandValue" should be nonnegative, so we check for that.

/*
LandValue
- nonnegative
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE LandValue < 0;
-- no negative values

"BuildingValue" should be nonnegative, so we check for that.

/*
BuildingValue
- nonnegative
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE BuildingValue < 0;
-- no negative values

"TotalValue" should be nonnegative, so we check for that.

/*
TotalValue
- nonnegative
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE TotalValue < 0;
-- no negative values

"YearBuilt" should be nonnegative and have only four digits, so we check for that.

/*
YearBuilt
- nonnegative
- 4 digits
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE YearBuilt < 0;
-- no negative values

SELECT DISTINCT(LEN(YearBuilt))
FROM dbo.TNHousing_clean;
-- only 4 digits

"Bedrooms" should be nonnegative, so we check for that.

/*
Bedrooms
- nonnegative
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE Bedrooms < 0;
-- no negative values

"FullBath" should be nonnegative, so we check for that.

/*
FullBath
- nonnegative
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE FullBath < 0;
-- no negative values

"HalfBath" should be nonnegative, so we check for that.

/*
HalfBath
- nonnegative
*/

SELECT COUNT(*)
FROM dbo.TNHousing_clean
WHERE HalfBath < 0;
-- no negative values

Take final look at the dataset to see if any glaring issues stand out.

/*
Final Check
*/

SELECT *
FROM dbo.TNHousing_clean;