Cleaning and transforming data are crucial steps in deriving value from data. Data cleaning is the process of removing unnecessary data. Data transformation is the process of converting data from one format or structure to a different format or structure that is more suitable and convenient for analysis and data organization.
Click on Projects, then click on the project named your-datalake-listings.
Scroll to the right to view more columns. We will proceed to delete columns that do not contain data.
In the Source columns section, add columns host_response_rate and host_acceptance_rate.
Next, we will split the year and month information from the hosted_since column for future partitioning.
Set Starting position = 0 and Ending position = 4 to get the first 4 characters.
Repeat steps 4 and 5 with Starting position = 5 and Ending position = 7 to extract the month column.
Click Create job to start the job for cleaning and transforming data.
Name the job airbnb-listings-cleantransform.
Scroll down, select the role AWSGlueDataBrewServiceRole-airbnb-dataset.
You can check the job status in the Job section.
After the job is completed, access the S3 bucket yourname-0000-datalake/cleantransform/ to view the cleaned and transformed data.
Click to select the csv file that has been cleaned and transformed.
Arrange the folder structure similar to our original dataset.
We will perform cleaning and data transformation for the listings table, and then we will do the same for the reviews table.