Clean & Transform Data

Clean & Transform Data

Cleaning and transforming data are crucial steps in deriving value from data. Data cleaning is the process of removing unnecessary data. Data transformation is the process of converting data from one format or structure to a different format or structure that is more suitable and convenient for analysis and data organization.

  1. Click on Projects, then click on the project named your-datalake-listings.

    Datalake

  2. Scroll to the right to view more columns. We will proceed to delete columns that do not contain data.

    • Click to select the host_reponse_time column.
    • Click the COLUMN icon.
    • Click Delete.

    Datalake

  3. In the Source columns section, add columns host_response_rate and host_acceptance_rate.

    • Click Apply.

    Datalake

  4. Next, we will split the year and month information from the hosted_since column for future partitioning.

    • Click the icon of the hosted_since column.
    • Click Extract.
    • Click Value between position.

    Datalake

  5. Set Starting position = 0 and Ending position = 4 to get the first 4 characters.

    • Name the extracted data column Year.
    • Click Apply.

    Datalake

  6. Repeat steps 4 and 5 with Starting position = 5 and Ending position = 7 to extract the month column.

    Datalake

  7. Click Create job to start the job for cleaning and transforming data.

    Datalake

  8. Name the job airbnb-listings-cleantransform.

    • Choose to export the cleaned and transformed data to S3 at the following path: s3://yourname-0000-datalake/cleantransform/.

    Datalake

  9. Scroll down, select the role AWSGlueDataBrewServiceRole-airbnb-dataset.

    • Click Create and run job.

    Datalake

  10. You can check the job status in the Job section.

    Datalake

  11. After the job is completed, access the S3 bucket yourname-0000-datalake/cleantransform/ to view the cleaned and transformed data.

    Datalake

  12. Click to select the csv file that has been cleaned and transformed.

    • Click Download.
    • Name the file LOAD000001.csv.

    Datalake

  13. Arrange the folder structure similar to our original dataset.

    Datalake

We will perform cleaning and data transformation for the listings table, and then we will do the same for the reviews table.