Preparing the Next Table
Preparing the Next Table
In this step, we will repeat similar steps as with the listings table for the reviews table.
Create Project
-
In the Glue DataBrew service interface, click Create project.
-
Name the project your-datalake-reviews.
- We will see the Recipe section selected by default as Create new recipe with the recipe name automatically set to the project name as your-datalake-reviews-recipe.
- Click to choose New dataset.
-
Scroll down, name the Dataset airbnb reviews.
- At the path to the S3 bucket, select the path to the reviews folder you have uploaded.
- Example: s3://yourname-0000-datalake/raw/reviews/
- Click Select the entire folder.
-
Depending on your data, we will choose the format, delimiter for data fields, and whether the data includes column names in the first row or not.
- In the case of using the Dataset attached to this workshop (airbnb listings), we will choose the default options as follows:
data:image/s3,"s3://crabby-images/74f3f/74f3f8cb68bb28122f8375b425ec0e8262c74967" alt="Datalake"
- Scroll down, at the Permissions section.
- Click to choose AWSGlueDataBrewServiceRole-airbnb-dataset.
- Click Create Project.
data:image/s3,"s3://crabby-images/5761d/5761df91da9e731dbb2eee1517e1c9398c62445e" alt="Datalake"
- Click Skip at the DataBrew feature introduction interface.
- Wait a few minutes for your session to initialize.
-
In the Glue DataBrew service interface, click on Projects, then click on the project name your-datalake-reviews.
-
We will extract year and month information from the date column for future partitioning.
- Click the … icon of the date column.
- Click Extract.
- Click Value between position.
data:image/s3,"s3://crabby-images/4e25b/4e25b340da63586f71e9073d023cae07485a7cdb" alt="Datalake"
-
Set Starting position = 0 and Ending position = 4 to extract the first 4 characters.
- Name the extracted data column Year.
- Click Apply.
-
Repeat steps 2 and 3 with Starting position = 5 and Ending position = 7 to extract the month column.
data:image/s3,"s3://crabby-images/dd9eb/dd9eb042953e024fbb762bad6e1dd9e02f36ea05" alt="Datalake"
-
Click Create job to create a job for cleaning and transforming data.
-
Name the job airbnb-reviews-cleantransform.
- Choose the output location for the cleaned and transformed data as S3 with the following path s3://yourname-0000-datalake/cleantransform/.
data:image/s3,"s3://crabby-images/1dbf9/1dbf9cf1810da5bde3eae349753e0455cb37b970" alt="Datalake"
- Scroll down, click and choose the role AWSGlueDataBrewServiceRole-airbnb-dataset.
- Click Create and run job.
data:image/s3,"s3://crabby-images/3b3a2/3b3a230fe523385c720cafebfc9bd6896e075ac5" alt="Datalake"
- You can check the job status in the Job section.
data:image/s3,"s3://crabby-images/1f451/1f45114bdc4f93d027d42ce6380b90f66414c2d5" alt="Datalake"
- After the job is completed, access the S3 bucket yourname-0000-datalake/cleantransform/ to view the cleaned and transformed data.
data:image/s3,"s3://crabby-images/4422f/4422f404843a3c7ad753f21c3c812e92e9d69d4f" alt="Datalake"
- Click to select the cleaned and transformed csv data.
- Click Download.
- Name the file LOAD000001.csv.
data:image/s3,"s3://crabby-images/d0724/d072401dafe31a74a8cd332dbd223d3476a8212d" alt="Datalake"
- Set the folder structure similar to our original dataset.
data:image/s3,"s3://crabby-images/c02f2/c02f25c7898a6c015aa98a197305fd1fbf028ca3" alt="Datalake"
We will perform cleaning and transformation of the data for the reviews table, then we will proceed to upload the cleaned data to the S3 bucket.