Transform to Parquet

Convert to Parquet - listings table

  1. Go to AWS Glue service
  • Click ETL Jobs.
  • Click Notebooks.
  • Click Notebook

DataLake

  1. At Configure the job properties page.
  • Name the Job name as yourname-datalake-csvtoparquet.
  • Select the IAM role as AWSGlueServiceRole-yourname-datalake.

DataLake

DataLake

  1. Copy the python script below into the edit script screen.
  • Note to rename s3 bucket and glue database to match your configuration.
import sys
import datetime
import re
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)

## DONT FORGET TO PUT IN YOUR INPUT AND OUTPUT LOCATIONS BELOW.
your_database_name = "yourname-datalake-db"
your_table_name = "listings"
output_location = "s3://yourname-0000-datalake/parquet/listings/"

job.init("byod-workshop" + str(datetime.datetime.now().timestamp()))

#load our data from the catalog that we created with a crawler
dynamicF = glueContext.create_dynamic_frame.from_catalog(
    database = your_database_name,
    table_name = your_table_name,
    transformation_ctx = "dynamicF")

# invalid characters in column names are replaced by _
df = dynamicF.toDF()
def canonical(x): return re.sub("[ ,;{}()\n\t=]+", '_', x.lower())
renamed_cols = [canonical(c) for c in df.columns]
df = df.toDF(*renamed_cols)

# write our dataframe in parquet format to an output s3 bucket
df.write.mode("overwrite").format("parquet").save(output_location)

job.commit()
  • Click Save.
  • Click Run job, click Run job to confirm.

DataLake

  1. Access AWS Glue service
  • Click Jobs.
  • Click to select yourname-datalake-csvtoparquet job.
  • Monitor the job until the job runs successfully.

DataLake

  1. Check the data in S3 bucket yourname-0000-datalake/parquet/listings/.
  • You will see that the data has been converted to parquet format and compressed very well.

DataLake

In this step we have created a job to convert our data from CSV to Parquet format for the data of the listings table. Next we will do the same for the reviews table