Building a Datalake with Your Data > Data Ingestion with AWS Glue > Transform to Parquet

Transform to Parquet

Convert to Parquet - listings table

Go to AWS Glue service

Click ETL Jobs.
Click Notebooks.
Click Notebook

DataLake

At Configure the job properties page.

Name the Job name as yourname-datalake-csvtoparquet.
Select the IAM role as AWSGlueServiceRole-yourname-datalake.

DataLake

Copy the python script below into the edit script screen.

Note to rename s3 bucket and glue database to match your configuration.

import sys
import datetime
import re
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)

## DONT FORGET TO PUT IN YOUR INPUT AND OUTPUT LOCATIONS BELOW.
your_database_name = "yourname-datalake-db"
your_table_name = "listings"
output_location = "s3://yourname-0000-datalake/parquet/listings/"

job.init("byod-workshop" + str(datetime.datetime.now().timestamp()))

#load our data from the catalog that we created with a crawler
dynamicF = glueContext.create_dynamic_frame.from_catalog(
    database = your_database_name,
    table_name = your_table_name,
    transformation_ctx = "dynamicF")

# invalid characters in column names are replaced by _
df = dynamicF.toDF()
def canonical(x): return re.sub("[ ,;{}()\n\t=]+", '_', x.lower())
renamed_cols = [canonical(c) for c in df.columns]
df = df.toDF(*renamed_cols)

# write our dataframe in parquet format to an output s3 bucket
df.write.mode("overwrite").format("parquet").save(output_location)

job.commit()

Click Save.
Click Run job, click Run job to confirm.

DataLake

Access AWS Glue service

Click Jobs.
Click to select yourname-datalake-csvtoparquet job.
Monitor the job until the job runs successfully.

DataLake

Check the data in S3 bucket yourname-0000-datalake/parquet/listings/.

You will see that the data has been converted to parquet format and compressed very well.

DataLake

In this step we have created a job to convert our data from CSV to Parquet format for the data of the listings table. Next we will do the same for the reviews table