Data Partition

Partition data in Athena

In this step we will query data with basic queries.

  1. Execute the query below to create a new table with data partitioned by year.

create table reviews_partition
WITH (
     format = 'PARQUET',
     external_location = 's3://yourname-0000-datalake/parquet/reviews_partition',
     partitioned_by = ARRAY['year']
)
AS SELECT reviews.listing_id,reviews.review_id,reviews.date,reviews.reviewer_id,reviews.year as year
FROM reviews;
  1. Execute the query below.
select reviews_partition.listing_id,reviews_partition.review_id
from reviews_partition
where reviews_partition.year = 2019;

We can see, the amount of scanned data is greatly reduced when using partition. That will help us optimize performance and costs. We can choose any column to perform data partitioning, but usually we often partition data by time.

DataLake

  1. Execute the query below and compare the amount of scanned data when querying a non-partitioned table.
select reviews.listing_id,reviews.review_id
from reviews
where reviews.year = 2019;

The amount of data that needs to be scanned for non-partitioned tables is ~ 3 times higher.

DataLake