In this step we will query data with basic queries.
create table reviews_partition
WITH (
format = 'PARQUET',
external_location = 's3://yourname-0000-datalake/parquet/reviews_partition',
partitioned_by = ARRAY['year']
)
AS SELECT reviews.listing_id,reviews.review_id,reviews.date,reviews.reviewer_id,reviews.year as year
FROM reviews;
select reviews_partition.listing_id,reviews_partition.review_id
from reviews_partition
where reviews_partition.year = 2019;
We can see, the amount of scanned data is greatly reduced when using partition. That will help us optimize performance and costs. We can choose any column to perform data partitioning, but usually we often partition data by time.
select reviews.listing_id,reviews.review_id
from reviews
where reviews.year = 2019;
The amount of data that needs to be scanned for non-partitioned tables is ~ 3 times higher.