Preparing Data

Preparing Data

AWS Glue DataBrew is a visual data preparation tool that helps data analysts, data scientists, and non-technical users easily prepare data with an interactive visual interface, reducing the effort in programming.

With Glue DataBrew, you can easily represent, clean, and normalize terabytes, and even petabytes of data directly from your datalakes, datawarehouses, and databases. You can create datasets using any data source such as; Amazon S3, AWS Glue Data Catalog (Amazon Redshift, Amazon Aurora, and Amazon RDS), AWS Data Exchange. For more information about Data Sources supported on DataBrew, see Creating and using AWS Glue DataBrew datasets.

Below is the reference architecture, our raw data has been stored in Amazon S3 as CSV. We will use Glue DataBrew to read and prepare the data and write the transformed data into another Amazon S3 bucket.

Datalake

Contents

  1. Setting up DataBrew
  2. Data Profiling
  3. Clean & Transform
  4. Preparing the Next Table
  5. Upload Cleaned Dataset