We close Chapter 29 with a crucial aspect when a company accumulates a lot of data: data governance. Having a data lake (subchapter 29.1) full of valuable information is great, but it raises serious questions: who can see which data? How do you protect sensitive information? How do you control access centrally when you have data from the entire company? To answer this, AWS offers Lake Formation: a service to build, secure, and govern your data lake centrally.

The problem: a data lake without control is a risk

A data lake gathers a lot of data from the entire company in one place (S3). That’s powerful, but also dangerous if you don’t control who accesses what:

In the data lake there is all kinds of data:
   - Public data (product catalog)
   - Internal data (sales)
   - SENSITIVE data (customer personal data, finances...)
   → NOT everyone should be able to see EVERYTHING

Without good access control:

  • Anyone with access to the lake could see sensitive data they shouldn’t (a serious risk, remember privacy and compliance from Chapter 23).
  • Managing permissions “by hand” over millions of files in S3 would be unfeasible and error-prone.
  • It would be difficult to prove (to auditors, for regulations) that the data is well protected.

You need a centralized and granular way to govern who accesses which data. That’s Lake Formation.

What is Lake Formation

AWS Lake Formation is a service that makes it easier to build, secure, and govern a data lake centrally. Its most prominent feature is granular and centralized access control to data: defining, from a single place, who can access which data (down to specific tables and columns), in a simple way.

   Lake Formation (centralized data lake governance):
   ├── build the data lake more easily
   ├── control access in a GRANULAR and centralized way
   │      "this team sees the sales table, but NOT the personal data column"
   └── audit who accesses what

Analogy: Lake Formation is like the access control and security system of a large library or national archive. It’s not enough to have all the documents stored (that’s the data lake); you need to control who can enter which section: the general public accesses the common room, accredited researchers access special archives, and only authorized personnel access confidential documents. Lake Formation is that system that, from a central point, decides and monitors who accesses each part of your data.

What Lake Formation brings you

  1. Build the data lake more easily

It helps set up the data lake more easily: it makes it easier to bring in data, organize it, and catalog it (works together with Glue, subchapter 29.1). It simplifies the steps to create the lake.

  1. Granular and centralized access control

This is the star feature. From a single place, you define who can access which data, in great detail:

Examples of granular permissions with Lake Formation:
   - "The marketing team can see the customers table,
      but NOT the email and phone columns" (column level)
   - "The finance team sees the complete sales data"
   - "Analysts only see aggregated data, not individual data"

Instead of managing permissions file by file in S3 (chaos), you define clear rules at the data level (databases, tables, columns), centrally. This connects with the least privilege we saw in IAM (subchapter 7.2): everyone accesses only the data they need.

  1. Protect sensitive data

Thanks to that granular control, you can protect sensitive information (personal, financial data) ensuring that only those who should see it can, while others access the rest. It’s key for privacy compliance.

  1. Auditing and compliance

It allows you to log and demonstrate who accesses which data, which is essential for audits and to comply with regulations (links to compliance in Chapter 23). You have a central view of your data security.

Why it matters: from “data chaos” to “governed data lake”

The great value of Lake Formation is turning a potentially chaotic and insecure data lake into a governed one: where you know exactly who accesses what, you protect sensitive data, and you can prove it. Without governance, a data lake full of valuable data is also a ticking time bomb for security and compliance. With Lake Formation, it’s a secure and well-controlled asset.

   Without governance:  data lake = lots of data + uncontrolled access = RISK
   With Lake Formation: data lake = lots of data + controlled access = SECURE ASSET

Real-world example: a healthcare company has a data lake with patient data (very sensitive), operational data, and public data. They use Lake Formation to govern it. They define, centrally: researchers access anonymized and aggregated data (without seeing identities), authorized medical staff access the full data of their patients, and the marketing team only accesses public data. Columns with identifiable personal data are protected and only visible to those with explicit authorization. When a data protection audit arrives, the company easily demonstrates who accesses what. What would be a huge legal risk without governance, with Lake Formation is a controlled, secure, and compliant system.

How Chapter 29 closes

Lake Formation completes the data platform we have built in this chapter:

S3 + Glue + Athena (29.1)  → store and query the data lake
Kinesis (29.2)             → ingest real-time data
Redshift (29.3)            → fast, large-scale analytics (data warehouse)
Lake Formation (this)      → GOVERN and SECURE everything (who accesses what)

The first pieces build and exploit the data; Lake Formation ensures that all of it is secure, controlled, and compliant. A complete data platform needs both: capability and governance.

What you should remember

  • A data lake gathers a lot of data (including sensitive data) from the entire company; without access control, it’s a serious security and compliance risk, and managing permissions “by hand” over millions of files is unfeasible.
  • AWS Lake Formation makes it easier to build, secure, and govern a data lake centrally. Like the access control system of a large archive.
  • Its star feature is granular and centralized access control: you define from a single place who accesses which data, down to the level of tables and columns (in line with IAM’s least privilege), instead of managing loose files in S3.
  • It brings: easier lake setup, protection of sensitive data (key for privacy), and auditing/compliance (demonstrating who accesses what).
  • It turns a chaotic and insecure data lake into a governed and secure one: the difference between a risk and an asset. Capability (29.1-29.3) plus governance (Lake Formation) = complete data platform.

You’ve completed Chapter 29 and mastered data platforms in AWS: data lakes, streaming, data warehouse, and data governance! In Chapter 30 we’ll return to the realm of large-scale organization: how to structure multiple accounts and landing zones for large enterprises.

Cloud, AWS & Terraform — From Zero to Expert

Chapter 1 · What is cloud computing

Chapter 2 · The cloud market and major providers

Chapter 3 · Regions, availability zones and edge

Chapter 4 · Compute: EC2

Chapter 5 · Storage: S3

Chapter 6 · Networking: VPC

Chapter 7 · Identity and access: IAM

Chapter 8 · Managed databases

Chapter 9 · Why Infrastructure as Code

Chapter 10 · HCL: the Terraform language

Chapter 11 · Providers and state

Chapter 12 · Your first real infrastructure in Terraform

Chapter 13 · Load balancing and auto scaling

Chapter 14 · Serverless with Lambda

Chapter 15 · Messaging and events

Chapter 16 · Content delivery and DNS

Chapter 17 · Containers on AWS

Chapter 18 · Modules: reuse and composition

Chapter 19 · Workspaces and environment management

Chapter 20 · Remote backends and locking

Chapter 21 · Infrastructure testing

Chapter 22 · Terraform in CI/CD

Chapter 23 · Defense in depth

Chapter 24 · Observability: logs, metrics and traces

Chapter 25 · Cost optimization

Chapter 26 · High availability and disaster recovery

Chapter 27 · AWS Well-Architected Framework

Chapter 28 · Serverless architectures at scale

Chapter 29 · Data platforms on AWS

Chapter 30 · Multi-account and landing zones

Chapter 31 · Platform Engineering and Internal Developer Platform

Chapter 32 · Relevant AWS certifications

Chapter 33 · Projects to consolidate what you've learned

Chapter 34 · Resources and community

© Copyright 2024. All rights reserved