DataLake

Data Lake - Introduction

Jul 17, 2021
By Sourabh Joshi

This article records the preliminary understanding of what data lake is and Why is it Used.

Concept of Data Lake

What is the concept of a data lake? Generally speaking, the data generated by an organization is maintained in a storage platform, which we called the "data lake".

I personally think that the data lake should be an evolving and scalable infrastructure for big data storage, processing and analysis. To achieve full acquisition, full storage, multi-mode processing and full life cycle management of any source, any scale, and any type of data, It must have interaction and integration with various external heterogeneous source systems, making it a goto place for any data in an organization.

The data sources of a lake are diverse. Some may be structured data, some may be unstructured data, and some may even be binary data.The data can be of the form of batch or streaming form. As the lake accepts data from various sources it can preserve both the original data and also be used for lineage of data transformations

Data Engineers or Transformation Engineers stand at the entrance of the lake, using equipments check the water quality, and pump water out of the lake.
The Lake can serve as a staging area for the data warehouse.

Data scientists or Analysts use the lake for discovery and ideation. They extract value from the data lake through machine learning.

In summary, the data lake has four main characteristics -

Store raw data

The source of these raw data is very rich.

structured data.

Semi-structured data.

Unstructured data.

Binary data (pictures,videos etc.)

Support multiple computing models

batch processing

stream computing

interactive analysis

machine learning

Data Management capabilities

connect to multiple data sources with different access times

support Schema management

support authority management

Flexible underlying storage

generally uses S3/OSS/HDFS, a cheap distributed file system

supports Parquet/Avro/Orc file formats

supports data cache acceleration

LEARN, SHARE AND GROW