Menu

This article records the preliminary understanding of what data lake is and Why is it Used.

Concept of Data Lake

What is the concept of a data lake? Generally speaking, the data generated by an organization is maintained in a storage platform, which we called the "data lake".

I personally think that the data lake should be an evolving and scalable infrastructure for big data storage, processing and analysis. To achieve full acquisition, full storage, multi-mode processing and full life cycle management of any source, any scale, and any type of data, It must have interaction and integration with various external heterogeneous source systems, making it a goto place for any data in an organization.

The data sources of a lake are diverse. Some may be structured data, some may be unstructured data, and some may even be binary data.The data can be of the form of batch or streaming form. As the lake accepts data from various sources it can preserve both the original data and also be used for lineage of data transformations

Data Engineers or Transformation Engineers stand at the entrance of the lake, using equipments check the water quality, and pump water out of the lake.
The Lake can serve as a staging area for the data warehouse.

Data scientists or Analysts use the lake for discovery and ideation. They extract value from the data lake through machine learning.

pump

In summary, the data lake has four main characteristics -

Store raw data

The source of these raw data is very rich.
  • structured data.
  • Semi-structured data.
  • Unstructured data.
  • Binary data (pictures,videos etc.)
  • Support multiple computing models

  • batch processing
  • stream computing
  • interactive analysis
  • machine learning
  • Data Management capabilities

  • connect to multiple data sources with different access times
  • support Schema management
  • support authority management
  • Flexible underlying storage

  • generally uses S3/OSS/HDFS, a cheap distributed file system
  • supports Parquet/Avro/Orc file formats
  • supports data cache acceleration
  • LEARN, SHARE AND GROW