What is a Data Lake?

Austin Bauer
Chris Jennings
Zach Slayton

Pioneers are already realizing the value from new technologies and approaches such as data lakes, data science, column store databases, Hadoop, NoSQL databases, cloud platforms and cloud infrastructure. Fast followers are yearning to understand how to apply these enablers to gain competitive advantage from their information. That yearning is the catalyst for this blog series that will focuses on demystifying the technologies and concepts that enable these solutions.

Up First:  The Data Lake

A data lake is a fancy term for an architectural concept similar to a staging area that allows access to a wide array of data in its native or raw format. Let's continue by describing some of the core concepts of a data lake.

•     Schema on Read vs. Schema on Write -  In this context, a schema is a structured representation of a concept, typically a business concept. Think of a schema as your data model where business concepts are well defined. In a traditional data warehousing environment, the schema is known before data is loaded or written to it. This is a schema on write model. In a data lake, data is placed in a loosely structure environment without a strong definition of how the data is structured. To interpret the data in a data lake, a schema is created when the data is read; hence, schema on read.

•     Raw Data - Data lakes have minimal transformations or other pre-processing of the data except for putting it in a place where it can be found.

•     Data Varieties - The data community historically thought of data as having a structure; e.g. flat files, XML, tables, etc. The data has an order that is pre-defined and easily understood, hence structured data. Today's definition expands that definition to include semi-structured and unstructured data as well. Text from tweets, forms, manuals, photos, videos, audio files are all examples of data.

•     Technology - Data lakes are most frequently thought of as running on Hadoop or a Hadoop-based platform (e.g. Hortonworks, Cloudera, Azure data lake). Remember, a data lake is an architectural concept and not a specific technology.   Data lakes can be implemented in a wide variety of technologies from a simple filesystem to a relational database that supports blobs.

•     Application - An application in the context of a data lake is a program or a tool that makes sense of the raw data by creating a schema at runtime.  It then processes the data into a more useful form. Examples of an application are: an R program performing machine learning, a TensorFlow program performing image identification and an ETL process feeding data into a traditional data warehouse. 

Given those concepts, a clearer definition of a data lake is:

•     A way to store a wide variety of raw data in a loosely structured format that is,

•     Implemented in a technology that allows for highly performant processing of the data,

•     Using an application that applies a schema on read approach.

Are Low Volume Data Lakes Even a Thing?

Notice volume was not used in the definition above. In some definitions of a data lake you will see volume as an important component.  If your data volume is smaller than a terabyte or petabyte scale, your organization can still apply a data lake architecture to unstructured data and achieve valuable results. Volume does affect the technology you choose to implement a data lake; therefore your architects may consider volume and determine that Hadoop may be overkill and a more traditional tool might be the most appropriate solution.

Ok, But Why Would I Go To All That Trouble? 

In a word, speed to results.  Traditional data warehouse environments place a large emphasis on conformity. This has tremendous value to a business but takes time to achieve.  Data scientists and power users can access the data lake and interpret the data almost immediately after it is landed. This leads to much faster outcomes than the traditional warehouse environment but requires a heightened level of data knowledge than the conformed data warehouse.

How to Avoid Data Swamps

You may have heard some disillusionment around the implementation of data lakes, or data swamps as naysayers often call them. If your business craves the flexibility to rapidly add new data to your environment, a data lake may be for you! We recommend taking an incremental approach to your first data lake in order to develop your lessons learned and grow your capabilities.  You should also choose a partner to help you architect and design the data lake so their lessons become yours and your organization avoids inadvertently creating a data swamp.