Big data architecture has evolved a lot in the last decade so did the big data technology and products. Lambda architecture is such a very popular big data architecture which has a big data processing capability with high throughput, low latency and aims at near real time applications. Lambda architecture consists of Batch layer, Speed layer and Serving layer.
Refer the image above, we may have an architecture where the speed layer for near real time application is implemented by Azure Stream analytics and the batch layer with data lake gen 2 with databricks for AI ML modelling. For the serving layer we may have the Polybase to run the SQL queries.
The advantages of such an architecture was multiple like:
- Mix of speed and reliability because of two distinct layers
- Low latency and high throughput
- Scalable and fault tolerant
Enterprises however did start to face challenges with such an architecture mainly due to the below reasons:
- It needed two channels to be maintained at the same time which was a good hit on OPEX or RTB costs
- CAB was extensive for such an architecture as it needed to back to change scheme and update downstream
- Re process of batch jobs
- Migrate to another technology was a challenge
- Compliance which came in due time needed excessive reworks
- Version management and meta data management
- Data Governance and data quality challenges
In 2019 early, Databricks announced the Delta Lake architecture as part of the Spark summit. Delta Lake was then adopted with Linux Foundation for hosting. According to the Linux Foundation website, Delta Lake has been adopted by over 4000 organisations and process over two exabytes of data each month (Just for a quick note: 1 exabtye = 1 billion GB).
Delta Lake sits as a storage layer on top of core data lake which works with Apache Spark API’s.
According to official documentation Delta Lake provides:
- ACID (atomicity, consistency, isolation, durability) transactions on Spark
- Scalable metadata handling
- Streaming and batch unification
- Schema enforcement
- Time travel.
- Upserts and deletes
As shown in the diagram, there is now a steam and batch unification unlike Lambda architecture which needed two separate channels to be maintained. We then have three quality levels as shown in the table above. The purest form of data is in the GOLD layer used for business level aggregates.
Delta Lake can be used with Databricks and to conduct a quick POC the DBFS can be used as a directory to hold the bronze, silver and gold layer as shown in the figure above. To maintain data lineage, ACID transactions, scalable meta data handling and versioning (aka time travel in Delta Lake), the file level allocation is below.
Transaction log acts as single source of truth and is referred by Spark when a new transaction is to happen. Transaction log also supports atomicity for the ACID properties to ensure that INSERT UPDATE operations on the data lake is either completed or do bot complete at all.
There is also the concept of checkpoint files as shown above, wherein 10 commits to the transaction log Delta Lake saves a Parquet format in the same _delta log subdirectory. Delta Lake automatically generates checkpoint files every 10 commits. For time travel Spark now can skip to the latest checkpoint to retrieve the version.
Additionally, to overcome the problem of concurrent writes Delta Lake uses mechanisms as Delta optimistic concurrency control.
To sum up, using the Delta Lake architecture enterprise can look to avoid the data lake reliability challenges and incorporate better data quality and governance. Delta lake is Opensource and supported over Azure and AWS. Delta Lake also takes away the problem of maintaining two different channels for batch and stream.
Delta Lake is originally written to work with Apache Spark but now other Open source initiatives are getting adopted around it. Companies like Intel and Ali Baba have adopted delta lake with success as per the website reference of Linux Foundation.
“As a major cloud provider, Alibaba has been a leader, contributor, consumer, and supporter for various open source initiatives, especially in the big data and AI area. We have been working with Databricks on a native Hive connector for Delta Lake on the open source front, and we are thrilled to see the project joining the Linux Foundation. We will continue to foster and contribute to the open source community.”
– Yangqing Jia, VP of Big Data / AI at Alibaba
“Intel and Databricks have a long history of working together to advance Apache Spark technology with innovative data analytics and AI solutions and to enable enterprise readiness. Databricks Delta Lake contribution to the Linux Foundation is an important open source storage technology that can help the ecosystem improve reliability for data lakes. We look forward to joining in the Delta Lake project and continuing our collaboration with Databricks and the Apache community.”
– Wei Li, Vice President, Intel Architecture, Graphics and Software and General Manager, Machine Learning Performance
To know more:
Watch the Delta lake webinars from Stay Ahead with StackRoute Webinars
- Common mistakes made with raw data lakes – Delta lake to the rescue!
- Why should you use Databricks for your next Healthcare analytics project?
About the Author:
Anirban Ghatak
- Senior Consultant, heading the AI and Data Science track at StackRoute
- 17 years of experience in the area of enterprise transformation and analytics and founder for Founder of a Robotics edu tech startup from Bengaluru
- Instrumental in leading a multi-million $ UK Utilities Industry Digital transformation program using Data lake
- Azure data scientist and provides enterprise guidance in areas of Machine Learning, Big data and Azure.
References:
Posted on 28 August, 2020