Data Lake Powered BI Solution for Better Decision Making

A Public Transportation Agency in the US has been operational for the last 30 years across 80 routes with 500 fleets and has an annual ridership of 2.5 billion. They were generating almost terabytes (TBs) of data each day.

Our Client

  • Multiple software systems for different operations and departments make a massive mess of the collected data.
  • Not getting a better understanding of the business due to data being saved in different places and with changes of manual error as well.
  • Lots of manual effort and chances of human errors and data inconsistencies leading to misinformation.

Problem Statement

  • Creating a robust and scalable Business Intelligence/Analytics (BI) solution powered by a cloud-based data lake.
  • Providing a BI solution that will run across multiple departments and present different KPIs according to user roles.
  • Creating a data lake that will store the large amount of data generated from various sources in real-time, providing that information to the real-time dashboards for KPI monitoring and decision-making
  • Creating a data lake that will be supplemented with various individual data marts for their respective departments’ data segregation and separate dashboards.
Challenges Involved
The client has been running the operations for many years, and that too at an enormous scale. As a result, the client deployed multiple legacies. We had to implement new systems internally to handle the various processes.
  • Different systems generated different types and formats of data at different places.
  • The top management had difficulty understanding the vast amount of data at a combined level. This impeded the decision-making and overshadowed the opportunities to improve further.
  • The client’s IT team was not updated with tech advancements to provide access to the various sources, considering the updated frequency of data.
  • Data access is a significant problem with the PCI data security protocol.
  • Keeping the data lake and corresponding data marts uniform and updated with new data.
  • Efficient data models to store large volumes of data in an optimized format.
  • Keeping the overall solution cost-effective for the client.

Our Solution

The complete pipeline for creating a data lake and using it to create data marts and respective dashboards was done as per the steps listed here:
  • Understanding the individual systems, how they generate data in different formats, how the storage structures work, and in what frequencies.
  • Narrowing down to a single format for individual data streams to be captured from the client while keeping in mind the one-time load, incremental load, and Change Data Capture (CDC).
  • Finalize data lake architecture, and set up individual data pipelines (ETL), considering the variety, velocity, and integrity of data.

Facing a similar challenge in your business?

Technical Architecture

Since on-cloud deployment was included, Amazon Web Services (AWS) was the most obvious choice for creating the data lake due to the supplementary services provided by AWS.
Following is the most optimized architecture of the data lake created:

Business Impact

This data lake BI solutions helped in data management and in achieving all targets.
  • Time for Data Analysis and Decision Making reduced to minutes.
  • Automated orchestration of data from disparate sources eliminates manual intervention and any chance of error.
  • More timely, accurate, and less laborious access to high-value reporting and KPIs.
  • Analytics is delivered to the users with comprehensive options for better analysis.

Looking for a Similar Solution?
We Can Help Protection Status