The explosion of new types of data in recent years – from inputs such as the web and connected devices, or just sheer volumes of records – has put tremendous pressure on the Enterprise Data Warehouse.
In response to this disruption, an increasing number of organizations have turned to Hybrid Data Warehouse using Data Lake to help manage the enormous increase in data while maintaining coherence of the data warehouse. Complementing an existing enterprise data warehouse with a data lake, creating a hybrid architecture, can be a smart first step for many companies. It provides more flexibility and speed in terms of data processing and capturing unstructured, semi-structured and streaming data, and frees up bandwidth in the data warehouse for well-defined, repeatable business intelligence activities. It’s also a use case that typically produces a guaranteed return on investment.
What is a Data Lake?
A Data Lake is a single data repository for storing (either physically or logically) all the organization’s data including data generated from internal transactions and interactions as well as data gathered from third party and publicly available sources. The Hadoop Distributed File System (HDFS) is the preferred data lake platform because it provides a cost-effective, powerful, agile, scale out environment for assembling, preparing, aligning, enriching, and analyzing diverse structured and unstructured data sources
What is a hybrid Data Warehouse?
- Simplifies the offload of “cold data” from the traditional Enterprise Data Warehouse to a low cost storage solution like Hadoop and offers seamless access to all data (both warehoused and offloaded to Hadoop) by applying high performance query federation.
- Extends the value in the traditional Enterprise Data Warehouse by logically combining it with unstructured data from the lake and from other corporate and public sources.
A hybrid Data Warehouse provides the following benefits to an enterprise:
- New efficiencies for data architecture through a significantly lower cost of storage, and through optimization of data processing workloads such as data transformation and integration.
- New opportunities for business through flexible “Schema-on-Read” access to all enterprise data, and through multi-use and multi-workload data processing on the same sets of data, from batch to real-time.
- A data lake’s flexible architecture enables faster loading of data and parallel processing, resulting in faster time to insight. The data lake is also much more effective than the data warehouse for processing the increasing amount of unstructured and semi-structured data that’s important for analytics today.
- A data warehouse augmentation with a data lake made it possible for the enterprise to employ more strategic use of its assets.
- The Hybrid Data Warehouse guarantees the best query performance and introduces agility and a sound information architecture by enabling reuse of access and integration logic across applications.
- Reduces development costs and time to solution. Reuse reduces maintenance costs and creates
- The Hybrid Data Warehouse supports more data sources (specialized NoSQL data stores, public web data, cloud apps, etc.) and can expose data to more consumers (linked data through RESTful for instance), improving the overall Return on Assets of your data solutions.
- Extends the application of “schema-on-read” to specialized NoSQL operational data stores outside the Hybrid Data Warehouse like Document, Graph or Hierarchical Databases.