Look Before You Leap
Data lakes are not a good fit for every insurer.
- Bill Jenkins
- May 2020
Insurance carriers can get quickly enamored with new technologies. In some instances, a CEO reads about the benefits of an emerging technology and wants the company to use it. This can result in the company's IT organization scrambling to adopt this technology to appease the CEO. As this scenario unfolds, other projects are dropped, delayed, reduced in scope or suffer inaction. This reaction does not allow the needed due diligence to implement this new technology, i.e., the ramifications to processes, technologies, human resources and organizational structure.
This scenario is playing out in many organizations today as senior management examines their company's use of big data. To store unstructured data, many company executives believe they must create a data lake. This solution is intriguing to many CIOs because the organization can now (potentially) bypass the investment and implementation of the building blocks needed to manage and control the data being put into the company's data warehouse.
However, if unmanaged raw data is dumped into the data lake it can create a “data swamp.” The lake becomes overwhelmed by data that is of poor quality, lacks proper synchronization and cannot be easily identified or located. Again, the company has wasted not only significant time and money but also the opportunity costs of dropping other needed projects.
What goes unrecognized by the company is that data lakes also need to have similar data governance, data management and data quality processes and practices as a data warehouse.
The question then becomes whether an insurance carrier should invest in maturing their existing data warehouse or invest in developing and implementing a data lake.
Both data warehouses and data lakes are designed and meant to store large volumes of data. However, each serves a different purpose and requires different types of skills to be successfully used. The major similarities and differences between the two are explained below.
This concept became fashionable in the 1970s and 1980s. Data warehouses were populated with transactional/operational data to be used primarily for managers to view the financial and operational health of the organization. The warehouses contained the data used to report to regulatory and financial rating agencies, as well as to support “business intelligence” with “data marts” that are part of the overall data warehouse.
Data is generally based on pre-established management reporting requirements and is filtered, cleansed, structured and standardized prior to being placed into the warehouse. This upfront data preparation process—called “schema on write”—involves significant effort and investment. It ensures that this stored data is of high fidelity for regulatory reporting and possesses the reliability needed to manage the company.
The concept of a data lake was introduced about 10 years ago. It is a way to capture and store unstructured/raw data or big data. Such data can be used to obtain deeper insights into customers and products/services by gathering customers' sentiments, buying habits and lifestyles and aligning their needs to the organization's products and services.
Another driver for a data lake is the ability to include data from multiple sources so the company gets a complete picture of its operations while undertaking data exploration and discovery via advanced analytics.
The raw data can be captured with little upfront data preparation prior to it being stored. It is kept in raw form and transformed when it is to be used. This transformation process is known as “schema on read.”
Areas of Commonality
Both data warehouses and data lakes have commonalities, including:
- The data extracted, positioned and used from each should address a specific business question or need and be driven by the organization's business and data strategy, thereby providing business value.
- Both can capture and store structured and unstructured data (although not a common function of the data warehouse).
- Both are business-driven, not IT-driven
- Both require that the data used be properly managed and structured. It should be noted that as data is loaded into the data lake there is the need to index and attach identifiers (metadata—the need to provide context to the data) to make sense of it, and to follow certain data quality processes. As data in the lake begins to be repetitive in structure and use, the modeling of this data should be undertaken to show the relationship between the captured data.
- Development and administration of both repositories should be overseen by an enterprise data governance process and committee.
- Both afford the user a data source from which to run analysis and make business decisions
- Both require that operational staff have the proper skill sets to manage, control and administer data.
Pros and Cons
Developing a list of each repository's pros and cons is a good place to begin one's analysis on the direction an organization might want to pursue in developing either a data warehouse or a data lake.
Pros of the Data Warehouse
- It is well-established and a proven solution using mature tool sets and technologies.
- The architecture is based on data models that provide metadata to enable understanding.
- Data is stored in accordance with data model design, using normalization for quality and traceability, and star or snowflake schema for data mart aggregation and analytics.
- Due to its structured schema on write, it traditionally delivers good performance.
- The data feeds numerous BI reporting applications and satisfies needed regulatory and financial rating reporting needs.
- It is well-suited for business users who need to work with pre-aggregated and pre-integrated information targeted for historical analytics applications.
Cons of the Data Warehouse
- Traditionally, storage costs are generally more expensive because a data lake is more likely to exist in a virtual or cloud storage environment.
- It is built more on specialized hardware which is generally more expensive and more difficult to scale for large data sets.
- The time needed to build out each business component in order for the business to receive value from the data can be quite lengthy.
- It is not designed for big data, thereby limiting its effectiveness and restricting the capture of certain types of data.
- Data is limited to descriptive or historic analytics.
- The data captured is for predefined purposes and is limited structurally.
- Given the data warehouse's highly structured architecture, adding new data is time consuming and complex, thus reducing its flexibility and agility.
Pros of the Data Lake
- It handles many types of data from social media, census and other public data related to insurance lines of business, thereby affording a higher benefit as it provides value to more users.
- Storage is generally commoditized, is built on less expensive hardware and generally exists in the cloud.
- As more diverse data is input into the data lake, proper governance and quality processes and practices need to be used.
- All types of data can be loaded, which allows the data to be explored for new and unforeseen purposes that are targeted for supporting investigation into defined areas or niches.
- Data can be assessed for doing predictive models and prescriptive analytics.
- Experiments can result in the formation of artificial intelligence and machine learning.
- Data is only transformed when it is to be used, giving it agility and flexibility. There is the need to employ certain management, governance and quality processes into the overall process.
- Less expense and shorter start-up time is needed as the data structure and data requirements are not defined until the data is to be used.
Cons of the Data Lake
- Security and privacy processes and technologies are immature.
- Skill sets are lacking for this type of technology among current IT staffs.
- Data lake applications tend not to support partial or incremental loading.
- Changes will be needed in existing governance, management and strategies of data to properly manage the data lake.
- The captured data must be monitored to prevent it from becoming obsolete or unusable—i.e., a “swamp.”
- It is difficult to build a business case for a data lake since much of the stored data is used for “discovery” and “what if” scenarios.
Which Works Better?
In deciding which data repository to pursue, a company should undertake a review of its data plans, processes, technologies, skill sets and organizational structure, as well as how data savvy the organization is overall.
It's hard to avoid the magnetic pull of new technologies. And there are certainly some benefits to using a data lake over a data warehouse (specifically in the area of analytics and business intelligence). However, due to the insurance industry's lack of effective data management, control and use of data, most carriers are not positioned to entertain data lake structures. In fact, very few carriers have incorporated big data into their decision-making or analytics efforts today.
A 2018 survey conducted by the data solutions firm Syncort found insurers were most concerned with improving their data governance and data quality processes (due primarily to regulatory challenges). These findings also suggested that before carriers embark on building a data lake, they become more proficient in the management and governance of their data, because as additional regulations concerning data management and data quality come into play, the pressure for carriers to address their practices will become more intense.
As carriers move along this data management curve, and as the tools and technologies used for data lakes mature, the company's data warehouse can become a resource for data lake exploration and implementation. The data lake should not be seen as a replacement for the data warehouse. A data lake, however, can become a powerful tool for business intelligence and analytics … the kinds of things that make insurers competitive in the marketplace.