Technology & Innovation

Modernize Your Data Architectures Thru Data Virtualization

BY: Emil Capino • Jan 10, 2023

The Global Datasphere, the total sum of the world’s data, is expected to hit 175 zettabytes by 2025, a zettabyte is equal to about a thousand exabytes or a billion terabytes. According to IDC’s latest “Data Age 2025” whitepaper, data grows at a compounded annual rate of 61%. In 2018, the world’s total data volume has reached 33 zettabytes and it will keep growing exponentially.

In this article, we look at how organizations can cope with the data explosion by modernizing data architectures through Data Virtualization.

At the rate data is growing, physically moving and storing the same data in various repositories will increase costs and slow down insighting due to the migration process. Organizations must find new ways to leverage data and avoid being stuck in the traditional way of extracting, transforming, loading, storing, and managing their data. 

What is Data Virtualization? 

Data Virtualization is the ability to provide real-time access to integrated data across the organization without the need to replicate any data. Unlike traditional data integration and data warehousing approaches where data is extracted and copied into a single repository and thereby replicating data, data virtualization uses advanced query optimizations to provide a logical data abstraction layer. Data is left where they are captured and stored, at the transactional systems, and uses the declarative SQL language to combine, join, filter, and apply security to the organization’s data without the need to move and store data. 

Techopedia defines Data Virtualization as the process of aggregating data from different sources of information to develop a single, logical and virtual view of information so that it can be accessed by front-end solutions such as applications, dashboards, and portals without having to know the exact storage location of the data.

As enterprises continue to generate data from various internal and external systems, they face challenges in data integration and storage of huge amounts of data. These challenges must be addressed to enable business users to access growing data volumes across the enterprise. With data virtualization, business users can get real-time and reliable information regardless of the data source. 

Data Virtualization integrates data from disparate sources, locations, and formats, without replicating the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users. It involves abstracting, transforming, federating, and delivering data from various data sources. The result is faster access to all data, less replication, and cost, and more agility to change. 

The main goal of data virtualization is to enable enterprise users to leverage data quickly and securely from a virtual data layer. This makes it possible for users to access data from any source for consumption by any application at any time and location. 

Data Virtualization modernizes and performs many of the same transformation and quality functions as traditional data integration, data replication, data federation, etc. By employing modern technology to deliver real-time data integration at a lower cost, with more speed and agility, data visualization can be applied to implement the following data management systems: 

Virtual Data Integration 

Traditional Data Integration requires a dedicated physical data repository to store data from two or more data sources. Data are usually extracted from different source systems and combined into a target database. Once data is available, users access the combined data for operational and/or analytical reporting applications. 

With modern data virtualization, data from various source systems are no longer extracted and stored in a database repository. Instead, data is accessed directly from two or more data sources. Using advanced query optimization technologies, which include artificial intelligence to improve performance, data is joined, sorted, filtered, grouped, and delivered on the fly. This approach provides real-time information faster without replicating the same data thereby saving on storage and integration process overhead. 

Logical Data Warehouse 

Physical Data Warehouse systems provide a central repository of data that is extracted from disparate data sources, transformed, and loaded into a fact and dimensional structure or star schema for analytical purposes. The data warehouse needs to be maintained and updated regularly which involves batch ETL (extract, transform, and load) processing. This approach has worked for several decades but may not be able to cope with today’s increasing data volumes.

With a logical data warehouse, there is no need to extract, transform and load data into a physical database. Data is queried in real-time, as and when users need access to the logical data warehouse, for analysis. Data are no longer replicated or staged in a separate repository. Transformations are done on the fly and the query results are delivered instantaneously to the end user application or reporting platform. This approach may not have worked in the past. However, with technological advances in query optimization and artificial intelligence, the data virtualization platform can perform better than physical data warehouse systems. 

Virtual Data Lake 

To support big data analytics, all enterprise data that includes structured, unstructured, and semi-structured data, must be combined in a physical data lake. Usually, the data lake is implemented in a Hadoop data platform which consists of a distributed file system, a no-SQL database, and a streaming data engine. While the data lake provides highly scalable compute and storage capacities for all data formats, it introduces a lot of complexities brought by new big data technologies. 

This approach is similar to a physical data warehouse with the added parallel computer and distributed storage to handle demanding data volumes, variety, and velocity. Physical data lakes serve specific advanced analytical use cases but require huge investments in hardware infrastructure, whether on-premise or in-cloud and manpower to build the lake. 

With a virtual data lake, a logical big data repository is created virtually, which eliminates the need to move and copy data from source systems into a Hadoop cluster. This approach saves a lot of cost in physical infrastructure, particularly storage, and considerable manpower that is no longer required to physically build the lake. Although a separate computer and storage are needed to perform advanced analytics, the required data preparation infrastructure cost is minimized. Enterprises can then redirect efforts in building the data lake into more valuable big data analytics and data science activities. 

Unified Data Governance 

Most enterprises recognize the value of data but lack a data governance strategy to govern how data is created, collected, retained, used, and archived with clearly defined owners or data stewards and formal processes through its life cycle. Data governance initiatives, which should capture the enterprise’s policies on information collection, use, and management, are often not fully executed due to the absence of a unified platform coupled with complexities in capturing metadata from various data platforms and processes. 

With data virtualization, metadata is automatically captured and collected while building the virtual data layer. This allows data governance policies to be executed or enforced in one place, which enables enterprises to easily address regulatory compliance, customer trust/satisfaction, and better decision-making, the top three drivers of data governance. 

Data virtualization brings more benefits than just modern data integration. Among its many benefits, data virtualization empowers organizations to implement the following data management functions: 

Data Lineage – provides visibility to how data is used within a specific data flow. It describes what happens to the data from its origin, to the many processes it goes through, and to the final destination where it resides. Data lineage is important in tracing how information is derived, particularly in the area of business intelligence and analytics. It can be represented visually to help users discover the data flow. Data virtualization can capture metadata at every stage of the data lifecycle and allows users to trace where the data came from, how it is processed, what calculations were applied, and where it is stored. 

Data Catalog – enables users to search, discover and find information about data and the available datasets within an organization. This functionality is becoming a must-have with today’s data volumes. It helps users understand and gain value from the data assets of the enterprise. 

By automatically capturing metadata to build the virtual data layer, data virtualization enables users to find data for business intelligence, application development, data science, or any other task where data is required. Data Catalog allows users to quickly find data that matches their needs, and evaluate its fitness for consumption. 

Data Audit – refers to the assessment of the data to determine its quality or utility for a specific purpose. It involves profiling and investigation of data to identify problem areas and unravel why they exist. A data audit is usually performed to assess regulatory compliance in highly regulated industries such as banking and finance. 

With data virtualization, enterprises will be able to perform a data audit by leveraging the metadata that is captured in the virtual data layer. This makes it easier to assess which step in the data life cycle does not adhere to regulatory policies and agreed-upon data governance rules or processes.

The challenges brought by the data explosion that we are experiencing also bring opportunities to re-think and re-vamp traditional approaches to enterprise data architectures. Data Virtualization promises to simplify and lower costs in managing data by abstracting the underlying data sources to create a virtual data fabric. Organizations must consider this logical approach to harnessing the power of data.

____________________________________________________________________________________

Note: This article was previously published in the printed issue of The Corporate, Guide and Style for Professionals magazine.

About the Author: The author, Emil Capino is the Founder and CEO of Info Alchemy Corporation.

Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *