The data lake was conceptualized in 2011. With big data in full swing today, our primary challenge is leveraging big data technologies to mass-produce industry services.
As enterprises speed up their digital transformation, their data grows exponentially. In addition, cross-system analysis makes data usage even more expensive. Huawei serves its enterprise customers with the data lake solution, a one-stop answer to the following issues:
Petabyte-level storage Centralized data management includes storage of both existing structured data and unstructured data from digital transformation. The influx of unstructured data includes user behavior logs, images, videos, and documents. Big data applications will be embedded into more and more business scenarios.
Terabyte-level computation Compute power is in demand for large-scale processing before and during data input. Orders, contracts, and user profiles in ultra-large wide tables of over a thousand dimensions require aggregation, processing, and calculation. More terabytes also come from scanned barcodes.
Access to same-source heterogeneous data Diversified data storage means Oracle GoldenGate (OGG) tables will be stored in the Oracle database while barcodes with key-V quick query will be stored in the HBase. Cross-database analysis requires query engines such as Spark and Hive to directly access local metadata. The challenge arises when the actual data is stored in multiple environments, such as HDFS, HBase, or Oracle.
Large-throughput data pipes Massive volumes of service data require rapid aggregation for downstream big data analysis, computing, and modeling. Predictive models are useless if data access cannot keep up with analysis.
What is a Data Lake?
The data lake is a platform that converges data thanks to its architecture of traditional Oracle hybridized with Huawei’s FusionInsight HD&LibrA. This central platform integrates data from Huawei R&D, manufacturing, supply, storage, installation, and delivery. Such integration enables more interaction and digital twin functionality. These automated and intelligent capabilities make overall operations more efficient.
The following figure shows the scenario overview of the data lake.
The data lake platform consists of three logical modules: access, computing, and storage.
How Do You Build a Data Lake?
1. Defining Data Access
Applications have the highest demand for access, particularly high-value digital twin projects.
Data entering the lake must be verified to meet asset standards and mapped to their corresponding owners.
Data is modeled using a bottom-up standard: raw data, cleansed and consolidated data, third normal form (3NF), and service-oriented wide tables.
Data lakes must be highly available, support linear capacity expansion, and viable for three to five years of development.
2. Applying Data to Scenarios
The data lake is a one-stop platform for data collection, computing, and other services. Data in the IT system is classified into structured (or row) and unstructured. The following figure shows how different data types are processed and applied to different scenarios.
Structured data (in green frames) is stored in the Hive after batch processing and virtual mirroring. It then undergoes Kylin preprocessing to the Cube, and encapsulated into REST API services for high-concurrency, subsecond querying and quality monitoring.
IoT data (in red frames) is collected by sensors and reported to the MQS, and then sorted immediately by Storm into HBase. After algorithm processing, this data is used for prewarning and monitoring.
Barcode data (in yellow frames) enters the IQ columnar data lake through the ETLloader.
After cleansing, this data is used for scanning of hundreds of billions of barcodes.
What Are the Ways Data Can Be Stored?
Currently, the data lake stores data on FusionInsight HD&LibrA and Oracle platforms. Where data is stored, and how to store it, depends on which of the two categories it belongs to:
1. Data that is of high value and in demand, such as FIN (financial) data, is mainly stored on the FusionInsight LibrA or Oracle platforms.
2. Data that is novel and unstructured, such as images, videos, and maps, is mainly stored on the FusionInsight HD platform.
3. Each source data system connects to its corresponding platform. For example, a relational database connects to the Oracle database while a Hadoop system connects to FusionInsight HD.
4. Data to be stored is prioritized by service digitalization maturity. For example, the IT data, manufacturing data, and R&D code come from business domains of different digitalization maturity and therefore are stored differently.
The following table shows recommended combinations of data type, specification, and scenario.
What is the Process of Data Entry?
Data is processed by type, because unstructured data has varying formats and standards, and is more difficult to standardize and understand than structured information. Unstructured data processing involves massive storage, intelligent retrieval, knowledge mining, content protection, and value-added development and utilization of information. Structured and unstructured data is therefore stored and managed by different databases:
Structured data is logically expressed and implemented as a 2D table structure. It complies with format and length specifications and is stored and managed in relational databases.
Unstructured data has an irregular or incomplete structure and lacks a predefined data model. Such data includes documents, texts, images, XML, HTML, reports, audio, and video files. Instead of using the 2D logic table, unstructured data is stored in databases using multi-value field, subfield, and variable-length field mechanisms to create and manage data items.
Modeling: Unstructured data is indexed centrally for data retrieval and analysis. These indexes contain maintenance personnel and update time as optional object description fields. When unstructured data is stored, the process catalogs the object modes and digital attributes, customizes metadata, and associates a large amount of unstructured heterogeneous data with unified file metadata. Models are created, where each piece of metadata is a dimension. The engine indexes each metadata attribute in multiple dimensions. Data of various types – text, image, audio, video, and hypermedia – can thus be associated and processed, and widely used for full-text retrieval and multimedia information processing.
Storage platforms: HBase, MongoDB, and HDFS
Increment: Push and pull policies are supported. If HBase storage is selected, the number of database versions must be considered for easier viewing of historical versions.
When Do We Use Data Lakes?
1. Freight Logistics
The data lake solution offers a view of logistical operations in real-time using GPS data. Any risk perceived is reported immediately, enabling five-minute prewarnings thanks to the association between shipping orders and Huawei internal orders. This prewarning enables the identification of at-risk shipments. The next challenge is associating freight, warehouse, and order data, increasing the update frequency to under five minutes.
2. Site Delivery
The data lake solution now serves a dozen site scenarios – to planning, survey, and configuration, to delivery and acceptance. Data is aggregated across these scenarios for full update reports in 15 minutes. This enables personnel to quickly check key delivery points. The next challenge is synchronizing data aggregated from millions of sites and tens of millions of detailed data entries, and ensuring the consistency of all this data.
3. Order fulfillment
Data from the entire lifecycle – production, planning, shipment, delivery, and acceptance – is integrated. Real-time computing means user-defined rules and detected exceptions to provide overall inspections, risk follow-up, and flexible rule configurations. Thanks to distributed computing engines, rule expansion does not affect computing efficiency.
The messaging mode collects manufacturing data from the surface-mount for automated optical inspection in real time. These messages have a daily increment of 50 million, growth that has supported monitoring of the production line in seconds. This IoT pilot project has the potential to showcase the data lake solution.
5. Project operations
The data lake solution integrates five sources, including iBuy, iGo, and iResource, for real-time querying of budget grants and threshold warnings. Reasonable POs are now executed more easily. The solution leverages HBase querying for massive detailed data in just seconds.
These success stories have shown how digital transformation enables operations with real-time visualization and monitoring of services. Thanks to data lakes, querying even massive volumes of data is efficient, project control enhanced, and greater value from data derived.