Inexhaustible Data Lake Enables City Intelligence
A Smart City is a Big Data ecosystem centered on and driven by data. Big Data forms only when a large amount of data is fully converged, and, according to Big Data experts, data aggregation is difficult to achieve during governmental Big Data development.
When constructing a city’s Big Data center, challenges stemming from data aggregation will arise, like growing data sources, expanding data types, and exploding data volume. Big Data is useful only when it is technologically manageable.
In addition to the typically limited traditional application system data, the Big Data center will aggregate new types of Big Data such as electronic documents and multimedia files generated by daily work; streaming data generated by city-wide video security and Internet of Things (IoT) sensors; and even the social data resources of enterprises, institutions, and the Internet. How to collect, store, and manage such data becomes a problem for the Big Data center. Enterprises must obtain, store, manage, and understand data before they can achieve ‘one-time aggregation and multiple sharing times.’
Simply aggregating massive disordered data will turn a Big Data center into a data marsh, as the data is unstructured. However, a government’s information resource directory and information exchange systems process only structured data. With the evolution of government informatization over the past 10 years, technology and management limitations are becoming increasingly prominent.
Big Data technologies come from Internet enterprises. However, governmental Big Data is distinct from Internet Big Data in that it is heterogeneous, scattered, and disordered — and data storage is not highly centralized or completely homogeneous. Manual cataloging cannot handle the pressure of metadata annotation for mass data, and the system architecture must be upgraded to that of Big Data. Governmental Big Data is an attribute of public ownership where external value is greater than internal value, and external use occurs prior to internal use. Therefore, the focus is on public data-set development and resource-based services. Data aggregation problems cannot be solved by simply replicating the experience of Internet enterprises and ignoring the scattered distribution, diversity, and value of governmental Big Data.
Generally, Big Data development adopts type-A application modes that focus on data analysis results (application object: Analytical application). For Big Data development, there is also type-D application modes that focus on data content (application object: Public data sets). If data resources are not fully concentrated in scale, most governments should use type-D modes, instead of pursuing rapid development with type-A.
James Dixon, CTO of Pentaho, proposed the concept of the data lake in 2010. A data lake differs from a data warehouse in regards to two major limitations: 1) A warehouses can only be used to answer pre-determined questions; and 2) The data stored in a warehouse has been filtered and packaged, obscuring the initial state.
“If you think of a datamart as a store of bottled water — cleansed and packaged and structured for easy consumption — the data lake is a large body of water in a more natural state,” said Dixon in his original blog post. “The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
The core principle of the data lake is to centrally store original and unchanged data, and then process it only after extraction. Data lakes store various types of data, primarily unstructured and semi-structured, and then allow access to the data through a unified view. The data lake must have powerful metadata management capabilities to ensure the semantic consistency of stored data resources, which is the prerequisite for Big Data analytics.
Governmental Big Data value chains include convergence, aggregation, management, calculation, and use. The data lake is an upstream part of the chain related to data collection, aggregation, and storage — and functions as the source of Data-as-a-Service (DaaS) and analytical applications. In a narrow sense, the data lake corresponds to the aggregation phase. In a broad sense, the data lake corresponds to convergence, aggregation, and management.
• For data aggregation, enterprises aim to build a unified data collection system and a unified Big Data resource pool to optimize data processing on the Big Data supply side.
• For data analysis, enterprises aim to establish a channel between Big Data analytics and the Big Data lake so that data can be extracted for immediate analysis.
• For data management, enterprises aim to establish a unified metadata management system and a unified raw data warehouse for Big Data analytics to fully meet Big Data demands.
Huawei has proposed the ‘One Cloud, One Lake, One Platform’ Solution, which leverages the company’s extensive experience in Smart City construction, data asset management transformation, and the technical accumulation of Big Data and Artificial Intelligence (AI). ‘One Cloud’ refers to an eGovernment cloud, ‘One Lake’ to a data lake, and ‘One Platform’ to a Big Data platform. The data lake consists of a metadata management platform, data lake warehouse, and data lake service. The metadata management platform registers, counts, evaluates, and disposes of data assets; the data lake warehouse stores native data in a manageable and scalable manner; and the data lake service provides data discovery, preparation, and extraction for external systems.
Huawei launched the Smart City Data Lake Solution to provide an inexhaustible source for the Big Data ecosystem. Huawei’s Big Data solution, with a data lake as its core, has the following three capabilities:
• Early practice and pilot exploration: With 180,000 employees, Huawei has accumulated extremely complex information systems with massive data resources. Cross-domain data acquisitions can be difficult, or even impossible, due to a lack of permissions or the loss of large amounts of intermediate data which, in turn, prevent legacy systems from meeting digital operations and Big Data analytics application requirements. In 2017, Huawei launched the data asset management transformation project, initiated data lake construction in the product field, successfully applied the Image Process Design (IPD) data lake solution, and constructed a unified database to store theme data for centralized data asset management. In this way, Huawei removed data barriers, connected data, and enabled proactive services.
• Targeting the future with leading architecture: In the future, all data will be migrated to the unified eGovernment cloud. In the early stages of advanced and practical systems, organizations can adopt the small data architecture of traditional databases and new Big Data architecture from data lakes and apply unified metadata management. When conditions mature, those same companies can then integrate traditional data architecture with the new Big Data architecture.
• Automation and high efficiency: AI technologies enable metadata to be automatically annotated. The existing directory system applies only to structured data and primarily relies on manual cataloging, which has disadvantages like a heavy workload, high complexity, and low quality. After the system stores unstructured and semi-structured data, the manual cataloging method becomes useless because of the large scale — and mature AI technologies such as image recognition, voice recognition, and natural language processing must be introduced to handle video, voice, and electronic documents. Robots understand unstructured data and automatically extract subject words, keywords, and labels. Therefore, the adoption of machine learning technologies continuously improves quality.
China has taken the lead in Big Data development by driving the emergence of Smart City data lakes. In Big Data engineering projects such as Smart Gaoqing; Beijing’s secondary municipal center; and Lanzhou’s new district — Huawei replicated China’s approach to IPD data lake construction and accelerated solution implementation to make breakthroughs in governmental Big Data aggregation, which is helping local governments to set sail toward the new horizons that are Smart Cities.