第 35 届中国数据库学术会议( NDBC 2018)特邀报告



1. 特邀大会报告题目:Mosaics in Big Data:Stratosphere, Apache Flink, and Beyond

特邀大会报告摘要: The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define “big data”, i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.

特邀大会报告专家简介:
Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin) and an Adjunct Full Professor at the University of Toronto. At the German Research Center for Artificial Intelligence (DFKI), he is both a Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group. In addition, he is Director of the Berlin Big Data Center (BBDC). Earlier in his career, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California, USA and a Research Group Leader at FORWISS, the Bavarian Research Center for Knowledge-based Systems located in Munich, Germany. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 20 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been both the Speaker and Principal Investigator for the Stratosphere Project, which resulted in a Humboldt Innovation Award as well as Apache Flink, the open-source big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.” Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.

Website: http://www.dima.tu-berlin.de

2. 特邀大会报告题目:Medical Treatment Support by Data Engineering Technologies

特邀大会报告摘要: Our daily lives have been greatly impacted by information technology which has become one of the most important infrastructure. As an example, the information technology has introduced significant changes in the medical field, such as medical image recognition, medical sensor data processing, computational drug design, and electronic medical record (EMR) systems. Focusing on the EMR systems, the data engineering technologies has high potentials for supporting them. The EMR systems do not only reduce the cost of managing medical treatment histories, but also can improve medical processes by the secondary use of these records. To expedite the secondary use, Japanese government has started a project to collect the EMR from a large number of hospitals in Japan. The clinical pathway service is a good instance of the secondary use of the EMR. Medical workers including doctors, nurses, and technicians generally use clinical pathways as their guidelines for typical sequences of medical treatments. The clinical pathways have been traditionally created by the medical workers themselves based on their experiences with great effort. The candidates of the clinical pathways can be extracted by applying the sequential pattern mining techniques to medical orders in the EMR. It is helpful for the medical workers to verify the correctness of existing clinical pathways or modify them by comparing the extracted frequent sequential patterns. To provide proper patterns as the useful information to the medical workers, there are a number of technical issues to be considered. At first, consideration on time intervals between the medical treatments is essential. Moreover, there are a number of branches in the frequent sequential patterns extracted from the EMR. Visualization of these branches is important to choose appropriate patterns. The issues of cost, safety, and reasoning related to these branches should also be considered.

特邀大会报告专家简介:
Haruo Yokota received his B.E., M.E., and Dr.Eng. degrees from Tokyo Institute of Technology in 1980, 1982, and 1991, respectively. He joined Fujitsu Ltd. in 1982, and was a researcher at ICOT for the 5th Generation Computer Project from 1982 to 1986, and at Fujitsu Laboratories Ltd. from 1986 to 1992. From 1992 to 1998, he was an Associate Professor at Japan Advanced Institute of Science and Technology (JAIST). He moved to Tokyo Institute of Technology 1998, and has been a Full Professor at the Department of Computer Science since 2001. He is currently the Dean of School of Computing in Tokyo Institute of Technology. His research interests include the general research areas of data engineering, information storage systems, and dependable computing. He was a vice president of DBSJ, a chair of ACM SIGMOD Japan Chapter, a trustee board member of IPSJ, the Editor-in-Chief of Journal of Information Processing, and an associate editor of the VLDB Journal. He is currently a board member of DBSJ, a fellow of IEICE and IPSJ, a senior member of IEEE, and a member of IFIP-WG10.4, JSAI, ACM, and ACM-SIGMOD.

3. 特邀大会报告题目:Data Analytics as a Service for Data Scientists

特邀大会报告摘要: Data scientists and domain experts often face challenges when dealing with large amounts of data, especially due to the scale and limited IT knowledge and infrastructure maintenance skills. In this talk, I will present several software solutions we are developing to support data analytics as a service to these users. These solutions include Apache AsterixDB as an open source parallel database, Cloudberry as a middleware system to support data visualization, and Texera as a system to enable browser-based text analytics using declarative workflows. These solutions can be integrated to support data ingestion, storage, indexing, querying, visualization, and analytics.. As an example, we will report experiences of using these solutions to support management of large-scale social media data (e.g., billions of tweets in terabytes) as a service to researchers of various disciplines such as social science and public health from several schools and universities.

特邀大会报告专家简介:
Chen Li is a professor in the Department of Computer Science at UC Irvine. He received his Ph.D. degree in Computer Science from Stanford University, and his M.S. and B.S. in Computer Science from Tsinghua University, China, respectively. His research interests are in the field of data management, including data-intensive computing, query processing and optimization, visualization, and text analytics. His current focus is building open source systems for data management and analytics. He was a recipient of an NSF CAREER Award, several test-of-time publication awards, and many grants and industry gifts. He was once a part-time Visiting Research Scientist at Google. He founded a company to commercialize university research.

4.特邀大会报告题目: Building Scalable Machine Learning Solutions for Data Curation

特邀大会报告摘要:Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. In this talk I discuss why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions. The talk focuses on two main problems: (1) entity consolidation, which is arguably the most difficult data curation challenge because it is notoriously complex and hard to scale; and (2) using probabilistic inference to suggest data repair for identified errors and anomalies using our new system called HoloCLean. Both problems have been challenging researchers and practitioners for decades due to the fundamentally combinatorial explosion in the space of solutions and the lack of ground truth. There’s a large body of work on this problem by both academia and industry. Techniques have included human curation, rules-based systems, and automatic discovery of clusters using predefined thresholds on record similarity Unfortunately, none of these techniques alone has been able to provide sufficient accuracy and scalability. The talk aims at providing deeper insight into the entity consolidation and data repair problems and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

特邀大会报告专家简介:
Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees, elected SIGMOD vice chair, and an associate editor of the ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Website: https://cs.uwaterloo.ca/~ilyas/

5. 特邀大会报告题目:Big Data 2.0:未来数据计算

特邀大会报告摘要: 经过40余年的发展,当今信息技术社会正在历经IT时代到DT时代的转变,大数据技术正在正在深刻地影响着整个社会和世界。本报告首先回顾了当前大数据计算的发展现状和主要技术进展,包括批处理、流计算等大数据计算平台;展望下一代大数据计算技术,首先分析了新一代大数据计算系统的特征与挑战,然后从批流融合、跨域处理和边缘计算等三个方向介绍未来数据计算的发展趋势。

特邀大会报告专家简介:
王国仁,北京理工大学教授、博士生导师、国务院学科评议组成员、 长江学者特聘教授、国家杰出青年科学基金获得者、 中国计算机学会数据库专业委员会副主任委员。入选国家百千万人才工程国家级人选,授予“有突出贡献中青年专家”荣誉称号。主持国家自然科学基金、国家863计划项目等20余项。 发表学术论文100余篇,主要研究方向包括: 图数据管理、大数据计算技术、 生物信息学等。

Website: http://cs.bit.edu.cn/szdw/jsml/js/wgr_2018/index.htm

6.特邀大会报告题目: 大数据近似算法

特邀大会报告摘要:传统近似算法针对NP完全问题,通过引入近似得到多项式时间的算法。在大数据时代,随着数据规模的不断提升,很多多项式时间可解的问题也变得难以计算。近来年,利用近似算法解决由数据规模扩大带来的时间/空间/通讯量效率问题成为了学界热点。大数据近似算法的核心思想在于通过引入可控误差,将大数据转换为统计意义上与元数据非常相似的“小数据”。在大数据分析领域,人们往往并不关心某条特定的记录,更重要的是一些宏观上具有统计意义的数据刻画,而近似算法产生的小数据与元数据相比小几个数量级,使得系统的实时响应和对数据的实时分析成为可能。在本次报告中,我们将讨论大数据近似算法中常用的采样(sampling)、略图(sketch)、摘要(summary)等技术,以及这些技术在结构化数据、流数据、矩阵数据以及图数据中的应用。

特邀大会报告专家简介:
文继荣,1994年和1996年于中国人民大学分别获得学士和硕士学位,1999年于中科院计算所获得博士学位,并在随后加入微软亚洲研究院,自2008年担任高级研究员和互联网搜索与数据挖掘组主任。2013年9月加入中国人民大学,目前担任信息学院教授、院长,大数据管理与分析方法研究北京市重点实验室主任,国家“千人计划”特聘专家。主要研究方向是大数据管理和分析、信息检索、数据挖掘和机器学习,尤其擅长跨领域的研究和大规模数据系统的开发。至今已在国际著名学术会议和期刊上发表论文170多篇,总计引用12000多次,H-Index为49,担任ACM TOIS和IEEE TKDE的编委。在系统和软件产品开发方面做了大量的工作,获得了49项美国专利,领导了微软学术搜索、人立方等系统的开发,全面参与了微软搜索引擎Bing的讨论、设计和实现。这些系统有上亿的用户使用,具有广泛的影响力。

7. 特邀大会报告题目:面向海量数据高吞吐OLTP场景的云数据库架构研究与实践

特邀大会报告摘要: 随着ABC时代的到来,传统关系型数据库必然面临许多新业务场景和行业转型的挑战:一方面需要解决指数级增长的PB级海量数据存储问题,另一方面不能降低OLTP场景高吞吐低时延的性能指标,同时面对企业级低成本的资源压力需要从数据库架构层面做出性能与成本的平衡策略。针对以上这些挑战,百度云数据库团队通过借鉴积累的多年数据库研发和运维经验,走出来一条弹性架构、智能运维、缓存融合的发展路线,当前已取得的一些研究实践进展如下,本次演讲会重点介绍第一部分云数据库架构研究与实践:
1、弹性架构:介绍百度云数据库架构演进过程,从数据库三层架构到水平拆分的分布式数据库架构,升级到计算引擎与存储引擎分离的下一代云数据库架构,通过物理复制、MVCC For Page、Smart Storage等技术革新,并搭配分布式块存储、Optane、RDMA等硬件升级,巧妙地解决了传统数据库系统典型的IO争抢和平滑扩展问题,一方面保证了高吞吐和高可用性,另一方面打破了数据库的烟囱模型,降低了数据库的成本。
2、智能运维:通过构建运维数据仓库和知识库,并应用运维大脑的策略和算法,从而打造运维机器人感知+反馈+决策的闭环框架,解决了大规模数据库集群在机房故障场景下自动感知、决策、执行的全流程智能止损问题。
3、缓存融合:针对高QPS要求的秒杀类业务场景,业务痛点是维护多份数据源,并解决缓存和数据库数据一致性问题,百度云数据库团队提出一种多协议统一,架构融合的冷热数据分离架构,使用中间件标准化通信接口并解决数据库与缓存之间的数据同步问题,实现对业务系统透明,整体吞吐提升10倍并降低了20%的时延,完美满足了业务对于存储系统高吞吐低延时的服务要求。

特邀大会报告专家简介:
张皖川,毕业于威斯康星大学麦迪逊分校, 获数学博士,计算机硕士。毕业后加入IBM公司长期从事IBM DB2 数据库研发。于2008年回国后, 参与组建 IBM 中国研发中心数据库研发团队并领导团队参与 DB2 研发。在数据库存储, 压缩,复制, 列存储内存数据库等领域有丰富的研发经验。于2018年 加入百度云数据库团队负责百度下一代云数据库研发。