Welcome to Waim 2013

Welcome!

General Info

Program

Participants

Registration

Organization

Travel Information

Past Events

WAIM 2013 / MSRA Summer School

COURSE INTRODUCTION

Course 1: A Tutorial on Probabilistic Databases
Lecturer: Prof. Dan Suciu University of Washington

A major challenge in modern data management is how to cope with uncertainty in the data, such as in data extracted from text, in physical or RFID data, or in fuzzy data integration. In a probabilistic database uncertainty is modeled using probabilities, and data management techniques are extended to cope with probabilistic data. This tutorial will discuss the main challenge in probabilistic databases, which is query evaluation. Each answer to a SQL query has a degree of certainty, defined as the probability that the answer is present. This problem is equivalent to computing the probability of a Boolean formula, or to the model counting problem, which has been extensively studied in the AI and model checking literature, and is known to be intractable in general (#P-complete). The approach taken in probabilistic databases, however, is entirely novel, since here we can separate between the query and the data. By a careful static analysis of the SQL query we can identify many cases when the probabilistic inference problem is in PTIME, an approach that lead to the discovery of entirely new classes of tractable Boolean formulas. Even when the query is #P-hard, we can approximate the query answer by evaluating a dissociated version of the query, which can be done in PTIME. In all cases, the SQL query is entirely rewritten into a (more complex) SQL query that manipulates probabilities directly, and which can be computed in a standard relational database system.

The tutorial has four parts:
1. Motivation and Basic Definitions
-- sample applications
-- tuple/attribute level uncertainty
-- the possible worlds semantics
-- the query evaluation problem and its complexity (#P-hard)
2. Extensional Query Plans and Safe Plans
-- join, group-by (projection), selection, union, summation
-- safe and unsafe plans
-- converting safe plans back into SQL; demonstration in postgres
-- the dissociation theorem for approximate query evaluation
3. Extensional Query Evaluation
-- Conjunctive queries without self-joins; hierarchical queries
-- General queries and the inclusion/exclusion formula
-- the Moebuius function in a lattice
-- query shattering and ranking
-- the dichotomy theorem
4. Intensional Query Evaluation (advanced) -- Lineage
-- the DPLL class of algorithms for model counting
-- approaches to model counting: read-once formulas, OBDDs, FBDDs, d-DNNFs
-- query compilation and the characterization theorems
-- open problems

The tutorial assumes basic familiarity with probability theory and with simple SQL queries. No background is need in database internals, graphical models, or model counting. Most of the material covered in the tutorial is also available in: Suciu, Olteanu, R��, Koch: Probabilistic Databases. Synthesis Lectures on Data Management, Morgan & Claypool Publishers 2011,
http://www.morganclaypool.com/doi/abs/10.2200/S00362ED1V01Y201105DTM016

Prof.Dan Suciu
Computer Science & Engineering
University of Washington
Box 352350
suciu@cs.washington.edu

Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semi-structured Data and XML,1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM, holds twelve US patents, received the ACM SIGMOD Best Paper Award in 2000, the ACM PODS Alberto Mendelzon Testof Time Award in 2010 and in 2012, and is a recipient of the NSF Career Award and of an AlfredP. Sloan Fellowship. Suciu serves on the VLDB Board of Trustees, and is an associate editor for the VLDB Journal, ACM TOIS, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS. Suciu's PhD students Gerome Miklau and Christopher Re received the ACM SIGMOD Best Dissertation Award in 2006 and 2010 respectively, and Nilesh Dalvi was a runner up in 2008.

Course 2: SQL, NoSQL, NewSQL and Other Interesting Ways to Process Big Data
Lecturer: Prof. Michael Franklin UC Berkeley
In this four-hour mini course he will cover various techniques for analyzing Big Data at scale. he'll give a bit of background and history about massively parallel query processing in database systems (a topic he first starting working on over 25 years ago) and then cover more recent massively parallel data processing infrastructure such as Map Reduce and Hadoop. The goal here will be to compare and contrast these approaches while avoiding sounding too much like a grumpy old database guy. Then, he'll describe the data analytics stack that they are building in the Berkeley AMPLab (called the BDAS - the Berkeley Data Analytics Stack) including the popular Spark and Shark systems as well as more recent efforts to extend these for supporting advanced techniques such as Stream processing, Graph processing and Machine Learning. The overall goal is to give a broad survey of this incredibly active area of research, with some ideas and pointers for areas of research opportunity. he apologize in advance that this will not be a comprehensive survey - there's just way too much going on!

Prof. Michael Franklin
Professor and Director of AMPLab
Dept of Computer Science
The University of California at Berkeley
USA

Michael Franklin is the Thomas M. Siebel Professor of Computer Science and Director of the Algorithms, Machines and People Lab (AMPLab) at UC Berkeley. His research focuses on new approaches for data management and data analysis, including data stream processing and continuous analytics, scalable query processing, large-scale sensing environments, data integration, and hybrid human/computer data processing systems. He started his research career (pre-Ph.D.) as a programmer on the Bubba massively parallel database system at MCC in Austin, Texas and is happy to see the1000-node systems underlying the design of that system finally becoming mainstream - a couple decades later. In 2006 he founded (with Sailesh Krishnamurthy) Truviso, Inc. a real-time data analytics company that was acquired by CIsco Systems last year. Truviso pioneered a unified approach to processing streaming data and stored data that is particularly well-suited for processing fast moving data in networks. He is an ACM Fellow and winner of the ACM SIGMOD Test of Time Award. He's recently won the Best Paper Awards at ICDE 2013 and NSDI 2012, a "Best of VLDB 2012" selection, Best Demo awards at SIGMOD 2012 and VLDB 2011 and the Outstanding Advisor Award from the Computer Science Graduate Student Association at Berkeley. He is currently serving as a committee member on the U.S. National Academy of Sciences study on Analysis of Massive Data. Prof. Franklin is currently on sabbatical at the Center for Big Data and Cloud Computing at East China Normal University in Shanghai.

Course 3: Usability in Database Systems
Lecturer:H V Jagadish While usability has long been recognized as an important virtue for a database system, there has been a recent strong push to improve database systems in this regard.
In this tutorial, I will present an overview of recent work in this regard, organized according to a recently developed framework in which to classify this research, based on both the \emph{data life-cycle} and on the steps of user interaction.

H V Jagadish
Bernard A Galler Collegiate Professor of
Elec. Engg. and Computer Science.
University of Michigan
jag at eecs . umich . edu

H. V. Jagadish is Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science, and Director of the Software Systems Research Laboratory, at the University of Michigan in Ann Arbor. After earning his PhD from Stanford in 1985, he spent over a decade at AT&T Bell Laboratories in Murray Hill, N.J., eventually becoming head of AT&T Labs database research department at the Shannon Laboratory in Florham Park, N.J.
Professor Jagadish is well-known for his broad-ranging research on information management, and has approximately 200 major papers and 37 patents. He is a fellow of the ACM ("The First Society in Computing") and serves on the board of the Computing Research Association, and is the Founding Editor-in-Chief of the Proceedings of the VLDB Endowment (since 2008).

Download Summer School Application Form & Summer School Accommodation Form

	The 14th International Conference on Web-Age Information Management June 14-16, 2013, Beidaihe, China