Steven Euijong Whang (황의종)
직함: Research Scientist
Google Research
Big Data and AI are two of the most important technologies for the Fourth Industrial Revolution and are inevitably becoming integrated. Traditionally, Big data research was focused on scalability on large amounts of data, while AI was focused on intelligent processing, but not on large data. Recently, practical systems with real-life applications need integrated technology from both AI and Big data, introducing new challenges that can change the way we think about traditional AI and Big data issues. To this end, I will describe my research starting from Big data I have done at Stanford to Big data-AI integration at Google Research.
At Stanford, I worked on Big data analytics (in particular, information integration and privacy), which has become an extremely important and challenging problem in disciplines like computer science, biology, and medicine. As massive amounts of data are available for analysis, scalable integration techniques that provide a unified view to the heterogenous information from various sources are becoming important for data analytics. Within information integration, I will focus on the problem of entity resolution (ER), which identifies objects that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema, and application are better understood. In this talk I will address the problem of keeping the ER result up-to-date when the ER logic "evolves" frequently by using evolving rules. A naive approach that re-runs ER from scratch may not be tolerable for resolving large datasets. I will show when and how we can instead exploit previous "materialized" ER results to save redundant work with evolved rules. I will also briefly explain how I used crowdsourcing techniques to enhance ER.
At Google Research, I worked on AI systems that use Big data. First, I will describe the Biperpedia project (a state-of-art knowledge base for search applications) I developed as the technical leader. Using Biperpedia, a search engine can understand a query like "brazil coffee production 2017" and understand that the user is asking for some numeric attribute (called coffee production) of the country Brazil in the year 2017. While the attributes of existing knowledge bases like Freebase are manually curated, Biperpedia automatically extracts attributes (thousands per class) on the long tail from Search queries and Web Text using machine learning and natural language processing techniques. Next, I will briefly describe my current work on developing Big data management infrastructure for large-scale machine learning systems. Unlike conventional software, the success of a machine learning system heavily depends on the quality of the data used for training models. In fact, most significant outages in production-scale machine learning systems involve some problem in the data. Hence, Big data management becomes critical in all steps of large-scale machine learning.
Steven Euijong Whang (황의종) is a Research Scientist at Google Research. His research interests include Big Data-Artificial Intelligence Integration, Big Data Analytics, Information Integration, Knowledge Systems, and Machine Learning. The goal of his work is to integrate Big Data and Artificial Intelligence techniques to build scalable AI systems that handle large amounts of data. Dr. Whang received his Ph.D. in Computer Science in 2012 from Stanford University working with Prof. Hector Garcia-Molina. He received his M.S. in Computer Science from Stanford in 2007 and his B.S. in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in 2003. For his dissertation work, he has made significant contributions to information integration (in particular, entity resolution) proposing a general framework of entity resolution and comprehensively working out a series of related issues. At Google Research, he led the Biperpedia project (a scalable state-of-art knowledge base) as the technical leader and is currently working on Big data management challenges in large-scale machine learning systems. His work has led to numerous publications in top venues including VLDB(5), ACM SIGMOD(3 including a Tutorial), IEEE ICDE(2), ACM SIGKDD (1), EMNLP(1), WWW(1), ACM SIGIR(1), ACM CIKM(1), CIDR(1), VLDB Journal(4), IEEE TKDE(1), and a chapter in a textbook. He received the best paper award from WebDB 2015 and is a recipient of the IBM PhD Fellowship, KFAS (한국고등교육재단) Fellowship, and KAIST Presidential Prize (KAIST 총장상) at commencement. CV is available at http://infolab.stanford.edu/~euijong