报告题目:集合包含关系查询处理的相关技术
报告时间:2019年4月27日上午10:00
报告地点:校本部计算机楼三楼308会议室
报告人:杨建业博士
摘要:在数据库领域,集合包含关系连接是一个经典的基础集合运算,当前的主流算法分为两大类:集合交集导向和集合并集导向,二者各有优劣。前者利用倒排索引结构,通过对倒排索引链表进行集合的交集运算得到结果,整个运算过程无需再验证,然而其缺点在于建立倒排索引时,需要对S中的每个集合保存多个副本,因而随着每个集合的增大,其计算开销急剧增长;后者利用标签技术进行候选集生成和验证两个阶段,其优点是索引结构小,候选集生成效率高,缺点是需要进行验证,对于结果集比较大的数据集,这个方法比较耗时。我们提出一种新的集合并集导向的方法,TT-Join,该方法融合了集合交集导向的优点,并且利用前缀树结构分别对R和S建立索引结构,在此基础上,巧妙的在两颗前缀树之间同步进行树游历操作, 得到最终结果。我们在20个benchmark数据集上与7个现有方法进行实验比较,实验结果显示TT-Join在绝大部分数据集上要显著优于现有方法,可达两个数量级。同时,为了能支持更大规模的数据集,我们将TT-Join扩展到分布式计算框架(mapreduce)上,实验结果显示,我们的数据划分策略要显著优于随机划分和现有的基于集合相似性的划分。
报告人简介:蚂蚁金服高级算法工程师,负责集团风控相关模型与算法落地。2017年8月博士毕业于澳大利亚新南威尔士大学,师从数据库领域专家Xuemin Lin教授,本科硕士毕业于西安电子科技大学,主要研究方向为文本数据库及图数据库查询技术与算法。相关工作发表于数据库及数据挖掘领域顶级国际期刊和会议上,包括VLDBJ, ICDE, TKDE, KAIS, WWWJ, 其中CCF A类国际期刊和会议长文5篇, CCF B类国际期刊长文2篇,曾担任多个国际期刊和会议的审稿人,ACM/IEEE会员。
Title:Efficient set containment join
Report time:10:00 am, April 27, 2019
Report location:conference room 308, 3rd floor, computer building, campus campus
Reporter:Dr. Yang jianye
Digest:Set containment join is a fundamental operation on massive collections of set values.Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersection-oriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists. A nice property of this method is that the join computation is verification free. However, the number of records explored during the join process may by large because there are multiple replicas for each record in S. On the other hand, the union-oriented method follows a candidate generation-and-verification paradigm by utilizing effective signatures. Unfortunately, union-oriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. In this work, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous union-oriented methods but also integrates the goodness of intersection-oriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets and synthetic datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets, and can achieve up to two orders of magnitude speedup. Furthermore, to support large scale of datasets, we extend our techniques to distributed systems on top of MapReduce framework. With the help of careful designed load-aware distribution mechanisms, our distributed join algorithm can achieve up to an order of magnitude speedup than the baselines methods.
Speaker profile:Dr Jianye Yang is currently a senior data engineer in Ant Financial Servieces Group. He received the PhD degree in Computer Science from the University of New South Wales, Australia, under the supervision of professor Xuemin Lin, in 2017. He got his Bachelor and Master degree in Computer Science from Xidian University, China, in 2010 and 2013 respectively. His research interests include text data query processing and graph data analysis. He has published multiple papers in top-tier international conference and journals, such as VLDBJ, ICDE, TKDE, KAIS, WWWJ. He served as peer reviewer for top-tier international conferences and journals.