
WELCOME TO THE DATA SYSTEMS LAB
Our vision is growing data systems research through education, innovation and discovery as well as bridging the gap between the data systems community and other scientific and business domains that may benefit from data management solutions. Our mission is to design experimental data systems that enable emerging applications, incorporate human-awareness in the system design, process non-traditional data types, or support emerging software platforms or hardware paradigms.
PEOPLE

KANCHAN CHOWDHURY
Research Assistant (PhD Student)

ZISHAN FU
Research Assistant
(MS Student)

SETU SHAH
Research Assistant
(MS Student)

WHAT'S NEW?
Stay in the Know

SELECTED PUBLICATIONS
Members of the data systems lab contributed to more than 30 publications in major database systems and spatial computing venues. Below is a sample:
GEOSPARK: A CLUSTER COMPUTING FRAMEWORK FOR PROCESSING LARGE-SCALE SPATIAL DATA
ACM SIGSPTIAL International Conference on Geographic Information Systems, GIS 2015
This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as reg- ular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDDs to support geometrical and spatial objects. GeoSpark provides a geometrical oper- ations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). System users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range, Join, KNN query) on SRDDs. GeoSpark also allows users to create a spatial index (e.g., R-tree, Quad-tree ) that boosts spatial data processing performance in each SRDD partition. Pre- liminary experiments show that GeoSpark achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).
TWO BIRDS, ONE STONE: A FAST, YET LIGHTWEIGHT, INDEXING SCHEME FOR MODERN DATABASE SYSTEMS
Proceeding of the Very Large Database Endowment, PVLDB 2016
This paper proposes Hippo a fast, yet scalable, database indexing approach. It significantly shrinks the index storage and mitigates maintenance overhead without compromising much on the query execution performance. Hippo stores disk page ranges instead of tuple pointers in the indexed table to reduce the storage space occupied by the index. Itmaintains simplified histograms that represent the data distribution and adopts a page grouping technique that groups contiguous pages into page ranges based on the similarity of their index key attribute distri- butions. When a query is issued, Hippo leverages the page ranges and histogram-based page summaries to recognize those pages such that their tuples are guaranteed not to sat- isfy the query predicates and inspects the remaining pages. Experiments based on real and synthetic datasets show that Hippo occupies up to two orders of magnitude less storage space than that of the B+-Tree while still achieving compa- rable query execution performance to that of the B+-Tree for 0.1% - 1% selectivity factors. Also, the experiments show that Hippo outperforms BRIN (Block Range Index) in exe- cuting queries with various selectivity factors. Furthermore, Hippo achieves up to three orders of magnitude less maintenance overhead and up to an order of magnitude higher throughput (for hybrid query/update workloads) than its counterparts.
LARS*: AN EFFICIENT AND SCALABLE LOCATION-AWARE RECOMMENDER SYSTEM
IEEE Transactions on Knowledge and Data Engineering, TKDE 2014
This paper proposes LARS*, a location-aware recommender system that uses location-based ratings to produce recommendations. Traditional recommender systems do not consider spatial properties of users nor items; LARS*, on the other hand, supports a taxonomy of three novel classes of location-based ratings, namely, spatial ratings for non-spatial items, non-spatial ratings for spatial items, and spatial ratings for spatial items. LARS* exploits user rating locations through user partitioning, a technique that influences recommendations with ratings spatially close to querying users in a manner that maximizes system scalability while not sacrificing recommendation quality. LARS* exploits item locations using travel penalty, a technique that favors recommendation candidates closer in travel distance to querying users in a way that avoids exhaustive access to all spatial items. LARS* can apply these techniques separately, or together, depending on the type of location-based rating available. Experimental evidence using large-scale real-world data from both the Foursquare location-based social network and the MovieLens movie recommendation system reveals that LARS* is efficient, scalable, and capable of producing recommendations twice as accurate compared to existing recommendation approaches.