Hosted by the University of Toronto, this event is a focus on high performance geoprocessing and big data.
Our deep graitude to Dr. Angela Demke Brown and the University of Toronto's Computer Systems and Network Group for hosting this event. The event will take place in room 1190 on the ground floor of the Bahen Centre for Information Technology (40 St. George Street).
Brief Welcome: Andrew Ross & Dr. Angela Demke Brown
Fast, Distributed Geoprocessing with Scala and GeoTrellis
by Robert Cheetham, CEO, Azavea
What got you hooked on geospatial? For me it was more than just being able to see stuff on a map – it was the ability to transform geographic data in ways that enabled me to see something new, make a better decision or shed new light on some aspect of my environment.
Whether you use GDAL, ArcGIS ModelBuilder, GRASS or IDRISI, we have usually done this type of data transformation with a variety of desktop software tools. So why have these types of capabilities been relatively rare in web and mobile applications? Speed and scalability. It has generally required too much time to calculate a viewshed, combine a pile of raster data into a weighted overlay, compute a watershed or generate slope and aspect from elevation data.
Azavea has been working on this problem – fast, scalable geoprocessing for the web – for the past few years. In 2012 we released a new open source project called GeoTrellis (http://geotrellis.io/), an open source framework for high-performance (low latency), distributed geoprocessing. Built using the Scala programming language and based on the Akka and Spark frameworks, GeoTrellis is designed to create scalable, fast geoprocessing applications as well as parallelize geoprocessing operations for large geospatial datasets in order to take full advantage of distributed, multi-core architectures.
This talk will give an overview of the GeoTrellis framework; how it leverages features of Scala, Akka, Spark and other frameworks; and how it can be integrated with conventional web mapping tools to create apps that are more than just dots on a map. The talk will also give an overview of applications for online geoprocessing in several domains including: stormwater modeling, education games, infrastructure prioritization, climate change and transportation.
Spatial Data processing with Hadoop
by Ahmed Eldawy, PhD Candidate at the University of Minnesota
This talk describes GeoJinni, formerly SpatialHadoop; an open source full-fledged MapReduce framework with native support for spatial data [http://spatialhadoop.cs.umn.edu]. GeoJinni handles large scale spatial data by injecting spatial data awareness in each layer of Hadoop, namely, the language, storage, MapReduce, and operations layers. In the language layer, a new high level language, termed Pigeon, is proposed to work with standard spatial data types and operations. In the storage layer, GeoJinni supports standard spatial indexes, Grid File, R-tree and R+-tree, which are adapted to work in a distributed environment. The MapReduce layer contains new components to utilize the spatial indexes. The operations layer encapsulates many spatial operations such as range query, spatial join, computational geometry, and visualization operations. The extensibility and efficiency of GeoJinni allowed it to be used as a backbone in three real systems, SHAHED, a system for satellite data analysis and visualization, TAREEG, a MapReduce extractor for OpenStreetMap data, and MNTG, a web-based traffic generator.
Parallel Spatial Join Query Processing: Challenges and Opportunities
by Suprio Ray, PhD Candidate at the University of Toronto
Spatial join is a crucial operation in many spatial analysis applications in scientific and geographical information systems. Due to the compute-intensive nature of spatial predicate evaluation, spatial join queries can be slow even with a moderate sized dataset. Efficient parallelization of spatial join is therefore essential to achieve acceptable performance for many spatial applications. Technological trends, such as the rising core count through multicore machines, Cloud computing and growing main memory capacity, hold great promise in this regard.
A key problem with spatial join queries is the processing skew, that significantly limits the achievable parallel performance. Previous parallel spatial join approaches focused only on the filter step. However, when the more compute-intensive refinement step is included, significant processing skew may arise due to the uneven size of the objects. Another issue with spatial join processing, particularly in the Cloud, is the performance heterogeneity. Unfortunately, traditional parallel spatial join approaches are ill-equipped to deal with the performance heterogeneity that is common in the Cloud.
In this talk I describe two systems that we developed to address the above-mentioned problems. The first system Niharika is intended to be a parallel spatial data analysis infrastructure for the Cloud. Niharika adapts to performance heterogeneity and processing skew in the spatial dataset using spatial declustering and dynamic load-balancing. We evaluate Niharika with three load-balancing algorithms and two different spatial datasets (both from TIGER) using Amazon EC2 instances. Niharika adapts to the performance heterogeneity in the EC2 nodes, thereby achieving excellent speedups.
The second system SPINOJA is a skew-resistant parallel in-memory spatial join infrastructure. SPINOJA introduces MOD-Quadtree declustering, which partitions the spatial dataset such that the amount of computation demanded by each partition is equalized and the processing skew is minimized. We compare three work metrics used to create the partitions and three load-balancing strategies to assign the partitions to multiple cores. Our evaluation shows that SPINOJA outperforms in-memory implementations of previous spatial join approaches by a significant
When & Where
LocationTech is a vendor neutral community for individuals and organizations who wish to collaborate on commercially-friendly open source software that is location aware.
LocationTech hosts technology projects and helps cultivate both an open source community and an ecosystem of complementary products and services.