Storage patterns for data-local processing of georeferenced data in BigData systems

Type of thesis: Masterarbeit / location: Leipzig / Status of thesis: Finished theses

With more and more geolocation-aware devices, large data sets with geographical references are no longer uncommon. As a consequence, when processing these data sets, georeferences allow the discovery of location-related patterns and connections between records which would otherwise not have been found. To enable global processing and aggregation of such data sets, the georeferenced data tracked by sensors in mobile devices is transmitted and stored in large data storage solutions. Such mobile systems, for instance smartphones, tablets, and cars, are often always-online which allows them to immediately tag their tracked sensor values and send them to the central storage system. Depending on the area of application, this can lead to a significant amount of data gathered.

To store and process such vast quantities of data, BigData solutions like the Hadoop framework with its integral components HDFS and MapReduce have gained popularity during the last decade [1] [2] [3]. BigData solutions offer to store huge amounts of data whilst still preserving the ability to consider all data for aggregations. With georeferenced data, however, it is often enough to only consider local regions for aggregations, e.g. cities or districts. Division of georeferenced data into regions has already been proposed with the Geohash [4] concept. As a next step, storing data which belongs to a region on the same cluster node may allow for a speed-up of computations by exploiting the data locality gained.

Aim

 

The aim of the thesis is to design, implement and compare approaches for region-based storage of georeferenced data within the Hadoop Distributed File System. The comparison of the implemented strategies has to at least take variability of region sizes, performance gain achieved and hot spot mitigation into account.

Contact

References

[1] M. M. Ahmed Eldawy, „SpatialHadoop – A MapReduce Framework for Spatial Data,“ University of Minnesota, [Online]. Available: http://spatialhadoop.cs.umn.edu/.

[2] A. e. a. Aji, „Hadoop GIS: a high performance spatial data warehousing system over mapreduce.,“ in Proceedings of the VLDB Endowment 6.11, 2013, pp. 1009-1020.

[3] A. a. M. F. M. Eldawy, „A demonstration of spatialhadoop: an efficient mapreduce framework for spatial data.,“ in Proceedings of the VLDB Endowment 6.12, 2013, pp. 1230-1233.

[4] „Geohash,“ [Online]. Available: http://en.wikipedia.org/wiki/Geohash. [Zugriff am 01 08 2014].

Counterpart

Dr.
Eric Peukert

Administration Director

Department of computer science

Universität Leipzig

TU
Universität
Max
Leibnitz-Institut
Helmholtz
Hemholtz