Status: finished / Type of Theses: Master theses / Location: Leipzig
Time-Series data has become more and more important for Industry 4.0, IoT and data-driven companies. Since the data volume is rising, NoSQL databases like Apache Accumulo, Cassandra and HBase are providing extensions to work with time-series data:
Unfortunately they are either immature or didn’t provide exact numbers for aggregations (min, max, sum, avg, std deviation, percentile) of large data sets.
This theses aims to define a performant schema for exact aggregations by either using Apache Accumulo with server-side iterators or Apache Flink as distributed calculation framework.
[1] Knuth, Donald Ervin: The Art of computer programming. Volume 2, Seminumerical algorithms. S. 216, 1998.
[2] Menne; M.J.; Durre, I.; Korzeniewski, B.; McNeal, S.; Thomas, K.; Yin, X.; Anthony, S.; Ray, R.; Vose, R.S.; Gleason, B.E.; Houston, T.G.: Global historical climatology network-daily (GHCN-Daily), Version 3.22. NOAA National Climatic Data Center, 2012. http://doi.org/10.7289/V5D21VHZ, Stand:18.10.2016.
[3] Saukas, Einar LG; Song, Siang W: Efficient selection algorithms on distributed memory computers. In: Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, S. 1–26, 1998.