Two new cloud-based data processing papers published

Two of my latest re­search pa­pers about cloud-based data pro­cessing have just been pub­lished. The first pa­per is en­titled “Cap­ab­il­ity-based Schedul­ing of Sci­entific Work­flows in the Cloud” and deals with the schedul­ing al­gorithm I im­ple­men­ted in Steep. I presen­ted this pa­per at the 9th In­ter­na­tional Con­fer­ence on Data Sci­ence, Tech­no­logy and Ap­plic­a­tions (DATA), which was held as a vir­tual con­fer­ence due to COVID-19.

The other pa­per en­titled “Scal­able pro­cessing of massive geodata in the cloud: gen­er­at­ing a level-of-de­tail struc­ture op­tim­ized for web visu­al­iz­a­tion” was a joint col­lab­or­a­tion with Ralf Gut­bell, Hendrik M. Würz, and Jan­nis Weil where we im­ple­men­ted an ap­proach to dis­trib­uted tri­an­gu­la­tion of di­gital ter­rain mod­els with Apache Spark and Geo­Trel­lis. This journal art­icle has been pub­lished in the AGILE GIS­cience Series.

Please find de­tails about the pa­pers, the con­fer­ence present­a­tion of the first one, as well as the full ref­er­ences be­low.

Capability-based Scheduling of Scientific Workflows in the Cloud

In this pa­per, I presen­ted a dis­trib­uted task schedul­ing al­gorithm and a soft­ware ar­chi­tec­ture for a sys­tem ex­ecut­ing sci­entific work­flows in the Cloud. The main chal­lenges I ad­dressed are (i) cap­ab­il­ity-based schedul­ing, which means that in­di­vidual work­flow tasks may re­quire spe­cific cap­ab­il­it­ies from highly het­ero­gen­eous com­pute ma­chines in the Cloud, (ii) a dy­namic en­vir­on­ment where re­sources can be ad­ded and re­moved on de­mand, (iii) scalab­il­ity in terms of sci­entific work­flows con­sist­ing of hun­dreds of thou­sands of tasks, and (iv) fault tol­er­ance be­cause in the Cloud, faults can hap­pen at any time.

My soft­ware ar­chi­tec­ture con­sists of loosely coupled com­pon­ents com­mu­nic­at­ing with each other through an event bus and a shared data­base. Work­flow graphs are con­ver­ted to pro­cess chains that can be sched­uled in­de­pend­ently.

My schedul­ing al­gorithm col­lects dis­tinct re­quired cap­ab­il­ity sets for the pro­cess chains, asks the agents which of these sets they can man­age, and then as­signs pro­cess chains ac­cord­ingly.

I presen­ted the res­ults of four ex­per­i­ments I con­duc­ted to eval­u­ate if my ap­proach meets the afore­men­tioned chal­lenges. The pa­per fin­ishes with a dis­cus­sion, con­clu­sions, and fu­ture re­search op­por­tun­it­ies.

An im­ple­ment­a­tion of my al­gorithm and soft­ware ar­chi­tec­ture is pub­licly avail­able with the open-source work­flow man­age­ment sys­tem Steep.


Here are the slides of the present­a­tion I gave at the DATA con­fer­ence:


Krämer, M. (2020). Cap­ab­il­ity-based Schedul­ing of Sci­entific Work­flows in the Cloud. Pro­ceed­ings of the 9th In­ter­na­tional Con­fer­ence on Data Sci­ence, Tech­no­logy, and Ap­plic­a­tions DATA, 43–54. ht­tps://​​10.5220/​0009805400430054


The pa­per has been pub­lished un­der the CC BY-NC-ND 4.0 li­cense. You may down­load the fi­nal manuscript here.

Scalable processing of massive geodata in the cloud

In this pa­per, we de­scribed a cloud-based ap­proach to trans­form ar­bit­rar­ily large ter­rain data to a hier­arch­ical level-of-de­tail struc­ture that is op­tim­ized for web visu­al­iz­a­tion. Our ap­proach is based on a di­vide-and-con­quer strategy. The in­put data is split into tiles that are dis­trib­uted to in­di­vidual work­ers in the cloud. These work­ers ap­ply a Delaunay tri­an­gu­la­tion with a max­imum num­ber of points and a max­imum geo­met­ric er­ror. They merge the res­ults and tri­an­gu­late them again to gen­er­ate less de­tailed tiles. The pro­cess re­peats un­til a hier­arch­ical tree of dif­fer­ent levels of de­tail has been cre­ated. This tree can be used to stream the data to the web browser.

We have im­ple­men­ted this ap­proach in the frame­works Apache Spark and Geo­Trel­lis. Our pa­per in­cludes an eval­u­ation of our ap­proach and the im­ple­ment­a­tion. We fo­cus on scalab­il­ity and runtime but also in­vest­ig­ate bot­tle­necks, pos­sible reas­ons for them, as well as op­tions for mit­ig­a­tion. The res­ults of our eval­u­ation show that our ap­proach and im­ple­ment­a­tion are scal­able and that we are able to pro­cess massive ter­rain data.


Krämer, M., Gut­bell, R., Würz, H. M., & Weil, J. (2020). Scal­able pro­cessing of massive geodata in the cloud: gen­er­at­ing a level-of-de­tail struc­ture op­tim­ized for web visu­al­iz­a­tion. AGILE: GIS­cience Series, 1. ht­tps://​​10.5194/​agile-giss-1-10-2020


The pa­per has been pub­lished un­der the CC-BY 4.0 li­cense. You may down­load the fi­nal manuscript here.

Posted by Michel Krämer
on July, 16th 2020.