Steep 5.6.0

The new ver­sion of my sci­entific work­flow man­age­ment sys­tem Steep has just been re­leased. New fea­tures in­clude auto­matic retry­ing, the pos­sib­il­ity to de­ploy mul­tiple agents, as well as an op­tim­ised schedul­ing al­gorithm.

Steep 5.6.0 also comes with a few other im­prove­ments and fixes (see com­plete list be­low). The ver­sion has been thor­oughly tested in prac­tice over the last couple of months.

Steep is a sci­entific work­flow man­age­ment sys­tem that can ex­ecute data-driven work­flows in the Cloud. It is very well suited to har­ness the pos­sib­il­it­ies of dis­trib­uted com­put­ing in or­der to par­al­lel­ise work and to speed up your data pro­cessing work­flows no mat­ter how com­plex they are and re­gard­less of how much data you need to pro­cess. Steep is an open-source soft­ware de­veloped at Fraunhofer IGD. You can down­load the bin­ar­ies and the source code of Steep from its Git­Hub re­pos­it­ory.

Automatic retrying

It is now pos­sible to spe­cify retry policies to define how of­ten a cer­tain ser­vice or work­flow ac­tion should be re-ex­ecuted in case of an er­ror. The fea­ture is best ex­plained with an ex­ample:

- type: execute
  service: cp
  inputs:
    - id: input_file
      var: input_file
  outputs:
    - id: output_file
      var: output_file
  retries:
    maxAttempts: 5
    delay: 1s
    exponentialBackoff: 2
    maxDelay: 10s

The ex­ample shows an ex­ecute ac­tion that cop­ies an input_file to an output_file. The retry policy (at­trib­ute retries) spe­cifies that the ac­tion should be re-ex­ecuted if the copy pro­cess fails. The maxAttempts at­trib­ute defines the max­imum num­ber of ex­e­cu­tions in­clud­ing the ini­tial at­tempt. In this ex­ample, Steep will run the ac­tion one time and then retry it up to four times (1 + 4 = 5). Between each at­tempt, Steep will wait at least 1 second, which is spe­cified by the delay para­meter. Since the exponentialBackoff factor is set to 2, this delay will double on each at­tempt. The max­imum delay between two at­tempts is 10 seconds in this ex­ample.

You can spe­cify retry policies in the ser­vice metadata or in an ex­ecute ac­tion. Read more about this fea­ture in Steep’s doc­u­ment­a­tion.

Deploy multiple agents

In pre­vi­ous ver­sions, each Steep in­stance con­tained ex­actly one agent. This meant that if you wanted to run mul­tiple pro­cess chains in par­al­lel on the same ma­chine, you had to start Steep more than once.

The new ver­sion 5.6.0 al­lows you to spe­cify how many agents should be de­ployed per Steep in­stance through the new steep.agent.instances con­fig­ur­a­tion item. This can help you make use of full par­al­lel­ism without wast­ing re­sources.

Optimised scheduling algorithm

I’ve in­ves­ted a lot of work into op­tim­ising Steep’s schedul­ing al­gorithm to make it much more scal­able. The new ver­sion caches re­sponses from re­mote agents in or­der to avoid hav­ing to send too many mes­sages over the event bus. Agents that are known to be busy as well as those that def­in­itely do not sup­port a given re­quired cap­ab­il­it­ies set will be skipped dur­ing schedul­ing.

In ad­di­tion, the schedul­ing al­gorithm now does not stop any­more if it can­not as­sign a pro­cess chain to an agent. In­stead, it con­tin­ues with the other pro­cess chains and agents. This im­proves the schedul­ing through­put (i.e. it in­creases the num­ber of pro­cess chains as­signed to an agent in one schedul­ing step).

Steep now also does not un­ne­ces­sar­ily run mul­tiple schedul­ing look­ups in par­al­lel. This saves re­sources and fur­ther im­proves scalab­il­ity. I’ve also ad­ded a Mon­goDB com­pound in­dex to speed up fetch­ing pro­cess chains.

Fi­nally, Steep now se­lects the best re­quired cap­ab­il­it­ies by the total count of re­main­ing pro­cess chains. This makes sure pro­cess chains are dis­trib­uted more evenly to agents sup­port­ing sim­ilar cap­ab­il­it­ies, which can re­duce the over­all runtime of work­flows.

The fol­low­ing fig­ures taken from my up­com­ing pa­per on the im­proved schedul­ing al­gorithm show the im­prove­ments.

Old al­gorithm
New al­gorithm

The new al­gorithm re­duces the total runtime of this ex­ample work­flow by about one and a half minutes.

The im­proved sched­uler is very scal­able. It is able to suc­cess­fully ex­ecute 300.000 pro­cess chains on 1.000 agents.

Im­age source: Krämer, M. (2020). Ef­fi­cient schedul­ing of sci­entific work­flow ac­tions in the Cloud based on re­quired cap­ab­il­it­ies. (Sub­mit­ted to Data Man­age­ment Tech­no­lo­gies and Ap­plic­a­tions)

Other new features

Be­sides the fea­tures men­tioned above, the new ver­sion con­tains the fol­low­ing im­prove­ments:

  • The time each schedul­ing step took will now be logged
  • Mul­tiple orphaned VMs will now be de­leted in par­al­lel
  • The de­fault value of the store flag will not be in­cluded in seri­al­ized work­flows any­more
  • Ad­di­tional Pro­meth­eus met­rics are now ex­posed through the HTTP in­ter­face:
    • Num­ber of pro­cess chains ex­ecuted by the sched­uler
    • Num­ber of re­tries per­formed by the local agent (per ser­vice ID)

Maintenance

  • Up­date UI de­pend­en­cies
  • Re­move un­ne­ces­sary log mes­sages

Bug fixes

  • Pull alpine docker im­age be­fore run­ning unit tests

Posted by Michel Krämer
on November, 14th 2020.