Abstract | Workflow allows e-Scientists to express their experimental processes in a structured way and provides a glue to integrate remote applications. Since Grid provides an enormously large amount of data and computational resources, executing workflows on the Grid results in significant performance improvement. Several workflow management systems, which are widely used by different scientific communities, were developed for various purposes. Therefore, they differ in several aspects. This thesis outlines two major problems of existing workflow systems: workflow interoperability and data access. On the one hand, existing workflow systems are based on different technologies. Therefore, to achieve interoperability between their workflows at any level is a challenging task. In spite of the fact that there is a clear demand for interoperable workflows, for example, to enable scientists to share workflows, to leverage existing work of others, and to create multi-disciplinary workflows; currently, there are only limited, ad-hoc workflow interoperability solutions available for scientists. Existing solutions only realise workflow interoperability between a small set of workflow systems and do not consider performance issues that arise in the case of large-scale (computational and/or data intensive) scientific workflows. Scientific workflows are typically computation and/or data intensive and are executed in a distributed environment to speed up their execution time. Therefore, their performance is a key issue. Existing interoperability solutions bottleneck the communication between workflows in most scenarios dramatically increasing execution time. On the other hand, many scientific computational experiments are based on data that reside in data resources which can be of different types and vendors. Many workflow systems support access to limited subsets of such data resources preventing data level workflow interoperation between different systems. Therefore, there is a demand for a general solution that provides access to a wide range of data resources of different types and vendors. If such a solution is general, in the sense that it can be adopted by several workflow systems, then it also enables workflows of different systems to access the same data resources and therefore interoperate at data level. Note that data semantics are out of the scope of this work. For the same reasons as described above, the performance characteristics of such a solution are inevitably important. Although in terms of functionality, there are solutions which could be adopted by workflow systems for this purpose, they provide poor performance. For that reason, they did not gain wide acceptance by the scientific workflow community. Addressing these issues, a set of architectures is proposed to realise heterogeneous data access and heterogeneous workflow execution solutions. The primary goal was to investigate how such solutions can be implemented and integrated with workflow systems. The secondary aim was to analyse how such solutions can be implemented and utilised by single applications. |
---|