Abstract | Distributed computing is becoming ubiquitous in recent years in many areas, especially the scientific and industrial ones, where the processing power - even that of supercomputers - never seems to be enough. Grid systems were born out of necessity, and had to grow quickly to meet requirements which evolved over time, becoming today’s complex systems. Even the simplest distributed system nowadays is expected to have some basic functionalities, such as resources and execution management, security and optimization features, data control, etc. The complexity of Grid applications is also accentuated by their distributed nature, making them some of the most elaborate systems to date. It is often too easy that these intricate systems happen to fall in some kind of failure, it being a software bug, or plain simple human error; and if such a failure occurs, it is not always the case that the system can recover from it, possibly meaning hours of wasted computational power. In this thesis, some of the problems which are at the core of the development and maintenance of Grid software applications are addressed by introducing novel and solid approaches to their solution. The difficulty of Grid systems to deal with unforeseen and unexpected circumstances resulting from dynamic reconfiguration can be identified. Such problems are often related to the fact that Grid applications are large, distributed and prone to resource failures. This research has produced a methodology for the solution of this problem by analysing the structure of distributed systems and their reliance on the environment which they sit upon, often overlooked when dealing with these types of scenarios. It is concluded that the way that Grid applications interact with the infrastructure is not sufficiently addressed and a novel approach is developed in which formal verification methods are integrated with distributed applications development and deployment in a way that includes the environment. This approach allows for reconfiguration scenarios in distributed applications to proceed in a safe and controlled way, as demonstrated by the development of a prototype application. |
---|