My research focuses on the dependability of software that drives computers. I'm especially interested in the design of systems software to achieve high reliability and security. These aspects are constantly growing in importance, fueled by the increased ubiquity of computers in our daily lives.
My thesis introduces CuriOS, a new operating system that restructures OS service state in order to minimize intra-component error propagation and to allow state persistence across service restarts [3]. This is accomplished by lightweight distribution, isolation and persistence of application-specific state information used by OS services. A microkernel OS component providing a service is also commonly referred to as a server. Application-specific state is stored in application-associated, but application-inaccessible memory called Server State Regions (SSRs) and servers are only granted access to this information when processing a request. This prevents errors in servers from affecting state related to all applications. Because SSRs are not associated with the server, servers can be transparently restarted without loss of this state. This distribution of state is illustrated in figure 1. CuriOS provides a state management framework that can be used by OS services to allocate and manage SSRs. SSR access control is lightweight and is implemented using virtual memory maps. Many traditional OS services can be written to use SSRs for storing application-related state. Fault-injection experiments with several OS services show that it is possible to recover from 87-100% of all manifested errors.
CuriOS incorporates several other novel research ideas that were developed during the course of my PhD research. It includes support for a unified exception handling framework that allows developers to use similar constructs to handle both software and hardware exceptions [5]. Self-healing properties are achieved through the isolation and micro-rebooting of system components [2]. Hardware watchdog timers can be used to detect lockup errors within the OS. We have demonstrated that it is possible to recover from processor resets generated by such hardware in both CuriOS and Linux [4]. This exploits the fact that, on many systems, main memory contents are preserved across a processor reset and can be used for recovery.
I am also actively involved in several projects exploring OS security. Cloaker is a sophisticated rootkit that illustrates the weaknesses of existing malware detection techniques [1]. It survives by compromising the integrity of hardware state and thus evades detection by tools that only ensure integrity of software state. We advocate non-generic hardware specific checks to detect and neutralize such threats.
We are currently investigating low-level memory analysis techniques to automatically identify program structures. This has applications in post-intrusion forensic analysis and data structure recovery. We are also studying a potential attack against computer systems that exploits the fact that RAM contents are not cleared after a processor reset. While we have investigated using this feature to recover the system and improve reliability [4], this same technique may allow an attacker to access sensitive information in memory. We expect to publish our research on these topics soon.
Reliability: We have seen a rapid growth of parallelism in processors as a result of the move towards multi-core architectures. With a large number of threads expected to be available on processors in the future, I see great potential in exploiting this parallelism to provide significant improvements in reliability. I believe that it would be possible to re-create expensive and highly reliable systems like the Tandem computers on inexpensive modern desktops with multiple cores. The state separation approach advocated in my thesis can also exploit parallel threads to potentially speed up OS services. If a request cannot be processed by an OS server on one core, it may be dispatched to a duplicate server on another core. This is possible for stateless servers since any required state is dispatched along with the request.
I would also like to work further on techniques for evaluating the reliability of system software. There is a need for a freely available tool that can be used to evaluate system software through automatic injection of complex faults such as software bugs. We have built a simple tool that uses the standardized gdb protocol to inject arbitrary faults into virtual machine environments. But a significant number of open research questions still remain that would need to be addressed in order to perform meaningful fault injection experiments.
Security: I would like to continue exploring the boundary between system software and architecture and investigate security failures that occur because architectural features are not carefully considered in the design of techniques that protect system software. The Cloaker rootkit work illustrates one such vulnerability. Uncleared RAM after a reset presents yet another similar vulnerability that can be exploited. I am working in close collaboration with several researchers in order to identify and fix such vulnerabilities in current systems.