(CLOSED) Department of Energy DE-FOA-0002902: Distributed Resilient Systems

Slots: Closed.

Deadlines

Internal Deadline: Contact RII.

LOI: February 9, 2023

External Deadline: March 30, 2023

Award Information

Award Type: Grants or Cooperative Agreements

Estimated Number of Awards: The exact number of awards will depend on the number of meritorious applications and the availability of appropriated funds.

Anticipated Award Amount: DOE anticipates that, subject to the availability of future year appropriations, up to $45 million in current and future fiscal year funds will be used to support awards under this FOA.

Who May Serve as PI: Individuals with the skills, knowledge, and resources necessary to carry out the proposed research as a Principal Investigator (PI) are invited to work with their organizations to develop an application. Individuals from underrepresented groups as well as individuals with disabilities are always encouraged to apply.

Link to Award: https://science.osti.gov/ascr/-/media/grants/pdf/foas/2023/SC_FOA_0002902.pdf

Process for Limited Submissions

PIs must submit their application as a Limited Submission through the Research Initiatives and Infrastructure Application Portal: https://rii.usc.edu/oor-portal/.

Materials to submit include:

(1) Single Page Proposal Summary (0.5” margins; single-spaced; font type: Arial, Helvetica, or Georgia typeface; font size: 11 pt). Page limit includes references and illustrations. Pages that exceed the 1-page limit will be excluded from review.
(2) CV – (5 pages maximum)

Note: The portal requires information about the PIs and Co-PIs in addition to department and contact information, including the 10-digit USC ID#, Gender, and Ethnicity. Please have this material prepared before beginning this application.

Purpose

The DOE SC program in Advanced Scientific Computing Research (ASCR) hereby announces its interest in receiving applications focusing on basic research in computer science that explores innovative approaches to creating distributed resilient systems for science. Such systems might be national or global in scale, linking geographically-distributed computing systems and scientific instruments, and might involve a large number of edge devices or sensors, but regardless, must manage computation and data in scalable and fault-tolerant manner. Important research challenges involve techniques for advanced middleware and operating and runtime systems, with this FOA targeting two research areas: 1) scalable system modeling, and 2) adaptive management and partitioning of resources. Advances in these areas will contribute to scaling-up our increasingly complex and interconnected scientific enterprise.

Research Areas Each pre-application and application must propose research that addresses one or both of the following research areas: 1. Scalable system modeling: The computational modeling of large, distributed systems is an important tool in understanding their behavior, both in normal operations1 and under anomalous conditions. Scaling these models on state-of-the-art computing systems to capture the scale of envisioned distributed scientific workflows and infrastructure is challenging [8]. Moreover, the kinds of system properties, including performance and resilience properties, and thus the associated requirements on system models, may be different for different scientific workflows. Pre-applications and applications addressing research on system modeling must describe the unknown system properties on which the model will provide insight, outline specific metrics and targets for those metrics reasonably necessary to provide the desired insight, and explain why it is reasonable to believe that the proposed approach, if successful, will reach or exceed those targets. 2. Adaptive management and partitioning of resources: Scheduling complex workflows on a single system, especially when the resources required for the workflow vary with time and/or the system provides heterogeneous resources, is a challenging endeavor. The scheduling and managing of resources across many systems for complex workflows provides even greater challenges. Moreover, the middleware and runtime and operating systems managing the execution, data movement, and other aspects of both the running and waiting workflows must simultaneously scale to support multiple large systems and be resilient to failures. The protocols and algorithms may include a combination of selfmonitoring and self-healing capabilities, but regardless, must have well-characterized stability properties in the face of uncertain operating conditions. Applications proposing research in this area must describe specific unknowns in the construction of middleware and/or runtime or operating systems achieving these goals and explain why it is reasonable to believe that the proposed approach, if successful, will shed light on those unknowns.

Visit our Institutionally Limited Submission webpage for more updates and other announcements.