Our Challenges
Software tools play an essential role in optimizing scientific application performance and maximizing resource efficiency. In addition to enabling supercomputer performance (a decisive determinant of scientific discovery), these tools provide feedback to users, operations staff, and software developers that amplifies the long-term impact of investments in scientific computing. A unified tools project not only supports individual software teams but also fosters a healthy ecosystem for collaboration. STEP will require a disciplined approach to multiple challenges that would otherwise limit scientific computing.
The rapid pace of hardware innovation poses a looming challenge to the long-term health of software tools. Today’s machines are characterized by increased software heterogeneity in OS versions, programming languages, environments, schedulers, and libraries. Moreover, computer hardware heterogeneity is increasing for both processors and memory architectures. The maturity of performance tools varies widely as applications and architectures increasingly rely on GPUs, accelerators, and other specialized technologies to continue performance scaling, and from specialized accelerators to dispersed specialized accelerators within the architecture.
There is a steady increase in software complexity of leadership class computers. Growing software divergence can be seen in OS versions, programming languages, environments, schedulers, and libraries.
Application teams face numerous challenges that may prevent them from adopting and using tools effectively, including the need to prioritize certain inputs, a lack of awareness of pertinent tools, and the absence of sustainable communication channels with the tools community.
To summarize our challenges, we face these urgent and domain- specific barriers which complicate sustainability:
- Exploding hardware complexity: The rapid pace of increasing hardware complexity and heterogeneity greatly expands tools’ targets and forces HPC tool developers to respond in a reactive manner.
- Exploding use cases: New and emerging application paradigms, including AI/ML, edge, and embedded instrumentation are shifting the usages that tools need to support. Additionally, there are new opportunities for tools in traditional HPC areas, such as feedback-driven dynamic resource management.
- The coordination challenge: Tools themselves are uniquely and closely tied to design decisions across different layers of the execution stack, including: hardware, system software, middleware, and applications.
- The management challenge: Building a sustainable tools ecosystem will require plans for organizing, operating budgets, community standards, technology tracking, and promoting technical leadership.

A Community Based Approach
Certainly the challenges STEP faces are multi-faceted; perhaps it is not surprising that our approach to address these challenges is multi-faceted too.
An effective communication nexus for key segments of the HPC landscape: Community partnerships are a vital component of STEP’s approach for realizing a healthy tools ecosystem. STEP actively engages vendors, facilities, applications, hardware designers, and tool developers to help establish desired requirements, guidelines, testing methods, and early system access. These collaborative efforts benefit all parties involved, fostering a culture of proactive collaboration and continuous improvement. We also collaborate with other communities that share similar concerns, including cloud and data center providers, as well as AI vendors, to combine our efforts and maximize our impact. Our community partnerships create closer and more frequent communication between HPC tool developers and scientific applications, which reduces redundant effort in both domains. Application developers are able to focus on their application’s requirements, rather than trying to develop ad-hoc tools to analyze resource usage or performance. At the same time, tool developers receive better and more direct feedback by deploying their tools in real-world applications, rather than trying to approximate these conditions in simulation or through proxy applications.
An efficient source for next generation performance tools: Establishing clear correlations between root causes of problems, workflow patterns, and performance analysis within the broader context of network, storage, power, and other system-level metrics is more challenging than ever. New hardware and software heterogeneity raises load-balancing issues with more sophisticated tradeoffs. It is crucial to engage early with vendor and application teams to effectively sort out root cause questions. Today we face numerous challenges related to interoperability including incompatible file formats and redundant implementations of similar capabilities across different tools. Relying solely on formal standards is not a cure-all solution, and, in fact, can hinder productivity. STEP encourages the production and adoption of tool portability layers that enable tool developers to share common functionality, drawing inspiration from successful examples from other communities such as LLVM. STEP encourages the development of metrics and evaluation methods that ensure accountability while also maximizing community sustainability impact. We support the development of metrics and evaluation methods, and the curation of mini-apps representing a broad cross section of use cases to serve as a proving ground for sustainable tool technology. Further, STEP encourages improvements in security practices across the tools community as a whole, and we distill secure development guidance into more concise actionable best practices whenever possible.
A productive spark for the broad HPC community: The benefit derived from STEP community engagement also extends beyond improvements to the tools themselves. A healthy tools ecosystem significantly broadens the use of tools in scientific applications. For example, standard and interoperable interfaces among tools will make it easier for developers to adopt and combine tools with different purposes. Moreover, developers have better assurances that the capabilities their applications rely upon will not disappear if one of the tools in the ecosystem is discontinued. Improving the efficiency of the software on HPC platforms not only provides faster time-to-solution for the scientist, but also effectively decreases the cost per computation. Additionally, STEP focuses on improving on-ramps for new tool usage by streamlining the onboarding process and providing immediately actionable feedback. Finally, STEP actively supports opportunities to boost US workforce expertise through multiple strategies including training, internships and technology introductions.