AI-Integrated Workflows

National Data Platform (NDP)

National Data Platform (NDP)

The National Data Platform (NDP) is a federated and extensible data ecosystem, which promotes and enables collaboration, innovation, and the equitable use of data atop existing cyberinfrastructure capabilities. Since the inception of the NDP project, our team based at the San Diego Supercomputer Center (SDSC) has worked with our partners at the University of Utah, University of Colorado Boulder (CU Boulder), and EarthScope Consortium to build out a platform that provides its users with capabilities including:

Land Cover Classification at the Wildland Urban Interface Using High-Resolution Satellite Imagery and Deep Learning

Land Cover Classification at the Wildland Urban Interface Using High-Resolution Satellite Imagery and Deep Learning

Land cover classification analysis from satellite imagery is important for monitoring change in ecosystems and urban growth over time. However, the land cover classifications that are widely available in the United States are generated at a low spatial and temporal resolution, so that the spatial distribution between vegetation and urban areas in the wildland urban interface is difficult to measure. High spatial and temporal resolution analysis is essential for understanding and managing changing environments in these regions.

Workflow-Driven Distributed Machine Learning in CHASE-CI: A Cognitive Hardware and Software Ecosystem Community Infrastructure

Workflow-Driven Distributed Machine Learning in CHASE-CI: A Cognitive Hardware and Software Ecosystem Community Infrastructure

The advances in data, computing and networking over the last two decades led to a shift in many application domains that includes machine learning on big data as a part of the scientific process, requiring new capabilities for integrated and distributed hardware and software infrastructure. This paper contributes a workflow-driven approach for dynamic data-driven application development on top of a new kind of networked Cyberinfrastructure called CHASE-CI.

Toward a Methodology and Framework for Workflow-Driven Team Science

Toward a Methodology and Framework for Workflow-Driven Team Science

Scientific workflows are powerful tools for the management of scalable experiments, often composed of complex tasks running on distributed resources. Existing cyberinfrastructure provides components that can be utilized within repeatable workflows. However, data and computing advances continuously change the way scientific workflows get developed and executed, pushing the scientific activity to be more data-driven, heterogeneous, and collaborative.

Ten Simple Rules for Writing and Sharing Computational Analyses in Jupyter Notebooks

Ten Simple Rules for Writing and Sharing Computational Analyses in Jupyter Notebooks

As studies grow in scale and complexity, it has become increasingly difficult to provide clear descriptions and open access to the methods and data needed to understand and reproduce computational research. Numerous papers, including several in the Ten Simple Rules collection, have highlighted the need for robust and reproducible analyses in computational research, described the difficulty of achieving these standards, and enumerated best practices.

Sharing and Archiving Data Science Course Projects to Support Pedagogy for Future Cohorts

Sharing and Archiving Data Science Course Projects to Support Pedagogy for Future Cohorts

Founded in 2018, the Halıcıoğlu Data Science Institute (HDSI) is a significant new organization on the UC San Diego campus. As part of their pedagogical processes, HDSI faculty desired a way to store and share student capstone projects with future cohorts, so students could easily access reusable, raw datasets and analytical workflows, with the potential to expand on work done by previous cohorts. The UC San Diego Library has been managing an institutional data repository for over a decade, with established ingest workflows and tools.

PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data

PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data

Discovering the Bayesian network (BN) structure from big datasets containing rich causal relationships is becoming increasingly valuable for modeling and reasoning under uncertainties in many areas with big data gathered from sensors due to high volume and fast veracity. Most of the current BN structure learning algorithms have shortcomings facing big data. First, learning a BN structure from the entire big dataset is an expensive task which often ends in failure due to memory constraints.

Modeling Wildfire Behavior at the Continuum of Computing

Modeling Wildfire Behavior at the Continuum of Computing

This talk will review some of our recent work on building this dynamic data driven cyberinfrastructure and impactful application solution architectures that showcase integration of a variety of existing technologies and collaborative expertise. The lessons learned from the development of the NSF WIFIRE cyberinfrastructure will be summarized. Open data issues, use of edge and cloud computing on top of high-speed network, reproducibility through containerization and automated workflow provenance will also be discussed in the context of WIFIRE.

HydroFrame: A Software Framework to Enable Continental Scale Hydrologic Simulation

HydroFrame: A Software Framework to Enable Continental Scale Hydrologic Simulation

The goal of the HydroFrame project is to provide a community framework for sophisticated high resolution hydrologic simulation across the entire continental US. To accomplish this we are building an integrated software framework for continental scale hydrologic simulation and data analysis built with multi-scale configurable components. The multi-scale requirements of this domain drive the design of the proposed framework.

The Evolution of Bits and Bottlenecks in a Scientific Workflow Trying to Keep Up with Technology: Accelerating 4D Image Segmentation Applied to NASA Data

The Evolution of Bits and Bottlenecks in a Scientific Workflow Trying to Keep Up with Technology: Accelerating 4D Image Segmentation Applied to NASA Data

In 2016, a team of earth scientists directly engaged a team of computer scientists to identify cyberinfrastructure (CI) approaches that would speed up an earth science workflow. This paper describes the evolution of that workflow as the two teams bridged CI and an image segmentation algorithm to do large scale earth science research. The Pacific Research Platform (PRP) and The Cognitive Hardware and Software Ecosystem Community Infrastructure (CHASE-CI) resources were used to significantly decreased the earth science workflow's wall-clock time from 19.5 days to 53 minutes.

End-to-End Workflow-Driven Hydrologic Analysis for Different User Groups in HydroFrame

End-to-End Workflow-Driven Hydrologic Analysis for Different User Groups in HydroFrame

We present the initial progress on the HydroFrame community platform using an automated Kepler workflow that performs end-to-end hydrology simulations involving data ingestion, preprocessing, analysis, modeling, and visualization. We will demonstrate how different modules of workflow can be reused and repurposed for the three target user groups. Moreover, the Kepler workflow ensures complete reproducibility through a built-in provenance framework that collects workflow specific parameters, software versions and hardware system configuration.

Enabling FAIR Research in Earth Science Through Research Objects

Enabling FAIR Research in Earth Science Through Research Objects

Data-intensive science communities are progressively adopting FAIR practices that enhance the visibility of scientific breakthroughs and enable reuse. At the core of this movement, research objects contain and describe scientific information and resources in a way compliant with the FAIR principles and sustain the development of key infrastructure and tools. This paper provides an account of the challenges, experiences and solutions involved in the adoption of FAIR around research objects over several Earth Science disciplines.

A Demonstration of Modularity, Reuse, Reproducibility, Portability and Scalability for Modeling and Simulation of Cardiac Electrophysiology Using Kepler Workflows

A Demonstration of Modularity, Reuse, Reproducibility, Portability and Scalability for Modeling and Simulation of Cardiac Electrophysiology Using Kepler Workflows

Multi-scale computational modeling is a major branch of computational biology as evidenced by the US federal interagency Multi-Scale Modeling Consortium and major international projects. It invariably involves specific and detailed sequences of data analysis and simulation, often with multiple tools and datasets, and the community recognizes improved modularity, reuse, reproducibility, portability and scalability as critical unmet needs in this area. Scientific workflows are a well-recognized strategy for addressing these needs in scientific computing.

Using Dynamic Data Driven Cyberinfrastructure for Next Generation Disaster Intelligence

Using Dynamic Data Driven Cyberinfrastructure for Next Generation Disaster Intelligence

Wildland fires and related hazards are increasing globally. A common observation across these large events is that fire behavior is changing to be more destructive, making applied fire research more important and time critical. Significant improvements towards modeling of the extent and dynamics of evolving plethora of fire related environmental hazards, and their socio-economic and human impacts can be made through intelligent integration of modern data and computing technologies with techniques for data management, machine learning and fire modeling.

Scalable Workflow-Driven Hydrologic Analysis in HydroFrame

Scalable Workflow-Driven Hydrologic Analysis in HydroFrame

The HydroFrame project is a community platform designed to facilitate integrated hydrologic modeling across the US. As a part of HydroFrame, we seek to design innovative workflow solutions that create pathways to enable hydrologic analysis for three target user groups: the modeler, the analyzer, and the domain science educator. We present the initial progress on the HydroFrame community platform using an automated Kepler workflow. This workflow performs end-to-end hydrology simulations involving data ingestion, preprocessing, analysis, modeling, and visualization.

NeuroKube: An Automated and Autoscaling Neuroimaging Reconstruction Framework using Cloud Native Computing and A.I.

NeuroKube: An Automated and Autoscaling Neuroimaging Reconstruction Framework using Cloud Native Computing and A.I.

The neuroscience domain stands out from the field of sciences for its dependence on the study and characterization of complex, intertwining structures. Understanding the complexity of the brain has led to widespread advances in the structure of large-scale computing resources and the design of artificially intelligent analysis systems. However, the scale of problems and data generated continues to grow and outpace the standards and practices of neuroscience.

Cloud Software for Enabling Community-Oriented Integrated Hydrologic Modeling

Cloud Software for Enabling Community-Oriented Integrated Hydrologic Modeling

In previous work, we provided static domain and parameter datasets for the National Water Model (NWM) and Parflow (PF-CONUS) on demand, at regional watershed scales. We extend this functionality by connecting existing cloud applications and tools into a virtual ecosystem that supports extraction of domain and parameter datasets, execution of NWM and PF-CONUS models, and collaboration.

Workflows Community Summit: Bringing the Scientific Workflows Community Together

Workflows Community Summit: Bringing the Scientific Workflows Community Together

Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions must be managed using some software infrastructure.

Workflows Community Summit: Advancing the State-of-the-Art of Scientific Workflows Management Systems Research and Development

Workflows Community Summit: Advancing the State-of-the-Art of Scientific Workflows Management Systems Research and Development

Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms.

TemPredict: A Big Data Analytical Platform for Scalable Exploration and Monitoring of Personalized Multimodal Data for COVID-19

TemPredict: A Big Data Analytical Platform for Scalable Exploration and Monitoring of Personalized Multimodal Data for COVID-19

A key takeaway from the COVID-19 crisis is the need for scalable methods and systems for ingestion of big data related to the disease, such as models of the virus, health surveys, and social data, and the ability to integrate and analyze the ingested data rapidly. One specific example is the use of the Internet of Things and wearables (i.e., the Oura ring) to collect large-scale individualized data (e.g., temperature and heart rate) continuously and to create personalized baselines for detection of disease symptoms.

Quantum Data Hub: A Collaborative Data and Analysis Platform for Quantum Material Science

Quantum Data Hub: A Collaborative Data and Analysis Platform for Quantum Material Science

Quantum materials research is a rapidly growing domain of materials research, seeking novel compounds whose electronic properties are born from the uniquely quantum aspects of their constituent electrons. The data from this rapidly evolving area of quantum materials requires a new community-driven approach for collaboration and sharing the data from the end-to-end quantum material process.

Perspectives on Automated Composition of Workflows in the Life Sciences

Perspectives on Automated Composition of Workflows in the Life Sciences

Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences.

Modular Performance Prediction for Scientific Workflows Using Machine Learning

Modular Performance Prediction for Scientific Workflows Using Machine Learning

Scientific workflows provide an opportunity for declarative computational experiment design in an intuitive and efficient way. A distributed workflow is typically executed on a variety of resources, and it uses a variety of computational algorithms or tools to achieve the desired outcomes. Such a variety imposes additional complexity in scheduling these workflows on large scale computers. As computation becomes more distributed, insights into expected workload that a workflow presents become critical for effective resource allocation.

Integrated End-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD)

Integrated End-to-End Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD)

This report details the accomplishments from the ASCR funded project “Integrated End-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows” under the award numbers FWP-66406 and DE-SC0012630, with a focus on the UC San Diego (Award No. DE-SC0012630) part of the accomplishments. We refer to the project as IPPD.

Expanse: Computing without Boundaries - Architecture, Deployment, and Early Operations Experiences of a Supercomputer Designed for the Rapid Evolution in Science and Engineering

Expanse: Computing without Boundaries - Architecture, Deployment, and Early Operations Experiences of a Supercomputer Designed for the Rapid Evolution in Science and Engineering

We describe the design motivation, architecture, deployment, and early operations of Expanse, a 5 Petaflop, heterogenous HPC system that entered production as an NSF-funded resource in December 2020 and will be operated on behalf of the national community for five years. Expanse will serve a broad range of computational science and engineering through a combination of standard batch-oriented services, and by extending the system to the broader CI ecosystem through science gateways, public cloud integration, support for high throughput computing, and composable systems.

A Community Roadmap for Scientific Workflows Research and Development

A Community Roadmap for Scientific Workflows Research and Development

The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projects partnered to bring the international workflows community together. This paper reports on discussions and findings from two virtual “Workflows Community Summits” (January and April, 2021).

Building Cyberinfrastructure for Translational Impact: The WIFIRE Example

Building Cyberinfrastructure for Translational Impact: The WIFIRE Example

This paper overviews the enablers and phases for translational cyberinfrastructure for data-driven applications. In particular, it summarizes the translational process of and the lessons learned from the development of the NSF WIFIRE cyberinfrastructure. WIFIRE is an end-to-end cyberinfrastructure for real-time data fusion and data-driven simulation, prediction, and visualization of wildfire behavior. WIFIRE’s real-time data products and modeling services are routinely accessed by fire research and emergency response communities for modeling as well as the public for situational awareness.

Autonomous Provenance to Drive Reproducibility in Computational Hydrology

Autonomous Provenance to Drive Reproducibility in Computational Hydrology

The Kepler-driven provenance framework provides an Autonomous Provenance Collection capability for Hydrologic research. The framework scales to capture model parameters, user actions, hardware specifications and facilitates quick retrieval for actionable insights, whether the scientist is handling a small watershed simulation or a large continental-scale problem.

Towards a Dynamic Composability Approach for Using Heterogeneous Systems in Remote Sensing

Towards a Dynamic Composability Approach for Using Heterogeneous Systems in Remote Sensing

Influenced by the advances in data and computing, the scientific practice increasingly involves machine learning and artificial intelligence driven methods which requires specialized capabilities at the system-, science- and service-level in addition to the conventional large-capacity supercomputing approaches. The latest distributed architectures built around the composability of data-centric applications led to the emergence of a new ecosystem for container coordination and integration.

Smart Connected Worker Edge Platform for Smart Manufacturing: Part 2—Implementation and On-Site Deployment Case Study

Smart Connected Worker Edge Platform for Smart Manufacturing: Part 2—Implementation and On-Site Deployment Case Study

In this paper, we describe specific deployments of the Smart Connected Worker (SCW) Edge Platform for Smart Manufacturing through implementation of four instructive real-world use cases that illustrate the role of people in a Smart Manufacturing paradigm through which affordable, scalable, accessible, and portable (ASAP) information technology (IT) acquires and contextualizes data into information for transmission to operation technologies (OT).

Smart Connected Worker Edge Platform for Smart Manufacturing: Part 1—Architecture and Platform Design

Smart Connected Worker Edge Platform for Smart Manufacturing: Part 1—Architecture and Platform Design

The challenge of sustainably producing goods and services for healthy living on a healthy planet requires simultaneous consideration of economic, societal, and environmental dimensions in manufacturing. Enabling technology for data driven manufacturing paradigms like Smart Manufacturing (a.k.a. Industry 4.0) serve as the technological backbone from which sustainable approaches to manufacturing can be implemented.

Responding to Emerging Wildfires through Integration of NOAA Satellites with Real-Time Ground Intelligence

Responding to Emerging Wildfires through Integration of NOAA Satellites with Real-Time Ground Intelligence

This presentation discusses the process of delivering fire behavior forecasts on initial attack using earliest detections of fire from geostationary satellite data. The current GOES 16 and 17 satellites deliver rapid detections and the future GeoXO will increase the speed and accuracy of the earliest alerts. GeoXO will also deliver important information such as the radiative power of the fire detected, providing insight into the fire intensity.

Machine Learning for Improved Post-Fire Debris Flow Likelihood Prediction

Machine Learning for Improved Post-Fire Debris Flow Likelihood Prediction

Timely prediction of debris flow probabilities in areas impacted by wildfires is crucial to mitigate public exposure to this hazard during post-fire rainstorms. This paper presents a machine learning approach to amend an existing dataset of post-fire debris flow events with additional features reflecting existing vegetation type and geology, and train traditional and deep learning methods on a randomly selected subset of the data.

HydroFrame Infrastructure: Developments in the Software Behind a National Hydrologic Modeling Framework

HydroFrame Infrastructure: Developments in the Software Behind a National Hydrologic Modeling Framework

The HydroFrame project combines cutting-edge environmental modeling approaches with modern software principals to build an end-to-end workflow for regional and continental scale scientific applications, by enabling modelers to extract static datasets from continental datasets and execute them using high performance computing hardware hosted at Princeton University. In prior work we have provided the capability for users to extract domain data for the ParFlow model at local scales and execute them using freely accessible cloud computing services (i.e. MyBinder.org).

IPPD

IPPD: Integrated End-to-End Performance Prediction and Diagnosis for Extreme Scientific Workflows

Scientific workflows execute on a loosely connected set of distributed and heterogeneous computational resources. The Integrated End-to-End Performance Prediction and Diagnosis (IPPD) project contributes to clear understanding of the factors that influence the performance and potential optimization of scientific workflows. IPPD addressed three core issues in order to provide insights into workflow execution that can be used to both explain and optimize their execution:

HydroFrame

HydroFrame

The HydroFrame is a community platform that facilitates integrated hydrologic modeling across the United States. We design innovative workflow solutions that create pathways to enable hydrologic analysis for three target uses: modeling, analysis, and domain science. As part of our contribution to HydroFrame, we run HydroFrame workflows in the Kepler system, utilizing its automated workflow capabilities to perform end-to-end hydrology simulations involving data ingestion, preprocessing, analysis, modeling, and visualization.

NRP

National Research Platform

The National Research Platform (NRP), formerly known as the Pacific Research Platform (PRP), is a collaborative, multi-institutional effort to create a shared national infrastructure for data-driven research. Backed by the National Science Foundation (NSF) and the Department of Energy (DOE), the NRP provides high-performance computing resources, data storage and management services, and network connectivity capabilities to researchers across various disciplines (e.g., the earth sciences and health sciences).

The Kepler Project

The Kepler Project

The Kepler Project supports the use and development of the free, open source Kepler Scientific Workflow System. This system helps scientists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines. The Kepler Scientific Workflow System can operate on data stored locally and over the internet.

CESMII

Clean Energy Smart Manufacturing Innovation Institute (CESMII)

The Clean Energy Smart Manufacturing Innovation Institute (CESMII) is a non-profit organization driving the transformation of the manufacturing industry toward a cleaner, more sustainable future. The US Department of Energy's (DOE) Clean Energy Manufacturing Initiative declared CESMII a Manufacturing Innovation Institutes in 2016.

WIFIRE

WIFIRE: Workflows Integrating Collaborative Hazard Sciences

The WIFIRE CI (cyberinfrastructure) builds an integrated system for wildfire analysis by combining satellite and remote sensor data with computational techniques to monitor weather patterns and predict wildfire spread in real-time. The WIFIRE Lab, powered by this CI and housed at the San Diego Supercomputer Center, UCSD, was founded in 2013 and is composed of various platforms and efforts, including:

Sage (A Software-Defined Sensor Network)

Sage: A Software-Defined Sensor Network

The Sage project is a National Science Foundation (NSF)-backed endeavor, led by Northwestern University since 2019. The project focuses on harnessing the latest edge computing technologies and methods to create a programmable, reusable network of smart, AI-based sensors at the edge for various applications, e.g., tracking smoke plume dispersion during wildfires. Leveraging our expertise in cyberinfrastructure development and data architecture, we have been working towards the robust development of several pieces of the Sage sensor network, including: