NIMO
NonInvasive Modeling for Optimization

Computer Science Department
Duke University

Overview | Publications | Members

Introduction

NIMO is a system for proactively and automatically creating end-to-end application performance models to enable informed assignment of resources in networked utilities---distributed collection of compute and storage resources. Self-managing systems must manage the performance goals of a system, e.g., meet SLAs, optimize application performance, and maximize resource utilization in an automated fashion. To do so, a system must understand the impact of all the relevant factors on application performance such as: (i) the workload, e.g., the arrival patterns in an Internet web-service such as Amazon; (ii) the resources assigned to the application, e.g., the amount of CPU, memory, storage, and network resources assigned to the application---the provisioning (how much) as well as the placement (where) of these resources; and (iii) the data that it processes, e.g., the dataset size of a scientific application. A self-managing system must capture the impact of all the relevant factors that affect application performance in an automated fashion.

NIMO enables automated management of system performance goals. It has the following objectives.

  • End-to-End. Learn performance models that predict performance measures taking into account an application's workload, the resources assigned to the application such as CPU, memory, and network, and the data processed by it.
  • NonInvasive. Gather the training data for models from passive instrumentation streams readily available with common tools, with no changes to application or system software.
  • Active. Proactively deploy and monitor applications on heterogeneous resource assignments to collect sufficient training data for learning accurate models quickly and automatically.

Overview

NIMO's overall architecture consists of: (i) a scheduler that enumerates, selects, and requests resource assignments (e.g., CPU, memory, and network) for applications from the utility resource infrastructure (e.g., from Shirako); (ii) a modeling engine that consists of an application profiler, a resource profiler, and a data profiler that learns performance models for applications; and (iii) a workbench where NIMO conducts active (or proactive) application runs to automatically collect samples for learning performance models. Active learning with acceleration seeks to reduce the time before a reasonably-accurate model is available.

We now summarize the components of NIMO in the context of a computational-science workflow G (See the VLDB paper for details).


Scheduler
NIMO's scheduler is responsible for generating and executing an effective plan for a given workflow G. The scheduler enumerates candidate plans for G, estimates the performance of each plan, and chooses the execution plan with the best performance. A plan P for workflow G is an execution strategy that specifies a resource assignment for each task in G. In addition to the tasks in G, G may also interpose additional tasks for staging data between each pair of tasks in G. For example, a staging task Gij between tasks Gi and Gj in the workflow DAG, copies the parts of Gj's input dataset produced by Gi from Gi's storage resource to that of Gj.

Modeling Engine
The scheduler uses a performance model M(G, I, R) to estimate the performance of G with input dataset I on a resource assignment R. NIMO builds profiles of resources and frequently executed applications by analyzing instrumentation data gathered from previous runs using common and noninvasive tools (e.g., sar, tcpdump, and nfsdump). A performance model M for an application G predicts the performance of a plan for G given three inputs: (i) G's application profile, (ii) resource profiles of resources assigned to the plan, and (iii) data profile of the input dataset.

Intuitively, the application profile captures how an application uses the input data set and the resources assigned to it. Resource profiles specify attributes that characterize the function and power of those resources in an application-independent way. For example, a resource profile might represent a compute server with a fixed number of CPUs defined by attributes such as clock rate and cache sizes, with an attached memory of a given size. Similarly, storage resources can be approximated by attributes such as capacity, spindle count, seek time, and transfer speed. The data profile comprises the data characteristics of G's input dataset, e.g., the input data size. The profiles are described in our ICAC paper.

Active Learning of Models
NIMO's modeling engine automatically learns the performance model for G from the instrumentation data samples obtained by deploying G on selected resource assignments, either to serve a real request, or proactively to use idle or dedicated resources (a ``workbench''; see the figure above). NIMO's modeling engine actively initiates new runs of G on selected resource assignments in the workbench guided by the theory of design of experiments, and active learning from machine learning. The goal is to obtain sufficient training data for learning an accurate performance model for G in the shortest possible time.

Instrumentation data is collected during a run, then aggregated to generate a sample data point as soon as the run completes. In keeping with NIMO's objective of being noninvasive, the collection of instrumentation data requires no changes to the workflow or the underlying system. Instead, NIMO relies only on high-level metrics collected by commonly-available monitoring tools.

Model-guided Resource Planning
Once NIMO learns the application performance models, it uses them for making an informed assignment of resources for applications by ranking the list of available assignments in the order of application performance, by predicting the assignments that meets a target performance, and by doing what-if analysis. Details of model use are in our ICAC paper.


Publications

  • "Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications", Piyush Shivam, Shivnath Babu, Jeff Chase. International Conference on Very Large Data Bases (VLDB), September 2006. [pdf]
  • "Active Sampling for Accelerated Learning of Performance Models ", Piyush Shivam, Shivnath Babu, Jeff Chase. First Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), June 2006.[pdf]
  • "Learning Application Models for Utility Resource Planning", Piyush Shivam, Shivnath Babu, Jeff Chase. IEEE International Conference on Autonomic Computing (ICAC), June 2006.[pdf]
Project Members