Overview of Starfish

The ability to perform timely and scalable analytical processing of large datasets is now a critical ingredient for the success of most organizations. The Hadoop software stack (which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces) is a popular choice for big data analytics. Unfortunately, Hadoop's power and flexibility comes at a high cost because it relies heavily on the user or system administrator to optimize MapReduce jobs and the overall system at various levels. For example:

TeraSort Responce Surface
WordCount Responce Surface
Starfish finds good settings automatically
for MapReduce jobs
  • A number of tuning parameters must be set even to run a single job in Hadoop. These parameters include the number of mappers and reducers, memory allocation settings, controls for I/O and network usage, and others. Many of these parameters have a major impact on job performance as our study shows.
  • Operations like joins, declarative queries, and large workflows are specified in Hadoop using higher-level languages like HiveQL and Pig Latin. These language compilers do not support cost-based optimization to convert the declarative specifications to MapReduce jobs; relying instead on hard-coded rules and user-specified hints.
  • A well-tuned data layout is often the key to good performance in parallel data processing since it is critical to move the computation to the data. The data layout has multiple dimensions such as data block placement, partitioning, replication, compression, and choice of storage and retrieval engine. Hadoop has little support for tuning the data layout in response to workload and resource changes.

The Starfish project is addressing these challenges using a combination of techniques from cost-based database query optimization, robust and adaptive query processing, static and dynamic program analysis, dynamic data sampling and run-time profiling, and statistical machine learning applied to streams of system instrumentation data. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. The novelty in Starfish's approach comes from how it focuses simultaneously on different workload granularities—overall workload, workflows, and jobs (procedural and declarative)—as well as across various decision points—provisioning, optimization, scheduling, and data layout.


If you would like to stay up-to-date with the latest Starfish news or to pose any questions, please subscribe to our group below:

Subscribe to the Starfish group
Visit this group

Also, welcome to visit our blog.

Current Project Members

  • Shivnath Babu, Assistant Professor, Duke Computer Science
  • Harold Lim, Ph.D. Candidate, Duke Computer Science
  • Jie Li, Ph.D. Candidate, Duke Computer Science
  • Kunal Agarwal, Duke University

Past Project Members


We are extremely grateful to Amazon Web Services for multiple grants that enable us to test Starfish on the Amazon Cloud Platform.