Video demo of Cyclops

Timely analysis of activity and operational data is critical for companies to stay competitive. Activity data from a company’s website contains page and content views, searches, and advertisements shown as well as clicked. Operational data includes monitoring data collected from web applications (e.g., request latency) and cluster resources (e.g., CPU usage).

The vast majority of analysis over activity and operational data involves continuous queries. A continuous query Q is a query that is issued once over data D that is constantly updated. Q runs continuously over D and lets users get new results as D changes, without having to issue the same query repeatedly. Continuous queries arise naturally over activity and operational data because of two reasons: (i) the data is generated continuously in the form of append-only streams; (ii) the data has a time component such that recent data is usually more relevant than older data.

The growing interest in continuous queries is reflected by the engineering resources that companies have recently been investing in building continuous query execution platforms. Esper, Storm, and Hadoop are some recent examples of systems that can run continuous queries. Each of these systems is usually designed to work well for a particular type of workload. Thus, there is not a single system that can outperform all other systems for all types of workload.

The number of systems that can run continuous queries poses a number of challenges for application developers and system administrators.

  • Each system has its own possibly different interface for writing continuous queries, which leads to tedious development efforts for running continuous queries.
  • The most suitable execution plan and system to use is not always clear cut.
  • Managing many systems is nontrivial.

The Cyclops project addresses the challenges above. Cyclops is a management system for executing and optimizing continuous queries. It abstracts out the underlying systems for running continuous queries, by giving users a common interface to create and run continuous queries. It has an optimizer that can select the most appropriate execution plan, which includes the algorithm to use and system to run a given query (see the picture below showing the accuracy of the optimizer's estimated latency for processing an example continuous query under different execution plans vs. the actual latency). A high-level overview of Cyclops' system architecture can be found here.

Shivnath Babu, Associate Professor, Duke Computer Science

Harold Lim, Ph.D., Duke Computer Science

Howard Chung, Undergraduate, Duke Computer Science