"The important thing in life is to have a great aim, and the determination to attain it" - Gothe.
DISCO: Distributed, Sustainable, and Cloud Computing Systems Lab
The DISCO Lab aims to explore in-depth understanding of Distributed, Sustainable and BigData Cloud computing and augmented services,
and develop open-source technologies to enhance the system performance, dependability, scalability and sustainability. The research was supported in part by funding from the National Science Foundation.
The DISCO Lab is located in the new science and engineering building. The server room is furnished
with cutting-edge HP data center blade facility that has three racks of HP ProLiant BL460C G6 blade server modules
and a 40 TB HP EVA storage area network with 10 Gbps Ethernet and 8 Gbps Fibre/iSCSI
dual channels. It has three APC InRow RP Air-Cooled and UPS equipments for maximum 40 kWs in the n+1 redundancy design.
- Shaoqi's paper "Scalable Distributed DL Training: Batching Communcation and Computation" was accepted by AAAI 2019 (acceptance rate 16.2%).
- Shaoqi's paper "Aggressive Synchronization with Partial Processing for Iterative ML Jobs on Clusters" was accepted by ACM Middleware 2018 (acceptance rate 23%).
- Eddie's paper "Profiling Distrbuted Systems in Light-weight Virtualized Environments with Logs and Resource Metrics" was accepted by ACM HPDC 2018 (acceptance rate 19.5%).
- Tiago's paper "Reference-distance Eviction and Prefetching for Cache Management in Spark" was accepted by IEEE ICPP 2018 (acceptance rate 28%).
- Wei's paper "Characterizing Scheduling Delay for Low-latency Data Anayltic Workloads" was accepted by IEEE IPDPS 2018 (acceptance rate 24.5%).
- A joint paper with Dr. Palden Lama "Performance Isolation of Data-intensive Scale-out Applications in Multi-tenant Clouds" was accepted by IEEE IPDPS 2018 (acceptance rate 24.5%).
- Wei's paper "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" was accepted by USENIX ATC 2017 (acceptance rate 21%).
- Wei's paper "Addressing Memory Pressure in Data-Intensive Parallel Programs via Container based Virtulization" was accepted by IEEE ICAC 2017.
- Wei's paper "Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks" was accepted by IEEE IPDPS 2017(acceptance rate 23%).
- Shaoqi's paper "Network-Adaptive Scheduling of Data-Intensive Parallel Jobs in Clusters" was accepted by IEEE ICAC 2017.
- Dazhao's paper "Adaptive Scheduling of Parallel Jobs in Spark Streaming" was accepted by IEEE INFOCOM 2017 (acceptance rate 21%).
- A joint paper with Dr. Bo Wu "FLEP: Enabling Flexible and Efficient Preemption on GPUs" was accepted by ACM ASPLOS 2017(acceptance rate 17%).
- Yanfei's paper "Fault Tolerant MapReduce-MPI for HPC Clusters" was accepted by ACM/IEEE SC 2015 (acceptance rate 21%).
- Dazhao's paper "Towards Energy Efficiency in Heterogeneous Hadoop Clusters by Adaptive Task Assignment" was accepted by 2015 IEEE ICDCS (acceptance rate 12.8%).
- Dazhao's paper "Resource and Deadline-aware Job Scheduling in Dynamic Hadoop Clusters" was accepted by IEEE IPDPS 2015 (acceptance rate 22%).
- Yanfei's paper "StoreApp: A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters" was accepted by IEEE INFOCOM 2015 (acceptance rate 19%).
- Beaulah Navaman, PhD student, 2012 - present
- Ben Albernathy, PhD student, 2012 - present
- Tiago Perez, PhD student, 2014 - present
- Wei Chen, PhD student, 2014 - present
- Shaoqi Wang, PhD student, 2015 - present
- AiDi Pi, PhD student, 2016 - present
- Oluwatobi Akanbi, 2016 - present
- Tina Rose, 2017 - present
- Kathir Palaiappan, 2018 - present
- Amy Oh, 2018 - present
Graduated PhD Students
- Palden Lama, PhD in May 2013 (Assistant Professor, UT San Antonio)
- Sireesha Muppala, PhD in May 2013 (Postdoc, UCCS)
- Dennis Ippoliti, PhD in Dec 2013 (Project Manager, Microsoft)
- Yanfei Guo, PhD in May 2015 (Postdoc, Argonne National Lab)
- Dazhao Cheng, PhD in May 2016 (Assistant Professor, UNC Charlotte)
- Jason Upchurch, PhD in May 2016 (Research Fellow, Intel)
SHF: Small: Lightweight Virtualization Driven Elastic Memory Management and Cluster Scheduling (Sponsor: NSF SHF-1816850, PI: X. Zhou. 7/2018 - 06/2021)
Data-centers are evolving to host heterogeneous workloads on shared clusters to reduce the operational cost and achieve high resource utilization.
However, it is challenging to schedule heterogeneous workloads with diverse resource requirements and performance constraints on heterogeneous hardware.
Data parallel processing often suffers from interference and significant memory pressure, resulting in excessive garbage collection and out-of-memory errors that harm application performance and reliability.
Cluster memory management and scheduling is still inefficient, leading to low utilization and poor multi-service support.
Existing approaches either focus on application awareness or operating system awareness, thus are not well positioned to address the semantic gap between application run-times and the operating system.
This project aims to improve application performance and cluster efficiency via lightweight virtualization-enabled elastic memory management and cluster scheduling.
It combines system experimentation with rigorous design and analyses to improve performance and efficiency, and tackle memory pressure of data-parallel processing.
Developed system software will be open-sourced, providing opportunities to foster a large ecosystem that spans system software providers and customers.
MapReduce, a parallel and distributed programming model on clusters of commodity hardware, has emerged as the de facto standard for processing
large data sets. Although MapReduce provides a simple and generic interface for parallel programming, it incurs several problems including
low cluster resource utilization, suboptimal scalability and poor multi-tenancy support. This project explores and designs new techniques
that let MapReduce fully exploit the benefits of flexible and elastic resource allocations in the cloud while addressing the overhead and issues
caused?by server virtualization. It broadens impact by allowing a flexible and cost-effective way to perform big data analytics.
This project also involves industry collaboration, curriculum development, and provides more avenues to bring women, minority,
and underrepresented students into research and graduate programs.
SFS: A Security-Integrated Computer Science Curriculum for Intensive Capacity Building (Sponsor: NSF DGE-1438935, PI: C. Yue, CoPIs: X. Zhou, E. Chow, and T. Boult. 09/2014 - 08/2017)
This project will advance the state of art and practice in cybersecurity education by systematically exploring a novel security-integrated computer science curriculum approach.
CSR: System and Middleware Approaches to Predictable Services in Multi-Tenant Clouds (Sponsor: NSF CNS-1320122, PI: J. Rao, CoPI: X. Zhou. 09/2013 - 12/2017)
Datacenter-based cloud services exhibit unpredictable performance variations due to multi-tenant interferences and the heterogeneity in datacenter hardware. The investigators attribute the causes of such performance unpredictability to the missing of two important service guarantees from existing cloud providers: resource capacity and application agility. To provide guaranteed resource capacity and enhanced application agility, this project develops independent but complementary approaches at system and middleware levels to reduce performance variations of in-cloud applications without compromising other objectives such as high datacenter utilization and good average performance. The deliverables are new system support in cloud resource management to account for interferences and hardware heterogeneity in shared infrastructures and middleware approaches to perform agile, non-invasive and application-centric resource provisioning. The research methodology combines architectural knowledge on the complex interplay between simultaneous multi-threading, multicore, and non-uniform memory access architectures with statistical learning algorithms to quantify interference and heterogeneity, and integrates the strength of self-optimizing learning and control techniques to automate resource provisioning under dynamic workloads. This project broadens impact by exploring inter-disciplinary techniques in computer system design and enhancing cloud services with predictability guarantees. The success will guide resource management and metering in future cloud systems.
Modern data centers hosting popular Internet services face significant and multi-facet challenges in performance and power control.
The challenges are mainly due to complex interaction of highly dynamic and heterogeneous workloads in complex virtualized computing systems.
In this research project, the investigators take an organized approach to autonomic performance and power control on virtualized servers.
The project designs and develops automated, agile and scalable techniques for server parameter tuning, virtual machine capacity planning,
non-invasive energy-efficient performance isolation, and elastic power-aware resource provisioning. The deliverables are innovative and practical
approaches and mechanisms that provide performance assurance of applications, maximize effective system throughput of data centers with resources
and power budget, mitigate performance interference among heterogeneous applications, and achieve performance and power targets with
flexible tradeoffs while assuring control accuracy and system stability. The research methodology integrates strengths of reinforcement learning,
fast online learning neural networks, fuzzy logic control, model predictive controls and distributed and coordinated control.
The project broadens impact by developing a testbed in a university prototype data center to demonstrate the orchestration of developed
approaches and mechanisms for autonomous management of virtualized computing systems, middleware, and services.
The success will guide autonomous resource management for sustainable computing in next-generation data centers.
Due to the dynamic nature and unprecedented scale of the Internet, Internet services pose challenges including scalability, reliability, and availability
to underlying networked systems. This CAREER project concentrates on building Internet services that are resilient to those challenges with machine
learning and control techniques. Internet services build upon cluster-based computer systems that keep growing in scale and complexity.
Such systems become so complicated that it is even a big challenge to get a good understanding of the entire system dynamic behaviors.
The investigators take an analytical and organized approach to design an autonomous software infrastructure on networked systems for building
resilient Internet services. The project builds empirical models using statistical learning to help overcome the challenges of scale and complexity
in networked systems. It designs coordinated admission control and capacity planning algorithms with end-to-end quality-of-service on multi-tier clusters.
Model-independent control techniques are used with empirical models to allocate resources and to dynamically reconfigure the system for performance
optimization needs. It develops performance differentiation, isolation, and self-adaptive reconfiguration capabilities for enhancing system reliability
and availability. It broadens the research impact by developing a testbed in a data center lab to demonstrate the orchestration of designed techniques
for automated arrangement, coordination, and management of complex computer systems, middleware, and services.
CSR: Resource Allocation Optimization for Quantitative Service Differentiation on Multi-Tier Server Clusters (Sponsor: NSF CNS-0720524, Sole PI: X. Zhou. 08/2007 - 07/2011)
Internet services have become an important class of driving applications for scalable and
quality aware distributed computing technologies. Service differentiation is to provide different
quality levels to satisfy requirements of Internet services while maintaining resource availability.
It is demanded due to the diversity of access devices and networks of users, but also because
it can enhance the system scalability and dependability of the computing technologies.
In this research project, the investigators take an analytical and organized approach to examine
resource management techniques for quantitative service differentiation in popular multi-tier server clusters.
The broad impact of the research will be on quality control for system scalability and dependability enhancement.
This project will help society develop quality aware applications and salable computing technologies for popular Internet services.