Buffer and Cache Management in Scalable Network Servers

Abstract

Introduction - The net is everywhere, and network servers need to be
    scalable. Moreover, they need to be scalable cheaply, since cost
    is important. The goals of this work are to drive performance at
    three levels - application, operating system, and cluster. The
    way we do this is by improving the caching behavior of the systems
    involved.

Part I: Application-level
 
  Problem - Servers are expected to handle simultaneous connections
      from many clients provide good performance across a wide range
      of workloads.  However, existing concurrency architectures allow
      either good caching or good disk behavior, but not both.

  Processing Steps - Request processing consists of a series of steps,
      some of which involve waiting on the network, the client, or the
      disk. When handling multiple clients, it makes sense to
      interleave the processing steps for different connections. 
      Several of the steps present opportunities for caching, since
      many requests will be for a small set of files.

  Architectures - Existing architectures handle interleaving the steps
      in different ways. Discuss the architectures, covering their
      strengths and weaknesses.  We've developed a new concurrency
      architecture, designed to combine the strengths of existing
      approaches.

  Implementation - Discuss the various mechanisms used to give Flash
      its performance.

  Experiments - Show how various architectures perform, and where
      Flash gets its benefits.

Part II: Operating system

  Problem - Separate buffering and caching systems within the
      operating result in multiple copies of data existing within a
      single machine. This redundant copying has two effects - wasting
      CPU while performing the copying, and wasting memory for the
      copies.

  Approach - Discuss design space of copy avoidance approaches. For
      each, mention benefits and drawbacks. Conclude with design of
      IO-Lite, explaining the rationale behind the decisions. Mention
      technology trends and IO-Lite is favorable to them.

  Implementation - Describe IO-Lite's implementation in FreeBSD. 
      Discuss how it interacts with various parts of the operating
      system.  Discuss internal memory consumption and fragmentation
      issues, and what steps are taken to reduce the problem.

  Performance - Show microbenchmarks and trace-based workloads showing
      IO-Lite's performance. Also show effects on WAN workload. 

Part II: Cluster-wide

  Problem - Once the expected workload exceeds the capabilities of a
      uniprocessor, the range of possible alternatives is large. 
      Clusters provide a potentially cheap entry point, and provide
      the ability to scale beyond the limits of single machines. 
      However, existing approaches to distributing requests in
      clusters either impose undesirable administrative/flexibility
      costs, or they make potentially poor use of cluster resources.

  Options - Discuss the various options for handling greater load,
      along with their benefits and drawbacks. Examples include
      namespace partitioning, cheap SMPs, real SMPs, NUMA machines,
      and clusters. Also discuss existing partitioning approaches for
      the non-SMP cases: DNS RR, IP sprayer, redirects, content-based.

  Approach - Discuss the LARD algorithm. In particular, describe the
      intuitive motivation but also discuss how it handles hysteresis,
      oscillatory avoidance, automatic replication, etc. Also mention
      the LARD/R approach.

  Evaluation - Show results of simulation and prototype cluster
      performance. Do simulations of SMP performance and do
      sensitivity studies.
