2011 Barrelfish workshop

20-21 October, Cambridge University Computer Laboratory

Thursday 20 October
09.30	Timothy Roscoe (ETH Zurich)	Welcome, last year's progress, workshop goals
10.00	Werner Haas (Intel)	System-level implications of non-volatile, random-access memory
10.15	Matt Horsnell (ARM)	OS support in ARMv7A
10.30	Andrew Baumann (Microsoft)	Drawbridge on Barrelfish
11.30	Pravin Shinde (ETH Zurich)	Scalable and adaptive network stack architecture
12.00	Zach Anderson (ETH Zurich)	Fine-grained, language-level, hierarchical resource management
12.30	Stefan Kästle (ETH Zurich)	Message-passing co-processor
14.00	Adrian Schüpbach (ETH Zurich)	A declarative language approach to device configuration
14.30	Ross McIlroy (Microsoft)	Calico: rethinking the language / runtime-system boundary
15.00	Marcin Orczyk & Calum McCall (U. Glasgow)	GHC for a multi-kernel architecture
16.00	Georgios Varisteas (KTH)	Dynamic inter-core scheduling in Barrelfish
16.30	Robert Watson (CUCL)	BERI: an open source platform for research into the h/w-s/w interface
16.45	Mikel Lujan (U. Manchester)	Teraflux: A Manchester Perspective

Friday 21 October
09.30	Zeus Gómez Marmolejo (BSC)	GCC cross compiler and Gasnet
10.00	Jana Giceva (ETH Zurich)	Database-OS co-design
10.30	Tim Harris (Microsoft)	Flexible hardware support for message passing

Session Transcripts

Session 1 - Thursday

Timothy Roscoe - Keynote

Mothy gave an overview of the state of the Barrelfish community, which is noticably growing. A new Barrelfish version has also recently been released, with a new build system, support for the Intel Single-Chip Cloud COmputer and ARM, bootscripts, new IDC support, a bulk transport facility, POSIX and VFS support, and a basic AHCI driver. This is enough to run full-blown POSIX applications, like PostgreSQL. Finally, the copyright now belongs solely to ETH.

At ETH, a large number of Master's students is now doing work on Barrelfish, which should enhance its feature-list even further in the near future.

The goals of Barrelfish haven't changed and are still to build a solid platform for good OS research, as well as a well-respected OS. Finally, Mothy is seeking feedback from participants of the workshop. What are we doing right, what wrong? Is the frequency of yearly workshop meetings okay or should it change?

Matt Horsnell - OS Support in ARMv7A

Matt introduced the ARMv7-A architecture. ARM is now market-pervasive and thus an important architecture to look out for. The ARMv7-A is a 32bit RISC + Thumb ISA chip, based on the Cortex A processor. Thumb supports less registers than the RISC ISA, but has a smaller memory footprint. Furthermore, it supports SIMD instructions for multimedia and LPAE (like x86 PAE) to support large physical address spaces.

He demonstrated weak ordering in the ARM ISA: There is no defined ordering or consistency of instructions between ARM processors. Instead, a memory barrier instruction is supported to order memory accesses, but this is expensive.

Richard Black (RB): How expensive are barriers?
Matt: On the order of 100 cycles.
RB: Because you have to ensure the L1 is clean?
Matt: Have to write back memory to point of synchronization.
RB: If I have 2 barriers close to each other, is the second one as expensive as the first?
Matt: The cost is on the same order as the first, but it depends on the architecture.

Next, Matt talked about the heterogeneous ARM Big Little. This is based on a cluster of Cortex A7 and and some Cortex 15 chips. The A7 have low power requirements and high efficiency, while the 15 is more beefy. Their idea is to migrate a program seamlessly within 10k cycles (20us) from an A7 core to the A15 if it needs more performance.

Mothy: Is the OS in control of this?
Matt: Yes.
Mothy: Does the OS see multiple cores?
Matt: Yes. We'd probably harness DVFS to switch between CPUs.
Orion Hodson: Where are we today from the press announcement?
Matt: Product will come in 2013. The models are developed. In research, you should use the GEM5 simulator, instead of the actual product.

Werner Haas - System-level implementations of non-volatile RAM

Werner started by demonstrating that charge-based memory (DRAM, Flash) has scalability problems. Industry's focus today is on using resistance as information carrier (NVRAM). As this is non-volatile, we have an opportunity for universal memory inside the machine. Current NVRAM prototypes are PCM (phase-change memory, the leading technology), STTRAM (Spin-Torque Transfer RAM), and ReRAM (Resistive RAM). ReRAM scales best (has highest density) and has good endurance.

Steve Hand: I'd like to buy this ReRAM. When can I do so?
Werner: HP press announcement says it's a product in 2013. First there's going to be Flash and then ReRAM for mobile devices in 2015/16.

There is a trend to replace DRAM with NVRAM. Caches are the last thing that is still volatile. Should the new NVRAM be software-managed? For example, we can make address space be block accessable to make them look more like a disk, or should we rather make them byte addressable? The address space is rather huge, so there would be a lot of pages and big page tables.

We should also think about separating translation from protection. In terms of persistent memory, we could use Barrelfish's capabilities and match them to hardware protection directly.

Kornilios Kourtis: Are there any access patterns that get better performance in this setting? E.g. disk drivers prefer sequential data access.
Werner: You should rewrite your filesystem drivers and just start accessing data randomly and you get better performance. There are papers about non-volatile filesystems.
Steve Hand: There are papers at FAST and SOSP'09 about this. All those reinvent the interface between CPUs and NVRAM. They either ignore it or make something up.
Werner: Especially memory barriers from first talk could help here. You want to ensure memory is in NV backing sotre. Maybe transactional memory can also help here.
Mothy: These papers ignore that this is RAM and thus cache-line addressable. They shouldn't treat it like a disk.

Andrew Baumann - Drawbridge on Barrelfish

Drawbridge is a light-weight virtualization technology for Windows applications that presents a middle-ground between virtualization and porting from scratch. It reuses two ideas: Picoprocesses and library OSes from a recent ASPLOS paper. We have implemented Windows 7 as a library OS, which is also documented in that paper.

Picoprocesses are light-weight, secure isolation containers. There is no access to system calls, but a much smaller API (45 vs. 1200+ calls). It's higher-level than a VM, though. Inside a pricoprocess, you can run a library OS. This is much smaller than a fully virtualized OS, because it only drives one application and is self-contained.

We put this on top of Barrelfish for three reasons: Firstly, it's a proof of self-containedness. Secondly, it ensures OS independence for Drawbridge and thirdly, it allows real applications to run on Barrelfish. An attached research agenda is to investigate OS support for heterogeneous hardware.

For Barrelfish, there is no security monitor, which is present if Windows is used as a host OS. Instead, we convert Win7 libOS calls to Barrelfish libOS calls and run as if we're a normal Barrelfish application. The platform adaptation layer (PAL) runs in-process.

So far, the project has contributed several improvements to Drawbridge: It has forced us to nail down an ABI (not just an API), such as thread stack ownership, the TLS contract, and relocations. There are also improvements to Barrelfish: We now allows mappings below 512GB, timers, user control over segment registers, user-mode trap handling, a user-mode debug stub and an mmap() implementation.

The limitations of the current system include terrible IO performance, unsupported file sharing modes, child processes, and missing multicore support.

Andrew then presented several models for OS support for accelerators. The driver model essentially has no direct OS integration with the accelerator and treats it as a peripheral that is programmed via a device driver API. The co-processor model runs applications on the accelerator, but forwards all system calls to the host. Finally, the Multikernel uses the library OS only for compatibility and runs a full OS on the accelerator. Andrew is currently looking at Knight's Ferry as a possible accelerator to exploit.

Andrew Moore: Why did it not work for this powerpoint example?
Andrew: Bugs. Timers are hard. We're not bug-for-bug compatible. In this case, we were not quite doing the right thing. In this case, PowerPoint would stall. Have to look at callframes to see what it's doing. Implicit assumption that memory frames are 64k aligned that we didnt spot.

Steve Hand: WHat's different to getting this working on Hyper-V vs. Barrelfish, as shown in the ASPLOS paper?
Andrew: They were cheating. They ran the guest in Hyper-V. They did RPC for each call to host machine. Issues for Hyper-V were providing threads, sync, etc on top of bare-metal.

Werner Haas: Accelerators are usually operating in physical address space. How would you like to have that handled?
Andrew: I think accels are getting increasingly general purpose. We need enough support to run a small Microkernel like Barrelfish. Look at MIC, AMD's fusion arch. They all have MMUs.

Werner: Then you run into TLB consistency issues.
Andrew: Yes, we know how to deal with those.

Someone: Accelerators are usually used in different way. Do you really want to run PowerPoint there?
Andrew: No! My goal is to reduce the barrier to developers being able to use accelerators, like I can't invoke the OS on an accelerator right now. I'm not transparently going to migrate an app to an accelerator and do all the magic for the developer.

Kornilios: What kind of OS support would code on an accelerator want?
Andrew: No good handle yet. It's driven by programming models today. The ideal app would be rich, invokes OS services, and wants to turn a parallel loop into something to work on an accelerator, by just needing an annotation.

Session 1 - Friday

Zeus Gomez Marmolejo - Porting Nanos++ runtime to Barrelfish

Zeus wants to run OpenMP (StarSs) on Barrelfish to compare it on different architectures. He has a compiler (Mercurium) that converts OmpSs to C++, creates a dependency graph and and schedules tasks. The entire system is called Nanos.

Nanos requires dynamic libraries, but they use static linking for now. Also, TLS (now available) is required. Finally, Barrelfish still leaks memory, and a lot of runtime from C++ is not in Barrelfish. Hake was also not easy to use for Nanos. It is inflexible and not compatible with autoconf, which makes existing apps difficult to port.

Zeus proposes a GCC cross-compiler and to use newlib to replace libc, so GNU libstdc++ and libgcc_eh work. He specifies 3 targets (x86_64, i386 & scc).

To show that newlib already works, he shows an example of bash running on Barrelfish. He notes that there are still many missing features, such as fork(), wait(), kill() & execve(), which prevent the shell from starting anything yet. There are also problems with loop dependencies between libbarrelfish and libc (now newlib) that are not easy to work around and other small implementation problems, like a bug in memory allocation on Barrelfish that leaks memory.

GASnet uses common API code and conduit code specific to network hardware. There is a conduit for Barrelfish, which creates a complete network to all CPUs.

Andrew: What do you want to run on top of gasnet?
Zeus: OpenMP and other projects are using it.

Werner: <Couldn't understand the question>?
Zeus: Bash shell is very simple. Had to solve simple things. Need to implement newlib syscalls.

Someone: Any rough ideas about performance? 4 threads sounds like overhead?
Zeus: Not very good performance yet. We used spinlocks everywhere, but will replace by mutexes (hoping to reduce overhead). Expect to have more performance than UDP in the end.

Jana Giceva - Database-OS co-design

Jana started off by showing the diversity in current HW resources and how this is not well handled if operating systems and databases are not working together on this problem. New hardware trends and diverse workloads already influence database research and appliances are an example where tying a database closely with the operating system has tangible benefits.

She demonstrates NUMA awareness as one example where DB/OS co-design can provide benefits. For her project, she uses Barrelfish and the CSCS engine, which is a column store. There are several initial questions to answer, such as what is the initial performance going to be, are there hot-spots or bottlenecks, what is its scalability. She answers these by comparing the Barrelfish/CSCS combination to CSCS running on Linux. As a workload, she uses the Amadeus flight bookings trace. The single core results look alike and varying the number of updates does not impose performance degradation. The datastore size is also proportional to its throughput.

On Barrelfish, the database can submit a manifest to the OS, which describes its requirements. The OS is also allowed to reply back to the manifest for 2-way information flow.

Steve Hand: Is data in memory?
Jana: Yes.

Barrelfish is faster than Linux. Due to DRAM bandwidth. Ultimately due to the C++ library implementation, which has faster support for strings.

Richard: If bottlenecked on DRAM then this is inconsistent?
Jana: No, we're CPU bound.
Richard: You should use a hash, instead of strcmp().

NUMA analysis: NUMA awareness makes everything well-distributed over nodes, not so without awareness. But doesn't impact performance.

Someone: How big are the error bars?
Jana: Not big.

To conclude, the CSCS engine works on BF. Future work: When do we hit a scalability bottleneck? And what is it?

Richard: Summarize how worker threads work? How often do they check? How are they synced?
Jana: No interaction with OS, except for NUMA alloc.
Richard: How are cores talking?
Jana: Shared-nothing DB, partitioned over cores, enqueue queries over all cores, cores scan their partition.
Richard: What's the unit of batching?
Jana: One rotation.
Richard: How does latency differ between Linux vs. BF?
Jana: Goal of DB design is to bound latency time, but scale number of simultaneous queries. Partition dataset, such that latency is less than bound. Put as many queries as possible such that workload is still CPU bound. Separate updates and reads. Goal is predictability, scalable throughput. Scans and updates are synchronized.

Someone: Seems like we need a better libc?
Jana: Yes! We plan to use BSD's libc.

Steve Hand: Why do you want to run this on Barrelfish?
Jana: We're doing it because it's easier to modify Barrelfish.

Tim Harris - Flexible hardware support for message passing

Tim looks at hardware support to accelerate message passing between protection domains. Currently, Barrelfish has several layers (AC, Flounder, UMP, CC-interconnect). Tim adds AMP interconnect driver and new hardware features to this. Related hardware work is dealing with communication within processes, not between them.

Higher-level abstractions are hard (flow control, no direct MPB access with protection). Related software work is focusing on shared FIFO queues. This relies on CC-memory and IPIs are heavy-weight.

Tim wants to add just enough hardware to avoid cache-coherent-memory reliance, and to avoid the need for polling or going into the kernel on notification.

In this work, Tim prepares messages in regs and introduces a hardware send(vspace, thread) mechanis. Both processes map a shared page to exchange more data.

Steve Hand: How is a thread named?
Tim: 64-bit integer.

Andrew: Can I use the thread ID as a channel ID?
Tim: Nothing would prevent multiple people to map in the same channel.

The hardware then maps virtual->physical, thread->core, sends on interconnect, and updates the cache line on the target core.

Steve: What is state of that cacheline?
Tim: Treat like write from target core.
Someone: Similar to injecting data to cache right from NIC.

On the slow path, we speculatively allocate, if line not present. If receiver not running, we can always inject. We introduce a Thread Translation Buffer (TTB) to map software IDs to core IDs.

The key ideas are to use existing VM mechanisms for naming and protection and in the common case to have cache-to-cache transfers.

As notification mechansism, Tim proposes a notify channel going through the kernel. On the receive side, the kernel periodically watches a bitmap of active notifications.

The new hardware mechanisms are implemented on the GEM5 simulator, including the TTB, a connection white list, rx and tx message queues, as well as flow control.

Preliminary results show that application performance using the new mechanisms is much better, due to sending right to the receiver and not via the cache coherence protocol. Tim used a 2-core two-phase-commit as evaluation workload. There are several ways to schedule messages and a fair scheduling strategy is most well behaved, as it's fast with short polling times and only a bit worse with long polling times.

The status is that an initial implementation exists for the Beehive computer, used for numerical programs, as well as a port to GEM5 to model non-cache-coherent shared memory. Tim can also run it on x86-64, which ignores most of the protection questions, but can be used to run longer experiments.

Someone: How do you write right into someone else's cacheline?
Tim: We use regular CC techniques. It's UMP with notifications. Maybe
we can use this for work-stealing systems. Have to investigate.

Kornilios: Is gang scheduling effective?
Tim: No numbers for that. Not different to other systems.

Someone: What about remote store programming?
Tim: Have to have a look.

Someone: If you have cache-coherence protocol with special store instruction, would that be the same?
Tim: Could be adapted. Here, the sender is explicit about where the send is going to.

Steve Hand: How does kernel watching work?
Tim: MONITOR/MWAIT.
Steve: Do in user-space?
Tim: Have receiver thread to MONITOR/MWAIT. It's already done.

Someone: Is power considered?
Tim: No.
Someone: You could power down cores and wait for memory?
Tim: That's MONITOR/MWAIT.