2011 Barrelfish workshop
20-21 October, Cambridge University Computer Laboratory
Thursday 20 October
Timothy Roscoe (ETH Zurich)
Werner Haas (Intel)
Matt Horsnell (ARM)
Andrew Baumann (Microsoft)
Drawbridge on Barrelfish
Pravin Shinde (ETH Zurich)
Zach Anderson (ETH Zurich)
Stefan Kästle (ETH Zurich)
Adrian Schüpbach (ETH Zurich)
Ross McIlroy (Microsoft)
Marcin Orczyk & Calum McCall (U. Glasgow)
Georgios Varisteas (KTH)
Robert Watson (CUCL)
BERI: an open source platform for research into the h/w-s/w interface
Mikel Lujan (U. Manchester)
Teraflux: A Manchester Perspective
Friday 21 October
Zeus Gómez Marmolejo (BSC)
Jana Giceva (ETH Zurich)
Tim Harris (Microsoft)
Session 1 - Thursday
Timothy Roscoe - Keynote
Mothy gave an overview of the state of the Barrelfish community, which is noticably growing. A new Barrelfish version has also recently been released, with a new build system, support for the Intel Single-Chip Cloud COmputer and ARM, bootscripts, new IDC support, a bulk transport facility, POSIX and VFS support, and a basic AHCI driver. This is enough to run full-blown POSIX applications, like PostgreSQL. Finally, the copyright now belongs solely to ETH.
At ETH, a large number of Master's students is now doing work on Barrelfish, which should enhance its feature-list even further in the near future.
The goals of Barrelfish haven't changed and are still to build a solid platform for good OS research, as well as a well-respected OS. Finally, Mothy is seeking feedback from participants of the workshop. What are we doing right, what wrong? Is the frequency of yearly workshop meetings okay or should it change?
Matt Horsnell - OS Support in ARMv7A
Matt introduced the ARMv7-A architecture. ARM is now market-pervasive and thus an important architecture to look out for. The ARMv7-A is a 32bit RISC + Thumb ISA chip, based on the Cortex A processor. Thumb supports less registers than the RISC ISA, but has a smaller memory footprint. Furthermore, it supports SIMD instructions for multimedia and LPAE (like x86 PAE) to support large physical address spaces.
He demonstrated weak ordering in the ARM ISA: There is no defined ordering or consistency of instructions between ARM processors. Instead, a memory barrier instruction is supported to order memory accesses, but this is expensive.
Richard Black (RB): How expensive are barriers? Matt: On the order of 100 cycles. RB: Because you have to ensure the L1 is clean? Matt: Have to write back memory to point of synchronization. RB: If I have 2 barriers close to each other, is the second one as expensive as the first? Matt: The cost is on the same order as the first, but it depends on the architecture.
Next, Matt talked about the heterogeneous ARM Big Little. This is based on a cluster of Cortex A7 and and some Cortex 15 chips. The A7 have low power requirements and high efficiency, while the 15 is more beefy. Their idea is to migrate a program seamlessly within 10k cycles (20us) from an A7 core to the A15 if it needs more performance.
Mothy: Is the OS in control of this? Matt: Yes. Mothy: Does the OS see multiple cores? Matt: Yes. We'd probably harness DVFS to switch between CPUs. Orion Hodson: Where are we today from the press announcement? Matt: Product will come in 2013. The models are developed. In research, you should use the GEM5 simulator, instead of the actual product.
Werner Haas - System-level implementations of non-volatile RAM
Werner started by demonstrating that charge-based memory (DRAM, Flash) has scalability problems. Industry's focus today is on using resistance as information carrier (NVRAM). As this is non-volatile, we have an opportunity for universal memory inside the machine. Current NVRAM prototypes are PCM (phase-change memory, the leading technology), STTRAM (Spin-Torque Transfer RAM), and ReRAM (Resistive RAM). ReRAM scales best (has highest density) and has good endurance.
Steve Hand: I'd like to buy this ReRAM. When can I do so? Werner: HP press announcement says it's a product in 2013. First there's going to be Flash and then ReRAM for mobile devices in 2015/16.
There is a trend to replace DRAM with NVRAM. Caches are the last thing that is still volatile. Should the new NVRAM be software-managed? For example, we can make address space be block accessable to make them look more like a disk, or should we rather make them byte addressable? The address space is rather huge, so there would be a lot of pages and big page tables.
We should also think about separating translation from protection. In terms of persistent memory, we could use Barrelfish's capabilities and match them to hardware protection directly.
Kornilios Kourtis: Are there any access patterns that get better performance in this setting? E.g. disk drivers prefer sequential data access. Werner: You should rewrite your filesystem drivers and just start accessing data randomly and you get better performance. There are papers about non-volatile filesystems. Steve Hand: There are papers at FAST and SOSP'09 about this. All those reinvent the interface between CPUs and NVRAM. They either ignore it or make something up. Werner: Especially memory barriers from first talk could help here. You want to ensure memory is in NV backing sotre. Maybe transactional memory can also help here. Mothy: These papers ignore that this is RAM and thus cache-line addressable. They shouldn't treat it like a disk.
Andrew Baumann - Drawbridge on Barrelfish
Drawbridge is a light-weight virtualization technology for Windows applications that presents a middle-ground between virtualization and porting from scratch. It reuses two ideas: Picoprocesses and library OSes from a recent ASPLOS paper. We have implemented Windows 7 as a library OS, which is also documented in that paper.
Picoprocesses are light-weight, secure isolation containers. There is no access to system calls, but a much smaller API (45 vs. 1200+ calls). It's higher-level than a VM, though. Inside a pricoprocess, you can run a library OS. This is much smaller than a fully virtualized OS, because it only drives one application and is self-contained.
We put this on top of Barrelfish for three reasons: Firstly, it's a proof of self-containedness. Secondly, it ensures OS independence for Drawbridge and thirdly, it allows real applications to run on Barrelfish. An attached research agenda is to investigate OS support for heterogeneous hardware.
For Barrelfish, there is no security monitor, which is present if Windows is used as a host OS. Instead, we convert Win7 libOS calls to Barrelfish libOS calls and run as if we're a normal Barrelfish application. The platform adaptation layer (PAL) runs in-process.
So far, the project has contributed several improvements to Drawbridge: It has forced us to nail down an ABI (not just an API), such as thread stack ownership, the TLS contract, and relocations. There are also improvements to Barrelfish: We now allows mappings below 512GB, timers, user control over segment registers, user-mode trap handling, a user-mode debug stub and an mmap() implementation.
The limitations of the current system include terrible IO performance, unsupported file sharing modes, child processes, and missing multicore support.
Andrew then presented several models for OS support for accelerators. The driver model essentially has no direct OS integration with the accelerator and treats it as a peripheral that is programmed via a device driver API. The co-processor model runs applications on the accelerator, but forwards all system calls to the host. Finally, the Multikernel uses the library OS only for compatibility and runs a full OS on the accelerator. Andrew is currently looking at Knight's Ferry as a possible accelerator to exploit.
Andrew Moore: Why did it not work for this powerpoint example? Andrew: Bugs. Timers are hard. We're not bug-for-bug compatible. In this case, we were not quite doing the right thing. In this case, PowerPoint would stall. Have to look at callframes to see what it's doing. Implicit assumption that memory frames are 64k aligned that we didnt spot.
Steve Hand: WHat's different to getting this working on Hyper-V vs. Barrelfish, as shown in the ASPLOS paper? Andrew: They were cheating. They ran the guest in Hyper-V. They did RPC for each call to host machine. Issues for Hyper-V were providing threads, sync, etc on top of bare-metal.
Werner Haas: Accelerators are usually operating in physical address space. How would you like to have that handled? Andrew: I think accels are getting increasingly general purpose. We need enough support to run a small Microkernel like Barrelfish. Look at MIC, AMD's fusion arch. They all have MMUs.
Werner: Then you run into TLB consistency issues. Andrew: Yes, we know how to deal with those.
Someone: Accelerators are usually used in different way. Do you really want to run PowerPoint there? Andrew: No! My goal is to reduce the barrier to developers being able to use accelerators, like I can't invoke the OS on an accelerator right now. I'm not transparently going to migrate an app to an accelerator and do all the magic for the developer.
Kornilios: What kind of OS support would code on an accelerator want? Andrew: No good handle yet. It's driven by programming models today. The ideal app would be rich, invokes OS services, and wants to turn a parallel loop into something to work on an accelerator, by just needing an annotation.
Session 3 - Thursday
Speaker: Adrian Schüpbach (ETH Zurich)
Topic: A declarative language approach to device configuration
Adrian talks about how hard is hardware resource configuration. He explains this problem by talking in details about the PCI address space configuration.
He explains how PCI address space works and how allocations happens in most simple case. Then he proceeds with explaining how this problem becomes incrementally complex as various constraints in PCI configuration come into pictures.
Question: How you find the devices? Do you have the list of them with you?
Answer: We don't have the list, We discover them.
Question: How you discover them? Do you have to configure all the bridges first and then discover the devices?
Answer: No, you can query hardware without configuring the bridges. By querying the devices, you get the list of devices.
Adrian then explains the complexities introduced by special cases hardwares which needs some fixed address in PCI address space.
Chris Adeniyi-Jones: Why is there a hole?
Answer (Andrew): Some legacy PCI device may always decode a fixed address, irrespective of what is assigned to it. So, we treat it these address ranges as holes to avoid allocating them to some other device.
Adrian explains the quirks in PCI address space and how they complicate the PCI address space allocation problem. He explains how Linux and Windows deals with this problem and the limitations.
He explains the approach taken by him which involves expressing the problem in high level language and then separating the device access logic from PCI allocation logic. He explains reasoning behind using CLP and eclipse. The steps involved in this solution are explained. He shows one example of how the problem can be expressed into CLP problem and another example of how to handle the quirks using this approach.
He explains the implementation of this solution inside Barrelfish and how it works within the boot sequence of Barrelfish.
He continues with evaluation which shows that this solution works correctly and is maintainable due to small code-base. He shows that how compactly the PCI allocation problem can be shown and also that handling additional quirks is easy. He compares the LOC count for Linux solution and his solution.
He concludes his session by stating that PCI configuration problem is hard declarative language is promising approach for dealing with large, diverse and evolving hardware base.
Question: Don't you have the garbage collector in the eclipse solver? You are consuming 60MB of memory in 36ms, so you are consuming the memory at the rate more than 1GB/s, which looks like running without garbage collector.
Answer: The garbage collector is active in eclipse. The 60MB memory is statically allocated during the initialization for the eclipse solver heap and this size does not change during the execution.
Question: Do the eclipse really need 60MB of memory?
Yes. Eclipse creates its dynamic structures on the heap, and then garbage collector runs and removes these dynamic structures. But we still keep the 60MB of memory. So, 60MB is the high mark for memory usage. It is not that bad as we use eclipse for many other services and hence the cost gets amortized.
Question: Is it 60MB per core? Does each core need it's own copy?
Answer: No, it is central service. So, it is allocated only once.
Question: Can it solve the optimization solved incrementally? And is PCI hot-plugging supported in Barrelfish?
Answer: Not implemented in Barrelfish, so we don't know for sure. But in theory it should be doable with this approach.
Speaker: Ross McIlroy (Microsoft)
Topic: Calico: rethinking the language / runtime-system boundary
Ross explains Calico which is a parallel runtime system. He explains the motivation behind this work by giving some context about history of parallelism for performance improvement. He then explains how all the assumptions in HPC systems do not hold on recent multi-core commodity systems.
He then gives the overview of Calico system, which challenges are tackled in by this system. He then explains the task based, fork-join programming model used inside Calico and the advantages of this approach.
He explains the Calico programming model and the usage of "async", "par" and "finish" directives using small code sample.
Andrew: Do you really want to mix 'async' and 'par' in same block? (slide-4)
Answer: It depends on the application requirements. Sometimes you want to mix them, sometimes you definitely don't want to mix them. So, it seems it can work together.
He takes the problem of FluidAnimate as an example which is a benchmark from parsec benchmark suit and explains it. He then explains the approach of static partitioning which is taken by openMP and Intel Parallel building block and their limitations.
He then explains the dynamic partitioning or work-stealing approach which is used in Clik-5 and Wool, He explains the spawn/sync overhead on these implementations.
Question: Have you measured the context switch costs? Based on if the work is stolen or not stolen, won't the cost change? (slide-17)
Answer: The cost was taken from the paper, and was not measured directly by authors. And author agrees to the observation that cost changes based on the work is stolen or not.
Richard black: How is compared to the minimum cycles to do the computation? (slide-17)
Answer: 10 cycles (on slides)
Ross then explains disciplined partitioning which is attempt to take benefits from above two approaches and is used inside Calico. He then shows the code for FluidAnimate which uses Calico and explains how this deals with limitations of above two approaches.
He then shows the evaluations comparing Calico with Parsec Native when there is no competition for CPU-time and when there is a competition for CPU-time.
Question: What architecture you are running on? (slide-44)
Answer: This was 4 core Intel workstation. We never got it running on large CPU server.
Question: What is the graph normalized to? (slide-44)
Answer: This graph is wall-clock execution time and is normalized to the sequential (no. of cores == 1) execution time.
Question: what is the other domain competing for CPU? (slide-45)
Answer: Its a very simple dummy application which is takes up the CPU.
Question: Are you running this on a Barrelfish?
Answer: Yes. There is no clever thread migration going on in these runs.
Kornilios Kourtis: What exactly is parsec using for running threads? (slide-45)
Answer: Various Parsec version can differ but this version is using pthreads.
Kornilios Kourtis: And what type of partitioning is used here? (slide-45)
Answer: This experiment is using static partitioning. The number of threads == no. of cores
Simon Peter: Do you know if parsec is pinned on the core?
Answer: Yes, in Barrelfish they will be pinned to core.
Simon Peter: Can linux load-balancing can be observed in Linux for this benchmark?
Answer: Linux does not migrate threads for this benchmark.
Ross then discusses the current status of the work and the future directions for OS/Runtime system co-design.
Kornilios Kourtis: About space-time continuum, my thinking was that if workload is somewhere in middle you shouldn't be too much away from the optimal results. I was wandering if you have results which contradicts that? For example, if you have partitions with chunks large enough then you won't have dispatch overhead, so that should be good for any case. I was wondering if you have any results which shows against it.
Answer: Often this is the case. The problem is in working out what is the proper chunk size that does not cost you lot in dispatch cost.
Question: Have you heard about Habaneru-C or Habaneru-Java?
Speaker: Marcin Orczyk & Calum McCall (U. Glasgow)
Topic: GHC for a multi-kernel architecture
Marcin start with explaining Islands RTS which is a Haskell runtime system for machines with partial cache coherence. He justifies the reasoning behind assumption of partial cache coherence and systems composed of a number of islands.
He then gives the overview of Islands runtime system and its inspiration from GHC7 and GUM. He explains how Islands RTS works between islands and how it works within Island.
He then explains evil static thunks and the issues caused by it in the developments.
He then explains the current status and the results observed.
Calum then takes over the presentation and explains his future plan to port Islands GHC runtime on Barrelfish.
Andrew Baumann: What was bad with putting one island in one process?
Answer: There is nothing wrong. It was bad idea from authors side.
Question: virtual machine runtimes are like OS, then what happens if you merge language runtime mix with OS?
Answer: OS may not give the features needed by GHC language (eg. light-weight threads), thats why GHC runtime is needed as additional layer.
Timothy Roscoe: If that is the case, what is the problem with OS interface provided by barrelfish? If not, what is stopping GHC from merging with OS?
Answer: Author is not well aware of barrelfish internals. But what will help is to know common denominator of what GHC needs and what Barrelfish provides.
Werner Hass: what about configurable cache coherency where you can pick which cores should be cache coherency?
Answer: This would be a great proposition. Currently the system can only validate if particular cores can go in same island by checking with SKB. But if one can configure cores to selectively become cache-coherent then you definitely have much better flexibility.
Question: Will it help to move the core to different island?
Answer: The island is actually a heap, so even thought core migrates to new island it will not take all the local data with it in current the current implementation. So, for current implementation, it does not make sense.
Session 4 - Thursday
Speaker: Georgios Varisteas (KTH)
Topic: Dynamic inter-core scheduling in Barrelfish
The goal of this work is to increase throughput in Barrelfish by using dynamic inter-core scheduling. The scheduling is separated in two levels, a kernel and a user-level scheduler. The latter one runs as a user-level library. This user-level scheduler is aware of the application's requirement, whereas the kernel level scheduler knows about the available resources and manages the global state.
Questions during the presentations:
Andrew Baumann: What do you mean by Kernel scheduler?
> The scheduler is not running in the kernel, but on top of the monitors
Andrew Baumann: So it is more a system service?
> Yes, a distributed system service
On slide 12:
Simon Peter: So the processes gather statistics?
> Yes, to minimize traffic, to prevent bottlenecks and sustain scalability
Simon Peter: So the green processes send information to the responsible scheduler for section four?
Questions at the end of the talk:
Simon Peter: Information flow from user-level application to the kernel-level schedulers, at which time granularity?
> Less than the interval for scheduling. Because still in implementation phase, no concrete values.
Simon Peter: What would you like it to be? Like milliseconds?
> This is not simple to answer. Maybe big and small samples. Scheduling different for different kind of programs. Some more concrete answer after some experimenting.
Referring to slide 12:
Simon Peter: Scheduler responsible for entire domain, which might also span over other sections. Wouldn't it be more straight forward to have schedulers responsible for one section.
> Two schedulers. User-level run-time and low-level scheduler. For example: yellow program spans over several sections. For each segment only one instance of a low-level scheduler.
Simon Peter: And that is the one that has primary control over the section?!
> Over some processes, not all of them.
Simon Peter: Which one control the yellow area again?
> The naive criteria is: which process has more cores.
Simon Peter: My model would have been that each scheduler is responsible for those threads of a domain within a section and re-allocations in this sections, which might be less flexible. Hunch: One primary for each section, responsible for all threads in a section scales better. The questions is what would you like to scale with.
> Yes, it is still work in progress.
Karl-Filip Faxén: A scheduler is responsible for a set of cores, but also responsible for decisions affecting programs. Scheduler has therefore two roles. I could imagine that the scheduler is responsible for getting the statistics for cores in one section as you propose, but that there is still some need to aggregate on program-by-program basis. The schedulers in a section would gather and aggregate measurements from cores for applications spanning sections. The other schedulers would just gather measurements to this scheduler, which would then decide what needs to be done in the application.
Simon Peter: That sound more like the model I had in mind. To have local decisions and then communication with other section.
Speaker: Robert Watson
Topic: BERI: an open source platform for research into h/w-s/w interface
Robert Watson was talking about BERI, a system to help hardware and operating-system developers working together again. The platform is ment to be used to experiment and investigate the impact of even fine grained changes to hardware components along with their parameters. This will help to develop operating system across several platforms, architectures and applications.
Timothy Roscoe: As I understand it, most processor actually have many many many registers. 32 mains, the compiler knows about this and reuses these.
> Absolutely. One question for our capability model is how many capability register we should have and the cost of context-switches and all those kinds of things. We try to see what makes sense. For example: Caches and the size of these caches. We can change assembler instructions, op-codes but also the number of registers and there implications.
Timothy Roscoe: SW & HW people need to cooperate.
Timothy Roscoe: We use our capability system not so much for security but resource allocation. For us, the attraction was in easily figuring out who owns what and we thought that this would be absolutely true. Do you think that caps for authentication leads to problems?
> That's a good questions, I don't know. Capabilities are not as cheap as everyone thinks. Some things came up. We have two types of capabilities: a) essentially segments, b) object capabilities. our notion of capabilities are really really cheap. But there are questions like: garbage collection, address spaces, language run-times. Address spaces being cheap/ almost free, we never reuse address-spaces except for certain very specific contexts, in particular the stack because it is very hard to change assumptions about the stack, e.g. running existing software side-by-side with new stuff.
Timothy Roscoe: Yeah, tell us about it. Basic problem we hit: Contention, really don't want to share bit-representation of capabilities between cores. Synchronization done asynchronously. Lazily. Refer to an address space, which is shared. Transferring capabilities between cores nearly impossible business.
> We eliminated replication. No synchronization requirements. Delicate things across cores.
Speaker: Mikel Lujan
Topic: Teraflux: A Manchester Perspective
Mikel Lujan was talking about the teraflux project. The project is about to improve parallelism for machines having several thousand cores. These systems cannot be exploited with todays programming models and applications. The project seeks to find a complete solution to leverage the full power of theses systems by rethinking the design of applications, programming models, the architecture and compiler techniques. Then, data-flow principals can be applied efficiently throughout the system.
Timothy Roscoe: This is a slightly naughty question, so you can attack me afterward. The way you draw that diagram, it really looks like kind of a waterfall. At the end you get to: "Yes, we have developed some hardware and we can see whether it works". Is there are danger there that you end up doing the same thing than the Manchester data-flow machine did, which is that you develop a bunch of useful techniques but in the end if you put them together the improvement is not so large after all.
> I think that does not happen, we got it right. We handled absolutely everything. It's not a waterfall. We integrate as much as we can. For the other project, we had like 150 people, not allowed to talk to each other, which was difficult to handle.
Following: A discussion of simulators and the licensing models. Also: Comparison of the simulators.
Kornilios Kourtis: How dynamic is the graph during execution. Can it change?
Kornilios Kourtis: Is there a difference between task-parallel programming and data-flow programming.
Karl-Filip Faxén: As a kind of author of task-based systems: When people are talking about task-based parallelism basically talking about nested fork/join kind of parallelism. and that is a more limited view of parallelism. Computation have a certain shape, e.g. like a sequential loop you would need a nested fork/join framework to have a barrier between sequential loop iterations. With the data-flow model some iterations depend on some iterations in the previous iteration of the sequential loop. That would buy you a lot of flexibility.
Session 1 - Friday
Zeus Gomez Marmolejo - Porting Nanos++ runtime to Barrelfish
Zeus wants to run OpenMP (StarSs) on Barrelfish to compare it on different architectures. He has a compiler (Mercurium) that converts OmpSs to C++, creates a dependency graph and and schedules tasks. The entire system is called Nanos.
Nanos requires dynamic libraries, but they use static linking for now. Also, TLS (now available) is required. Finally, Barrelfish still leaks memory, and a lot of runtime from C++ is not in Barrelfish. Hake was also not easy to use for Nanos. It is inflexible and not compatible with autoconf, which makes existing apps difficult to port.
Zeus proposes a GCC cross-compiler and to use newlib to replace libc, so GNU libstdc++ and libgcc_eh work. He specifies 3 targets (x86_64, i386 & scc).
To show that newlib already works, he shows an example of bash running on Barrelfish. He notes that there are still many missing features, such as fork(), wait(), kill() & execve(), which prevent the shell from starting anything yet. There are also problems with loop dependencies between libbarrelfish and libc (now newlib) that are not easy to work around and other small implementation problems, like a bug in memory allocation on Barrelfish that leaks memory.
GASnet uses common API code and conduit code specific to network hardware. There is a conduit for Barrelfish, which creates a complete network to all CPUs.
Andrew: What do you want to run on top of gasnet? Zeus: OpenMP and other projects are using it.
Werner: <Couldn't understand the question>? Zeus: Bash shell is very simple. Had to solve simple things. Need to implement newlib syscalls.
Someone: Any rough ideas about performance? 4 threads sounds like overhead? Zeus: Not very good performance yet. We used spinlocks everywhere, but will replace by mutexes (hoping to reduce overhead). Expect to have more performance than UDP in the end.
Jana Giceva - Database-OS co-design
Jana started off by showing the diversity in current HW resources and how this is not well handled if operating systems and databases are not working together on this problem. New hardware trends and diverse workloads already influence database research and appliances are an example where tying a database closely with the operating system has tangible benefits.
She demonstrates NUMA awareness as one example where DB/OS co-design can provide benefits. For her project, she uses Barrelfish and the CSCS engine, which is a column store. There are several initial questions to answer, such as what is the initial performance going to be, are there hot-spots or bottlenecks, what is its scalability. She answers these by comparing the Barrelfish/CSCS combination to CSCS running on Linux. As a workload, she uses the Amadeus flight bookings trace. The single core results look alike and varying the number of updates does not impose performance degradation. The datastore size is also proportional to its throughput.
On Barrelfish, the database can submit a manifest to the OS, which describes its requirements. The OS is also allowed to reply back to the manifest for 2-way information flow.
Steve Hand: Is data in memory? Jana: Yes.
Barrelfish is faster than Linux. Due to DRAM bandwidth. Ultimately due to the C++ library implementation, which has faster support for strings.
Richard: If bottlenecked on DRAM then this is inconsistent? Jana: No, we're CPU bound. Richard: You should use a hash, instead of strcmp().
NUMA analysis: NUMA awareness makes everything well-distributed over nodes, not so without awareness. But doesn't impact performance.
Someone: How big are the error bars? Jana: Not big.
To conclude, the CSCS engine works on BF. Future work: When do we hit a scalability bottleneck? And what is it?
Richard: Summarize how worker threads work? How often do they check? How are they synced? Jana: No interaction with OS, except for NUMA alloc. Richard: How are cores talking? Jana: Shared-nothing DB, partitioned over cores, enqueue queries over all cores, cores scan their partition. Richard: What's the unit of batching? Jana: One rotation. Richard: How does latency differ between Linux vs. BF? Jana: Goal of DB design is to bound latency time, but scale number of simultaneous queries. Partition dataset, such that latency is less than bound. Put as many queries as possible such that workload is still CPU bound. Separate updates and reads. Goal is predictability, scalable throughput. Scans and updates are synchronized.
Someone: Seems like we need a better libc? Jana: Yes! We plan to use BSD's libc.
Steve Hand: Why do you want to run this on Barrelfish? Jana: We're doing it because it's easier to modify Barrelfish.
Tim Harris - Flexible hardware support for message passing
Tim looks at hardware support to accelerate message passing between protection domains. Currently, Barrelfish has several layers (AC, Flounder, UMP, CC-interconnect). Tim adds AMP interconnect driver and new hardware features to this. Related hardware work is dealing with communication within processes, not between them.
Higher-level abstractions are hard (flow control, no direct MPB access with protection). Related software work is focusing on shared FIFO queues. This relies on CC-memory and IPIs are heavy-weight.
Tim wants to add just enough hardware to avoid cache-coherent-memory reliance, and to avoid the need for polling or going into the kernel on notification.
In this work, Tim prepares messages in regs and introduces a hardware send(vspace, thread) mechanis. Both processes map a shared page to exchange more data.
Steve Hand: How is a thread named? Tim: 64-bit integer.
Andrew: Can I use the thread ID as a channel ID? Tim: Nothing would prevent multiple people to map in the same channel.
The hardware then maps virtual->physical, thread->core, sends on interconnect, and updates the cache line on the target core.
Steve: What is state of that cacheline? Tim: Treat like write from target core. Someone: Similar to injecting data to cache right from NIC.
On the slow path, we speculatively allocate, if line not present. If receiver not running, we can always inject. We introduce a Thread Translation Buffer (TTB) to map software IDs to core IDs.
The key ideas are to use existing VM mechanisms for naming and protection and in the common case to have cache-to-cache transfers.
As notification mechansism, Tim proposes a notify channel going through the kernel. On the receive side, the kernel periodically watches a bitmap of active notifications.
The new hardware mechanisms are implemented on the GEM5 simulator, including the TTB, a connection white list, rx and tx message queues, as well as flow control.
Preliminary results show that application performance using the new mechanisms is much better, due to sending right to the receiver and not via the cache coherence protocol. Tim used a 2-core two-phase-commit as evaluation workload. There are several ways to schedule messages and a fair scheduling strategy is most well behaved, as it's fast with short polling times and only a bit worse with long polling times.
The status is that an initial implementation exists for the Beehive computer, used for numerical programs, as well as a port to GEM5 to model non-cache-coherent shared memory. Tim can also run it on x86-64, which ignores most of the protection questions, but can be used to run longer experiments.
Someone: How do you write right into someone else's cacheline? Tim: We use regular CC techniques. It's UMP with notifications. Maybe we can use this for work-stealing systems. Have to investigate.
Kornilios: Is gang scheduling effective? Tim: No numbers for that. Not different to other systems.
Someone: What about remote store programming? Tim: Have to have a look.
Someone: If you have cache-coherence protocol with special store instruction, would that be the same? Tim: Could be adapted. Here, the sender is explicit about where the send is going to.
Steve Hand: How does kernel watching work? Tim: MONITOR/MWAIT. Steve: Do in user-space? Tim: Have receiver thread to MONITOR/MWAIT. It's already done.
Someone: Is power considered? Tim: No. Someone: You could power down cores and wait for memory? Tim: That's MONITOR/MWAIT.