Differences between revisions 15 and 16
Revision 15 as of 2011-11-02 12:53:42
Size: 19770
Editor: SimonPeter
Comment:
Revision 16 as of 2011-11-02 13:17:52
Size: 28057
Comment:
Deletions are marked like this. Additions are marked like this.
Line 23: Line 23:
   
Line 197: Line 198:

=== Session 4 - Thursday ===

--------------------------------------------------
'''Speaker:''' Georgios Varisteas (KTH)<<BR>>
'''Topic:''' Dynamic inter-core scheduling in Barrelfish<<BR>>
The goal of this work is to increase throughput in Barrelfish by using
dynamic inter-core scheduling. The scheduling is separated in two
levels, a kernel and a user-level scheduler. The latter one runs as a
user-level library. This user-level scheduler is aware of the
application's requirement, whereas the kernel level scheduler knows
about the available resources and manages the global state.
--------------------------------------------------

Questions during the presentations:

Andrew Baumann: What do you mean by Kernel scheduler? <<BR>>
> The scheduler is not running in the kernel, but on top of the monitors

Andrew Baumann: So it is more a system service? <<BR>>
> Yes, a distributed system service

On slide 12:

Simon Peter: So the processes gather statistics?<<BR>>
> Yes, to minimize traffic, to prevent bottlenecks and sustain scalability

Simon Peter: So the green processes send information to the responsible scheduler for section four?<<BR>>
> Yes

Questions at the end of the talk:

Simon Peter: Information flow from user-level application to the
kernel-level schedulers, at which time granularity?<<BR>>
> Less than the interval for scheduling. Because still in
implementation phase, no concrete values.

Simon Peter: What would you like it to be? Like milliseconds?<<BR>>
> This is not simple to answer. Maybe big and small samples. Scheduling different for
different kind of programs. Some more concrete answer after some
experimenting.

Referring to slide 12:

Simon Peter: Scheduler responsible for entire domain, which might also
span over other sections. Wouldn't it be more straight forward to have
schedulers responsible for one section.<<BR>>
> Two schedulers. User-level run-time and low-level scheduler. For
example: yellow program spans over several sections. For each
segment only one instance of a low-level scheduler.

Simon Peter: And that is the one that has primary control over the
section?!<<BR>>
> Over some processes, not all of them.

Simon Peter: Which one control the yellow area again?<<BR>>
> The naive criteria is: which process has more cores.

Simon Peter: My model would have been that each scheduler is
responsible for those threads of a domain within a section and
re-allocations in this sections, which might be less flexible. Hunch:
One primary for each section, responsible for all threads in a section
scales better. The questions is what would you like to scale with.<<BR>>
> Yes, it is still work in progress.

Karl-Filip Faxén: A scheduler is responsible for a set of cores, but
also responsible for decisions affecting programs. Scheduler has
therefore two roles. I could imagine that the scheduler is responsible
for getting the statistics for cores in one section as you propose,
but that there is still some need to aggregate on program-by-program
basis.
The schedulers in a section would gather and aggregate measurements from cores for applications spanning sections.
The other schedulers would just gather measurements to this scheduler, which would then decide what needs to be done in the application.

Simon Peter: That sound more like the model I had in mind. To have local
decisions and then communication with other section.

--------------------------------------------------
'''Speaker:''' Robert Watson<<BR>>
'''Topic: BERI:''' an open source platform for research into h/w-s/w
interface<<BR>>
Robert Watson was talking about BERI, a system to help hardware and
operating-system developers working together again. The platform is
ment to be used to experiment and investigate the impact of even fine
grained changes to hardware components along with their
parameters. This will help to develop operating system across several
platforms, architectures and applications.
--------------------------------------------------

Timothy Roscoe:
As I understand it, most processor actually have many many many
registers. 32 mains, the compiler knows about this and reuses these.<<BR>>
> Absolutely. One question for our capability model is how many
capability register we should have and the cost of
context-switches and all those kinds of things. We try to see what
makes sense. For example: Caches and the size of these caches. We
can change assembler instructions, op-codes but also the number of
registers and there implications.

Timothy Roscoe: SW & HW people need to cooperate.

Timothy Roscoe: We use our capability system not so much for security
but resource allocation. For us, the attraction was in easily figuring
out who owns what and we thought that this would be absolutely
true. Do you think that caps for authentication leads to problems?<<BR>>
> That's a good questions, I don't know. Capabilities are not as cheap
as everyone thinks. Some things came up. We have two types of
capabilities: a) essentially segments, b) object capabilities. our
notion of capabilities are really really cheap. But there are
questions like: garbage collection, address spaces, language
run-times. Address spaces being cheap/ almost free, we never reuse
address-spaces except for certain very specific contexts, in
particular the stack because it is very hard to change assumptions
about the stack, e.g. running existing software side-by-side with new
stuff.


Timothy Roscoe: Yeah, tell us about it. Basic problem we hit:
Contention, really don't want to share bit-representation of
capabilities between cores. Synchronization done
asynchronously. Lazily. Refer to an address space, which is
shared. Transferring capabilities between cores nearly impossible
business.<<BR>>
> We eliminated replication. No synchronization requirements. Delicate
things across cores.

--------------------------------------------------
'''Speaker:''' Mikel Lujan<<BR>>
'''Topic:''' Teraflux: A Manchester Perspective<<BR>>
Mikel Lujan was talking about the teraflux project. The project is
about to improve parallelism for machines having several thousand
cores. These systems cannot be exploited with todays programming
models and applications. The project seeks to find a complete solution
to leverage the full power of theses systems by rethinking the design
of applications, programming models, the architecture and compiler
techniques. Then, data-flow principals can be applied efficiently
throughout the system.
--------------------------------------------------

Timothy Roscoe: This is a slightly naughty question, so you can attack
me afterward. The way you draw that diagram, it really looks like kind
of a waterfall. At the end you get to: "Yes, we have developed some
hardware and we can see whether it works". Is there are danger there
that you end up doing the same thing than the Manchester data-flow
machine did, which is that you develop a bunch of useful techniques
but in the end if you put them together the improvement is not so
large after all.<<BR>>
> I think that does not happen, we got it right. We handled absolutely
everything. It's not a waterfall. We integrate as much as we
can. For the other project, we had like 150 people, not allowed to
talk to each other, which was difficult to handle.

Following: A discussion of simulators and the licensing
models. Also: Comparison of the simulators.

Kornilios Kourtis: How dynamic is the graph during execution. Can it
change?<<BR>>
> Yes.

Kornilios Kourtis: Is there a difference between task-parallel
programming and data-flow programming.

Karl-Filip Faxén: As a kind of author of task-based systems: When
people are talking about task-based parallelism basically talking
about nested fork/join kind of parallelism. and that is a more limited
view of parallelism. Computation have a certain shape, e.g. like a
sequential loop you would need a nested fork/join framework to have a
barrier between sequential loop iterations. With the data-flow model
some iterations depend on some iterations in the previous iteration of
the sequential loop. That would buy you a lot of flexibility.

2011 Barrelfish workshop

20-21 October, Cambridge University Computer Laboratory

Thursday 20 October

09.30

Timothy Roscoe (ETH Zurich)

Welcome, last year's progress, workshop goals

10.00

Werner Haas (Intel)

System-level implications of non-volatile, random-access memory

10.15

Matt Horsnell (ARM)

OS support in ARMv7A

10.30

Andrew Baumann (Microsoft)

Drawbridge on Barrelfish

11.30

Pravin Shinde (ETH Zurich)

Scalable and adaptive network stack architecture

12.00

Zach Anderson (ETH Zurich)

Fine-grained, language-level, hierarchical resource management

12.30

Stefan Kästle (ETH Zurich)

Message-passing co-processor

14.00

Adrian Schüpbach (ETH Zurich)

A declarative language approach to device configuration

14.30

Ross McIlroy (Microsoft)

Calico: rethinking the language / runtime-system boundary

15.00

Marcin Orczyk & Calum McCall (U. Glasgow)

GHC for a multi-kernel architecture

16.00

Georgios Varisteas (KTH)

Dynamic inter-core scheduling in Barrelfish

16.30

Robert Watson (CUCL)

BERI: an open source platform for research into the h/w-s/w interface

16.45

Mikel Lujan (U. Manchester)

Teraflux: A Manchester Perspective

 

Friday 21 October

09.30

Zeus Gómez Marmolejo (BSC)

GCC cross compiler and Gasnet

10.00

Jana Giceva (ETH Zurich)

Database-OS co-design

10.30

Tim Harris (Microsoft)

Flexible hardware support for message passing

Session Transcripts

Session 1 - Thursday

Timothy Roscoe - Keynote

Mothy gave an overview of the state of the Barrelfish community, which is noticably growing. A new Barrelfish version has also recently been released, with a new build system, support for the Intel Single-Chip Cloud COmputer and ARM, bootscripts, new IDC support, a bulk transport facility, POSIX and VFS support, and a basic AHCI driver. This is enough to run full-blown POSIX applications, like PostgreSQL. Finally, the copyright now belongs solely to ETH.

At ETH, a large number of Master's students is now doing work on Barrelfish, which should enhance its feature-list even further in the near future.

The goals of Barrelfish haven't changed and are still to build a solid platform for good OS research, as well as a well-respected OS. Finally, Mothy is seeking feedback from participants of the workshop. What are we doing right, what wrong? Is the frequency of yearly workshop meetings okay or should it change?

Matt Horsnell - OS Support in ARMv7A

Matt introduced the ARMv7-A architecture. ARM is now market-pervasive and thus an important architecture to look out for. The ARMv7-A is a 32bit RISC + Thumb ISA chip, based on the Cortex A processor. Thumb supports less registers than the RISC ISA, but has a smaller memory footprint. Furthermore, it supports SIMD instructions for multimedia and LPAE (like x86 PAE) to support large physical address spaces.

He demonstrated weak ordering in the ARM ISA: There is no defined ordering or consistency of instructions between ARM processors. Instead, a memory barrier instruction is supported to order memory accesses, but this is expensive.

Richard Black (RB): How expensive are barriers?
Matt: On the order of 100 cycles.
RB: Because you have to ensure the L1 is clean?
Matt: Have to write back memory to point of synchronization.
RB: If I have 2 barriers close to each other, is the second one as expensive as the first?
Matt: The cost is on the same order as the first, but it depends on the architecture.

Next, Matt talked about the heterogeneous ARM Big Little. This is based on a cluster of Cortex A7 and and some Cortex 15 chips. The A7 have low power requirements and high efficiency, while the 15 is more beefy. Their idea is to migrate a program seamlessly within 10k cycles (20us) from an A7 core to the A15 if it needs more performance.

Mothy: Is the OS in control of this?
Matt: Yes.
Mothy: Does the OS see multiple cores?
Matt: Yes. We'd probably harness DVFS to switch between CPUs.
Orion Hodson: Where are we today from the press announcement?
Matt: Product will come in 2013. The models are developed. In research, you should use the GEM5 simulator, instead of the actual product.

Werner Haas - System-level implementations of non-volatile RAM

Werner started by demonstrating that charge-based memory (DRAM, Flash) has scalability problems. Industry's focus today is on using resistance as information carrier (NVRAM). As this is non-volatile, we have an opportunity for universal memory inside the machine. Current NVRAM prototypes are PCM (phase-change memory, the leading technology), STTRAM (Spin-Torque Transfer RAM), and ReRAM (Resistive RAM). ReRAM scales best (has highest density) and has good endurance.

Steve Hand: I'd like to buy this ReRAM. When can I do so?
Werner: HP press announcement says it's a product in 2013. First there's going to be Flash and then ReRAM for mobile devices in 2015/16.

There is a trend to replace DRAM with NVRAM. Caches are the last thing that is still volatile. Should the new NVRAM be software-managed? For example, we can make address space be block accessable to make them look more like a disk, or should we rather make them byte addressable? The address space is rather huge, so there would be a lot of pages and big page tables.

We should also think about separating translation from protection. In terms of persistent memory, we could use Barrelfish's capabilities and match them to hardware protection directly.

Kornilios Kourtis: Are there any access patterns that get better performance in this setting? E.g. disk drivers prefer sequential data access.
Werner: You should rewrite your filesystem drivers and just start accessing data randomly and you get better performance. There are papers about non-volatile filesystems.
Steve Hand: There are papers at FAST and SOSP'09 about this. All those reinvent the interface between CPUs and NVRAM. They either ignore it or make something up.
Werner: Especially memory barriers from first talk could help here. You want to ensure memory is in NV backing sotre. Maybe transactional memory can also help here.
Mothy: These papers ignore that this is RAM and thus cache-line addressable. They shouldn't treat it like a disk.

Andrew Baumann - Drawbridge on Barrelfish

Drawbridge is a light-weight virtualization technology for Windows applications that presents a middle-ground between virtualization and porting from scratch. It reuses two ideas: Picoprocesses and library OSes from a recent ASPLOS paper. We have implemented Windows 7 as a library OS, which is also documented in that paper.

Picoprocesses are light-weight, secure isolation containers. There is no access to system calls, but a much smaller API (45 vs. 1200+ calls). It's higher-level than a VM, though. Inside a pricoprocess, you can run a library OS. This is much smaller than a fully virtualized OS, because it only drives one application and is self-contained.

We put this on top of Barrelfish for three reasons: Firstly, it's a proof of self-containedness. Secondly, it ensures OS independence for Drawbridge and thirdly, it allows real applications to run on Barrelfish. An attached research agenda is to investigate OS support for heterogeneous hardware.

For Barrelfish, there is no security monitor, which is present if Windows is used as a host OS. Instead, we convert Win7 libOS calls to Barrelfish libOS calls and run as if we're a normal Barrelfish application. The platform adaptation layer (PAL) runs in-process.

So far, the project has contributed several improvements to Drawbridge: It has forced us to nail down an ABI (not just an API), such as thread stack ownership, the TLS contract, and relocations. There are also improvements to Barrelfish: We now allows mappings below 512GB, timers, user control over segment registers, user-mode trap handling, a user-mode debug stub and an mmap() implementation.

The limitations of the current system include terrible IO performance, unsupported file sharing modes, child processes, and missing multicore support.

Andrew then presented several models for OS support for accelerators. The driver model essentially has no direct OS integration with the accelerator and treats it as a peripheral that is programmed via a device driver API. The co-processor model runs applications on the accelerator, but forwards all system calls to the host. Finally, the Multikernel uses the library OS only for compatibility and runs a full OS on the accelerator. Andrew is currently looking at Knight's Ferry as a possible accelerator to exploit.

Andrew Moore: Why did it not work for this powerpoint example?
Andrew: Bugs. Timers are hard. We're not bug-for-bug compatible. In this case, we were not quite doing the right thing. In this case, PowerPoint would stall. Have to look at callframes to see what it's doing. Implicit assumption that memory frames are 64k aligned that we didnt spot.

Steve Hand: WHat's different to getting this working on Hyper-V vs. Barrelfish, as shown in the ASPLOS paper?
Andrew: They were cheating. They ran the guest in Hyper-V. They did RPC for each call to host machine. Issues for Hyper-V were providing threads, sync, etc on top of bare-metal.

Werner Haas: Accelerators are usually operating in physical address space. How would you like to have that handled?
Andrew: I think accels are getting increasingly general purpose. We need enough support to run a small Microkernel like Barrelfish. Look at MIC, AMD's fusion arch. They all have MMUs.

Werner: Then you run into TLB consistency issues.
Andrew: Yes, we know how to deal with those.

Someone: Accelerators are usually used in different way. Do you really want to run PowerPoint there?
Andrew: No! My goal is to reduce the barrier to developers being able to use accelerators, like I can't invoke the OS on an accelerator right now. I'm not transparently going to migrate an app to an accelerator and do all the magic for the developer.

Kornilios: What kind of OS support would code on an accelerator want?
Andrew: No good handle yet. It's driven by programming models today. The ideal app would be rich, invokes OS services, and wants to turn a parallel loop into something to work on an accelerator, by just needing an annotation.

Session 4 - Thursday


Speaker: Georgios Varisteas (KTH)
Topic: Dynamic inter-core scheduling in Barrelfish
The goal of this work is to increase throughput in Barrelfish by using dynamic inter-core scheduling. The scheduling is separated in two levels, a kernel and a user-level scheduler. The latter one runs as a user-level library. This user-level scheduler is aware of the application's requirement, whereas the kernel level scheduler knows about the available resources and manages the global state.


Questions during the presentations:

Andrew Baumann: What do you mean by Kernel scheduler?
> The scheduler is not running in the kernel, but on top of the monitors

Andrew Baumann: So it is more a system service?
> Yes, a distributed system service

On slide 12:

Simon Peter: So the processes gather statistics?
> Yes, to minimize traffic, to prevent bottlenecks and sustain scalability

Simon Peter: So the green processes send information to the responsible scheduler for section four?
> Yes

Questions at the end of the talk:

Simon Peter: Information flow from user-level application to the kernel-level schedulers, at which time granularity?
> Less than the interval for scheduling. Because still in implementation phase, no concrete values.

Simon Peter: What would you like it to be? Like milliseconds?
> This is not simple to answer. Maybe big and small samples. Scheduling different for different kind of programs. Some more concrete answer after some experimenting.

Referring to slide 12:

Simon Peter: Scheduler responsible for entire domain, which might also span over other sections. Wouldn't it be more straight forward to have schedulers responsible for one section.
> Two schedulers. User-level run-time and low-level scheduler. For example: yellow program spans over several sections. For each segment only one instance of a low-level scheduler.

Simon Peter: And that is the one that has primary control over the section?!
> Over some processes, not all of them.

Simon Peter: Which one control the yellow area again?
> The naive criteria is: which process has more cores.

Simon Peter: My model would have been that each scheduler is responsible for those threads of a domain within a section and re-allocations in this sections, which might be less flexible. Hunch: One primary for each section, responsible for all threads in a section scales better. The questions is what would you like to scale with.
> Yes, it is still work in progress.

Karl-Filip Faxén: A scheduler is responsible for a set of cores, but also responsible for decisions affecting programs. Scheduler has therefore two roles. I could imagine that the scheduler is responsible for getting the statistics for cores in one section as you propose, but that there is still some need to aggregate on program-by-program basis. The schedulers in a section would gather and aggregate measurements from cores for applications spanning sections. The other schedulers would just gather measurements to this scheduler, which would then decide what needs to be done in the application.

Simon Peter: That sound more like the model I had in mind. To have local decisions and then communication with other section.


Speaker: Robert Watson
Topic: BERI: an open source platform for research into h/w-s/w interface
Robert Watson was talking about BERI, a system to help hardware and operating-system developers working together again. The platform is ment to be used to experiment and investigate the impact of even fine grained changes to hardware components along with their parameters. This will help to develop operating system across several platforms, architectures and applications.


Timothy Roscoe: As I understand it, most processor actually have many many many registers. 32 mains, the compiler knows about this and reuses these.
> Absolutely. One question for our capability model is how many capability register we should have and the cost of context-switches and all those kinds of things. We try to see what makes sense. For example: Caches and the size of these caches. We can change assembler instructions, op-codes but also the number of registers and there implications.

Timothy Roscoe: SW & HW people need to cooperate.

Timothy Roscoe: We use our capability system not so much for security but resource allocation. For us, the attraction was in easily figuring out who owns what and we thought that this would be absolutely true. Do you think that caps for authentication leads to problems?
> That's a good questions, I don't know. Capabilities are not as cheap as everyone thinks. Some things came up. We have two types of capabilities: a) essentially segments, b) object capabilities. our notion of capabilities are really really cheap. But there are questions like: garbage collection, address spaces, language run-times. Address spaces being cheap/ almost free, we never reuse address-spaces except for certain very specific contexts, in particular the stack because it is very hard to change assumptions about the stack, e.g. running existing software side-by-side with new stuff.

Timothy Roscoe: Yeah, tell us about it. Basic problem we hit: Contention, really don't want to share bit-representation of capabilities between cores. Synchronization done asynchronously. Lazily. Refer to an address space, which is shared. Transferring capabilities between cores nearly impossible business.
> We eliminated replication. No synchronization requirements. Delicate things across cores.


Speaker: Mikel Lujan
Topic: Teraflux: A Manchester Perspective
Mikel Lujan was talking about the teraflux project. The project is about to improve parallelism for machines having several thousand cores. These systems cannot be exploited with todays programming models and applications. The project seeks to find a complete solution to leverage the full power of theses systems by rethinking the design of applications, programming models, the architecture and compiler techniques. Then, data-flow principals can be applied efficiently throughout the system.


Timothy Roscoe: This is a slightly naughty question, so you can attack me afterward. The way you draw that diagram, it really looks like kind of a waterfall. At the end you get to: "Yes, we have developed some hardware and we can see whether it works". Is there are danger there that you end up doing the same thing than the Manchester data-flow machine did, which is that you develop a bunch of useful techniques but in the end if you put them together the improvement is not so large after all.
> I think that does not happen, we got it right. We handled absolutely everything. It's not a waterfall. We integrate as much as we can. For the other project, we had like 150 people, not allowed to talk to each other, which was difficult to handle.

Following: A discussion of simulators and the licensing models. Also: Comparison of the simulators.

Kornilios Kourtis: How dynamic is the graph during execution. Can it change?
> Yes.

Kornilios Kourtis: Is there a difference between task-parallel programming and data-flow programming.

Karl-Filip Faxén: As a kind of author of task-based systems: When people are talking about task-based parallelism basically talking about nested fork/join kind of parallelism. and that is a more limited view of parallelism. Computation have a certain shape, e.g. like a sequential loop you would need a nested fork/join framework to have a barrier between sequential loop iterations. With the data-flow model some iterations depend on some iterations in the previous iteration of the sequential loop. That would buy you a lot of flexibility.

Session 1 - Friday

Zeus Gomez Marmolejo - Porting Nanos++ runtime to Barrelfish

Zeus wants to run OpenMP (StarSs) on Barrelfish to compare it on different architectures. He has a compiler (Mercurium) that converts OmpSs to C++, creates a dependency graph and and schedules tasks. The entire system is called Nanos.

Nanos requires dynamic libraries, but they use static linking for now. Also, TLS (now available) is required. Finally, Barrelfish still leaks memory, and a lot of runtime from C++ is not in Barrelfish. Hake was also not easy to use for Nanos. It is inflexible and not compatible with autoconf, which makes existing apps difficult to port.

Zeus proposes a GCC cross-compiler and to use newlib to replace libc, so GNU libstdc++ and libgcc_eh work. He specifies 3 targets (x86_64, i386 & scc).

To show that newlib already works, he shows an example of bash running on Barrelfish. He notes that there are still many missing features, such as fork(), wait(), kill() & execve(), which prevent the shell from starting anything yet. There are also problems with loop dependencies between libbarrelfish and libc (now newlib) that are not easy to work around and other small implementation problems, like a bug in memory allocation on Barrelfish that leaks memory.

GASnet uses common API code and conduit code specific to network hardware. There is a conduit for Barrelfish, which creates a complete network to all CPUs.

Andrew: What do you want to run on top of gasnet?
Zeus: OpenMP and other projects are using it.

Werner: <Couldn't understand the question>?
Zeus: Bash shell is very simple. Had to solve simple things. Need to implement newlib syscalls.

Someone: Any rough ideas about performance? 4 threads sounds like overhead?
Zeus: Not very good performance yet. We used spinlocks everywhere, but will replace by mutexes (hoping to reduce overhead). Expect to have more performance than UDP in the end.

Jana Giceva - Database-OS co-design

Jana started off by showing the diversity in current HW resources and how this is not well handled if operating systems and databases are not working together on this problem. New hardware trends and diverse workloads already influence database research and appliances are an example where tying a database closely with the operating system has tangible benefits.

She demonstrates NUMA awareness as one example where DB/OS co-design can provide benefits. For her project, she uses Barrelfish and the CSCS engine, which is a column store. There are several initial questions to answer, such as what is the initial performance going to be, are there hot-spots or bottlenecks, what is its scalability. She answers these by comparing the Barrelfish/CSCS combination to CSCS running on Linux. As a workload, she uses the Amadeus flight bookings trace. The single core results look alike and varying the number of updates does not impose performance degradation. The datastore size is also proportional to its throughput.

On Barrelfish, the database can submit a manifest to the OS, which describes its requirements. The OS is also allowed to reply back to the manifest for 2-way information flow.

Steve Hand: Is data in memory?
Jana: Yes.

Barrelfish is faster than Linux. Due to DRAM bandwidth. Ultimately due to the C++ library implementation, which has faster support for strings.

Richard: If bottlenecked on DRAM then this is inconsistent?
Jana: No, we're CPU bound.
Richard: You should use a hash, instead of strcmp().

NUMA analysis: NUMA awareness makes everything well-distributed over nodes, not so without awareness. But doesn't impact performance.

Someone: How big are the error bars?
Jana: Not big.

To conclude, the CSCS engine works on BF. Future work: When do we hit a scalability bottleneck? And what is it?

Richard: Summarize how worker threads work? How often do they check? How are they synced?
Jana: No interaction with OS, except for NUMA alloc.
Richard: How are cores talking?
Jana: Shared-nothing DB, partitioned over cores, enqueue queries over all cores, cores scan their partition.
Richard: What's the unit of batching?
Jana: One rotation.
Richard: How does latency differ between Linux vs. BF?
Jana: Goal of DB design is to bound latency time, but scale number of simultaneous queries. Partition dataset, such that latency is less than bound. Put as many queries as possible such that workload is still CPU bound. Separate updates and reads. Goal is predictability, scalable throughput. Scans and updates are synchronized.

Someone: Seems like we need a better libc?
Jana: Yes! We plan to use BSD's libc.

Steve Hand: Why do you want to run this on Barrelfish?
Jana: We're doing it because it's easier to modify Barrelfish.

Tim Harris - Flexible hardware support for message passing

Tim looks at hardware support to accelerate message passing between protection domains. Currently, Barrelfish has several layers (AC, Flounder, UMP, CC-interconnect). Tim adds AMP interconnect driver and new hardware features to this. Related hardware work is dealing with communication within processes, not between them.

Higher-level abstractions are hard (flow control, no direct MPB access with protection). Related software work is focusing on shared FIFO queues. This relies on CC-memory and IPIs are heavy-weight.

Tim wants to add just enough hardware to avoid cache-coherent-memory reliance, and to avoid the need for polling or going into the kernel on notification.

In this work, Tim prepares messages in regs and introduces a hardware send(vspace, thread) mechanis. Both processes map a shared page to exchange more data.

Steve Hand: How is a thread named?
Tim: 64-bit integer.

Andrew: Can I use the thread ID as a channel ID?
Tim: Nothing would prevent multiple people to map in the same channel.

The hardware then maps virtual->physical, thread->core, sends on interconnect, and updates the cache line on the target core.

Steve: What is state of that cacheline?
Tim: Treat like write from target core.
Someone: Similar to injecting data to cache right from NIC.

On the slow path, we speculatively allocate, if line not present. If receiver not running, we can always inject. We introduce a Thread Translation Buffer (TTB) to map software IDs to core IDs.

The key ideas are to use existing VM mechanisms for naming and protection and in the common case to have cache-to-cache transfers.

As notification mechansism, Tim proposes a notify channel going through the kernel. On the receive side, the kernel periodically watches a bitmap of active notifications.

The new hardware mechanisms are implemented on the GEM5 simulator, including the TTB, a connection white list, rx and tx message queues, as well as flow control.

Preliminary results show that application performance using the new mechanisms is much better, due to sending right to the receiver and not via the cache coherence protocol. Tim used a 2-core two-phase-commit as evaluation workload. There are several ways to schedule messages and a fair scheduling strategy is most well behaved, as it's fast with short polling times and only a bit worse with long polling times.

The status is that an initial implementation exists for the Beehive computer, used for numerical programs, as well as a port to GEM5 to model non-cache-coherent shared memory. Tim can also run it on x86-64, which ignores most of the protection questions, but can be used to run longer experiments.

Someone: How do you write right into someone else's cacheline?
Tim: We use regular CC techniques. It's UMP with notifications. Maybe
we can use this for work-stealing systems. Have to investigate.

Kornilios: Is gang scheduling effective?
Tim: No numbers for that. Not different to other systems.

Someone: What about remote store programming?
Tim: Have to have a look.

Someone: If you have cache-coherence protocol with special store instruction, would that be the same?
Tim: Could be adapted. Here, the sender is explicit about where the send is going to.

Steve Hand: How does kernel watching work?
Tim: MONITOR/MWAIT.
Steve: Do in user-space?
Tim: Have receiver thread to MONITOR/MWAIT. It's already done.

Someone: Is power considered?
Tim: No.
Someone: You could power down cores and wait for memory?
Tim: That's MONITOR/MWAIT.

BarrelfishWiki: CambridgeWorkshop2011 (last edited 2011-12-12 15:00:41 by shindep)