= 2011 Barrelfish workshop =
== 20-21 October, Cambridge University Computer Laboratory ==
||||<(>''Thursday 20 October''||||
|| 09.30|| Timothy Roscoe (ETH Zurich) || [[attachment:2011-roscoe.pdf|Welcome, last year's progress, workshop goals]]||
|| 10.00|| Werner Haas (Intel) ||[[attachment:2011-haas.pdf|System-level implications of non-volatile, random-access memory]] ||
|| 10.15|| Matt Horsnell (ARM) ||[[attachment:2011-horsnell.pdf|OS support in ARMv7A]]||
|| 10.30|| Andrew Baumann (Microsoft) ||Drawbridge on Barrelfish||
|| 11.30|| Pravin Shinde (ETH Zurich) ||[[attachment:2011-shinde.pdf|Scalable and adaptive network stack architecture]]||
|| 12.00|| Zach Anderson (ETH Zurich) || [[attachment:2011-anderson.pdf|Fine-grained, language-level, hierarchical resource management]]||
|| 12.30|| Stefan Kästle (ETH Zurich) || [[attachment:2011-kaestle.pdf|Message-passing co-processor]] ||
|| 14.00|| Adrian Schüpbach (ETH Zurich) ||[[attachment:2011-schupbach.pdf|A declarative language approach to device configuration]]||
|| 14.30|| Ross !McIlroy (Microsoft) ||[[attachment:2011-mcilroy.pdf|Calico: rethinking the language / runtime-system boundary]] ||
|| 15.00|| Marcin Orczyk & Calum !McCall (U. Glasgow) ||[[attachment:2011-mccall.pdf|GHC for a multi-kernel architecture]]||
|| 16.00|| Georgios Varisteas (KTH) ||[[attachment:2011-varisteas.pdf|Dynamic inter-core scheduling in Barrelfish]]||
|| 16.30|| Robert Watson (CUCL) ||BERI: an open source platform for research into the h/w-s/w interface||
|| 16.45|| Mikel Lujan (U. Manchester) ||Teraflux: A Manchester Perspective||
|||| ||||
||||<(>''Friday 21 October''||||
|| 09.30|| Zeus Gómez Marmolejo (BSC) ||[[attachment:2011-gomez-marmolejo.pdf|GCC cross compiler and Gasnet]]||
|| 10.00|| Jana Giceva (ETH Zurich) ||[[attachment:2011-giceva.pdf|Database-OS co-design]]||
|| 10.30|| Tim Harris (Microsoft) ||[[attachment:2011-harris.pdf|Flexible hardware support for message passing]]||
== Session Transcripts ==
=== Session 1 - Thursday ===
==== Timothy Roscoe - Keynote ====
Mothy gave an overview of the state of the Barrelfish community, which
is noticably growing. A new Barrelfish version has also recently been
released, with a new build system, support for the Intel Single-Chip
Cloud COmputer and ARM, bootscripts, new IDC support, a bulk transport
facility, POSIX and VFS support, and a basic AHCI driver. This is
enough to run full-blown POSIX applications, like PostgreSQL. Finally,
the copyright now belongs solely to ETH.
At ETH, a large number of Master's students is now doing work on
Barrelfish, which should enhance its feature-list even further in the
near future.
The goals of Barrelfish haven't changed and are still to build a solid
platform for good OS research, as well as a well-respected
OS. Finally, Mothy is seeking feedback from participants of the
workshop. What are we doing right, what wrong? Is the frequency of
yearly workshop meetings okay or should it change?
==== Matt Horsnell - OS Support in ARMv7A ====
Matt introduced the ARMv7-A architecture. ARM is now market-pervasive
and thus an important architecture to look out for. The ARMv7-A is a
32bit RISC + Thumb ISA chip, based on the Cortex A processor. Thumb
supports less registers than the RISC ISA, but has a smaller memory
footprint. Furthermore, it supports SIMD instructions for multimedia
and LPAE (like x86 PAE) to support large physical address spaces.
He demonstrated weak ordering in the ARM ISA: There is no defined
ordering or consistency of instructions between ARM
processors. Instead, a memory barrier instruction is supported to
order memory accesses, but this is expensive.
{{{
Richard Black (RB): How expensive are barriers?
Matt: On the order of 100 cycles.
RB: Because you have to ensure the L1 is clean?
Matt: Have to write back memory to point of synchronization.
RB: If I have 2 barriers close to each other, is the second one as expensive as the first?
Matt: The cost is on the same order as the first, but it depends on the architecture.
}}}
Next, Matt talked about the heterogeneous ARM Big Little. This is
based on a cluster of Cortex A7 and and some Cortex 15 chips. The A7
have low power requirements and high efficiency, while the 15 is more
beefy. Their idea is to migrate a program seamlessly within 10k cycles
(20us) from an A7 core to the A15 if it needs more performance.
{{{
Mothy: Is the OS in control of this?
Matt: Yes.
Mothy: Does the OS see multiple cores?
Matt: Yes. We'd probably harness DVFS to switch between CPUs.
Orion Hodson: Where are we today from the press announcement?
Matt: Product will come in 2013. The models are developed. In research, you should use the GEM5 simulator, instead of the actual product.
}}}
==== Werner Haas - System-level implementations of non-volatile RAM ====
Werner started by demonstrating that charge-based memory (DRAM, Flash)
has scalability problems. Industry's focus today is on using
resistance as information carrier (NVRAM). As this is non-volatile, we
have an opportunity for universal memory inside the machine. Current
NVRAM prototypes are PCM (phase-change memory, the leading
technology), STTRAM (Spin-Torque Transfer RAM), and ReRAM (Resistive
RAM). ReRAM scales best (has highest density) and has good endurance.
{{{
Steve Hand: I'd like to buy this ReRAM. When can I do so?
Werner: HP press announcement says it's a product in 2013. First there's going to be Flash and then ReRAM for mobile devices in 2015/16.
}}}
There is a trend to replace DRAM with NVRAM. Caches are the last thing
that is still volatile. Should the new NVRAM be software-managed? For
example, we can make address space be block accessable to make them
look more like a disk, or should we rather make them byte addressable?
The address space is rather huge, so there would be a lot of pages and
big page tables.
We should also think about separating translation from protection. In
terms of persistent memory, we could use Barrelfish's capabilities and
match them to hardware protection directly.
{{{
Kornilios Kourtis: Are there any access patterns that get better performance in this setting? E.g. disk drivers prefer sequential data access.
Werner: You should rewrite your filesystem drivers and just start accessing data randomly and you get better performance. There are papers about non-volatile filesystems.
Steve Hand: There are papers at FAST and SOSP'09 about this. All those reinvent the interface between CPUs and NVRAM. They either ignore it or make something up.
Werner: Especially memory barriers from first talk could help here. You want to ensure memory is in NV backing sotre. Maybe transactional memory can also help here.
Mothy: These papers ignore that this is RAM and thus cache-line addressable. They shouldn't treat it like a disk.
}}}
==== Andrew Baumann - Drawbridge on Barrelfish ====
Drawbridge is a light-weight virtualization technology for Windows
applications that presents a middle-ground between virtualization and
porting from scratch. It reuses two ideas: Picoprocesses and library
OSes from a recent ASPLOS paper. We have implemented Windows 7 as a
library OS, which is also documented in that paper.
Picoprocesses are light-weight, secure isolation containers. There is
no access to system calls, but a much smaller API (45 vs. 1200+
calls). It's higher-level than a VM, though. Inside a pricoprocess,
you can run a library OS. This is much smaller than a fully
virtualized OS, because it only drives one application and is
self-contained.
We put this on top of Barrelfish for three reasons: Firstly, it's a
proof of self-containedness. Secondly, it ensures OS independence for
Drawbridge and thirdly, it allows real applications to run on
Barrelfish. An attached research agenda is to investigate OS support
for heterogeneous hardware.
For Barrelfish, there is no security monitor, which is present if
Windows is used as a host OS. Instead, we convert Win7 libOS calls to
Barrelfish libOS calls and run as if we're a normal Barrelfish
application. The platform adaptation layer (PAL) runs in-process.
So far, the project has contributed several improvements to
Drawbridge: It has forced us to nail down an ABI (not just an API),
such as thread stack ownership, the TLS contract, and
relocations. There are also improvements to Barrelfish: We now allows
mappings below 512GB, timers, user control over segment registers,
user-mode trap handling, a user-mode debug stub and an mmap()
implementation.
The limitations of the current system include terrible IO performance,
unsupported file sharing modes, child processes, and missing multicore
support.
Andrew then presented several models for OS support for
accelerators. The driver model essentially has no direct OS
integration with the accelerator and treats it as a peripheral that is
programmed via a device driver API. The co-processor model runs
applications on the accelerator, but forwards all system calls to the
host. Finally, the Multikernel uses the library OS only for
compatibility and runs a full OS on the accelerator. Andrew is
currently looking at Knight's Ferry as a possible accelerator to
exploit.
{{{
Andrew Moore: Why did it not work for this powerpoint example?
Andrew: Bugs. Timers are hard. We're not bug-for-bug compatible. In this case, we were not quite doing the right thing. In this case, PowerPoint would stall. Have to look at callframes to see what it's doing. Implicit assumption that memory frames are 64k aligned that we didnt spot.
}}}
{{{
Steve Hand: WHat's different to getting this working on Hyper-V vs. Barrelfish, as shown in the ASPLOS paper?
Andrew: They were cheating. They ran the guest in Hyper-V. They did RPC for each call to host machine. Issues for Hyper-V were providing threads, sync, etc on top of bare-metal.
}}}
{{{
Werner Haas: Accelerators are usually operating in physical address space. How would you like to have that handled?
Andrew: I think accels are getting increasingly general purpose. We need enough support to run a small Microkernel like Barrelfish. Look at MIC, AMD's fusion arch. They all have MMUs.
}}}
{{{
Werner: Then you run into TLB consistency issues.
Andrew: Yes, we know how to deal with those.
}}}
{{{
Someone: Accelerators are usually used in different way. Do you really want to run PowerPoint there?
Andrew: No! My goal is to reduce the barrier to developers being able to use accelerators, like I can't invoke the OS on an accelerator right now. I'm not transparently going to migrate an app to an accelerator and do all the magic for the developer.
}}}
{{{
Kornilios: What kind of OS support would code on an accelerator want?
Andrew: No good handle yet. It's driven by programming models today. The ideal app would be rich, invokes OS services, and wants to turn a parallel loop into something to work on an accelerator, by just needing an annotation.
}}}
=== Session 3 - Thursday ===
--------------------------------------------------
''Speaker:'' '''Adrian Schüpbach (ETH Zurich)'''<
>
''Topic:'' '''A declarative language approach to device configuration'''<
>
Adrian talks about how hard is hardware resource configuration. He explains
this problem by talking in details about the PCI address space configuration.
He explains how PCI address space works and how allocations happens in
most simple case. Then he proceeds with explaining how this problem
becomes incrementally complex as various constraints in PCI configuration
come into pictures.
Question: How you find the devices? Do you have the list of them with you?<
>
Answer: We don't have the list, We discover them.
Question: How you discover them? Do you have to configure all the bridges
first and then discover the devices?<
>
Answer: No, you can query hardware without configuring the bridges. By
querying the devices, you get the list of devices.
Adrian then explains the complexities introduced by special cases hardwares
which needs some fixed address in PCI address space.
Chris Adeniyi-Jones: Why is there a hole?<
>
Answer (Andrew): Some legacy PCI device may always decode a fixed address,
irrespective of what is assigned to it. So, we treat it these address ranges
as holes to avoid allocating them to some other device.
Adrian explains the quirks in PCI address space and how they complicate the
PCI address space allocation problem. He explains how Linux and Windows deals
with this problem and the limitations.
He explains the approach taken by him which involves expressing the problem
in high level language and then separating the device access logic from
PCI allocation logic. He explains reasoning behind using CLP and eclipse.
The steps involved in this solution are explained. He shows one example
of how the problem can be expressed into CLP problem and another example
of how to handle the quirks using this approach.
He explains the implementation of this solution inside Barrelfish and how
it works within the boot sequence of Barrelfish.
He continues with evaluation which shows that this solution works correctly
and is maintainable due to small code-base. He shows that how compactly
the PCI allocation problem can be shown and also that handling additional
quirks is easy. He compares the LOC count for Linux solution and his solution.
He concludes his session by stating that PCI configuration problem is hard
declarative language is promising approach for dealing with large, diverse
and evolving hardware base.
Question: Don't you have the garbage collector in the eclipse solver? You are
consuming 60MB of memory in 36ms, so you are consuming the memory at the
rate more than 1GB/s, which looks like running without garbage collector.<
>
Answer: The garbage collector is active in eclipse. The 60MB memory is statically
allocated during the initialization for the eclipse solver heap and this
size does not change during the execution.
Question: Do the eclipse really need 60MB of memory?<
>
Yes. Eclipse creates its dynamic structures on the heap, and then garbage
collector runs and removes these dynamic structures. But we still keep the
60MB of memory. So, 60MB is the high mark for memory usage. It is not that
bad as we use eclipse for many other services and hence the cost gets amortized.
Question: Is it 60MB per core? Does each core need it's own copy?<
>
Answer: No, it is central service. So, it is allocated only once.
Question: Can it solve the optimization solved incrementally?
And is PCI hot-plugging supported in Barrelfish?<
>
Answer: Not implemented in Barrelfish, so we don't know for sure.
But in theory it should be doable with this approach.
--------------------------------------------------
''Speaker:'' '''Ross !McIlroy (Microsoft)'''<
>
''Topic:'' '''Calico: rethinking the language / runtime-system boundary'''<
>
Ross explains Calico which is a parallel runtime system. He explains the
motivation behind this work by giving some context about history of
parallelism for performance improvement. He then explains how all the
assumptions in HPC systems do not hold on recent multi-core commodity systems.
He then gives the overview of Calico system, which challenges are tackled
in by this system. He then explains the task based, fork-join programming
model used inside Calico and the advantages of this approach.
He explains the Calico programming model and the usage of "async", "par" and
"finish" directives using small code sample.
Andrew: Do you really want to mix 'async' and 'par' in same block? (slide-4)<
>
Answer: It depends on the application requirements. Sometimes you want to mix
them, sometimes you definitely don't want to mix them. So, it seems it can
work together.
He takes the problem of FluidAnimate as an example which is a benchmark
from parsec benchmark suit and explains it. He then explains the approach of
static partitioning which is taken by openMP and Intel Parallel building block
and their limitations.
He then explains the dynamic partitioning or work-stealing approach which
is used in Clik-5 and Wool, He explains the spawn/sync overhead on these
implementations.
Question: Have you measured the context switch costs? Based on if the work is
stolen or not stolen, won't the cost change? (slide-17)<
>
Answer: The cost was taken from the paper, and was not measured directly by
authors. And author agrees to the observation that cost changes based on the
work is stolen or not.
Richard black: How is compared to the minimum cycles to do the computation?
(slide-17)<
>
Answer: 10 cycles (on slides)
Ross then explains disciplined partitioning which is attempt to take benefits
from above two approaches and is used inside Calico.
He then shows the code for FluidAnimate which uses Calico and explains how
this deals with limitations of above two approaches.
He then shows the evaluations comparing Calico with Parsec Native when there
is no competition for CPU-time and when there is a competition for CPU-time.
Question: What architecture you are running on? (slide-44)<
>
Answer: This was 4 core Intel workstation. We never got it running on large
CPU server.
Question: What is the graph normalized to? (slide-44)<
>
Answer: This graph is wall-clock execution time and is normalized to the
sequential (no. of cores == 1) execution time.
Question: what is the other domain competing for CPU? (slide-45)<
>
Answer: Its a very simple dummy application which is takes up the CPU.
Question: Are you running this on a Barrelfish?<
>
Answer: Yes. There is no clever thread migration going on in these runs.
Kornilios Kourtis: What exactly is parsec using for running threads? (slide-45)<
>
Answer: Various Parsec version can differ but this version is using pthreads.
Kornilios Kourtis: And what type of partitioning is used here? (slide-45)<
>
Answer: This experiment is using static partitioning.
The number of threads == no. of cores
Simon Peter: Do you know if parsec is pinned on the core?<
>
Answer: Yes, in Barrelfish they will be pinned to core.
Simon Peter: Can linux load-balancing can be observed in Linux for this
benchmark?<
>
Answer: Linux does not migrate threads for this benchmark.
Ross then discusses the current status of the work and the future directions
for OS/Runtime system co-design.
Kornilios Kourtis: About space-time continuum, my thinking was that if
workload is somewhere in middle you shouldn't be too much away from the
optimal results. I was wandering if you have results which contradicts that?
For example, if you have partitions with chunks large enough then you won't
have dispatch overhead, so that should be good for any case. I was wondering
if you have any results which shows against it.<
>
Answer:
Often this is the case. The problem is in working out what is the proper chunk
size that does not cost you lot in dispatch cost.
Question: Have you heard about Habaneru-C or Habaneru-Java?<
>
Answer: Nope.
--------------------------------------------------
''Speaker:'' '''Marcin Orczyk & Calum !McCall (U. Glasgow)'''<
>
''Topic:'' '''GHC for a multi-kernel architecture'''<
>
Marcin start with explaining Islands RTS which is a Haskell runtime system
for machines with partial cache coherence. He justifies the reasoning behind
assumption of partial cache coherence and systems composed of a number of
islands.
He then gives the overview of Islands runtime system and its inspiration
from GHC7 and GUM. He explains how Islands RTS works between islands and
how it works within Island.
He then explains evil static thunks and the issues caused by it in the
developments.
He then explains the current status and the results observed.
Calum then takes over the presentation and explains his future plan to port
Islands GHC runtime on Barrelfish.
Andrew Baumann: What was bad with putting one island in one process?<
>
Answer: There is nothing wrong. It was bad idea from authors side.
Question: virtual machine runtimes are like OS, then what happens if you merge
language runtime mix with OS?<
>
Answer: OS may not give the features needed by GHC language (eg.
light-weight threads), thats why GHC runtime is needed as additional layer.
Timothy Roscoe: If that is the case, what is the problem with OS interface
provided by barrelfish? If not, what is stopping GHC from merging with OS?<
>
Answer: Author is not well aware of barrelfish internals. But what will help
is to know common denominator of what GHC needs and what Barrelfish provides.
Werner Hass: what about configurable cache coherency where you can pick which
cores should be cache coherency?<
>
Answer: This would be a great proposition. Currently the system can only
validate if particular cores can go in same island by checking with SKB. But
if one can configure cores to selectively become cache-coherent then you
definitely have much better flexibility.
Question: Will it help to move the core to different island?<
>
Answer: The island is actually a heap, so even thought core migrates to new
island it will not take all the local data with it in current the current
implementation. So, for current implementation, it does not make sense.
=== Session 4 - Thursday ===
--------------------------------------------------
'''Speaker:''' Georgios Varisteas (KTH)<
>
'''Topic:''' Dynamic inter-core scheduling in Barrelfish<
>
The goal of this work is to increase throughput in Barrelfish by using
dynamic inter-core scheduling. The scheduling is separated in two
levels, a kernel and a user-level scheduler. The latter one runs as a
user-level library. This user-level scheduler is aware of the
application's requirement, whereas the kernel level scheduler knows
about the available resources and manages the global state.
--------------------------------------------------
Questions during the presentations:
Andrew Baumann: What do you mean by Kernel scheduler? <
>
> The scheduler is not running in the kernel, but on top of the monitors
Andrew Baumann: So it is more a system service? <
>
> Yes, a distributed system service
On slide 12:
Simon Peter: So the processes gather statistics?<
>
> Yes, to minimize traffic, to prevent bottlenecks and sustain scalability
Simon Peter: So the green processes send information to the responsible scheduler for section four?<
>
> Yes
Questions at the end of the talk:
Simon Peter: Information flow from user-level application to the
kernel-level schedulers, at which time granularity?<
>
> Less than the interval for scheduling. Because still in
implementation phase, no concrete values.
Simon Peter: What would you like it to be? Like milliseconds?<
>
> This is not simple to answer. Maybe big and small samples. Scheduling different for
different kind of programs. Some more concrete answer after some
experimenting.
Referring to slide 12:
Simon Peter: Scheduler responsible for entire domain, which might also
span over other sections. Wouldn't it be more straight forward to have
schedulers responsible for one section.<
>
> Two schedulers. User-level run-time and low-level scheduler. For
example: yellow program spans over several sections. For each
segment only one instance of a low-level scheduler.
Simon Peter: And that is the one that has primary control over the
section?!<
>
> Over some processes, not all of them.
Simon Peter: Which one control the yellow area again?<
>
> The naive criteria is: which process has more cores.
Simon Peter: My model would have been that each scheduler is
responsible for those threads of a domain within a section and
re-allocations in this sections, which might be less flexible. Hunch:
One primary for each section, responsible for all threads in a section
scales better. The questions is what would you like to scale with.<
>
> Yes, it is still work in progress.
Karl-Filip Faxén: A scheduler is responsible for a set of cores, but
also responsible for decisions affecting programs. Scheduler has
therefore two roles. I could imagine that the scheduler is responsible
for getting the statistics for cores in one section as you propose,
but that there is still some need to aggregate on program-by-program
basis.
The schedulers in a section would gather and aggregate measurements from cores for applications spanning sections.
The other schedulers would just gather measurements to this scheduler, which would then decide what needs to be done in the application.
Simon Peter: That sound more like the model I had in mind. To have local
decisions and then communication with other section.
--------------------------------------------------
'''Speaker:''' Robert Watson<
>
'''Topic: BERI:''' an open source platform for research into h/w-s/w
interface<
>
Robert Watson was talking about BERI, a system to help hardware and
operating-system developers working together again. The platform is
ment to be used to experiment and investigate the impact of even fine
grained changes to hardware components along with their
parameters. This will help to develop operating system across several
platforms, architectures and applications.
--------------------------------------------------
Timothy Roscoe:
As I understand it, most processor actually have many many many
registers. 32 mains, the compiler knows about this and reuses these.<
>
> Absolutely. One question for our capability model is how many
capability register we should have and the cost of
context-switches and all those kinds of things. We try to see what
makes sense. For example: Caches and the size of these caches. We
can change assembler instructions, op-codes but also the number of
registers and there implications.
Timothy Roscoe: SW & HW people need to cooperate.
Timothy Roscoe: We use our capability system not so much for security
but resource allocation. For us, the attraction was in easily figuring
out who owns what and we thought that this would be absolutely
true. Do you think that caps for authentication leads to problems?<
>
> That's a good questions, I don't know. Capabilities are not as cheap
as everyone thinks. Some things came up. We have two types of
capabilities: a) essentially segments, b) object capabilities. our
notion of capabilities are really really cheap. But there are
questions like: garbage collection, address spaces, language
run-times. Address spaces being cheap/ almost free, we never reuse
address-spaces except for certain very specific contexts, in
particular the stack because it is very hard to change assumptions
about the stack, e.g. running existing software side-by-side with new
stuff.
Timothy Roscoe: Yeah, tell us about it. Basic problem we hit:
Contention, really don't want to share bit-representation of
capabilities between cores. Synchronization done
asynchronously. Lazily. Refer to an address space, which is
shared. Transferring capabilities between cores nearly impossible
business.<
>
> We eliminated replication. No synchronization requirements. Delicate
things across cores.
--------------------------------------------------
'''Speaker:''' Mikel Lujan<
>
'''Topic:''' Teraflux: A Manchester Perspective<
>
Mikel Lujan was talking about the teraflux project. The project is
about to improve parallelism for machines having several thousand
cores. These systems cannot be exploited with todays programming
models and applications. The project seeks to find a complete solution
to leverage the full power of theses systems by rethinking the design
of applications, programming models, the architecture and compiler
techniques. Then, data-flow principals can be applied efficiently
throughout the system.
--------------------------------------------------
Timothy Roscoe: This is a slightly naughty question, so you can attack
me afterward. The way you draw that diagram, it really looks like kind
of a waterfall. At the end you get to: "Yes, we have developed some
hardware and we can see whether it works". Is there are danger there
that you end up doing the same thing than the Manchester data-flow
machine did, which is that you develop a bunch of useful techniques
but in the end if you put them together the improvement is not so
large after all.<
>
> I think that does not happen, we got it right. We handled absolutely
everything. It's not a waterfall. We integrate as much as we
can. For the other project, we had like 150 people, not allowed to
talk to each other, which was difficult to handle.
Following: A discussion of simulators and the licensing
models. Also: Comparison of the simulators.
Kornilios Kourtis: How dynamic is the graph during execution. Can it
change?<
>
> Yes.
Kornilios Kourtis: Is there a difference between task-parallel
programming and data-flow programming.
Karl-Filip Faxén: As a kind of author of task-based systems: When
people are talking about task-based parallelism basically talking
about nested fork/join kind of parallelism. and that is a more limited
view of parallelism. Computation have a certain shape, e.g. like a
sequential loop you would need a nested fork/join framework to have a
barrier between sequential loop iterations. With the data-flow model
some iterations depend on some iterations in the previous iteration of
the sequential loop. That would buy you a lot of flexibility.
=== Session 1 - Friday ===
==== Zeus Gomez Marmolejo - Porting Nanos++ runtime to Barrelfish ====
Zeus wants to run OpenMP (StarSs) on Barrelfish to compare it on
different architectures. He has a compiler (Mercurium) that converts
OmpSs to C++, creates a dependency graph and and schedules tasks. The
entire system is called Nanos.
Nanos requires dynamic libraries, but they use static linking for now.
Also, TLS (now available) is required. Finally, Barrelfish still leaks
memory, and a lot of runtime from C++ is not in Barrelfish. Hake was
also not easy to use for Nanos. It is inflexible and not compatible
with autoconf, which makes existing apps difficult to port.
Zeus proposes a GCC cross-compiler and to use newlib to replace libc,
so GNU libstdc++ and libgcc_eh work. He specifies 3 targets (x86_64,
i386 & scc).
To show that newlib already works, he shows an example of bash running
on Barrelfish. He notes that there are still many missing features,
such as fork(), wait(), kill() & execve(), which prevent the shell
from starting anything yet. There are also problems with loop
dependencies between libbarrelfish and libc (now newlib) that are not
easy to work around and other small implementation problems, like a
bug in memory allocation on Barrelfish that leaks memory.
GASnet uses common API code and conduit code specific to network
hardware. There is a conduit for Barrelfish, which creates a complete
network to all CPUs.
{{{
Andrew: What do you want to run on top of gasnet?
Zeus: OpenMP and other projects are using it.
}}}
{{{
Werner: ?
Zeus: Bash shell is very simple. Had to solve simple things. Need to implement newlib syscalls.
}}}
{{{
Someone: Any rough ideas about performance? 4 threads sounds like overhead?
Zeus: Not very good performance yet. We used spinlocks everywhere, but will replace by mutexes (hoping to reduce overhead). Expect to have more performance than UDP in the end.
}}}
==== Jana Giceva - Database-OS co-design ====
Jana started off by showing the diversity in current HW resources and
how this is not well handled if operating systems and databases are
not working together on this problem. New hardware trends and diverse
workloads already influence database research and appliances are an
example where tying a database closely with the operating system has
tangible benefits.
She demonstrates NUMA awareness as one example where DB/OS co-design
can provide benefits. For her project, she uses Barrelfish and the
CSCS engine, which is a column store. There are several initial
questions to answer, such as what is the initial performance going to
be, are there hot-spots or bottlenecks, what is its scalability. She
answers these by comparing the Barrelfish/CSCS combination to CSCS
running on Linux. As a workload, she uses the Amadeus flight bookings
trace. The single core results look alike and varying the number of
updates does not impose performance degradation. The datastore size is
also proportional to its throughput.
On Barrelfish, the database can submit a manifest to the OS, which
describes its requirements. The OS is also allowed to reply back to
the manifest for 2-way information flow.
{{{
Steve Hand: Is data in memory?
Jana: Yes.
}}}
Barrelfish is faster than Linux. Due to DRAM bandwidth. Ultimately due
to the C++ library implementation, which has faster support for
strings.
{{{
Richard: If bottlenecked on DRAM then this is inconsistent?
Jana: No, we're CPU bound.
Richard: You should use a hash, instead of strcmp().
}}}
NUMA analysis: NUMA awareness makes everything well-distributed over
nodes, not so without awareness. But doesn't impact performance.
{{{
Someone: How big are the error bars?
Jana: Not big.
}}}
To conclude, the CSCS engine works on BF. Future work: When do we hit
a scalability bottleneck? And what is it?
{{{
Richard: Summarize how worker threads work? How often do they check? How are they synced?
Jana: No interaction with OS, except for NUMA alloc.
Richard: How are cores talking?
Jana: Shared-nothing DB, partitioned over cores, enqueue queries over all cores, cores scan their partition.
Richard: What's the unit of batching?
Jana: One rotation.
Richard: How does latency differ between Linux vs. BF?
Jana: Goal of DB design is to bound latency time, but scale number of simultaneous queries. Partition dataset, such that latency is less than bound. Put as many queries as possible such that workload is still CPU bound. Separate updates and reads. Goal is predictability, scalable throughput. Scans and updates are synchronized.
}}}
{{{
Someone: Seems like we need a better libc?
Jana: Yes! We plan to use BSD's libc.
}}}
{{{
Steve Hand: Why do you want to run this on Barrelfish?
Jana: We're doing it because it's easier to modify Barrelfish.
}}}
==== Tim Harris - Flexible hardware support for message passing ====
Tim looks at hardware support to accelerate message passing between
protection domains. Currently, Barrelfish has several layers (AC,
Flounder, UMP, CC-interconnect). Tim adds AMP interconnect driver and
new hardware features to this. Related hardware work is dealing with
communication within processes, not between them.
Higher-level abstractions are hard (flow control, no direct MPB access
with protection). Related software work is focusing on shared FIFO
queues. This relies on CC-memory and IPIs are heavy-weight.
Tim wants to add just enough hardware to avoid cache-coherent-memory
reliance, and to avoid the need for polling or going into the kernel
on notification.
In this work, Tim prepares messages in regs and introduces a hardware
send(vspace, thread) mechanis. Both processes map a shared page to
exchange more data.
{{{
Steve Hand: How is a thread named?
Tim: 64-bit integer.
}}}
{{{
Andrew: Can I use the thread ID as a channel ID?
Tim: Nothing would prevent multiple people to map in the same channel.
}}}
The hardware then maps virtual->physical, thread->core, sends on
interconnect, and updates the cache line on the target core.
{{{
Steve: What is state of that cacheline?
Tim: Treat like write from target core.
Someone: Similar to injecting data to cache right from NIC.
}}}
On the slow path, we speculatively allocate, if line not present. If
receiver not running, we can always inject. We introduce a Thread
Translation Buffer (TTB) to map software IDs to core IDs.
The key ideas are to use existing VM mechanisms for naming and
protection and in the common case to have cache-to-cache transfers.
As notification mechansism, Tim proposes a notify channel going
through the kernel. On the receive side, the kernel periodically
watches a bitmap of active notifications.
The new hardware mechanisms are implemented on the GEM5 simulator,
including the TTB, a connection white list, rx and tx message queues,
as well as flow control.
Preliminary results show that application performance using the new
mechanisms is much better, due to sending right to the receiver and
not via the cache coherence protocol. Tim used a 2-core
two-phase-commit as evaluation workload. There are several ways to
schedule messages and a fair scheduling strategy is most well
behaved, as it's fast with short polling times and only a bit worse
with long polling times.
The status is that an initial implementation exists for the Beehive
computer, used for numerical programs, as well as a port to GEM5 to
model non-cache-coherent shared memory. Tim can also run it on x86-64,
which ignores most of the protection questions, but can be used to run
longer experiments.
{{{
Someone: How do you write right into someone else's cacheline?
Tim: We use regular CC techniques. It's UMP with notifications. Maybe
we can use this for work-stealing systems. Have to investigate.
}}}
{{{
Kornilios: Is gang scheduling effective?
Tim: No numbers for that. Not different to other systems.
}}}
{{{
Someone: What about remote store programming?
Tim: Have to have a look.
}}}
{{{
Someone: If you have cache-coherence protocol with special store instruction, would that be the same?
Tim: Could be adapted. Here, the sender is explicit about where the send is going to.
}}}
{{{
Steve Hand: How does kernel watching work?
Tim: MONITOR/MWAIT.
Steve: Do in user-space?
Tim: Have receiver thread to MONITOR/MWAIT. It's already done.
}}}
{{{
Someone: Is power considered?
Tim: No.
Someone: You could power down cores and wait for memory?
Tim: That's MONITOR/MWAIT.
}}}