Windyland serving my blog

LLVM 6.0.0 Release

LLVM 6.0.0 is finally released. Although I keep updating its and Clang’s docsets for Dash, I didn’t have the time to follow the updates due to my busy work. After all, it has been a long time since last time I contributed to LLVM while it was still on 3.5 Release. I tried to pick it up and explain the new features and changes of recent years in this blog.


Let’s celebrate LLVM’s 6.0.0 Release. From the announcement email, it contains many new features, and changes:

  • Significantly improved quality of CodeView debug info for Windows
  • [X86] Retpoline Spectre variant 2 mitigation,
  • [X86] Improved scheduling on several x86 micro-architectures
  • [AArch64] GlobalISel by default for AArch64 at -O0

Clang has many changes as well:

  • Clang supports the -mretpoline flag to enable retpolines.
  • Clang defaults to -std=gnu++14 instead of -std=gnu++98
  • Clang supports for some upcoming C++2a features
  • Improved optimizations, new compiler warnings, many bug fixes, and more

Windows Support

The another millstone that the email doesn’t mention is that Clang is now used to build Chrome for Windows.

It is a great time for Clang that it is used for all major Operating Systems in production. Looking back to 2015, we still struggled in supporting MSVC ABI and language extensions which GCC never finishes. I remember Microsoft helped provide a license-compatible header and documents about PDB file format which Clang is able to use to produce/read PDB files. Although clang/c2 development is haled, now, clang-cl is able to produce binaries which is accepted by most of Microsoft toolchain.

In this blog, it explains very well about the detailed information that user wants to know about the windows support.


The most interesting parts of this update is Retpoline support for Clang, a compiler technique to allow indirect branches to be isolated from speculative execution immune to Spectre Variant 2.

Back to year 2017, there are three CPU bugs found in Intel’s and other X86 CPUs, notably Meltdown and Spectre Variant 1 and Spectre Variant 2. Meltdown make it is exposed the whole executative pages to unauthorized program, for current faulty processors. For a fix solution, Meltdown can be fixed with CPU microcode update and Operating System patch while Spectre issues are much larger and hard to deal. According to the [FAQ][spectre-fix], patching software with compiler support works for Spectre.

Before Clang 6.0 comes out, the only compiler in production supports Retpoline is GCC 7.3.0.

Learning Vulkan - Part1

Recently I am writing some graphics programs and find Vulkan’s API really interesting. So I decide to spend some time on it.

Download the sdk and get it compiled

Let’s start with Vulkan’s sdk, which you can download from lunar

Someone might choose to use linux. It is known that Intel/Nvidia/AMD all have the support for Vulkan in their drivers. So it is quite reasonable especially for Nvidia cards where linux support is great. However since the Vulkan is quite a new thing, I recommand to use windows if you are using an Intel card.

the code below is written in C++

int main() {
  SampleApplication app;
  auto res = app.CreateInstance();
  if (res) {
    printf("error, failed to create instance\n");
    return -1;
  printf("Vulkan instance created\n");

As you want to load vulkan’s driver, it is cleaner to only have one “CreateInstance” api for SampleApplication.

In fact, vulkan has its C API designed in the same way. The definition of SampleApplication:

class SampleApplication {
  SampleApplication() {
    app_info.pApplicationName = "Vulkan Example";
    app_info.applicationVersion = 1;
    app_info.pEngineName = "VV";
    app_info.engineVersion = 1;
    app_info.apiVersion = VK_API_VERSION_1_0;
  ~SampleApplication() {
    if (inst) {
      vkDestroyInstance(inst, nullptr);
      inst = nullptr;
  VkResult CreateInstance() {
    VkInstanceCreateInfo inst_info = {};
    inst_info.flags = 0;
    inst_info.pApplicationInfo = &app_info;
    inst_info.enabledLayerCount = 0;
    inst_info.ppEnabledLayerNames = nullptr;
    inst_info.enabledExtensionCount = 0;
    inst_info.ppEnabledExtensionNames = nullptr;
    return vkCreateInstance(&inst_info, nullptr, &inst);
  VkApplicationInfo app_info = {};
  VkInstanceCreateInfo inst_info = {};
  VkInstance inst = nullptr;

Where Vulkan’s api vkCreateInstance is called to create Vulkan instance in SampleApplication::CreateInstance, and call vkDestroyInstance in its dtor.

The program is completed to find if the system has vulkan support.

So let’s get it compile. You need to include Vulkan’s header from sdk directory

#include <vulkan/vulkan.h>

And get it compile with vulkan loader library named vulkan-1.dll under windows ( under linux)

Once you compile this program and run successfully, you will receive “Vulkan instance created” message from console

What’s next

Vulkan has a builtin queue architecture and command buffer pool managed by applications not users.

Before you can send a command such as vkGetLineWidth() to GPU, you need to allocate a command buffer. Before you can allocate a command buffer, you need to setup a command buffer pool. And before you can setup command buffer pool, you need specify the queue.

Those steps are different from OpenGL which will handle command buffer and queue for you. So Vulkan is designed to expose more details about graphic hardware so it is possible to unleash full power of the hardware. Meanwhile it makes Vulkan harder to program with.

What’s we are doing is similar to vulkaninfo program which query vulkan driver’s extensions, every physical devices’s properties, their queues and their own extensions.

  std::vector<VkPhysicalDevice> pdevices;
  res = app.EnumeratePhysicalDevices(&pdevices);
  if (res || pdevices.empty()) {
    printf("no gpu found, res %d\n", res);
    return -1;
  printf("found %llu physical devices\n", pdevices.size());

  // investigate Physical Device
  for (auto i = 0U; i < pdevices.size(); ++i) {
    auto pdevice = pdevices[i];
    VkPhysicalDeviceProperties properties;
    vkGetPhysicalDeviceProperties(pdevice, &properties);
    printf("\t PhysicalDevice%u. %s api %u.%u.%u\n", i, properties.deviceName,
  printf("using physical device 0\n");

  // investigate Physical Device 0's Queue
  auto pdevice = pdevices[0];
  uint32_t queue_properties_count;
  std::vector<VkQueueFamilyProperties> queue_properties;
  vkGetPhysicalDeviceQueueFamilyProperties(pdevice, &queue_properties_count,
  vkGetPhysicalDeviceQueueFamilyProperties(pdevice, &queue_properties_count,
  uint32_t queue_family_index = UINT32_MAX;
  for (auto idx = 0U; idx < queue_properties.size(); ++idx) {
    auto queue_property = queue_properties[idx];
    printf("\t QueueIdx%u. ", idx);
    if (queue_property.queueFlags & VK_QUEUE_GRAPHICS_BIT) {
      queue_family_index = idx;
      printf("Graphics ");
    if (queue_property.queueFlags & VK_QUEUE_COMPUTE_BIT)
      printf("Compute ");
    if (queue_property.queueFlags & VK_QUEUE_TRANSFER_BIT)
      printf("Transfer ");
    if (queue_property.queueFlags & VK_QUEUE_SPARSE_BINDING_BIT)
      printf("SparseBinding ");
    printf("has %u depth\n", queue_property.queueCount);

Here is the output in my computer

vulkan instance created
found 1 physical devices
         PhysicalDevice0. GeForce GTX 1080 Ti api 1.0.42
using physical device 0
         QueueIdx0. Graphics Compute Transfer SparseBinding has 16 depth
         QueueIdx1. Transfer has 1 depth
         QueueIdx2. Compute has 8 depth

Create Logical Device

Once you have queue family index, you can specify the physical device and the selected queues to create logical device:

  VkResult CreateDevice(VkPhysicalDevice pdevice, uint32_t queue_family_index) {
    assert(inst && !device);
    VkDeviceQueueCreateInfo queue_info = {};
    float queue_priorities[1] = {0.0};
    queue_info.queueFamilyIndex = queue_family_index;
    queue_info.queueCount = 1;
    queue_info.pQueuePriorities = queue_priorities;

    VkDeviceCreateInfo device_info = {};
    device_info.queueCreateInfoCount = 1;
    device_info.pQueueCreateInfos = &queue_info;
    graphics_queue_family_index = queue_family_index;
    return vkCreateDevice(pdevice, &device_info, nullptr, &device);

this example only accepts one queue, but you can do more with vulkan’s api. And you can pass the needed extensions at this time.


Summarize those steps, you need to

  • Start the Vulkan Instance
  • Find the right physical device, usually the first one
  • Find the right queue which is able to do graphics work
  • Create logical device from the selected queues of physical device
  • Create command buffer pool

Code is available at github

Refer to VulkanSamples

Why writing software is hard.

As DHH talks in his blog, writing software is hard, which I can’t agree more.

First of all, I declaim that I am at my 28. Obviously it is not the age of a senior software engineer. but I am one of the people who uses software very early, i.e. at 10 in 1999.

Coding is hard

The most thing I want to talk about is coding. DHH makes it clear that no super-duper easiness will help you about coding. Problems get solved by using new coding technique such as new languages, frameworks or libs, but not all.

Coding is still hard. Writing good code is not simply applying cool technique or some super cool “best practices”.

If the code doesn’t solve the users’ problems, it means nothing.

If the code causes more problems after merged, it means nothing.

If the code breaks the system it belongs to, it means nothing.

You really want to make a good piece of coding. You learn many different things, compute science theories, algorithms, language rules, different libraries, other programmers’ code and many others. But it won’t make you great if your code breaks things.

Coding is something about the language

Even if you make hard to learn many techniques and apply them into your code, it won’t get you right. So there is something between your code and good code.

Let’s think a bit further. You write your code in computer languages, and you think it is the language used for you and computers only. But you are WRONG.

In fact, coding is not just coding for the computer. The final target you write to and compute with is not the computer, but the human being. I am not talking about the customers, but the involvees, including you and your team.

Maybe you are confused about this. Let me make it simpler. When you begin to write the code and get it shipped to customers, your team designs the product, writes the code, reviews and verify it before deliver. And your customers use your product and feed back about it. You all get involved. Ideas and thoughts are communicated via the software, or simpler, the code.

The Final Goal

So that’s what the computer stands for, an invention to compute the human being.

PS: Happy New Year!

The Null Block Device

Linux Kernel provides null_blk module since 3.13, which can be used for benchmarking various block-layer implementations. Also it can be used as a dummy device to diagnostic storage system issues.

Check your kernel has null_blk module

  • grep CONFIG_BLK_DEV_NULL /boot/config-$(uname -r)
  • or grep CONFIG_BLK_DEV_NULL /proc/config

for example centos 7 has:


Use Null blk

the usage is much like zram, just insmod this module:

insmod null_blk

you can find two 250 GB-sized block device added

nullb0      251:0    0   250G  0 disk
nullb1      251:1    0   250G  0 disk

and run some benchmark on them:

  • dd if=/dev/zero of=/dev/nullb0 bs=4k oflag=direct
  • dd if=/dev/nullb0 of=/dev/null bs=4k iflag=direct
  • hdparm -tT /dev/nullb0
  • aio-stress -O -s 64m -r 256k -i 1024 -b 1024 /dev/nullb0
  • fio --ioengine=libaio --readwrite=randread --bs=4k --filename /dev/nullb0 --name journal_test --thread --norandommap --numjobs=200 --iodepth=64 --runtime=30 --time_based --group_reporting
  • fio --ioengine=libaio --readwrite=randwrite --bs=4k --filename /dev/nullb0 --name journal_test --thread --norandommap --numjobs=200 --iodepth=64 --runtime=30 --time_based --group_reporting

if you turn on scsi_mq, you will find incredible iops on your fio benchmarks, such as 5M iops on randread on my machine.

Advanced Usage

null_blk module has many parameters provided, use can tune:

  • submit_queues: the number of submission queues
  • gb: size in GiB
  • bs: block size in bytes
  • nr_devices: number of devices to register
  • completion_nsec: time in ns to complete a request in hardware
  • hw_queue_depth: queue depth for each hardware queue

for the case for debian backported 4.6.0 kernel:

sudo modinfo null_blk
filename: /lib/modules/4.6.0-0.bpo.1-amd64/kernel/drivers/block/null_blk.ko
license:        GPL
author:         Jens Axboe <>
intree:         Y
vermagic:       4.6.0-0.bpo.1-amd64 SMP mod_unload modversions
parm:           submit_queues:Number of submission queues (int)
parm:           home_node:Home node for the device (int)
parm:           queue_mode:Block interface to use (0=bio,1=rq,2=multiqueue)
parm:           gb:Size in GB (int)
parm:           bs:Block size (in bytes) (int)
parm:           nr_devices:Number of devices to register (int)
parm:           use_lightnvm:Register as a LightNVM device (bool)
parm:           irqmode:IRQ completion handler. 0-none, 1-softirq, 2-timer
parm:           completion_nsec:Time in ns to complete a request in hardware.  Default: 10,000ns (ulong)
parm:           hw_queue_depth:Queue depth for each hardware queue. Default: 64
parm:           use_per_node_hctx:Use per-node allocation for hardware context queues. Default: false (bool)

there parameters are also described in null blk kernel documentation


Some notes on atomic operations

Atomic operations are quite useful in concurrent programming, notably in the implementations of lock-free algorithms. Many times when the locking algorithms goes to the performance bottleneck, atomic operations come to save.

Atomic ordering

  • NotAtomic (regular load and store)
  • Unordered (to match java safe language memory model)
  • Monotinic (or memory_order_relaxed)
  • Acquire (or memory_order_acquire and memory_order_consume)
  • Release (or memory_order_release)
  • AcuireRelease (or memory_oder_acq_rel)
  • SquentiallyConsistent (or memory_order_seq_cst)

Platforms implementations


all atomic loads generate a MOV and SquentiallyConsistent stores generate an XCHG, other stores generate a MOV.

on ARM (before v8) and MIPS

Acquire, Release and SquentiallyConsistent requires barrier instructions for every such operation. Loads and stores generate normal instructions.

Language Standard and Compiler, library implementations

  • The new C++11 atomic header and C11 stdatomic.h header
  • Java-style volatile variables (match SquentiallyConsistent)
  • gcc __sync_* builtins (match SquentiallyConsistent)

if you want to use atomic operations before sticking to the new standards, there are compiler builtins functions and libraries availabe to call proper asm instructions to do these atomic operations.

gcc atomic operations Built-in Functions:


A Small Benchmark

I wrote a small program to benchmark the performance of atomic operations against mutexes and spinlocks. It is hosted at GitHub.. You can clone the repository and execute make. It requires a modern c++ compiler with c++11 support.

this sample output is on my 4-cores 2013-late Macbook Pro:

jobs:  1 total time:  614932094 ns average time:  6149 ns (Mutex)
jobs:  1 total time:    5132226 ns average time:    51 ns (SpinLock)
jobs:  1 total time:    4172785 ns average time:    41 ns (Atomic)
jobs:  2 total time:  746401303 ns average time:  7464 ns (Mutex)
jobs:  2 total time:   15983439 ns average time:   159 ns (SpinLock)
jobs:  2 total time:    9356120 ns average time:    93 ns (Atomic)
jobs:  4 total time:   42609222 ns average time:   426 ns (SpinLock)
jobs:  4 total time:   13734551 ns average time:   137 ns (Atomic)
jobs:  8 total time:  107958834 ns average time:  1079 ns (SpinLock)
jobs:  8 total time:   30681228 ns average time:   306 ns (Atomic)
jobs: 16 total time:  213277915 ns average time:  2132 ns (SpinLock)
jobs: 16 total time:   52189737 ns average time:   521 ns (Atomic)

Future Reading