Array-like C++ containers: Four steps of trading speed

In C++ programming, the STL provides std::vector as a sequential container with a contiguous storage. It is implemented as a dynamic array that grows automatically when inserting elements.

In this blog article we first discuss performance aspects of std::vector and then look on three other array-like contiguous-storage sequential containers that can improve upon std::vector in terms of performance, which is often driven by cache locality and memory allocation operations.

std::vector

Memory layout. A std::vector is a sequential container that can hold an arbitrary number of elements. It is implemented¹ as a dynamic array with a specific growth factor, which is often 2, but not always. Whenever its capacity is increased then the std::vector has to allocate a memory for the new capacity and copy² over the old elements. A std::vector behaves just like an STL-enriched³ dynamic array and has the following memory layout:

Memory layout of std::vector

In an object-oriented software design we often implicitly obtain std::vector of std::vector structures due to twofold composition with a 0..* cardinality. For instance, a train has 0..* cars, a car has 0..* seats, and each of the compositions it is naturally modeled by a std::vector. Note that memory-wise a std::vector of std::vector is not an STL-enriched version of a multi-dimensional C++ array. The latter is stored in a contiguous chunk of memory, but the former is not:⁴

Memory layout of cascaded std::vector

Performance considerations. A contiguous storage is beneficial to cache locality, which makes std::vector more efficient than the non-contiguous sequential container std::deque for operations that do not change the capacity.⁵

A cascaded std::vector structure has a memory layout similar to std::deque, which is is sub-optimal for cache locality. Moreover, and probably more severe, the creation of a cascaded std::vector requires multiple⁶ memory allocations where a multi-dimensional array requires a single one.

In high-performance computing memory allocation is a performance killer. In cases where the size of a std::vector can be guessed it is good practice to call reserve() in order to eliminate repetitive reallocations on the way to a vector’s final size.

But even a single allocation can be a computational burden if it happens, say, during a frequently called sub-routine. On real-time systems dynamic memory allocation is prohibited or at least bad practice: A standard memory allocator does not provide constant time complexity and may break real-time guarantees. But even if a real-time memory allocator, like TLSF, is used the runtime costs for a single memory allocation are high.

Also note that a single invocation of the new operator or malloc() may cause a system call to increase heap size⁷, which then involves an entire switch to kernel space. A real-time memory allocator, of course, cannot do that.

boost small_vector

Memory layout. From memory point of view, a std::string is something similar to a std::vector. Many implementations of std::string (e.g. gcc-5 and later) implement small string optimization (SSO)⁸ (aka short string optimization), which works as follows: A string object needs to store a data pointer to the string buffer. A pointer of a 64-bit address could store 8 characters itself without allocating memory. A typical string implementation actually stores size, capacity and a pointer to the buffer. The latter two could be used directly for an internal string buffer if the size fits:

class string_with_sso {
    size_t _size;
    union {
        struct {
            size_t _capacity;
            char* _buffer;
        } heapdata;
        char smallstring [sizeof(heapdata)];
    };
    bool is_small() const {
        return _size < sizeof(heapdata) - 1;    // If we also store '\0'
    }
};

Boost 1.58 introduced a container type boost::container::small_vector which — similar to SSO — contains in-place space for a fixed number N of elements. If the size exceeds N then dynamic memory allocation takes place. For vectors this technique is also called small buffer optimization.

Performance considerations. A car may comprise an arbitrary number of wheels but in almost all cases it does not exceed four. If we implement this composition using a small_vector we effectively eliminate memory allocation, which is especially beneficial for real-time applications.

Moreover, a train — std::vector of cars — points to a contiguous chunk of memory that contains all cars with the wheels embedded (for cars with at most four wheels). That is, a std::vector over small_vector is (typically) memory-wise contiguous just as a multi-dimensional C++ array. This again improves cache locality:

Memory layout of cascaded small_vector

In case that a std::vector is created temporarily on the stack — e.g., a local variable of a function holding intermediate results of a computation — then a small_vector may hold its elements entirely on the stack with no memory allocation involved at all, just as static C++ arrays do.⁹

Using small_vector comes with an additional cost for the in-place buffer.¹⁰ Indications that a small_vector may not help are: (i) There is no reasonable “small” number N, (ii) in many cases the size is much smaller than N, or (iii) the size of an element is very large. In all theses cases the original motivation — improving cache locality — may not be achieved by small_vector. The cost reduction by eliminating memory allocations may be nullified by a larger memory footprint. Never optimize without measuring/profiling; check your hypotheses even if they appear obviously correct.

boost static_vector

Memory layout. The small buffer optimization requires additional logic to distinguish the two cases (i) data is held in-place and (ii) data is held externally. If we know that the size of a small_vector will not exceed N then we can turn this fact into a design decision and use static_vector instead. Boost 1.56 introduced static_vector as a contiguous-storage sequential container that can hold a number of elements between zero and a fixed maximum capacity N.

A static_vector only needs to store the size and requires a fixed buffer for capacity-many elements.¹¹

Performance considerations. In comparison to small_vector, a static_vector does not need the above mentioned case distinction at runtime, which eliminates costly control flow branches. But also the memory footprint is reduced because the capacity does not need to be stored. Note that in some sense space is time due to the memory hierarchy.

std::array

A static_vector is still more flexible than a C++ array over N elements. The latter has exactly N elements, which are initially instantiated, whereas the former has a variable number of elements, which can be initially zero. So we can strip a static_vector further down and we receive the std::array introduced in C++11. A std::array is only an STL-enriched static C++ array. The size of a std::array<char, N> is just min(1, N).

Conclusion

In a certain sense we drew a line from std::vector to std::array on which we trade a gain of speed for reduced flexibility (or a higher memory footprint). Very often the full flexibility of a std::vector is not required but it is merely used as a convenient standard choice that gives access to the entire STL arsenal. It is also a convenient standard choice for one-to-many relationships in an object oriented design.

By taking deliberate reductions of flexibility into account significant performance improvements can be achieved. Nicholas Ormrod reports a 1% performance gain of the entire Facebook C++ code base by switching to a fbstring by Andrei Alexandrescu with small string optimization!⁸

In my industry job at B&R I am working on a software that does motion control for a hundred of moving elements in a transport system. The software runs on a real-time operating system on a PLC. By replacing various occurrences of std::vector by boost small_vector and static_vector we could achieve significant performance improvements and heavily reduce runtime jitter.

Strictly speaking, the STL does not define the implementation. However, it defines time complexities on operations, guarantees on iterators, and the like that basically leave no other reasonable choice. ↩
Actually, one could leave the old storage and copy a constant number of old elements whenever an insertion takes place. This way the amortized constant time push_back() could become a worst-case constant time operation. However, this strategy violates STL-defined properties of std::vector, like element access via pointer arithmetic, which is possible for a single contiguous storage. ↩
By “STL-enriched” we mean that it conforms to the iterator concept, which again makes is usable for STL algorithms or container adaptors like std::priority_queue. ↩
Also a std::vector of std::vector can model more than a two-dimensional array can, as each sub-vector could have different length. Different cars of a train may have different number of seats.

The syntax of the C# programming language distinguishes between a jagged array (array of arrays) and a multi-dimensional array, e.g., int[][] versus int[,]. See also the indexing of multi-dimensional numpy arrays in Python. (Thanks for the remark, Gerhard). ↩
Generalities concerning performance are dangerous. A std::deque is typically implemented as an array of arrays. This, for instance, allows push_back to be in constant time time, whereas for std::vector it is only amortized constant time. The key difference, which gives std::deque the name, is that it also provides a push_front(), and it is constant time, too. The internal two-level structure, however, comes at a cost that is not covered by the Big-O notation, but still relevant for high-performance or real-time code. ↩
In our example, we need the number of cars plus one allocation operations. For a deeper composition cascade we need to consider a product of cardinalities, which grows exponentially with the composition depth. ↩
Under Unix-type systems this system call is brk. ↩
See also the CppCon 2016 talk The strange details of std::string at Facebook by Nicholas Ormrod. ↩ ↩²
Note that the stack size has a limit defined by the operating system. Under Linux the command ulimit -s prints the stack size in kB, which is 8192 on my system. ↩
On my 64-bit Gentoo Linux with boost-1.65 and gcc-7.3.0 a std::vector<char> has a size of 24 bytes and a small_vector<char, N> has a size of min(32, 24 + N) bytes, round down to multiples of 8 bytes. ↩
On my 64-bit system a static_vector<char, N> has a size of min(16, 8 + N) bytes, round up to multiples of 8 bytes. ↩