Overview

Recently, I began exploring the performance of concurrent filesystem tree walkers across various programming languages. I assumed that my C++ implementation would match or even exceed the speed of my Rust version. To my surprise, C++ lagged behind all my other implementations, including a single-threaded JavaScript version. Even with page caching, C++ was noticeably slower. This unexpected outcome prompted me to delve deeper into the underlying reasons for the performance disparity.

The Algorithm

The algorithm starts at the root of a storage device and traverses the filesystem in a depth-first manner, examining every directory and file. When a directory is encountered, a task is dispatched to a work-stealing thread pool, allowing for concurrent traversal of multiple directories. At the end of the process, the total number of files and directories, as well as the combined size of all files, is calculated. Symlinks are deliberately excluded from traversal, ensuring a more straightforward analysis.

The C++ Implementation

The C++ implementation is pretty straightforward. I used std::filesystem to walk through the directories and files. I am also using a threadpool library called BS::thread_pool which is a work-stealing thread pool implemented by Barak Shoshany.

#include <iostream>
#include <utility>
#include <filesystem>
#include <mutex>
#include <vector>
#include "mimalloc-new-delete.h" // enable mimalloc new/delete overloads
#include "BS_thread_pool.hpp"

namespace fs = std::filesystem;
namespace chrono = std::chrono;

class file_metadata {
public:
    std::string name;
    long size;
    chrono::time_point<chrono::file_clock> creation_time;
    chrono::time_point<chrono::file_clock> modification_time;
    chrono::time_point<chrono::file_clock> access_time;
    bool is_directory;
    bool is_file;

    file_metadata(std::string name, long size,
                  chrono::time_point<chrono::file_clock> creation_time,
                  chrono::time_point<chrono::file_clock> modification_time,
                  chrono::time_point<chrono::file_clock> access_time,
                  bool is_directory, bool is_file)
            : name(std::move(name)), size(size), creation_time(creation_time),
              modification_time(modification_time), access_time(access_time),
              is_directory(is_directory), is_file(is_file) {}
};

class dirstat {
private:
    std::vector<file_metadata> entries;
    std::mutex entries_mutex;
    BS::thread_pool pool;

    auto read_dir(fs::path const& directory_path) -> void {
        for(const auto &entry: fs::directory_iterator(directory_path)) {
            auto& entry_path = entry.path();

            // Skip symlinks
            if (fs::is_symlink(entry_path)) {
                continue;
            }

            if (entry.is_directory()) {
                std::scoped_lock lock(entries_mutex);
                entries.emplace_back(entry_path.string(), 0, fs::last_write_time(entry_path),
                                     fs::last_write_time(entry_path), fs::last_write_time(entry_path),
                                     true, false);

                pool.detach_task([this, entry_path] {
                    return read_dir(entry_path);
                });
            } else if (entry.is_regular_file()) {
                auto file_size = fs::file_size(entry_path);

                std::scoped_lock lock(entries_mutex);
                entries.emplace_back(entry_path.string(), file_size, fs::last_write_time(entry_path),
                                     fs::last_write_time(entry_path), fs::last_write_time(entry_path),
                                     false, true);
            }
        }
    }

public:
    auto walk_storage(fs::path const& path) -> std::vector<file_metadata> const& {
        read_dir(path);
        pool.wait();
        return entries;
    }
};

auto main() -> int {
    std::string root = "/mnt/sn850x";

    dirstat ds;
    auto& result = ds.walk_storage(root);

    long total_size = 0;

    for (const auto &entry: result) {
        total_size += entry.size;
    }

    double total_size_mb = (double)total_size / (1024.0 * 1024.0);
    double total_size_gb = (double)total_size / (1024.0 * 1024.0 * 1024.0);

    std::cout << "[+] Total size of all files: " << total_size << " B\n";
    std::cout << "[+] Total size of all files: " << std::format("{}", total_size_mb) << " MB\n";
    std::cout << "[+] Total size of all files: " << std::format("{}", total_size_gb) << " GB\n";
    std::cout << "[+] Number of files: " << result.size() << "\n";

    return 0;
}

Note: You may notice that fs::last_write_time(entry_path) is called multiple times during construction of the file_metadata class. This is not a mistake but a workaround for the missing fs::creation_time and fs::access_time functions in the std::filesystem library. If you need similar functions you can use the boost::filesystem library.

The Java Implementation

The Java implementation is pretty straightforward as well. I used the java.nio.file package to walk through the directories and files. I also used the ExecutorService to create a work-stealing thread pool to walk through the directories concurrently.

package sg.edu.ntu;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.attribute.BasicFileAttributes;
import java.nio.file.attribute.FileTime;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class Main {
    record FileMetadata(
            String name,
            long sz,
            FileTime creationTime,
            FileTime modificationTime,
            FileTime lastAccessTime,
            boolean isDirectory,
            boolean isFile
    ) {
        static FileMetadata tryFromPath(Path path) {
            try {
                BasicFileAttributes attributes = Files.readAttributes(path, BasicFileAttributes.class);
                boolean isDirectory = attributes.isDirectory();

                return new FileMetadata(
                        path.toString(),
                        isDirectory ? 0 : attributes.size(),
                        attributes.creationTime(),
                        attributes.lastModifiedTime(),
                        attributes.lastAccessTime(),
                        isDirectory,
                        attributes.isRegularFile()
                );
            } catch (IOException e) {
                System.out.printf("[-] Got IOException: %s\n", e);
            }

            return null;
        }
    }

    public static void main(String[] args) {
        List<FileMetadata> files = walkStorage("/mnt/sn850x");

        long totalSz = files.stream().mapToLong(FileMetadata::sz).sum();
        double totalSzMb = totalSz / (1024.0 * 1024.0);
        double totalSzGb = totalSz / (1024.0 * 1024.0 * 1024.0);

        System.out.printf("[+] Total size of all files: %d B\n", totalSz);
        System.out.printf("[+] Total size of all files: %f MB\n", totalSzMb);
        System.out.printf("[+] Total size of all files: %f GB\n", totalSzGb);
        System.out.printf("[+] Number of files: %d\n", files.size());
    }

    private static List<FileMetadata> walkStorage(String path) {
        try(ExecutorService pool = Executors.newWorkStealingPool()) {
            List<FileMetadata> fileMetadata = Collections.synchronizedList(new ArrayList<>());

            readDir(path, fileMetadata, pool);
            return fileMetadata;
        } catch (Exception e) {
            return List.of();
        }
    }

    private static void readDir(String path, List<FileMetadata> fileMetadata, ExecutorService pool) {
        File baseFile = new File(path);
        File[] files = baseFile.listFiles(file -> !Files.isSymbolicLink(file.toPath()));

        if(files != null) {
            for(File file: files) {
                if(file.isDirectory()) {
                    fileMetadata.add(FileMetadata.tryFromPath(file.toPath()));
                    pool.submit(() -> readDir(file.getAbsolutePath(), fileMetadata, pool));
                } else {
                    fileMetadata.add(FileMetadata.tryFromPath(file.toPath()));
                }
            }
        }
    }
}

The benchmark

All benchmarks were done on the following machine. time command was used to measure the time taken.

  • CPU: AMD Ryzen 7 5800X3D @ 4.5GHz
  • RAM: 32GB DDR4 3200MHz
  • Storage: WD_BLACK SN850X 1TB NVMe SSD
  • OS: CachyOS Linux x86_64 Rolling Release
  • Kernel: 6.8.7-3-cachyos
  • C++ Compiler: clang version 17.0.6 and GCC 13.2.1
  • Java Version: OpenJDK 22 2024-03-19

Dropping page cache

sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'

With Page Cache (Java)

[+] Total size of all files: 56254073733 B
[+] Total size of all files: 53648.065312 MB
[+] Total size of all files: 52.390689 GB
[+] Number of files: 414615

________________________________________________________
Executed in  458.40 millis    fish           external
   usr time    2.14 secs    177.00 micros    2.14 secs
   sys time    2.98 secs     95.00 micros    2.98 secs

With Page Cache (C++)

[+] Total size of all files: 56254068454 B
[+] Total size of all files: 53648.06027793884 MB
[+] Total size of all files: 52.39068386517465 GB
[+] Number of files: 414615

________________________________________________________
Executed in    3.74 secs    fish           external
   usr time    0.76 secs  177.00 micros    0.76 secs
   sys time    4.74 secs  109.00 micros    4.74 secs

Without Page Cache (Java)

[+] Total size of all files: 56254073734 B
[+] Total size of all files: 53648.065313 MB
[+] Total size of all files: 52.390689 GB
[+] Number of files: 414615

________________________________________________________
Executed in  718.45 millis    fish           external
   usr time    1.86 secs    196.00 micros    1.86 secs
   sys time    4.90 secs    110.00 micros    4.90 secs

Without Page Cache (C++)

[+] Total size of all files: 56254074259 B
[+] Total size of all files: 53648.06581401825 MB
[+] Total size of all files: 52.3906892715022 GB
[+] Number of files: 414615

________________________________________________________
Executed in    3.15 secs    fish           external
   usr time    0.86 secs  182.00 micros    0.86 secs
   sys time    6.51 secs  105.00 micros    6.51 secs

Looking at the benchmark numbers, C++ is faster without page cache LMAO. Which is weird since page cache is supposed to help with performance!

What in the world? Why is Java faster than C++?

I was surprised to find that Java was faster than C++ in this case. I was expecting C++ to be faster due to the fact that it is a systems programming language and is closer to the hardware. However, it seems that the std::filesystem library is not as optimized as the java.nio.file package. I suspect that the reason for this is that std::filesystem is a wrapper around the POSIX filesystem API which makes system calls for every operation. This is evident when we look at the strace output of the C++ program. For brevity, I will only show the 1 part of the output.

Command to get strace output

strace /mnt/sn850x/dirstat/impl/cpp/cmake-build-release/cpp &> cpp.txt
strace -f java -jar /mnt/sn850x/dirstat/impl/java/out/artifacts/java_jar/java.jar &> java.txt

C++ strace output at /mnt/sn850x/.Trash-1000 (directory)

newfstatat(AT_FDCWD, "/mnt/sn850x/.Trash-1000", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/mnt/sn850x/.Trash-1000", {st_mode=S_IFDIR|0700, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/sn850x/.Trash-1000", {st_mode=S_IFDIR|0700, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/sn850x/.Trash-1000", {st_mode=S_IFDIR|0700, st_size=4096, ...}, 0) = 0

C++ strace output at /mnt/sn850x/backup.zip (file)

newfstatat(AT_FDCWD, "/mnt/sn850x/backup.zip", {st_mode=S_IFREG|0644, st_size=12613917413, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/mnt/sn850x/backup.zip", {st_mode=S_IFREG|0644, st_size=12613917413, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/sn850x/backup.zip", {st_mode=S_IFREG|0644, st_size=12613917413, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/sn850x/backup.zip", {st_mode=S_IFREG|0644, st_size=12613917413, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/sn850x/backup.zip", {st_mode=S_IFREG|0644, st_size=12613917413, ...}, 0) = 0

Java strace output at /mnt/sn850x/.Trash-1000 (directory)

[pid 2575295] newfstatat(AT_FDCWD, "/mnt/sn850x/.Trash-1000/files",  <unfinished ...>

Java strace output at /mnt/sn850x/backup.zip (file)

[pid 2575259] newfstatat(AT_FDCWD, "/mnt/sn850x/backup.zip",  <unfinished ...>

As we can see from the strace output, the C++ program is making multiple system calls for every operation while the Java program is making only 1 system call for every operation. This corresponds exactly to the number of times fs functions are called in the C++ program!

Why are system calls slow?

System calls are inherently slow because they require a context switch from user space to kernel space. This transition involves saving the current process state, switching to kernel mode, executing the system call, and then reverting to user mode. This series of operations is resource-intensive and introduces significant overhead, which can impact performance.

Moreover, each system call disrupts the CPU pipeline, leading to additional performance loss as the pipeline must be flushed and rebuilt. This interruption also tends to invalidate the CPU cache, resulting in cache misses that further slow down the program. Overall, these factors contribute to the high cost of system calls and can significantly affect the efficiency of a process.

How to fix this?

We can call sys/stat.h functions directly to avoid the overhead of the std::filesystem library. This will reduce the number of system calls made by the C++ program and improve its performance. Here is the modified C++ program. For brevity, main function is not shown.

class file_metadata {
public:
    std::string name;
    long size;
    chrono::time_point<chrono::system_clock> creation_time;
    chrono::time_point<chrono::system_clock> modification_time;
    chrono::time_point<chrono::system_clock> access_time;
    bool is_directory;
    bool is_file;
    bool is_symlink;

    file_metadata(std::string name, long size,
                  chrono::time_point<chrono::system_clock> creation_time,
                  chrono::time_point<chrono::system_clock> modification_time,
                  chrono::time_point<chrono::system_clock> access_time,
                  bool is_directory, bool is_file, bool is_symlink)
            : name(std::move(name)), size(size), creation_time(creation_time),
              modification_time(modification_time), access_time(access_time),
              is_directory(is_directory), is_file(is_file), is_symlink(is_symlink) {}

    static auto from_path(const fs::path &path) -> file_metadata {
        // call stat on the path
        struct stat file_stat{};
        lstat(path.c_str(), &file_stat);

        chrono::time_point<std::chrono::system_clock> access_time(
                chrono::seconds(file_stat.st_atim.tv_sec) +
                chrono::nanoseconds(file_stat.st_atim.tv_nsec)
        );

        chrono::time_point<std::chrono::system_clock> modification_time(
                chrono::seconds(file_stat.st_mtim.tv_sec) +
                chrono::nanoseconds(file_stat.st_mtim.tv_nsec)
        );

        chrono::time_point<std::chrono::system_clock> creation_time(
                chrono::seconds(file_stat.st_ctim.tv_sec) +
                chrono::nanoseconds(file_stat.st_ctim.tv_nsec)
        );

        return {
                path.string(),
                S_ISDIR(file_stat.st_mode) ? 0 : file_stat.st_size,
                creation_time,
                modification_time,
                access_time,
                S_ISDIR(file_stat.st_mode),
                S_ISREG(file_stat.st_mode),
                S_ISLNK(file_stat.st_mode)
        };
    }
};

class dirstat {
private:
    std::vector<file_metadata> entries;
    std::mutex entries_mutex;
    BS::thread_pool pool;

    auto read_dir(const fs::path &directory_path) -> void {

        for(const auto &entry_path: fs::directory_iterator(directory_path)) {
            auto file_metadata = file_metadata::from_path(entry_path);

            if (file_metadata.is_directory) {
                std::scoped_lock lock(entries_mutex);
                entries.emplace_back(file_metadata);

                pool.detach_task([this, entry_path] {
                    return read_dir(entry_path);
                });
            } else if (file_metadata.is_file) {
                std::scoped_lock lock(entries_mutex);
                entries.emplace_back(file_metadata);
            }
        }
    }

public:
    auto walk_storage(const fs::path &path) -> std::vector<file_metadata>& {
        read_dir(path);
        pool.wait();
        return entries;
    }
};

Notice that we are calling lstat directly on the path to get the file metadata. This reduces the number of system calls made by the C++ program and improves its performance. Here is the benchmark result with the modified C++ program.

With Page Cache (C++)

[+] Total size of all files: 56254074259 B
[+] Total size of all files: 53648.06581401825 MB
[+] Total size of all files: 52.3906892715022 GB
[+] Number of files: 414615

________________________________________________________
Executed in  217.61 millis    fish           external
   usr time    0.56 secs    176.00 micros    0.56 secs
   sys time    1.61 secs    159.00 micros    1.61 secs

Without Page Cache (C++)

[+] Total size of all files: 56254074259 B
[+] Total size of all files: 53648.06581401825 MB
[+] Total size of all files: 52.3906892715022 GB
[+] Number of files: 414615

________________________________________________________
Executed in  519.53 millis    fish           external
   usr time    0.56 secs      0.00 micros    0.56 secs
   sys time    3.43 secs    392.00 micros    3.43 secs

As we can see, the modified C++ program is now faster than the Java program. This is because we are now making fewer system calls and reducing the overhead of the std::filesystem library.

Conclusion

This article explored why Java outperformed C++ in a filesystem tree-walking benchmark. The analysis revealed that the C++ std::filesystem library was causing performance bottlenecks by making multiple system calls for each operation, leading to slower execution. To address this, we optimized the C++ program by directly using lstat to retrieve file metadata, significantly reducing the number of system calls and enhancing performance.

Additionally, Java's superior speed was attributed to the highly optimized java.nio.file package. These results demonstrate that high-level languages, with their well-optimized libraries, can sometimes surpass the performance of low-level languages, challenging common assumptions about programming language efficiency.