We've investigated this, but it's really not that efficient. The problem is that a prefetch path's child nodes depend on the data of the parent. So a prefetch path is really fetched sequentially. Merging the data using separate threads sounds tempting, but it actually gives a lot of overhead, as a parent can't be shared among multiple threads so the parent can't be a merge target for multiple children, which is actually the scenario you run into. So this requires locking which is actually slower.
Add to that that many applications are webapplications, which means you want to handle multiple requests at once instead of clog the server with a single request using parallel threads.