These days I have worked on the implementation of the interface to the OpenMP-powered assembly offered by FEniCS. Despite being potentially a one-line intervention, it proved quite tricky: indeed, with the needed addition, the fem-fenics function for system assembly broke with a huge number of run time errors, probably due to a change in the underlying data structure that is transparent to the library users, but does not go unnoticed if you need to access it directly, as fem-fenics does. This led me to leave this functionality behind.
My choice is backed by some computational experiments. They show that the approach enacted by the FEniCS library is quite effective, with times required for assembly reduced by half using four threads instead of just one. However, they are negligible compared to the linear system solve phase, even when the OpenMP parallelisation is disabled. I used a great number of mesh nodes in order to have meaningful timings: even if linear systems took some minutes for resolution, the assembly phase lasted as much as a couple of hundredth of a second in serial code. If we add to these findings that the fem-fenics package requires a just-in-time compilation lasting around half a minute, we understand that there is no point in devoting effort for the implementation of this feature.