Multi-core systems are now found in many electronic devices. But does current software design fully leverage their capabilities? The complexity of the hardware and software stacks in these platforms requires software optimization with end-to-end knowledge of the system. To optimize software performance, we must have accurate information about system behavior and time losses. Standard monitoring engines impose tradeoffs on profiling tools, making it impossible to reconcile all the expected requirements: accurate hardware views, fine-grain measurements, speed, and so on. Subsequently, new approaches have to be examined. In this article, we propose a non-intrusive, accurate tool chain, which can reveal and quantify slowdowns in low-level software mechanisms. Based on emulation, this tool chain extracts behavioral information (time, contention) through hardware side channels, without distorting the software execution flow. This tool consists of two parts. (1) An online acquisition part that dumps hardware platform signals. (2) An offline processing part that consolidates meaningful behavioral information from the dumped data. Using our tool chain, we studied and propose optimizations to MultiProcessor System on Chip (MPSoC) support in the Linux kernel, saving about 60% of the time required for the release phase of the GNU OpenMP synchronization barrier when running on a 64-core MPSoC.