Edit: Turns out Nameful's friend worked for EA optimization team and the issue is now resolved. Thanks!
Monitoring other events in addition to clockticks and instructions retired can often reveal potential causes of what appear to be pipeline stalls. Some such events are those that map directly to common coding pitfalls: 64K aliasing conflicts, split loads retired, MOB load repla ys retired (blocked store-forwards retired), SSE input assists, x87 input assists, and x87 output assists. Each one of these events indicates that the source code contains certain sequences of instructions that are potentially unfriendly to the microarchitecture in one way or another. MOB (memory order buffer) load replays retired, for example, indicates that store-to-load forwarding restrictions are not being observed. The Pentium 4 processor and Intel Xeon® processor use a store-to-load forwarding technique to enable certain memory load operations (loads from an address whose data has just been modified by a preceding store operation) to complete without waiting for the data to be written to the cache. There are size and alignment restrictions for store-to-load forwarding cases to succeed, and when a restriction is not observed, the memory load operation stalls.
Evaluating when coding pitfalls are causing performance hits is difficult to do without the help of the analyzer. To determine whether there is something in the implementation that can be done to help speed things up, we profile the application with all the events we want to monitor. Though the VTune analyzer is capable of collecting data on multiple events simultaneously, it usually runs more than one sampling session to collect all the data because certain events cannot be monitored at the same time. After the sampling sessions are complete, we are presented with a graph similar to Figure 1.
Looking at the modules
After running a sample session, the VTune Performance Analyzer displays the following graph. Each color bar represents a different event sampling, and the pop-up box summarizes all the data.