Tag Archives: perf

Debugging Assembly loading

Does a referenced assembly get loaded if no types in the assembly are “not used”?The term used is is very subjective. For a developer it would mean that you probably never created an instance or called a method on it. But this does not cover the whole story. You can instead consider what are the reasons for an assembly load occurring. Suzanne’s blog on Assembly loading Failures would give you a good understanding of failures if that is what you are interested in. This post focuses on how to identify what exactly is causing an assembly to load.We in the WCF team are very cautious on introducing assembly dependencies and how how our code paths can cause assembly loads since this impacts the reference set of your process. Images that get loaded during a WCF call can become the cause of slow start up since every assembly is a potential disk look up and larger the number the higher the impact to startup.  As a guidance for quick app startup is that you can eliminate a lot of the unnecessary assemblies from being loaded to speed up application startup if you refactor types properly. Continue reading

How do I find all the ETW sessions on the machine?

logman is your tool for this. Here is how you can query for all the sessions and also how to see values from a particular session.

c:> logman -etsData Collector Set                      Type                          Status-------------------------------------------------------------------------------AITEventLog                             Trace                         RunningAudio                                   Trace                         RunningDiagLog                                 Trace                         RunningEventLog-Application                    Trace                         RunningEventLog-System                         Trace                         RunningNtfsLog                                 Trace                         RunningSQMLogger                               Trace                         RunningUBPM                                    Trace                         RunningWdiContextLog                           Trace                         RunningMpWppTracing                            Trace                         RunningFSysAgentTrace                          Trace                         RunningMSMQ                                    Trace                         RunningMSDTC_TRACE_SESSION                     Trace                         Runningtest_trace                              Trace                         RunningThe command completed successfully.c:> logman test_trace -etsName:                 test_traceStatus:               RunningRoot Path:            C:Segment:              OffSchedules:            OnSegment Max Size:     500 MBName:                 test_tracetest_traceType:                 TraceOutput Location:      C:9_19_44.etlAppend:               OffCircular:             OnOverwrite:            OffBuffer Size:          8Buffers Lost:         0Buffers Written:      1Buffer Flush Timer:   0Clock Type:           PerformanceFile Mode:            FileProvider:Name:                 Microsoft-Windows-Application Server-ApplicationsProvider Guid:        {C651F5F6-1C0D-492E-8AE1-B4EFD7C9D503}Level:                5KeywordsAll:          0x0KeywordsAny:          0xffffffffProperties:           0Filter Type:          0The command completed successfully.

How to synchronize multiple threads?

In certain load tests you want to make sure a bunch of threads reach a particular state before they proceed with the rest of the work. You cannot make sure that all threads execute a point simultaneously since the CPU scheduling would determine this. However you can move these threads to Ready. A ready-thread is a thread can be scheduled for execution on a particular core – http://msdn.microsoft.com/en-us/library/dd627187%28VS.85%29.aspx

WaitForMultipleObjects helps synchronize multiple user mode threads.

“The WaitForMultipleObjects function determines whether the wait criteria have been met. If the criteria have not been met, the calling thread enters the wait state until the conditions of the wait criteria have been met or the time-out interval elapses.”

Here is a small example of how to start multiple threads and then let them proceed after all of them have reached a particular point in execution.

using System;using System.Threading;using System.IO;namespace TestThreading{    class Program    {        const int ThreadCount = 10;        static ManualResetEvent[] events = new ManualResetEvent[ThreadCount];        static ThreadStart onStart = new ThreadStart(Start);        static int locked = -1;        static void Main(string[] args)        {            Thread[] threads = new Thread[ThreadCount];            for (int i = 0; i < ThreadCount; i++)            {                threads[i] = new Thread(onStart);                events[i] = new ManualResetEvent(false);            }            for (int i = 0; i < ThreadCount; i++)            {                threads[i].Start();            }            Console.ReadLine();        }        private static void Start()        {            int threadCount = Interlocked.Increment(ref locked);            Console.WriteLine("Thread {0} started & waiting", threadCount);            Thread.Sleep(3000); //Simulate some work befor setting.            events[threadCount].Set();            ManualResetEvent.WaitAll(events);            Console.WriteLine("Thread {0} proceeded", threadCount);        }    }}

How to collect stacks during context switches?

With xperf being more and more adopted and with rich stackwalking capabilities, its only natural to use it for finding out bottlenecks and cause for switch out.
Findout the ready thread information and what causes the threads to switch out and the associated stack that woke up when a thread switches back in is one way to determine what was the offending stack that causes other threads to switch out. This helps us identify potential hot locks or just really expensive locks or issues due to false data sharing.

You can run the following command to capture stack traces with ready thread information.

xperf –on base+cswitch+dispatcher –stackwalk cswitch+readythread

How to throttle callbacks or completions?

WCF enables throttling execution of operations but not their completions. This becomes and issue when a large number of outstanding operations complete almost simultaneously causing the callback on the client to be overwhelmed with completions.  Generally we don’t expect the client to issue of infinite number of pending operations but if you do end up with very high CPU usage and all suspect all your operations are stuck in the callback method which takes a lock then you need to throttle the callbacks yourself.You could try Setting the minThreads but this affects the whole app domain. The issue is due to the large number of callbacks that come in concurrently. The sample attached throttles the callbacks to have only one thread execute completions while there are 20 threads starting the operations and all completing almost simultaneously. The idea is to wrap the AsyncResult  of your operation and complete only the required number of results in parallel and this would throttle the service operation Ends automatically.Sample Source: AsyncEndThrottling

How to optimize Message Copy using CreateBufferedCopy?

Problem Statement

Some broker implementations require creating a copy the message forwarding it over to the backend. The broker also might slightly modify things like addressing headers etc. on the message for proper message routing within the DMZ. The problem is that we see a very high CPU cost in creating this copy message and this also results in lower throughput. Note: Streamed transfer mode is not in scope for this article.


For all performance issues we need to measure and profile and to investigate this issue we initially try to simulate the pattern of the broker by just copying over the message and then forwarding it to a backend dummy service. We then take profiles of this to understand how much the actual cost of copying is.

Simple Copy
Calls % Incl % Excl Function




System.ServiceModel.Channels.BufferedMessageBuffer::CreateMessage  •System.ServiceModel.Channels.Message()




System.ServiceModel.Channels.Message::CreateBufferedCopy  •System.ServiceModel.Channels.MessageBuffer(int32)

 An 8% cost for copying seems to be acceptable considering the value of making the copy and able to do other things if required. But then again this was not what is being observed. In the profiles from the actual broker we notice about 40% cost for creating a copy. This means that almost half the time is spent in creating a message copy. So effectively your throughput would almost drop to half when the broker is configured to create a copy of the message. This is excluding costs like logging etc.Evidently our simulation is not accurate so we need to isolate this further. We take in more functionality from the broker so that we hit this expensive path. One of the key observations was that the message is copied just before it is being forwarded. This also means that there are a bunch of manipulations that was done on the message and in our simulation we didn’t perform any manipulation. So to get this closer we need to probably change some things on the message.To keep it simple we did something like removing some header and adding another header to the message since most brokers modify headers before forwarding it over.


int headerIndex = input.Headers.FindHeader(header.Name, header.Namespace);

if (headerIndex >= 0)





 Eureka!! We observed our throughput went down and this was in line with what we were seeing in our broker. So we can see that CreateMessage and CreateBufferedCopy have increased in cost quite a bit.

With Single Header update
Calls % Incl % Excl Function




System.ServiceModel.Channels.DefaultMessageBuffer::CreateMessage  •System.ServiceModel.Channels.Message()




System.ServiceModel.Channels.Message::CreateBufferedCopy  •System.ServiceModel.Channels.MessageBuffer(int32)

 So this was performance data we collected.


Copy and Forward

Copy forwarding with new header

CPU UtilizationThroughput

98.6 %


98.7 %


 Now that we have identified the root cause we also need to identify the solution so that the broker can achieve the functionality without taking up so much CPU.


The solution is actually quite simple “Modify your message after you create the buffered copy”. I wantedgive the solution before the analysis since most of you would probably not be interested in the analysis but if you are then the rest would be interesting.


The most common way to create a copy your message is using Message.CreateBufferedCopy(int).

  1. The default case of creating a buffered copy creates a BufferedMessage from an underlying BufferedMessageData
  2. This is optimal in the following cases
    1. Message headers are not modified (no update in buffered header values)
    2. BufferedMessage headers haven’t been captured (this happens for e.g.  when the user inserts a header in the first location)

 If headers have been modified then CreateBufferedMessage takes an alternative path using the DefaultMessageBuffer. The reason is that a fully copy of the message has to be created if any buffered header has been modified. An internal property called headers.ContainsOnlyBufferedMessageHeadersis used to distinguish if the faster BufferedMessageBuffer can be used to create the buffered copy or not. If there are any modified headers then this means we need to assure that the message is fully marshaled over and the buffer itself cannot be copied(e.g. the user can add a reference type to the header) and so we fall back to a path that would fully reparse the message and create a fully deserialized copy of the modified message.The main point here is a copy should always be a deep copy and any kind of modification should not result in a message with shallow copied message parts. When you copy and create a message from the original then your message objects get its own copy of headers that it can play around with without affecting the original incoming message. Message copy by itself is a fast operation as you can see from the above profile and copying a modified message can be very CPU intensive when using buffered transfer mode. 

Router Implementation – Message Forwarding – Copy/Pass through

For greater flexibility our router can be something like a pass through router. If we are just calling a backend service then we can use a generic contract to receive and forward messages to the back end service as shown below.


Here we create a copy of the message to consume locally on the broker incase we want to validate some parts of the message or log etc. Ideally the fastest would be to just directly forward it over but application sometimes require all incoming messages to be logged or validated at the entry point of the DMZ.


  1. Loosely Coupling
  2. Potentially can avoid a lot of serialization and deserialization cost. (Encoders need to match for this)
  3. Changes in the backend usually do not require changes in the Broker provided the broker uses generic client contracts.


  1. Synchronous pattern has its overhead and not scalable.
  2. The router has to be of very high capacity to support high loads even though backend machines may block on IO.

Best Practices

  1. Match your encoders – Encoder mismatches can cause a heavy serialization and deserialization on the router. This is because any change between the backend and the frontend encoders on the binding would require re-encoding of the message and hence a full read and write at the encoder layer.
  2. Avoid having to create message copy. If the backend can create the copy/validate you save the routers CPU for more messages per second.
  3. Make sure you use fast message copy  – Buffered messages have a very optimized message copy path that WCF would take provided you haven’t changed parts in the message. I will talk about how to hit this fast path next.

  Next – How to Optimize Message Copying using CreateBufferedCopy?

Router Implementation – Strong Typed with Message Forwarding

I use the term broker and router very loosely here since they follow very similar guidelines as described here – WCF Broker Overview. Apologies for not being very rigid with these terms.

I will dive into best practices of building a router by progressing from a very simple implementation to a robust one through different scenarios and varying degrees of complexities.

The easiest implementation is by using a strongly typed contract with message forwarding as shown below.

[ServiceContract]public interface IOrderService{    [OperationContract]    Order[] GetOrders(int numOrders);}class OrderService:IOrderService{    Order[] GetOrders(int numOrders)    {      return backendProxy.GetOrders(numOrders);    }        }


  1. Very easy to implement.


  1. Tight coupling between router and backend.
  2. The router needs to serialize and deserialize the whole message and hence very inefficient.
  3. Any change to backend would require changes to router as well.
  4. Not a scalable solution since higher loads would choke the server due to heavy serialization issues.


Next – Router Implementation – Message Forwarding – Copy/Pass through

WCF Broker Overview

A broker is usually a central point for message forwarding and pass-through for clients and backend services. There are many types of brokers that come into mind

  1. Security broker – Usually a gate check for incoming message which transitions from one security protocol to another without mucking with the message most of the time.
  2. Protocol transition broker (http to tcp)
  3. Central work load balancer – Routing of some form
  4. Message transformer and service busses
    1. Aggregators/ Scatter gather/Composition & other patterns
  5. Encoding transition – SOAP/XML/Rest/Binary etc.
  6. Relaying persistent brokers (Persistent stores that holds on the message and later forwards it)
    1. Glorified queuing
  7. The list goes on… and you can derive a ton when using a combination of Message exchange patterns

But they mostly fall into the below basic model unless the client manages to bypass the broker with a P2P communication with the backend itself.


Router Implementation – Strong Typed with Message Forwarding




Here are 2 really good articles on MSDN that I would recommend for the functional aspects to architect your your router.

  1. Building a WCF Router, Part 1 (Addressing semantics)
  2. Building A WCF Router, Part 2 ( Scenarios and Message Exchange patterns)