Something about .NET string interning

Posted on December 11, 2007

0


Look at the following piece of code:

        private void Test()
        {
            string str = "Hello" + " " + "World";   /* line 1 */
            MessageBox.Show(str);                   /* line 2 */
        }

Note how we are first assigning “Hello”, then appending ” “, and then appending “World”. The C# compiler generates the following IL code for our method named Test.

.method private hidebysig instance void  Test() cil managed
{
// Code size       14 (0xe)
.maxstack  1
.locals init ([0] string str)
IL_0000:  ldstr      "Hello World"
IL_0005:  stloc.0
IL_0006:  ldloc.0
IL_0007:  call       valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult
[System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
IL_000c:  pop
IL_000d:  ret
} // end of method Form1::Test

 

From the IL you can see that the compiler did some optimization. The compiler saw what we were actually trying to do and generated just one statement to initialize our string to “Hello World” instead of generating statements to first append “Hello”, then “ “, and then “World”.

Now look at the following piece of code:

        private void Test2()
        {
            string str = "Hello";         /* line 1 */
            str += " ";                   /* line 2 */
            str += "World";               /* line 3 */
            MessageBox.Show(str);         /* line 4 */
        }

The C# compiler generates the following IL code for this method named Test2.

.method private hidebysig instance void  Test2() cil managed
{
// Code size       38 (0x26)
.maxstack  2
.locals init ([0] string str)
IL_0000:  ldstr      "Hello"
IL_0005:  stloc.0
IL_0006:  ldloc.0
IL_0007:  ldstr      " "
IL_000c:  call       string [mscorlib]System.String::Concat(string,
string)
IL_0011:  stloc.0
IL_0012:  ldloc.0
IL_0013:  ldstr      "World"
IL_0018:  call       string [mscorlib]System.String::Concat(string,
string)
IL_001d:  stloc.0
IL_001e:  ldloc.0
IL_001f:  call       valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult
[System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
IL_0024:  pop
IL_0025:  ret
} // end of method Form1::Test2

From the IL you can see that this time the compiler didn’t do a similar optimization. The compiler generated code to first initialize the string to “Hello” (IL_0000), then append “ “ (IL_000c), and then append “World” (IL_0018). This might seem surprising to some. Why didn’t the compiler do the optimization this time?

I’d reason it based on multi-threading. Let us look at the original C# code again:

        private void Test2()
        {
            string str = "Hello";        /* line 1 */
            str += " ";                  /* line 2 */
            str += "World";              /* line 3 */
            MessageBox.Show(str);        /* line 4 */
        }

Your code could have been written such that there was a thread that got hold of str just after line 2 and modified str to store “Hello Dijkstra” and got out. The result after line 3 in that case would be “Hello DijkstraWorld”. Hence at compilation time, you cannot assume that str won’t be modified between line 1 and line 2, and/or between line2 and line3. And that’s why the compiler doesn’t do the optimization in this case.

Incidentally that also explains why the following piece of code will display these two messages:

  1. str1 and str2 point to the same thing.
  2. str1 and str3 point to different things.

Here’s the code:

        private void Test3()
        {
            string str1 = "Hello World";
            string str2 = "Hello" + " " + "World";
            string str3 = "";

            str3 += "Hello";
            str3 += " ";
            str3 += "World";

            if (Object.ReferenceEquals(str1, str2))
                MessageBox.Show("str1 and str2 point to the same thing");
            else
                MessageBox.Show("str1 and str2 point to different things");

            if (Object.ReferenceEquals(str1, str3))
                MessageBox.Show("str1 and str3 point to the same thing");
            else
                MessageBox.Show("str1 and str3 point to different things");
        }

And how does all I have said so far explain the output?

Here’re two hints:

  1. Strings are interned in .NET. Essentially that means that if you assign “Hello” to s1 and assign “Hello” also to s2, both s1 and s2 would actually be pointing to the same object.
  2. String.Concat doesn’t intern return values.

Okay. So here is what happens.

  • I just showed how the compiler would optimize “Hello” + “ “ + “World” as if you wrote str2 = “Hello World”.
  • Now strings are interned, and that’s why str1 and str2 point to the same interned string so Object.ReferenceEquals(str1, str2) returns true.
  • From the IL of the previous example, you know that the “+” operator is translated by the compiler to String.Concat calls. I just said String.Concat doesn’t intern return values. So str1 and str3 would each be pointing to different strings and that’s why Object.ReferenceEquals(str1, str3) returns false.

About these ads
Posted in: .NET, General, Tech/Hacks