Harmful collection transformations. Part 1: string to array of chars

Published on Thursday, September 23, 2021

An array - the simplest collection. It's just a memory block with elements of one type. Arrays are often used when we need to iterate a collection, to search elements or transform them. C# and .net evolved, many new collections were introduced, interfaces allowed to create generalized algorithms, LINQ simplified the work of programmers, but we continue to use arrays in inappropriate places and perform excess and sometimes harmful transformations. This post begins a series about collections, their features, performance and design. All posts will contain benchmarks, code examples, my opinion and recommendations.

I want to start with a string, some specific collection of .net.

String

Any large enough project operates with strings. Usually strings store a user input and are used as data for other types. For example, we can read text input and convert it to a number type:

static void Main()
{
    var input = Console.ReadLine();
    var value = int.Parse(input);
}

All we know, user input must be validated. Here we need to check that all chars are digits. It's a very common check. And often the following code is used:

private bool IsDigitString(string str)
{
   return str.ToCharArray().All(char.IsDigit);
}

or

string bik = ...;
if (bik.ToArray().Any(x => !char.IsDigit(x)))
{
   message = "BIK must contain only digits";
   return false;
}

Looks simple: get a collection of chars and call LINQ-method to check every digit. The only unnecessary and harmful thing is transformation of string to array of chars with methods ToCharArray() or ToArray(). We used to use LINQ-methods with collections, but forget that the string is an immutable collection that implements IEnumerable<char> interface:

[Serializable]
public sealed partial class String : IComparable, IEnumerable, IConvertible, IEnumerable<char>, IComparable<string>, IEquatable<string>, ICloneable
{
...
}

So, we can modify our examples and remove excessive calls:

private bool IsDigitString(string str)
{
   return str.All(char.IsDigit);
}

and

string bik = ...;
if (bik.Any(x => !char.IsDigit(x)))
{
   message = "BIK must contain only digits";
   return false;
}

Ok, our code became simpler and more readable. But what do we get besides code quality benefits? Benchmarks can help us. Let's compare three methods: with ToArray(), with ToCharArray() and without additional calls for different string lengths and frameworks:

  • Benchmark code. Click to expand.
  • namespace ToCharArrayBenchmark
    {
        using System.Linq;
        using BenchmarkDotNet.Attributes;
        using BenchmarkDotNet.Jobs;
    
        [SimpleJob(RuntimeMoniker.Net48, baseline: true)]
        [SimpleJob(RuntimeMoniker.NetCoreApp31)]
        [SimpleJob(RuntimeMoniker.Net50)]
        [DisassemblyDiagnoser]
        [MemoryDiagnoser]
        public class IsDigitBenchmark
        {
            [Params(10, 100, 1000)]
            public int N { get; set; }
    
            private string value;
    
            [GlobalSetup]
            public void SetUp()
            {
                var chars = new char[N];
                for (int i = 0; i < N; i++)
                {
                    chars[i] = (char)(i % 10 + '0');
                }
    
                value = new string(chars);
            }
    
            [Benchmark]
            public bool IsDigitString_ToCharArray()
            {
                return value.ToCharArray().All(char.IsDigit);
            }
    
            [Benchmark]
            public bool IsDigitString_ToArray()
            {
                return value.ToArray().All(char.IsDigit);
            }
    
            [Benchmark(Baseline = true)]
            public bool IsDigitString()
            {
                return value.All(char.IsDigit);
            }
        }
    }
    
  • Full results in table format. Click to expand.
  • BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.1237 (21H1/May2021Update)
    AMD Ryzen 7 4700U with Radeon Graphics, 1 CPU, 8 logical and 8 physical cores
      [Host]             : .NET Framework 4.8 (4.8.4400.0), X64 RyuJIT
      .NET 5.0           : .NET 5.0.7 (5.0.721.25508), X64 RyuJIT
      .NET Core 3.1      : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
      .NET Framework 4.8 : .NET Framework 4.8 (4.8.4400.0), X64 RyuJIT
    
    Method Runtime N Mean Error StdDev Ratio RatioSD Gen 0 Allocated
    IsDigitString .NET 5.0 10 94.54 ns 1.924 ns 3.267 ns 1.01 0.04 0.0459 96 B
    IsDigitString_ToCharArray .NET 5.0 10 97.01 ns 0.602 ns 0.503 ns 1.02 0.02 0.0688 144 B
    IsDigitString_ToArray .NET 5.0 10 250.71 ns 3.346 ns 2.794 ns 2.64 0.04 0.1376 288 B
    IsDigitString .NET Core 3.1 10 98.74 ns 1.595 ns 1.492 ns 1.04 0.02 0.0459 96 B
    IsDigitString_ToCharArray .NET Core 3.1 10 105.02 ns 1.217 ns 1.079 ns 1.10 0.02 0.0688 144 B
    IsDigitString_ToArray .NET Core 3.1 10 294.50 ns 0.838 ns 0.700 ns 3.11 0.05 0.1373 288 B
    IsDigitString .NET Framework 4.8 10 94.96 ns 1.863 ns 1.913 ns 1.00 0.00 0.0459 96 B
    IsDigitString_ToCharArray .NET Framework 4.8 10 90.48 ns 1.345 ns 1.258 ns 0.95 0.02 0.0688 144 B
    IsDigitString_ToArray .NET Framework 4.8 10 249.17 ns 0.617 ns 0.482 ns 2.63 0.04 0.1452 305 B
    IsDigitString .NET 5.0 100 617.01 ns 2.901 ns 2.714 ns 0.91 0.02 0.0458 96 B
    IsDigitString_ToCharArray .NET 5.0 100 730.71 ns 6.392 ns 5.666 ns 1.08 0.02 0.1526 320 B
    IsDigitString_ToArray .NET 5.0 100 1,471.00 ns 25.149 ns 62.630 ns 2.21 0.16 0.3891 816 B
    IsDigitString .NET Core 3.1 100 695.67 ns 13.740 ns 22.576 ns 1.03 0.03 0.0458 96 B
    IsDigitString_ToCharArray .NET Core 3.1 100 707.81 ns 14.127 ns 40.760 ns 1.12 0.05 0.1526 320 B
    IsDigitString_ToArray .NET Core 3.1 100 1,638.57 ns 18.990 ns 16.834 ns 2.42 0.05 0.3891 816 B
    IsDigitString .NET Framework 4.8 100 677.60 ns 13.072 ns 12.839 ns 1.00 0.00 0.0458 96 B
    IsDigitString_ToCharArray .NET Framework 4.8 100 692.45 ns 10.962 ns 9.717 ns 1.02 0.02 0.1526 321 B
    IsDigitString_ToArray .NET Framework 4.8 100 1,696.70 ns 26.955 ns 22.509 ns 2.51 0.05 0.4768 1,003 B
    IsDigitString .NET 5.0 1000 8,340.66 ns 159.513 ns 141.404 ns 1.26 0.02 0.0458 96 B
    IsDigitString_ToCharArray .NET 5.0 1000 7,902.88 ns 153.280 ns 209.812 ns 1.20 0.04 1.0071 2,120 B
    IsDigitString_ToArray .NET 5.0 1000 12,526.66 ns 215.331 ns 201.421 ns 1.90 0.03 2.1820 4,568 B
    IsDigitString .NET Core 3.1 1000 6,771.54 ns 42.026 ns 35.093 ns 1.03 0.01 0.0458 96 B
    IsDigitString_ToCharArray .NET Core 3.1 1000 6,820.17 ns 77.269 ns 60.326 ns 1.03 0.01 1.0071 2,120 B
    IsDigitString_ToArray .NET Core 3.1 1000 14,993.39 ns 249.113 ns 255.821 ns 2.27 0.04 2.1820 4,568 B
    IsDigitString .NET Framework 4.8 1000 6,609.99 ns 51.380 ns 48.061 ns 1.00 0.00 0.0458 96 B
    IsDigitString_ToCharArray .NET Framework 4.8 1000 6,922.85 ns 102.382 ns 90.759 ns 1.05 0.01 1.0071 2,127 B
    IsDigitString_ToArray .NET Framework 4.8 1000 15,401.52 ns 117.030 ns 103.744 ns 2.33 0.02 3.0823 6,475 B
    // * Legends *
      N         : Value of the 'N' parameter
      Mean      : Arithmetic mean of all measurements
      Error     : Half of 99.9% confidence interval
      StdDev    : Standard deviation of all measurements
      Ratio     : Mean of the ratio distribution ([Current]/[Baseline])
      RatioSD   : Standard deviation of the ratio distribution ([Current]/[Baseline])
      Gen 0     : GC Generation 0 collects per 1000 operations
      Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
      1 us      : 1 Microsecond (0.000001 sec)
    

    Let's examine a diagram with execution times (in microseconds) for strings with 100 chars:

    Time diagram

    As we see, method IsDigitString is about 5% faster than calling ToCharArray(). ToArray() is up to 3 times slower. Both ToCharArray() and ToArray() do the same, but ToCharArray() is a string class method and very optimized with unsafe code. ToArray() is an extension method for IEnumerable<T> and is not optimized so much.

    So, we know, there is no need to transform string to array to use LINQ-methods, this transformation slightly hurts performance and there is one more thing - GC. Every time when we call ToArray() or ToCharArray() we create at least one new object - a new array. We no longer need the source string and GC will collect at some time.

    The next diagram shows us memory allocations (in bytes) for strings with 100 chars:

    Allocated memory diagram

    Cons

    Should we always always remove ToArray() and ToCharArray() calls? It depends on context. ToArray() generates a lot of overhead, so I can't find a situation where to use it. But as we saw before, ToCharArray() is very fast, and almost does not affect performance. Can you imagine that calling ToCharArray() will save time? But it does! Look at this example:

    var convertParameterSymbols = attributes["ConvertParameterSymbols"];
    if (convertParameterSymbols != null)
    	ConvertParameterSymbols = new List<char>(convertParameterSymbols.ToCharArray());
    

    It looks like we need to remove ToCharrArray(). But before doing this, we need to go back to the string class definition. As we remember, it implements IEnumerable<char>, but not ICollection<char>. ICollection<T> interface describes the contract for mutable collections (Add, Clear, Remove) that is not suitable for strings. And look at List<T> constructor:

    public List(IEnumerable<T> collection)
    {
      if (collection == null)
    	ThrowHelper.ThrowArgumentNullException(ExceptionArgument.collection);
      if (collection is ICollection<T> objs)
      {
    	int count = objs.Count;
    	if (count == 0)
    	{
    	  this._items = List<T>._emptyArray;
    	}
    	else
    	{
    	  this._items = new T[count];
    	  objs.CopyTo(this._items, 0);
    	  this._size = count;
    	}
      }
      else
      {
    	this._size = 0;
    	this._items = List<T>._emptyArray;
    	foreach (T obj in collection)
    	  this.Add(obj);
      }
    }
    

    It's optimized for ICollection<T> objects. Elements are just copied as a block with CopyTo instead of adding one by one that causes resizes of list. What types do implement ICollection<T> interface? You will be surprised, but any array (T[]) can be converted to ICollection<T>. And using new List<char>(convertParameterSymbols.ToCharArray()) is faster than new List<char>(convertParameterSymbols). Benchmarks can prove this.

  • Benchmark code. Click to expand.
  • namespace ToCharArrayBenchmark
    {
        using System.Collections.Generic;
        using System.Linq;
        using BenchmarkDotNet.Attributes;
        using BenchmarkDotNet.Jobs;
    
        [SimpleJob(RuntimeMoniker.Net48, baseline: true)]
        [SimpleJob(RuntimeMoniker.NetCoreApp31)]
        [SimpleJob(RuntimeMoniker.Net50)]
        [DisassemblyDiagnoser]
        [MemoryDiagnoser]
        public class ConstructorBenchmark
        {
            [Params(10, 100, 1000)]
            public int N { get; set; }
    
            private string value;
    
            [GlobalSetup]
            public void SetUp()
            {
                var chars = new char[N];
                for (int i = 0; i < N; i++)
                {
                    chars[i] = (char)(i % 10 + '0');
                }
                
                value = new string(chars);
            }
            
            [Benchmark(Baseline = true)]
            public List<char> Constructor()
            {
                return new List<char>(value);
            }
    
            [Benchmark]
            public List<char> Constructor_ToCharArray()
            {
                return new List<char>(value.ToCharArray());
            }
    
            [Benchmark]
            public List<char> Constructor_ToArray()
            {
                return new List<char>(value.ToArray());
            }
        }
    }
    
  • Full results in table format. Click to expand.
  • BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.1288 (21H1/May2021Update)
    AMD Ryzen 7 4700U with Radeon Graphics, 1 CPU, 8 logical and 8 physical cores
      [Host]             : .NET Framework 4.8 (4.8.4420.0), X64 RyuJIT
      .NET 5.0           : .NET 5.0.7 (5.0.721.25508), X64 RyuJIT
      .NET Core 3.1      : .NET Core 3.1.20 (CoreCLR 4.700.21.47003, CoreFX 4.700.21.47101), X64 RyuJIT
      .NET Framework 4.8 : .NET Framework 4.8 (4.8.4420.0), X64 RyuJIT
    
    
    Method Runtime N Mean Error StdDev Ratio RatioSD Gen 0 Allocated
    Constructor .NET 5.0 10 115.76 ns 1.517 ns 1.345 ns 0.74 0.01 0.0917 192 B
    Constructor_ToCharArray .NET 5.0 10 35.95 ns 0.698 ns 0.653 ns 0.23 0.00 0.0612 128 B
    Constructor_ToArray .NET 5.0 10 194.70 ns 2.219 ns 1.967 ns 1.25 0.01 0.1299 272 B
    Constructor .NET Core 3.1 10 138.96 ns 2.804 ns 2.880 ns 0.89 0.02 0.0918 192 B
    Constructor_ToCharArray .NET Core 3.1 10 52.31 ns 0.897 ns 0.839 ns 0.34 0.01 0.0612 128 B
    Constructor_ToArray .NET Core 3.1 10 228.65 ns 0.703 ns 0.657 ns 1.47 0.00 0.1299 272 B
    Constructor .NET Framework 4.8 10 155.57 ns 0.166 ns 0.147 ns 1.00 0.00 0.0956 201 B
    Constructor_ToCharArray .NET Framework 4.8 10 71.37 ns 1.098 ns 1.027 ns 0.46 0.01 0.0650 136 B
    Constructor_ToArray .NET Framework 4.8 10 232.85 ns 1.850 ns 1.730 ns 1.50 0.01 0.1414 297 B
    Constructor .NET 5.0 100 648.65 ns 7.217 ns 6.751 ns 0.80 0.02 0.3405 712 B
    Constructor_ToCharArray .NET 5.0 100 51.29 ns 0.921 ns 0.817 ns 0.06 0.00 0.2295 480 B
    Constructor_ToArray .NET 5.0 100 791.75 ns 15.823 ns 20.012 ns 0.99 0.03 0.4663 976 B
    Constructor .NET Core 3.1 100 684.45 ns 4.868 ns 4.065 ns 0.85 0.02 0.3405 712 B
    Constructor_ToCharArray .NET Core 3.1 100 70.10 ns 0.527 ns 0.493 ns 0.09 0.00 0.2294 480 B
    Constructor_ToArray .NET Core 3.1 100 962.36 ns 4.630 ns 3.867 ns 1.19 0.03 0.4654 976 B
    Constructor .NET Framework 4.8 100 800.13 ns 14.134 ns 17.875 ns 1.00 0.00 0.3443 722 B
    Constructor_ToCharArray .NET Framework 4.8 100 95.32 ns 1.880 ns 1.931 ns 0.12 0.00 0.2333 489 B
    Constructor_ToArray .NET Framework 4.8 100 960.90 ns 6.956 ns 5.431 ns 1.19 0.03 0.5569 1,171 B
    Constructor .NET 5.0 1000 5,289.46 ns 104.687 ns 153.449 ns 0.83 0.03 2.0828 4,368 B
    Constructor_ToCharArray .NET 5.0 1000 235.86 ns 4.589 ns 4.507 ns 0.04 0.00 1.9505 4,080 B
    Constructor_ToArray .NET 5.0 1000 5,943.57 ns 111.670 ns 204.195 ns 0.95 0.04 3.1128 6,528 B
    Constructor .NET Core 3.1 1000 5,196.81 ns 70.120 ns 65.590 ns 0.82 0.02 2.0828 4,368 B
    Constructor_ToCharArray .NET Core 3.1 1000 247.25 ns 1.114 ns 0.987 ns 0.04 0.00 1.9503 4,080 B
    Constructor_ToArray .NET Core 3.1 1000 7,100.14 ns 85.990 ns 80.435 ns 1.13 0.01 3.1128 6,528 B
    Constructor .NET Framework 4.8 1000 6,306.12 ns 69.763 ns 65.257 ns 1.00 0.00 2.0905 4,389 B
    Constructor_ToCharArray .NET Framework 4.8 1000 275.66 ns 2.318 ns 2.055 ns 0.04 0.00 1.9541 4,101 B
    Constructor_ToArray .NET Framework 4.8 1000 7,507.12 ns 20.504 ns 17.121 ns 1.19 0.01 4.0131 8,453 B
    // * Legends *
      N         : Value of the 'N' parameter
      Mean      : Arithmetic mean of all measurements
      Error     : Half of 99.9% confidence interval
      StdDev    : Standard deviation of all measurements
      Ratio     : Mean of the ratio distribution ([Current]/[Baseline])
      RatioSD   : Standard deviation of the ratio distribution ([Current]/[Baseline])
      Gen 0     : GC Generation 0 collects per 1000 operations
      Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
      1 us      : 1 Microsecond (0.000001 sec)
    

    Time diagram (in microseconds) for strings with 100 chars:

    Time diagram

    Incredible! A list created from an array is about 10 times faster than created from a source string. And memory usage is more optimal (in bytes):

    Memory diagram

    But we should remember, it's a very specific case.

    Conclusion

    In common we should be be more attentive and avoid unnecessary transformations for reasons:

    • Performance. Such small optimizations can positively influence the processing of big data.
    • Additional memory traffic - GC has to collect more data, GC pauses can be longer.
    • Code quality. We can reduce code but not lose functionality.

    Some cases require special attention and a deeper understanding of the .net platform.

    If you really need to convert a string to an array use ToCharArray() instead of ToArray().

    More examples

    Here are more examples I found.

    Lowercase string

    There is a very popular way to lowercase the first letter of a string. We remember - strings are immutable and name[0] = char.ToLowerInvariant(name[0]) won't compile. So, we need to create a new string from a char array.

    public override string Format(string name)
    {
    	if (string.IsNullOrEmpty(name) || char.IsLower(name[0])) return name;
    	var chars = name.ToCharArray();
    	chars[0] = char.ToLowerInvariant(chars[0]);
    	return new string(chars);
    }
    

    Reverse string

    There is no built-in method to reverse a string, so we have to create our own using Array.Reverse method, so we transform a source string to an array.

    public static string ReverseString(this string input)
    {
    	var charArray = input.ToCharArray();
    	Array.Reverse(charArray);
    	return new string(charArray);
    }
    

    Interate string

    public static byte[] ToBytes(string value)
    {
    	char[] chars = value.ToCharArray();
    	byte[] res = new byte[chars.Length / 2];
    	for (int i = 0, j = 0; i < chars.Length; i += 2, j++)
    		res[j] = (byte)(FromHex((byte)chars[i]) << 4 ^ FromHex((byte)chars[i + 1]));
    	return res;
    }
    

    We iterate through a string, and no need to create a new array. The code can be simplified:

    public static byte[] ToBytes(string value)
    {
    	byte[] res = new byte[value.Length / 2];
    	for (int i = 0, j = 0; i < value.Length; i += 2, j++)
    		res[j] = (byte)(FromHex((byte)value[i]) << 4 ^ FromHex((byte)value[i + 1]));
    	return res;
    }