Saturday, October 28, 2017

C# Filter duplicates from a collection quickly!

Today I did some testing to determine which of two methods was faster at finding duplicates in a collection: a LINQ query or a Dictionary.


In C#, there are always more than one way to do things. For example: I have a list of items which may or may not contain some duplicate items. In order to select just the duplicates, i wanted to put those in a separate list. I found two pieces of code able to do that.

For the hasty ones: a Dictionary is faster when you have very few duplicates. When you have a lot of duplicates, the dictionary becomes MUCH slower. The query stays fast.

The first is a LINQ : A Query written directly in C# source code.
In this program "source" is a list<String>. Lets assume that it contains: "aaa", "bbb", "ccc", "aaa". The query groups identical strings together, and then counts how many items are in each group. In this case, executing the query will produce three groups: One group with two strings "aaa", and two groups which each have one string, "bbb" and "ccc" respectively. Only the group with more than one string will be selected.
 ( This function was written just to test. To measure a  meaningful amount of time, the function is repeated a number of times. )

public void LINQ()
 {
 for (int i = 0; i < cycles; i++)
  {
  duplicates.Clear();

  var duplicatefilenames = from string item in source
     group item by item into duplicates
     where duplicates.Count() > 1
     select duplicates;

  foreach (var items in duplicatefilenames)
   {
   duplicates.Add(items.First());      
   }
  }
 }


The second method is by using a Dictionary. The idea here is that a Dictionary can only contain unique keys. If you try to enter a duplicate value an exception will be thrown:


public void Dictionary()
 {
 Dictionary<string, int> UniqueList = new Dictionary<string, int>();

 for (int i = 0; i < cycles; i++)
  {
  duplicates.Clear();
  UniqueList.Clear();

  foreach (string item in source)
   {
   try
    {
    UniqueList.Add(item, i);
    }
   catch (ArgumentException) //item already exists in uniquelist
    {
    duplicates.Add(item);
    }
   }
  }
 }


Now, my assumption was that maybe using a dictionary would be faster than using a LINQ query. So i set up a test with a Stopwatch. (Not a manual one, the .Net class. Click for details --->>

Conclusion:

As it turns out, I was wrong. When you have a source containing no duplicates, the dictionary is actually faster than the query. When you have some duplicates, the dictionary is MUCH slower.

For these tests, the list "source" is filled with 1000 unique strings. Between 0  and 100 % of those are inserted into the source a second time.
Below is the output of the program:

Starting speed test.
Repetitions: 10.000, percentage of duplicates: 0

Starting Linq test
Linq test finished in 2438 ms.

Starting Dictionary test. If there are a lot of collisions this might take a while...
Dictionary test finished in 716 ms.

So far, so good. Until we add some duplicates. The first time i ran the program, i shut it down because i thought it wasn't working right. As it turns out, catching exceptions just takes a really, really long time. Note that the output below has just one hundred cycles:

Starting speed test.
Repetitions: 100, percentage of duplicates: 1

Starting Linq test
Linq test finished in 35 ms.

Starting Dictionary test. If there are a lot of collisions this might take a while...
Dictionary test finished in 5210 ms.

WOW. The dictionary takes almost 149 times as long! For more information on how to use LINQ, a good source is these examples. I hope this helps you out, and have a great day coding!

No comments:

Post a Comment