IList.Contains

Posts   
 
    
mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 30-Dec-2005 19:25:19   

I've been performance profiling our code and noticed that the Contains function of the IList class that EntityCollectionBase2 derives from is very slow (it uses an iterative search). So, everytime you add to an entity collection, it has to search through all entities to make sure the entity is not already there. Would it be bad for the EntityCollectionBase2 to internally use a Dictionary or Hashtable object (which have O(1) lookup complexity) to do Contains lookups? The only drawback I see is that you would have to maintain this other object and sync it with the List object, but the performance gains for large collections could be very big I think.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39932
Joined: 17-Aug-2003
# Posted on: 30-Dec-2005 20:23:17   

Set DoNotPerformAddIfPresent to false, before adding an entity. This prevents the check and adds the entity directly.

A hashtable could be an option, but that fails if the entity is new as the PK isn't defined and IndexOf (which is internally used by CollectionBase to check if the entity exists in the collection) is using Equals on the entity which checks the PK, and if the entity is new, it does an instance compare.

Frans Bouma | Lead developer LLBLGen Pro
mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 30-Dec-2005 23:14:03   

Otis wrote:

Set DoNotPerformAddIfPresent to false, before adding an entity. This prevents the check and adds the entity directly.

A hashtable could be an option, but that fails if the entity is new as the PK isn't defined and IndexOf (which is internally used by CollectionBase to check if the entity exists in the collection) is using Equals on the entity which checks the PK, and if the entity is new, it does an instance compare.

I'm not sure why the Dictionary approach, using Dictionary.ContainsKey, would fail as it would use Equals just as List.Contains does, but List.Contains does an Equals call for every entity in the collection which is where the performance hit comes from.

Right now I'm deriving off of EntityCollectionBase2 and I implement this dictionary check myself (On add) with no obvious problems. It makes pretty much everything faster.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39932
Joined: 17-Aug-2003
# Posted on: 31-Dec-2005 00:28:33   

That won't work with two instances with the same values, as that would result in 2 keys but it should be the same key. ContainsKey doesn't use Equals, it calculates a hash from the value passed in and compares that to the hashes already available.

So you could use the Gethashcode() value but that can return the same value for different Pk's (as it's an int). The object fetch routine uses a hashtable with overflow buckets to check if an entity is already fetched, which is faster than equals.

You could argue to keep the set of hashtables in memory for the add routine, though this isn't done for memory usage reasons.

Frans Bouma | Lead developer LLBLGen Pro
mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 31-Dec-2005 19:49:53   

Otis wrote:

That won't work with two instances with the same values, as that would result in 2 keys but it should be the same key. ContainsKey doesn't use Equals, it calculates a hash from the value passed in and compares that to the hashes already available.

So you could use the Gethashcode() value but that can return the same value for different Pk's (as it's an int). The object fetch routine uses a hashtable with overflow buckets to check if an entity is already fetched, which is faster than equals.

You could argue to keep the set of hashtables in memory for the add routine, though this isn't done for memory usage reasons.

Hmm...why would two instances with the same value result in 2 keys? I'm on vacation and can't see the code right now sunglasses but I'm curious to know...

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39932
Joined: 17-Aug-2003
# Posted on: 02-Jan-2006 10:56:51   

mikeg22 wrote:

Otis wrote:

That won't work with two instances with the same values, as that would result in 2 keys but it should be the same key. ContainsKey doesn't use Equals, it calculates a hash from the value passed in and compares that to the hashes already available.

So you could use the Gethashcode() value but that can return the same value for different Pk's (as it's an int). The object fetch routine uses a hashtable with overflow buckets to check if an entity is already fetched, which is faster than equals.

You could argue to keep the set of hashtables in memory for the add routine, though this isn't done for memory usage reasons.

Hmm...why would two instances with the same value result in 2 keys? I'm on vacation and can't see the code right now sunglasses but I'm curious to know...

Check this code:


public class Foo
{
    public override int GetHashCode()
    {
        return 1;
    }
}

//...
Foo f = new Foo();
Foo g = new Foo();
Hashtable h = new Hashtable();
h.Add(f, null);
Console.WriteLine("Already there: {0}", h.ContainsKey(g));

Answer:

Already there: False

While you would expect 'true' because both have the same hash value: 1, according to your reasoning. But that's not how a hashtable works simple_smile . The hash value, is used to calculate the bucket to store the value. From there it's a linear search using the key. Which means that a hashtable can't be used to keep track of the existing values alone, you have to use a combination of hashtables and also, to find back an existing entity, although a different instance, requires PK matching, e.g.: hashvalue has to be the same and if so, the values have to be the same.

So you need a hashtable with all the hashvalues of the entities in the collection, and per hashvalue you need to store the entity itself so you can do an equals match so you can be sure the entity is indeed already in there, as multiple PK values can result to same hashvalue.

Frans Bouma | Lead developer LLBLGen Pro
mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 03-Jan-2006 18:32:16   

Otis wrote:

Check this code:


public class Foo
{
    public override int GetHashCode()
    {
        return 1;
    }
}

//...
Foo f = new Foo();
Foo g = new Foo();
Hashtable h = new Hashtable();
h.Add(f, null);
Console.WriteLine("Already there: {0}", h.ContainsKey(g));

Answer:

Already there: False

While you would expect 'true' because both have the same hash value: 1, according to your reasoning. But that's not how a hashtable works simple_smile . The hash value, is used to calculate the bucket to store the value. From there it's a linear search using the key. Which means that a hashtable can't be used to keep track of the existing values alone, you have to use a combination of hashtables and also, to find back an existing entity, although a different instance, requires PK matching, e.g.: hashvalue has to be the same and if so, the values have to be the same.

So you need a hashtable with all the hashvalues of the entities in the collection, and per hashvalue you need to store the entity itself so you can do an equals match so you can be sure the entity is indeed already in there, as multiple PK values can result to same hashvalue.

Sorry if I'm wasting your time with this, but those O(N^2) adds are really adding up to performance loss in our application. disappointed I'm sure this has been well thought out, but I need a little help understanding...

I'm still not sure I understand. I know why the example you gave will result with 2 Foo instances in 2 different buckets, but how does that relate to how it would work in your ORMSupportClasses? I don't see anything wrong with adding two empty entities into a hashtable ending up in 2 different buckets like in your example...Adding one empty entity twice would go into the same bucket, which would be the desired behavior.

Earlier you said:

That won't work with two instances with the same values, as that would result in 2 keys but it should be the same key. ContainsKey doesn't use Equals, it calculates a hash from the value passed in and compares that to the hashes already available.

when I first brought it up. Why would this result in 2 keys? The hashtable class calls GetHashCode on the key object to calculate the initial hash. The GetHashCode in the EntityFields class uses the PK values to create the hashcode...so why would


Dim A as new Entity
Dim B as new Entity
A.PK = 1
B.PK = 1
hashtable.Add A, true
If Not hashtable.ContainsKey(B)
    hashtable.Add B, true
End If

result in 2 filled buckets? ContainsKey would calculate the same value for GetHashCode, and the Equals in the EntityFields would say that A.Equals(B) is true...wouldn't it? ContainsKey does use Equals, but only for matching hashes (and A hashes the same as B).

You mentioned earlier that keeping the hashtable in memory would be a problem because of the memory loss. The O(1) memory loss would be much better than the O(N^2) performance loss for large entity collections, at least in our case.

Thanks for helping me understand smile

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39932
Joined: 17-Aug-2003
# Posted on: 04-Jan-2006 12:20:11   

mikeg22 wrote:

(...)

Sorry if I'm wasting your time with this, but those O(N^2) adds are really adding up to performance loss in our application. disappointed I'm sure this has been well thought out, but I need a little help understanding...

As explained earlier: set DoNotPerformAddIfPresent to false, which will make the add very fast. It's not always wanted, but you can control it in the situations where it's eating too much speed.

I'm still not sure I understand. I know why the example you gave will result with 2 Foo instances in 2 different buckets, but how does that relate to how it would work in your ORMSupportClasses? I don't see anything wrong with adding two empty entities into a hashtable ending up in 2 different buckets like in your example...Adding one empty entity twice would go into the same bucket, which would be the desired behavior.

The buckets aren't important, those are internals of the hashtable. what's important is that entities are identifyable through a datastore. An entity is data, it's not the container the data is in. This means that 2 instances of 'CustomerEntity' with the same data, represent teh same entity. To identify them as the same, the PK has to be checked. If the PK is a 2-3 field PK with all ints, you'll get duplicates (with strings as well). Each entity instances GetHashCode() calculates a hashcode from the PK value. This is to help multi-field PKs be usable fast.

As you can get duplicates, you can't simply do: myHashtable.Add(pkHash, entity);

you have to store it like this: ArrayList values = (ArrayList)myHashtable[pkHash]; values.Add(entity);

and to find it back, you first have to calculate the hash, then grab the arraylist, then traverse the arraylist by doing a Contains(entity).

This mechanism is used by the object fetch code (check ExecuteMultiRowRetrievalQuery() in DataAccessAdapterBase for example) to weed out duplicates fast. The downside is that it only works on existing entities.

NEW entities can cause problems, if they for example have an Identity PK: their PK is always 0 before they're saved. So Equals won't find them back.

But this isn't the reason why it's not implemented. The real reason is that it eats memory. If you have an entity collection with a lot of entities, it really eats memory. Add to that a large index of PK hashes, you're in for a lot of fun, hence the reason not to do it.

Earlier you said:

That won't work with two instances with the same values, as that would result in 2 keys but it should be the same key. ContainsKey doesn't use Equals, it calculates a hash from the value passed in and compares that to the hashes already available.

when I first brought it up. Why would this result in 2 keys? The hashtable class calls GetHashCode on the key object to calculate the initial hash. The GetHashCode in the EntityFields class uses the PK values to create the hashcode...so why would


Dim A as new Entity
Dim B as new Entity
A.PK = 1
B.PK = 1
hashtable.Add A, true
If Not hashtable.ContainsKey(B)
    hashtable.Add B, true
End If

result in 2 filled buckets? ContainsKey would calculate the same value for GetHashCode, and the Equals in the EntityFields would say that A.Equals(B) is true...wouldn't it? ContainsKey does use Equals, but only for matching hashes (and A hashes the same as B).

You mentioned earlier that keeping the hashtable in memory would be a problem because of the memory loss. The O(1) memory loss would be much better than the O(N^2) performance loss for large entity collections, at least in our case.

Thanks for helping me understand smile

Hashvalue != key! Hashvalue is used internally inside the hashtable to calculate the bucket (a general hashtable uses buckets which gets filled with every value which results in the same hashvalue, it's thus required that the hashvalue producing mechanism is good enough that all the buckets are filled equally as each bucket is searched linear).

The key is the identifying element, and that's used to retrieve teh value. The hashvalue calculated from the key is to determine the bucket where the actual key value is stored. That bucket is then traversed linear. This is faster than simply do a linear traversal of the complete value space, but the downside is that you have to keep the hashtable around with the collection. This can lead to memory consumption you don't want in a lot of cases. If you have 50,000 entities in memory and all have 2 entity collections inside themselves with 10 entities, you have a lot of extra memory eaten away by 100,000 hashtables.

As the core reason to add the hashtables is to get more performance, you can also skip the check. In v2.0 the check will be optional, and switched off by default. In .v1.0 it's switched on by default.

Frans Bouma | Lead developer LLBLGen Pro
mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 04-Jan-2006 17:15:27   

Ok, I think there's a small miscommunication. What I'm talking about doing is using the entity itself for the key. So Hashtable.Add(Entity1, True). The "value" is unimportant, I just gave it a value of True because it needs a value. When the Add happens, the entity is placed in the "key" field of the bucket determined by Entity.GetHashCode. GetHashCode is called automatically by the Hashtable class's Add routine. When ContainsKey(Entity1) is called, GetHashCode is called again on Entity1 to find a starting bucket to look in. For each possible bucket, bucket[location].key.Equals(Entity1) is called to see if the Entity1 "key" exists in the hashtable. When bucket[location].key.Equals(Entity1) returns true, ContainsKey returns true. So, the only thing that needs to be ensured is that Equals returns true when we want it to. The only thing that would need to be done is the EntityFields2.Equals would want to do an instance check to determine equivalence. This would be necessary for entities without PKs set.

By the way, I'm using: http://dotnet.di.unipi.it/Content/sscli/docs/doxygen/fx/bcl/hashtable_8cs-source.html

to see Microsoft's implementation of the Hashtable.

I see your point about memory usage, so I won't press the point anymore...I just wanted to be clear that I wasn't talking about using Hashtable.Add(Entity.GetHashCode, Entity). simple_smile

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39932
Joined: 17-Aug-2003
# Posted on: 05-Jan-2006 22:07:44   

I see we talk about the same things indeed smile . Sorry for the confusion simple_smile

Frans Bouma | Lead developer LLBLGen Pro