New search engine on its way!

Posts   
 
    
Otis avatar
Otis
LLBLGen Pro Team
Posts: 39774
Joined: 17-Aug-2003
# Posted on: 08-May-2005 12:10:47   

Yesterday I implemented a great deal of the new search engine for this forum. It will be using wordlists created from the messages' text (only the text you add, no quotes, code etc simple_smile ). I've to add some more code to other parts of the forum so new messages also get indexed immediately, but after that it's good to go simple_smile .

Searches are very very fast, searching through 1000 messages (the forum currently has over 15,000 messages, I indexed a 1000 for testing) is instant and the queries are very compact simple_smile

I had to implement this search engine because my ISP doesn't support MS full text search, but it turned out to be not that hard simple_smile . The message parser returns a parse tree in XML, so I could grab the text a user added to a message with a couple of simple XPath queries, then jam that through a routine which creates a wordlist, using a common-words ignore list and some rules, and use a cached search word list to determine the word id's. Selfmaintaining simple_smile .

As the core part of this forum is made with LLBLGen 1.x, thus uses stored procedures and classes calling them, I have some porting to do (of those parts which are affected by the new search engine). The code is also written in hungarian coding (it's pretty old, I used to love hungarian coding style sunglasses ) so most of the work is just maintenance work: port more and more code over to llblgen pro code and clean up the hungarian coding style.

I hope to have it up and running by next weekend! simple_smile

Frans Bouma | Lead developer LLBLGen Pro
mdissel
User
Posts: 92
Joined: 16-Sep-2003
# Posted on: 08-May-2005 21:27:32   

Hello

You should have a look at dotLucene (based on the java Lucene implementation).. It has a very good searchengine ala google and the best part: it's opensource simple_smile

http://www.dotlucene.net

http://jakarta.apache.org/lucene/docs/index.html

Thanks

Marco

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39774
Joined: 17-Aug-2003
# Posted on: 09-May-2005 09:29:13   

mdissel wrote:

Hello

You should have a look at dotLucene (based on the java Lucene implementation).. It has a very good searchengine ala google and the best part: it's opensource simple_smile

http://www.dotlucene.net

http://jakarta.apache.org/lucene/docs/index.html

Thanks Marco

Looks ok, but the main part of the search engine is embedding the code in the forum software, as per message, not all text has to be parsed. Looking at this searchengine I see that what it gains for me is writing the search code, but that's easy simple_smile

Frans Bouma | Lead developer LLBLGen Pro
mdissel
User
Posts: 92
Joined: 16-Sep-2003
# Posted on: 09-May-2005 10:02:07   

Otis wrote:

software, as per message, not all text has to be parsed. Looking at this searchengine I see that what it gains for me is writing the search code, but that's easy simple_smile

Removing the code can be included into a custom analyzer. Writing the searchcode with the same option isn't that easy, what about complex and/or, 'like queries, fuzzy queries (levenstein algorithm), multilanguage support, etc.. and not to forget the ranking algorithm.. I've played around with Sql Server FT search, but you have to write huge sql statements to have the same result..

Thanks

Marco

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39774
Joined: 17-Aug-2003
# Posted on: 09-May-2005 10:48:39   

mdissel wrote:

Otis wrote:

software, as per message, not all text has to be parsed. Looking at this searchengine I see that what it gains for me is writing the search code, but that's easy simple_smile

Removing the code can be included into a custom analyzer.

The problem is that this search engine uses 'documents'. But the forum uses 'messages'. A full message is not interesting, as it contains for example quotes from a previous message, pieces of code etc. So the only parts interesting are the literal text elements added by the author of the message. This is known right after message parsing and right before the XML is transfered to HTML (using an XSLT). Creating wordlists from that for indexation is the part which is the most work and that's what I now have written already wink . So what it now gains me is not that much. Searching texts is not hard, you just have to create a good index. To get a good index from forum messages, you have to filter out what's not interesting, and that's the core part of what's the actual work. After that, you have a string of words for a given message to be indexed. It takes 2 routines, a wordlist and a set of rules which words should be ignored, to get the words which are interesting for the message.

You then take that wordlist and you check it with the list of cached searchwords. If a new word is seen, add it to the list. You return a set of wordids. Then you insert per wordid a message-word combination. That's it.

The engine I wrote isn't meant to be top notch, as that would indeed require a lot of work. What I required was an engine which was very flexible to meet the forum's setup and usage.

I'm not sure if the engine you suggested fits in that scenario. The thing is that if I spend a day trying to get something to work, I could also write the search indexer (took me 4 hours).

The main thing is that a search should be fast and stay fast. Ranking and other stuff, it's nice, but what I want is a search engine which searches very fast on a set of terms, as that's what users do, fill in a set of terms and they want results: a set of threads containing the terms. This engine you suggested stores paths, I'm not sure how that would result in messageid's etc, perhaps it works, I've to check.

Writing the searchcode with the same option isn't that easy, what about complex and/or, 'like queries, fuzzy queries (levenstein algorithm), multilanguage support, etc.. and not to forget the ranking algorithm.. I've played around with Sql Server FT search, but you have to write huge sql statements to have the same result..

The forum uses 1 language, so that's easy, the and/or queries aren't that hard, as that's mostly an input string parsing thing, ranking is not doable now (see below) and fuzzy queries... with indexing of texts you've to store a lot of data if you want to be able to do real fuzzy queries.

I store the text indexes in 2 tables: searchword and wordmessage. searchword has 2 fields: wordid (PK) and word (with an UC). wordmessage contains 2 fields: wordid and messageid (forming both the pk, separate indexes).

ranking is not doable because I don't store the wordcount of a given word in a message. I could do that, but would it bring me much? I'm not sure, as how to determine what the ranking is? Wordcount? perhaps the message at place 10 is more interesting..

(I could do a ranking based on the # of messages with the word in a thread)

Like queries are easy, as the wordlist can be used to retrieve messages based on like %word%, as it's done now, but then using an indexed wordlist (though a like %...% query isn't using an index) I'm not familiar with the levenstein algorithm, and there isn't any 'near' or 'far' information stored in the db (as in: on which index(es) is the word located in the message. )

The last thing I want to do is to spend a couple of days on getting a searchengine to work which doesn't fit the forum structure and more importantly: the ISP setup. I also didn't think the more hard-core search engine designs would be necessary (bit-voodoo to pack information into chains so retrieval of documents is fast) as this forum is meant for just one task: Q/A for LLBLGen Pro, and that makes things a bit easier to work with. (no multi-language, no babble-search, just technical terms etc.).

I'll see what the engine you suggested can do, but I doubt it will b easy to fit into the system, if possible at all, because of its structure to scan documents, which isn't what this forum deals with (i.e.: everything is stored in a db, not in files)

(edit): the documentation is pretty weak, but I now see the way I can add documents using wordlists I pull from the message parser. My concerns are: 1) what about multiple users hammering the search and also adding messages? 2) is the searcher safe? (i.e.: hack-proof? )

Frans Bouma | Lead developer LLBLGen Pro
netclectic avatar
netclectic
User
Posts: 255
Joined: 28-Jan-2004
# Posted on: 09-May-2005 12:04:14   

Are you updateing the searchwords when users edit messages? I've worked with phpBB a lot and your method is very similar to the method used there and with very large forums (admittedly much larger than here) there was quite a performance hit when you had users editing messages.

Just something to watch out for wink

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39774
Joined: 17-Aug-2003
# Posted on: 09-May-2005 12:32:46   

netclectic wrote:

Are you updateing the searchwords when users edit messages? I've worked with phpBB a lot and your method is very similar to the method used there and with very large forums (admittedly much larger than here) there was quite a performance hit when you had users editing messages. Just something to watch out for wink

Yes, when you edit a message the message is re-indexed. It follows the same path as when a new message is added, with one exception: it first removes the words for the message, then it re-adds the new ones. That's the only difference.

Search words are never removed. That indeed would be a performance hit, as you have to re-scan the complete word-message table (which can be very very large) to see if there are searchwords which can be removed. It's not a real issue, as the amount of search words will likely be stable around 8000 words or so. The average english person knows about 6000 words (I read somewhere wink )

For 1000 messages (of the 15,000), the wordlist is about 4000 words long, but I have a lot of words to add to the ignore list of words (it already ignores all words wikipedia ignores too and the 1000 most used words in english) and some more filtering on forbidden characters (like the english pound sign etc.) for 1000 messages, the messageword table contains 19000 records, so on average the table will max out on 300,000 records at most (with 8 bytes per record). With proper ignore word list tuning, that amount can be brought down dramatically. I'm still in doubt if I'll add the wordcount per word in a message to the word-message row or not. It might be come in handy, but it is also a size increaser.

Frans Bouma | Lead developer LLBLGen Pro
Marcus avatar
Marcus
User
Posts: 747
Joined: 23-Apr-2004
# Posted on: 09-May-2005 17:03:17   

Otis wrote:

Yesterday I implemented a great deal of the new search engine for this forum. It will be using wordlists created from the messages' text (only the text you add, no quotes, code etc simple_smile ).

Frans,

This is great as the forum search was getting slower by the day wink

However... Why are you not including code? I often find what I am looking for by using code keywords. Have I missunderstood you?

Marcus

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39774
Joined: 17-Aug-2003
# Posted on: 09-May-2005 19:25:40   

Marcus wrote:

Otis wrote:

Yesterday I implemented a great deal of the new search engine for this forum. It will be using wordlists created from the messages' text (only the text you add, no quotes, code etc simple_smile ).

Frans,

This is great as the forum search was getting slower by the day wink

Yeah, it's a real pain now simple_smile . The search will soon also be used to fast-link popular topics, so you can directly query the forum for 'prefetch paths' 'paging' and what not simple_smile

However... Why are you not including code? I often find what I am looking for by using code keywords. Have I missunderstood you?

I excluded code sections because it will produce a truck load of words which will pollute the word list a lot. At least that's what I think they will do.

Frans Bouma | Lead developer LLBLGen Pro
Marcus avatar
Marcus
User
Posts: 747
Joined: 23-Apr-2004
# Posted on: 09-May-2005 22:17:55   

Otis wrote:

I excluded code sections because it will produce a truck load of words which will pollute the word list a lot. At least that's what I think they will do.

mmm... I do feel the search quality will suffer as a result... disappointed

I know you know the API like the back of your hand, but others ("me" simple_smile ) benefit from the code snippets that you (especially) post in answer to a question which is often not explicitly stated in the docs...

Just my 2 cents (as the american would say!)

wink

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39774
Joined: 17-Aug-2003
# Posted on: 09-May-2005 22:29:33   

simple_smile

It's one line of code to include them, so it's not such a big deal. I'll see if gives a lot of problems. simple_smile

Frans Bouma | Lead developer LLBLGen Pro
Marcus avatar
Marcus
User
Posts: 747
Joined: 23-Apr-2004
# Posted on: 10-May-2005 10:34:19   

With all this excitement about whether or not code sections would be included in your new search, I completely forgot how I found myself on this thread!!! simple_smile

I am also implementing a search engine for a site with large columns of data, the data being standard SQL rows and not handy nText fields... disappointed

Otis wrote:

Creating wordlists from that for indexation is the part which is the most work and that's what I now have written already wink . So what it now gains me is not that much. Searching texts is not hard, you just have to create a good index. To get a good index from forum messages, you have to filter out what's not interesting, and that's the core part of what's the actual work. After that, you have a string of words for a given message to be indexed. It takes 2 routines, a wordlist and a set of rules which words should be ignored, to get the words which are interesting for the message.

You then take that wordlist and you check it with the list of cached searchwords. If a new word is seen, add it to the list. You return a set of wordids. Then you insert per wordid a message-word combination. That's it.

I've got this same structure also but had planned to include some fuzziness...

The new/updated entities can be analysed for unique words that do not appear on the "ignore list" (which is an in memory hashtable). Once a unique word is found, it and all it's "similar" words are added to the DB's WorldList table (if they do not already appear there). When I say "similar" I mean, the word's plural / singlular, a soundex maybe even some Levenshtein (edit distance) variations (I would need a very clever algorithm to work out common misspellings). These similar words are generated at insert / update time and are added along with actual words that appeared in the row.

The WordList table has a unique contraint as you have, and will grow very large. But I figure SQL Server handles very large (narrow) table well and by adding variations on the words I can overcome the "LIKE" table scan requirement.

Again like you I store the EntityID / WordID in an intermediary table which also contains a score and count. The score is 0 for the original word and >0 for words which were inserted as "similar" words and the score depends on similarity.

The problem I'm finding as you correctly point out is the storage requirements to hold all of these combinations. The theory was always that DBs are FAST at reading vast amounts of narrow, exact match, indexed data.

I'm now thinking that there may be a compromise to be had....

If I keep only the original WordList in the DB, but also employ an in memory WordList (Singleton in the middle tier) which holds these words, their ID and generates the additional "similars" with scores at runtime... I can have SQL fetch slightly more data than we require using just the IDs for a given page of data thus eliminating the last JOIN with the WordList table.

To do this I get a list of WordIDs that match the user's entered keywords and all combinations of similar words. These IDs then form the WHERE clause of the fetch query. The SQL query fetches all possible Entities which match the original words OR all possible similar words.

On the middle tier then we have to run the length of the resultset (which should only be small) and assign a score based on the in memory score list and re-sort in order to pull the top most relevant results.

What about multiple users hammering the search and also adding messages?

The WordList table is not updated in real time... As user's insert or update rows which cause changes in the search, a job is created which can be run at a later time or on a thread with lower priority. As a consequence, this will also serialize updates to search tables. The plan is to use the Entity's timestamp to handle multiple updates to the EntityWordList table. If a Entity is updated again before it's associated WordLists are updated, earlier update jobs are thrown away as there must be a subsequent job with a more recent entry in the queue.

None of this has been implemented yet simple_smile as I'm just at the "thinking" stage.

But I'm curious, in your approach, what do you think the overhead of using "LIKE %word%" is on the WordList table?

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39774
Joined: 17-Aug-2003
# Posted on: 11-May-2005 12:11:44   

Marcus wrote:

With all this excitement about whether or not code sections would be included in your new search, I completely forgot how I found myself on this thread!!! simple_smile

I am also implementing a search engine for a site with large columns of data, the data being standard SQL rows and not handy nText fields... disappointed

Hmm. Well, ntext fields aren't more useful per se for a search wink .

Otis wrote:

Creating wordlists from that for indexation is the part which is the most work and that's what I now have written already wink . So what it now gains me is not that much. Searching texts is not hard, you just have to create a good index. To get a good index from forum messages, you have to filter out what's not interesting, and that's the core part of what's the actual work. After that, you have a string of words for a given message to be indexed. It takes 2 routines, a wordlist and a set of rules which words should be ignored, to get the words which are interesting for the message.

You then take that wordlist and you check it with the list of cached searchwords. If a new word is seen, add it to the list. You return a set of wordids. Then you insert per wordid a message-word combination. That's it.

I've got this same structure also but had planned to include some fuzziness...

The new/updated entities can be analysed for unique words that do not appear on the "ignore list" (which is an in memory hashtable). Once a unique word is found, it and all it's "similar" words are added to the DB's WorldList table (if they do not already appear there). When I say "similar" I mean, the word's plural / singlular, a soundex maybe even some Levenshtein (edit distance) variations (I would need a very clever algorithm to work out common misspellings). These similar words are generated at insert / update time and are added along with actual words that appeared in the row.

How do you determine all the 'similar' words? I think that requires some sort of dictionary ? (Sheep -> Sheep comes to mind wink )

The WordList table has a unique contraint as you have, and will grow very large. But I figure SQL Server handles very large (narrow) table well and by adding variations on the words I can overcome the "LIKE" table scan requirement.

If you also define an index, you can use LIKE 'word%' as well, which will use the index. So that could avoid the necessity for similar words, as a wildcard search will be using an index as well.

Again like you I store the EntityID / WordID in an intermediary table which also contains a score and count. The score is 0 for the original word and >0 for words which were inserted as "similar" words and the score depends on similarity.

that's clever! simple_smile

The problem I'm finding as you correctly point out is the storage requirements to hold all of these combinations. The theory was always that DBs are FAST at reading vast amounts of narrow, exact match, indexed data.

I'm now thinking that there may be a compromise to be had....

If I keep only the original WordList in the DB, but also employ an in memory WordList (Singleton in the middle tier) which holds these words, their ID and generates the additional "similars" with scores at runtime... I can have SQL fetch slightly more data than we require using just the IDs for a given page of data thus eliminating the last JOIN with the WordList table.

To do this I get a list of WordIDs that match the user's entered keywords and all combinations of similar words. These IDs then form the WHERE clause of the fetch query. The SQL query fetches all possible Entities which match the original words OR all possible similar words.

That indeed will save a join (or a subquery. I think a subquery is faster, but I have to test that)

What about multiple users hammering the search and also adding messages?

The WordList table is not updated in real time... As user's insert or update rows which cause changes in the search, a job is created which can be run at a later time or on a thread with lower priority. As a consequence, this will also serialize updates to search tables. The plan is to use the Entity's timestamp to handle multiple updates to the EntityWordList table. If a Entity is updated again before it's associated WordLists are updated, earlier update jobs are thrown away as there must be a subsequent job with a more recent entry in the queue.

You can do it in real time, though it has to be seen if that's efficient enough of course. (i.e.: depending on the amount of changes per minute or so). I push per message hte new words to the db and refetch the list, if there are new words found. This doesn't have to be slow, it's a dyn. list fetch which is then migrated into a hashtable.

But I'm curious, in your approach, what do you think the overhead of using "LIKE %word%" is on the WordList table?

That will be slower as that will scan the complete index. It still is faster than scanning the field itself. If users opt for a true wildcard search, they have to wait for the results somewhat longer. But I think that's not that bad, after all, a true wildcard search isn't solveable without a full index scan.

Frans Bouma | Lead developer LLBLGen Pro