Friday, September 01, 2006 8:04 AM
richard
Sql Server 2005 Full-text Thesaurus: Abyss of failure and paucity of imagination
Ryan introduced me to the MS idea of "Pit of Success" where you just fall into using things the right way from the obvious way their API indicates use (form shows function). I'm simplifying, I'm sure, but that's the gist.
In trying to get the Sql Server 2005 thesaurus to work, I have stumbled into the opposite: a pit of failure. We have all felt ourselves fall into this pit at one time or another with a product or framework or API we've used. One where we curse under our breath and promise ourselves that we will never force consumers of our APIs to go through what we are now.
Such is the thesaurus feature of Sql Server 2005's full-text search functionality. Among the attributes of this particular feature that single it out for 'constructive criticism':
- The naming scheme of the files is based in 3-digit language codes. This means the tsEng.xml is NOT the thesaurus you edit, Americans (that's me, btw) it's tsEnU.xml - okay, I'm just an idiot on this one. But it's still, imo, a pit of success violation.
- The xml files are written in unicode with byte order marks and they cease to work if the byte order marks aren't written with the text file is. TextPad, which, oh, a few developers use here and there ;) doesn't write the byte order marks by default. Maybe that's TextPad's fault, but frankly, IE can render lots of web pages with some pretty effed-up html and make them human readable, why can't they load a stupid xml file for an English-localized db without byte marks?
- The thesaurus files are read into memory after a Sql Server service restart (well, technically at the first catalog query after a restart) and can't be changed without the server being restarted!
- Certain benign-looking data arrangements actually cause the entire thesaurus to fail to work. I'm not certain of all of them, but it appears that having an <expansion> tag where one of the substitution words is a contained in another word is one case. So if you have <expansion><sub>cover</sub><sub>recover</sub></expansion> it appears to me your entire thesaurus will fail to work. Combine that with having to re-start sql sever to test the fixes to your problems and you get a lot of wasted tiem.
- When something in the file fails, causing the whole thing to fail, it writes an unhelpful entry to the windows application log (not the sql server logs).
- The thesaurus file is per instance (AFAICT) meaning if you want a diff. thesaurus for a different application's full-text catalog, you have to install another (named) instance on your server - double ug.
I don't know about you, but, frankly, I could have written a nice tokenizer and synonym injection API in the time I took trying to figure out the crappyness of the thesaurus xml.
It's too bad, because in Sql Server 2005 there are some pretty stunning pits of success. The SMO, for me, was like this. Objects to manage your database that made a lot of sense. Anyway, I'm done with the thesaurus. I can whole-heartedly NOT recommend it.