I'd like to give a big "hats off" to all the work Jon Udell is doing to make the "semantic web" a reality by looking at ways to extract information from existing (X)HTML content, and also in proposing new ways of adding semantic information to the markup for new content. If you don't already read his blog then I can't recommend it highly enough. Anyone who's involved in the reading or writing of technical blogs must surely have at least a passing interest in this sort of stuff.

[BTW, I'm surprised that his work does not seem to have received much attention from the "big" blogging sites and/or engines. Unless I've missed it, of course...]

My humble contribution to these efforts is to point out a tool that I first heard about some time ago, but doesn't often get a mention (maybe its slightly unintuitive name belies just how useful it is): Simon Mourier's "HTML agility pack". I've not used it myself (only so many hours in the day, after all :), but it purports to help in reading malformed HTML content just as easily as XML.

[Another aside: apologies for the dodgy title, but I just couldn't resist. As with much of my attempted humour, YMMV (your mirth may vary)]