Sunday, February 16, 2003

Mail aggregator

My mail aggregator is working fine - thanks Sam for the 2 pointers. The big problem is of course the "standard" XML/RSS that is used. Just like in almost all other places where XML is used - the benefit of a standard syntax is countered by the complete random and obfuscated ( and countless) schemas and variations.



First problem: many feeds don't include the date - I added few regexp to extract common patterns ( Updated: ..., Posted: ... ) from the content. If I can't get it - I'll assume the time of the collection - which would be wrong for older news, but it'll be close enough as I update.



Second problem: Since most of the time you can't tell if a "permalink" has been updated - I have to cache the MD5, so I get new mail when the link content changes.



What's next ? One think I allways liked is the multipart mime - with HTML and images in the same message. I don't remember the details ( the links have special syntax ), but it shouldn't be difficult to fecth the images and save them with the entry, for full off-line reading. And the other side - using mail to update my own

weblog.



I expect more tweaks as I read more weblogs - each weblog I add has its own (standard :-) XML style.



BTW, I (re)discovered the "trick" to include the images in the mail - the program will need to grab all , add the content as mime parts and use "cid:" magic protocol, and Content-ID header in the parts. It shouldn't be very hard to code - but for now it's not a big priority, the mailer can get the images from the web when online and only few weblogs have images.

No comments: