Extending Ruby’s RSS Parser

If you’re doing what I’m doing, and need to parse an RSS feed that has lots of fun little tags in other namespaces you want to slurp up along with all the normal things, here’s something you can do.

We’re going to use the example I’ve been working on, a) because it allows me to point out an interesting problem, b) because it allows me to brag about what I’m working on, and c) because at this point I’m too tired to think through the logic of making an example work.

I’m writing a command line based feedreader called bulletin in Ruby. bulletin uses NewsGator to sync online feeds. Here’s an enticing, exciting pre-release preview screenshot!

a tiny screenshot of bulletin, the Ruby RSS Feed Reeder for Linux

In any event, there’s lots of cool metadata in NewsGator’s RSS feeds. The one piece I was interested in was whether or not an item in a feed has been read by the user. It appears in the feed as this element:

<ng:read>True</ng:read>

Awesome. So how do we go about getting this item and parsing it like it ain’t no thang? By extending Ruby’s RSS parser, like so.

First, we extend the Item class for RSS feed items to add an extra attribute:

module RSS; class Rss; class Channel; class Item
  install_text_element "ng:read", "http://newsgator.com/schema/extensions", '?', "read", :boolean, "ng:read"
end; end; end; end

Here’s what this means: We want a new element, that looks like ng:read. It comes from this schema: http://newsgator.com/schema/extensions. We don’t know where it will show up in the parsing of an item (?). The name of the attribute we will access it with is read. It’s a :boolean type. If we write an RSS feed back out, it will appear as ng:read in that feed.

That is, I think that’s all true. This is a lot of experimenting and diving through source.

Next, we tell the parser to look for another element:

RSS::BaseListener.install_get_text_element "http://newsgator.com/schema/extensions", "read", "read="

This says: Install this element into the parser. It comes from this schema: http://newsgator.com/schema/extensions. Its accessor method is read. It’s setter method is read=.

And then you’re good! Well, except for one thing.

The name of this particular element, less its namespace, is read. The Listener needs to know what to call its accessor and setter methods. That means some reflection magic is being done behind the curtains. Yes! So now you have to be extra careful with this particular Item, because now its original read method has been overwritten. All three times we have a parameter up there with read have to be the same. I haven’t gotten it to work any other way.

The implications:

  • I haven’t found a way to give an element accessors and getters that are not its element name without the namespace.
  • Printing the item back out with to_s doesn’t appear to bring the new element with it, although from the looks of it my method above doesn’t provide for that no matter what the element is named.

I’d love to talk to someone who knows the internals a bit more — or at least someone who could help me write some documentation for Ruby’s RSS parser. This is a pretty important thing and it would be awesomely useful.

In the meantime, have fun with your newfound knowledge! We now have an Item#read method that gives us true or false, depending on what was parsed.

Let me know if you make any progress in figuring this beast out.

Three Reasons Why NewsGator Should Release An API

Observations — Tags: , , , — Ardekantur @ 10:53 pm
  1. It wouldn’t be any work for them, since they clearly already have an internal one they use for both FeedDemon and NetNewsWire.
  2. Handing the API out is a simple way to have people write programs for your service.
  3. Users, like me, may want to use NewGator on their Linux machines, but don’t like the web interface.
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2008 Ardekantur | powered by WordPress with Barecity