Tuesday, 30 December 2008

Joining two tables / feeds using Yahoo! Pipes

(posted originally on 20 May 2008 in our Lulu blog)


I've noticed in the Yahoo! Pipes forums (http://discuss.pipes.yahoo.com/) a steady stream of people asking essentially the same question:- how to combine information from two feeds so that:

a) Fields are taken from both feeds, and combined into a single output feed.
b) Items are retained from the first feed only if they have a corresponding matching item in the second feed.

In a nutshell, what they want to do is to join the two feeds in the same way that they would join two tables using a SQL statement like this:

SELECT * FROM Table1, Table2
WHERE Table1.keyValue = Table2.keyValue

Here I present a Pipes example where I do just that.

The tables

In this example I have a PUBLISHERS table stored in a CSV file at http://www.btinternet.com/~lotontech/pipes/data/Publishers.csv, and a BOOKS table stored as a CSV file at http://www.btinternet.com/~lotontech/pipes/data/Books.csv.

The PUBLISHERS table contains:

PUBLISHER,URL
LOTONtech,http://www.lotontech.com
Wrox,http://www.wrox.com
Wiley,http://www.wiley.com
SAMS,http://www.sams.com

The BOOKS table contains:

PUBLISHER,BOOK,URL
LOTONtech,Mashup Case Studies with Yahoo! Pipes,http://www.lulu.com/content/2403205
LOTONtech,Creating Google Mashups...,http://search.barnesandnoble.com...
LOTONtech,Working with Yahoo! Pipes,http://www.amazon.com...
LOTONtech,Introduction to Microsoft Popfly,http://www.amazon.com...
Wrox,Professional Visual Studio 2005 Team System,http://www.amazon.com...
Wrox,Professional UML with Visual Studio .NET,http://www.amazon.com...
Wiley,Web Content Mining with Java,http://www.amazon.com...
Apress,Foundations of Popfly,http://www.amazon.com...

The results

My Pipe at http://pipes.yahoo.com/lotontech/cs_selectandjoin shows the result, in which:

  • The item Title is taken from BOOKS.Book
  • The hyperlink (if you click on the title) is the book's URL taken from BOOKS.URL
  • The description is the publisher's web site URL taken from PUBLISHERS.URL

Items are only included where the key values BOOKS.PUBLISHER and PUBLISHERS.PUBLISHER match, so the results exclude publisher SAMS (which appears in PUBLISHERS but not in BOOKS) and Apress (which appears in BOOKS but not in PUBLISHERS).

The Pipes

The basic pattern is to have a Pipe which traverses one of the feeds / tables, which — for each entry — invokes a sub Pipe that traverses the other feed / table.

My Pipe LOTONtech-CS-SelectPublishersJoinBooks at
http://pipes.yahoo.com/pipes/pipe.edit?_id=ehxw9rcl3RGkb6_InkartA reads the PUBLISHERS CSV file and, for each entry, passes the two fields PUBLISHER and URL into the sub Pipe LOTONtech-CS-SelectBooksJoinPublisher.

My Pipe LOTONtech-CS-SelectBooksJoinPublisher at
http://pipes.yahoo.com/pipes/pipe.edit?_id=gPCyKrcl3RG9ih2gyp1_DQ reads the BOOKS CSV file and combines into every entry the two fields (from PUBLISHERS) that were passed in; thus forming a cross-product of all possible combinations. Since we don't want all possible combinations, it then filters to retain only those where the PUBLISHER field in BOOKS matches the PUBLISHER field passed in from PUBLISHERS.

Other Considerations

Since the first Pipe (what I call the Outer Loop) reads its file only once, and the second Pipe (what I call the Inner Loop) reads its file each time it is invoked with an entry from the Outer Loop, it should be more efficient to read the longer file (BOOKS) in the Outer Loop and the shorter file (PUBLISHERS) in the Inner Loop. But I didn't think it was so intuitive for my example.

0 comments: