Discovr: a flickr experiment gone wrong
Posted by Jorge Bernal November 08, 2009
I need help with this. I had a dream… Well, not so much as a dream, maybe a “It’d be cool to…”
I thought it’d be nice to discover new photos on flickr using your favorite photos and the people who also favorited those photos, and the favorite photos of those who also favorited my pictures. Still with me?
It’s actually a quite simple code (about 500 lines, check it on github: discovr), but it’s terribly slow. Some possible reasons:
- Way too much data. I’ve found people with
aroundmore than 18000 favorites, and there are photos with more than 2k fans. After limiting to 50 last favorites, the numbers are still creepy. Following from my personal favorites (366), I discovered 1268 users and 52632 photos - Too complicated for an API. This is the kind of feature that wouldn’t be so hard to implement if you have access to the flickr database directly, but having to do so many requests adds a lot of time to the process.
- Inefficient library. I had to do some modifications to the flickr ruby library just to make it work, but it’s still quite inefficient in some cases. Want to know the url of a picture (knowing the picture id)? 4 (completely unnecessary) API calls
- My code is bad. OK, I know it’s ugly to start blaming everyone else. I know my code is not very good, as it’s a quick prototype. Still, I’m not sure if making my code/libraries better would be enough improvement given the network/api bottleneck
The simplified algorithm goes like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # method from class User def similar_pictures similar = {} favorites.each do |favorite| favorite.favorited_by.each do |user| user.favorites.each do |v| similar[k] ||= {:weight => 0, :picture => v[:picture]} similar[k][:weight] += 1 end end end similar.values.sort {|a,b| b[:weight] <=> a[:weight]}.select {|v| v[:weight] > 1} end |
So I’ve created a github repository and uploaded the code: discovr at github. Feel free to clone, test and improve


Some ideas (Too lazy for a patch
):
1) taking very few favourites / users randomly
2) saving meta-data on a temporal cache, first request will be always slow, but next will be fast.
3) Hey, you could download Flickr… Not so crazy, after all google downloaded the internet.
1) could be nice. Switching from *most* favorited to *latest* favorited could be faster. I wonder if the results would be worthy
2) Already keeps caches. But they won’t help most users, unless…
3) …it keeps running for a while and mirrors flickr
But that would take a huge database, and many users willing to wait 30 minutes to see some pics
1) The random choosing would show the most favorited more times, as they are linked to more people. They are more probable to be choosen XD.
That’s kinda like a friend-of-a-friend type problem, or at least it’s a graph.
See http://openquery.com/graph
Emulating a graph is never going to be fast/pretty/scalable, that might be the prob.
One way of limiting the number of users (although not sure it would actually make it faster) would be to go the last.fm way.
Last.fm makes a list of your “neighbors”, that is, the people who like the same of music you do. First find people who not only added one picture you added as a favorite, but actually several. Make a list of the people who have most in common with you, keep 10 to 20 of them, and find the images that most of them added as their favorite. You will get more relevance this way imo.