This may or may not have come up recently due to my circumstances; I don’t think me blogging about my efforts is going to significantly improve anyone else’s chances when it comes to the moment.
The idea in principal is dead simple: Given a page on Wikipedia how many steps/links/pages does it take to get to a given second page?
We don’t care how we get there, there is no need to record the route, just count how many steps. The only other rule is you have to go from A -> B
, you can’t start at both ends and converge somewhere in the middle, i.e. A -> C <- B
.
I really thought I’d have this nailed. I’ve done so much with Ruby and Mechanize in the past and this was an ideal fit. But I found this quite a humbling experience. For whatever reason (could have perhaps been the cat relentlessly pawing at the door) I lost my cool in the last thirty minutes of the ninety minutes allotted time, shut down a bit and stopped thinking/communicating out loud, and just couldn’t get my brain to work: I knew what I was trying to do, but couldn’t even communicate that to myself, let alone anyone else.
Here’s exactly what I had when time was up:
#Want take url
#Destination url
#Both urls from wikipedia
#Shortest route from one to the next
#https://en.wikipedia.org/wiki/Ruby
#
@queue = {}
def findlinksbetween(url1, url2)
#For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
steps = 0
found = false
A = Mechanize.new
page = A.get(url1)
until found
links = page.links_with(:href => /\/wiki\// )
found, steps = checkfor(page, url2, steps, found)
#Get next link to try.
#Next link in list, but don't want to wipe out top list, etc
page = A.get(link.href)
#Then follow those links and try all the links on that page
#Then try each in turn
end
put steps
end
def checkfor(page, url, steps, found)
steps+=1 #this isn't right
links = page.links_with(:href => /\/wiki\// )
links.each do |link|
#Need to stop if we find the url
if link == url and not found
#then we want to stop
found = true
break
end
end
[found, steps]
end
#Could search through links
#Need to keep a cache of links checked
#follow justthe
#Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize?
Not pretty is it? Doesn’t work, at all. It’s more like a collection of random thoughts. Like I said: a humbling experience. But how it nagged me. I couldn’t leave it like that when I knew I had the right idea somewhere in my brain.
This is what I had a few hours later:
#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route
require 'mechanize'
@previous_level = []
@current_level = []
@checked = {}
@a = Mechanize.new
def findlinksbetween(url1, url2)
#For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
step = 0
found = false
page = @a.get(url1)
#Initialise level
addlinks(page, step)
loop do #Horrors
#Check current level links for a match
check(url2)
#Not found so carry on
@previous_level = @current_level
@current_level = []
step += 1
@previous_level.each do |item|
unless @checked[item[0]]
begin
page = @a.get(item[0])
puts "Level: #{step}, #{@previous_level.length} urls: fetching page #{item[0]}\n"
addlinks(page, step)
@checked[item[0]] = true
#TODO: Only problem by collating all the links for a level before checking is it takes a long time
rescue
#Could have been an image, etc
end
end
end
#but what happens
end
end
def addlinks(page, step)
#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
links = page.links_with(:href => /^\/wiki\// )
links.each do |link|
#Further filter out nonsense, colons should get most of the irrelevant stuff
unless link.href =~ /:/ or link.href =~ /Main_Page/
@current_level << [link.href, step]
end
end
#Might have duplicates at current level, but that is ok as won't visit it twice
end
def check(url)
@current_level.each do |item|
if item[0] == url
puts "Found #{url} after #{item[1]} step(s)\n"
exit
end
end
end
Which worked in principal, but was extremely slow and memory inefficient. I had to kill it before I got a result as it was using too much of the 1GB total I had on Linode.
You can see (maybe) what I was trying to with my initial attempt, i.e:
- Keeping track of the previous levels urls
- Keeping track of the current levels urls
- Keeping track of urls checked so time isn’t wasted checking them again
The current level becomes the next previous level when we reach the end, etc.
You can also see bits that I knew were a problem and that were inefficient, but that I didn’t really care about because I was sure I could figure them out later if need be (I.e. I didn’t see them as the main problem):
- Filtering out irrelevant links. I.e. all the header and footer links on a Wikipedia page which aren’t anything to do with the article itself.
- Checking all the links at the end of a step - slower than checking on the fly.
I decided to use Redis as a more memory efficient alternative to Ruby arrays for my final attempt for that day (by that point it was purely an effort for my own self-esteem). Redis is really awesome, by the way:
#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route
require 'mechanize'
require 'redis'
#TODO: Maybe should make previous_level a redis database as well, could flip/flop between current/previous
@previous_level = []
@redis = Redis.new
#current level in 0, checked in 1
@a = Mechanize.new
#Flush just to be safe
@redis.flushall
def findlinksbetween(url1, url2)
#For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
step = 0
found = false
page = @a.get(url1)
#Initialise level
addlinks(page, step)
loop do #Horrors
#Check current level links for a match
check(url2)
#Not found so carry on
@redis.select(0)
@previous_level = @redis.keys
@redis.flushdb
step += 1
@previous_level.each do |item|
@redis.select(1)
unless @redis.exists(item)
begin
page = @a.get(item)
puts "Level: #{step}, #{@previous_level.length} urls: fetching page #{item}\n"
addlinks(page, step)
@redis.select(1)
@redis.set(item, true)
#TODO: Only problem by collating all the links for a level before checking is it takes a long time
rescue
#Could have been an image, etc
end
end
end
#but what happens
end
end
def addlinks(page, step)
@redis.select(0)
#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
links = page.links_with(:href => /^\/wiki\// )
links.each do |link|
#Further filter out nonsense, colons should get most of the irrelevant stuff
unless link.href =~ /:/ or link.href =~ /Main_Page/
@redis.set(link.href, step)
end
end
end
def check(url)
@redis.select(0)
if @redis.exists(url)
puts "Found #{url} after #{@redis.get(url)} step(s)\n"
exit
end
end
This was good enough to get it to run reasonably ok on the resources I had, but I knew improvements could still be made. That would have to wait for another day though as it was bedtime; I hadn’t worked on this constantly that evening, I’d had other stuff to do.
Next, I figured out what I’d said to myself in a comment and managed to used Redis’s built in databases for both the previous and current levels meaning I could easily switch between them. I also made it command line callable to make it a bit easier to use.
#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route
require 'optparse'
require 'mechanize'
require 'redis'
optparse = OptionParser.new do |opts|
opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!
@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"
@a = Mechanize.new
#Flush just to be safe
@redis.flushall
def findlinksbetween(page1, page2)
step = 0
found = false
page = @a.get(Wikipedia+page1)
#All other links should resolve relatively
#Initialise level
addlinks(page, step)
loop do #Horrors
#Check current level links for a match
check(page2)
#Not found so carry on
#Switch previous and current
@previous ^= 1
@current ^= 1
@redis.select(@current)
@redis.flushdb
step += 1
@redis.select(@previous)
keys = @redis.keys
#No way not to avoid an array at some point
keys.each do |item|
@redis.select(Checked)
unless @redis.exists(item)
begin
page = @a.get(item)
puts "Level: #{step}, #{keys.length} urls: fetching page #{item}\n"
addlinks(page, step)
@redis.select(Checked)
@redis.set(item, true)
#TODO: Only problem by collating all the links for a level before checking is it takes a long time
rescue
#Could have been an image, etc
end
end
end
end
end
def addlinks(page, step)
@redis.select(@current)
#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
links = page.links_with(:href => /^\/wiki\// )
links.each do |link|
#Further filter out nonsense, colons should get most of the irrelevant stuff
unless link.href =~ /:/ or link.href =~ /Main_Page/
@redis.set(link.href, step)
end
end
end
def check(page)
@redis.select(@current)
if @redis.exists("/wiki/"+page)
puts "Found #{page} after #{@redis.get("/wiki/"+page)} step(s)\n"
exit
end
end
if defined?(Page1) and defined?(Page2)
findlinksbetween(Page1, Page2)
end
The next attempt was to use threading. I realised that, since I would have concurrent threads going on, I couldn’t use multiple Redis databases as each thread could override the database selection made in another thread and things would get very messy. I decided to use Redis Sets instead in just one database. The mechanize bit became a separate function to make threading it easier - I.e. I could extend this to three instances/threads easily. I also added in better feedback (counting down urls remaining).
In making the threaded version I discovered the need for better error handling, e.g. checks for images and checks for a successful page load: On a single thread I never had failure of the program to work; However, after adding threads (before I put in the page load check) it would blitz through, but not always find the result. I.e. some page loads must have failed, but been lost in the begin
/rescue
. This approach made sure to re-add the page to the list to be checked if the page failed to load.
I also realised I had lied to myself when I’d commented “No way not to avoid an array at some point”.
#Find shortest route between urls on Wikipedia
#Notes:
# - A depth first search is easier, but slower/doesn't find shortest route
# - Have to be careful threads don't interfere, i.e. select wrong database in other thread. Options:
# - Redis sets
# - Different Redis instances
# - Implement some kind of global lock
require 'optparse'
require 'mechanize'
require 'redis'
optparse = OptionParser.new do |opts|
opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!
@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"
#Use two instances to speed things up a bit
@a = Mechanize.new
@b = Mechanize.new
@remaining = 1
$found = false
#Flush just to be safe
@redis.flushall
def findlinksbetween(page1, page2)
step = 0
@redis.set("step", step)
page = @a.get(Wikipedia+page1)
#All other links should resolve relatively
#Initialise level
addlinks(page)
#TODO: Does this have to be a global var and not a class one to work? I.e. otherwise becomes localised in thread?
#Nah, i think things are failing and not repeating
until $found or step == 6
#Check current level links for a match
check(page2)
#TODO: Only problem by collating all the links for a level before checking is it takes a long time
#Not found so carry on
#Switch previous and current
@previous ^= 1
@current ^= 1
step += 1
@redis.set("step", step)
@remaining = @redis.smembers(@previous).length
puts "***Starting*** Level: #{step}, #{@remaining} urls"
t1 = Thread.new { mech(@a) }
t2 = Thread.new { mech(@b) }
Thread.new do
until @remaining == 0
puts "Level: #{step}, #{@remaining} urls"
sleep 1
end
end
#Makes sure both have finished though
t1.join
t2.join
end
end
def addlinks(page)
#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
links = page.links_with(:href => /^\/wiki\// )
links.each do |link|
#Further filter out nonsense, colons should get most of the irrelevant stuff
unless link.href =~ /:/ or link.href =~ /Main_Page/
@redis.sadd(@current, link.href)
end
end
end
def check(page)
if @redis.sismember(@current, "/wiki/"+page)
puts "Found #{page} after #{@redis.get("step")} step(s)\n"
$found = true
exit
end
end
def mech(m)
url = ""
until url == nil
url = @redis.spop(@previous)
@remaining = @redis.smembers(@previous).length
unless @redis.sismember(Checked, url)
#Be careful, might normally follow relative links fine, but re-added ones seem to throw it off?
page = m.get("https://en.wikipedia.org#{url}")
#Skip images
unless page.response["content-type"].include?("image")
addlinks(page)
end
if page.code == "200"
@redis.sadd(Checked, url)
else
#re add and check later
@redis.sadd(@previous, url)
end
end
end
@remaining = 0
end
if defined?(Page1) and defined?(Page2)
findlinksbetween(Page1, Page2)
end
However, GIL is still a thing so threading was not providing any performance benefits which prompted a fairly trivial switch from threads to processes instead. This represents the final state of the program; Obviously further improvements are possible, but I was happy enough with everything in principal.
#Find shortest route between urls on Wikipedia
#Notes:
# - A depth first search is easier, but slower/doesn't find shortest route
# - Have to be careful threads don't interfere, i.e. select wrong database in other thread. Options:
# - Redis sets
# - Different Redis instances
# - Implement some kind of global lock
require 'optparse'
require 'mechanize'
require 'redis'
optparse = OptionParser.new do |opts|
opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!
@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"
#Use two instances to speed things up a bit
@a = Mechanize.new
@b = Mechanize.new
@remaining = 1
$found = false
#Flush just to be safe
@redis.flushall
def findlinksbetween(page1, page2)
step = 0
@redis.set("step", step)
page = @a.get(Wikipedia+page1)
#All other links should resolve relatively
#Initialise level
addlinks(page)
#TODO: Does this have to be a global var and not a class one to work? I.e. otherwise becomes localised in thread?
#Nah, i think things are failing and not repeating
until $found or step == 6
#Check current level links for a match
check(page2)
#TODO: Only problem by collating all the links for a level before checking is it takes a long time
#Not found so carry on
#Switch previous and current
@previous ^= 1
@current ^= 1
step += 1
@redis.set("step", step)
@remaining = @redis.smembers(@previous).length
puts "***Starting*** Level: #{step}, #{@remaining} urls"
fork { mech(@a, step) }
fork { mech(@b, step) }
#Makes sure both have finished though
Process.waitall
end
end
def addlinks(page)
#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
links = page.links_with(:href => /^\/wiki\// )
links.each do |link|
#Further filter out nonsense, colons should get most of the irrelevant stuff
unless link.href =~ /:/ or link.href =~ /Main_Page/
@redis.sadd(@current, link.href)
end
end
end
def check(page)
if @redis.sismember(@current, "/wiki/"+page)
puts "Found #{page} after #{@redis.get("step")} step(s)\n"
$found = true
exit
end
end
def mech(m, step)
url = ""
until url == nil
url = @redis.spop(@previous)
@remaining = @redis.smembers(@previous).length
puts "Level: #{step}, #{@remaining} urls"
unless @redis.sismember(Checked, url)
#Be careful, might normally follow relative links fine, but re-added ones seem to throw it off?
page = m.get("https://en.wikipedia.org#{url}")
#Skip images
unless page.response["content-type"].include?("image")
addlinks(page)
end
if page.code == "200"
@redis.sadd(Checked, url)
else
#re add and check later
@redis.sadd(@previous, url)
end
end
end
@remaining = 0
end
if defined?(Page1) and defined?(Page2)
findlinksbetween(Page1, Page2)
end
I decided to chuck some resources at the processes version and see if I could speed things up. The original task I had been set was to go from the Wikipedia Ruby page to the David Bowie page. But whilst developing this program I checked for a known single step of Ruby to Metastable (The Ruby page links to Diamond, which links to Metastable).
On a two-core system:
Real / User / Sys
No threads: 22.993/15.360/2.923
Two threads: 21.142/15.867/3.780
Two processes: 13.149/14.493/2.553
Three processes: 10.132/13.620/2.423
On a four-core system:
No threads: 24.845/16.360/3.197
Two threads: 21.355/16.693/3.833
Two processes: 12.955/16.457/3.370
Four processes: 9.043/20.060/5.027
So the additional processes are definitely helping. Mind you, doing six levels deep would be crazy on a single machine as it turns out David Bowie is only two levels deep and on a four-core with four processes this still took: 108m6.189/173m5.299/7m24.900
.