July 03, 2016

Six Degrees of Wikipedia

This may or may not have come up recently due to my circumstances; I don’t think me blogging about my efforts is going to significantly improve anyone else’s chances when it comes to the moment.

The idea in principal is dead simple: Given a page on Wikipedia how many steps/links/pages does it take to get to a given second page?

We don’t care how we get there, there is no need to record the route, just count how many steps. The only other rule is you have to go from A -> B, you can’t start at both ends and converge somewhere in the middle, i.e. A -> C <- B.

I really thought I’d have this nailed. I’ve done so much with Ruby and Mechanize in the past and this was an ideal fit. But I found this quite a humbling experience. For whatever reason (could have perhaps been the cat relentlessly pawing at the door) I lost my cool in the last thirty minutes of the ninety minutes allotted time, shut down a bit and stopped thinking/communicating out loud, and just couldn’t get my brain to work: I knew what I was trying to do, but couldn’t even communicate that to myself, let alone anyone else.

Here’s exactly what I had when time was up:

#Want take url
#Destination url
#Both urls from wikipedia
#Shortest route from one to the next
#https://en.wikipedia.org/wiki/Ruby
#

@queue = {}

def findlinksbetween(url1, url2)
	#For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
	steps = 0
	found = false
	A = Mechanize.new
	page = A.get(url1)
	until found
		links = page.links_with(:href => /\/wiki\// )
		found, steps = checkfor(page, url2, steps, found)
		#Get next link to try.

		#Next link in list, but don't want to wipe out top list, etc
		page = A.get(link.href)
		#Then follow those links and try all the links on that page
		#Then try each in turn
	end
	put steps
end


def checkfor(page, url, steps, found)
	steps+=1 #this isn't right
	links = page.links_with(:href => /\/wiki\// )
	links.each do |link|
		#Need to stop if we find the url
		if link == url and not found
			#then we want to stop
			found = true
			break
		end
	end
	[found, steps]
end
		

	#Could search through links 
	#Need to keep a cache of links checked
	#follow justthe 
	#Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize?

Not pretty is it? Doesn’t work, at all. It’s more like a collection of random thoughts. Like I said: a humbling experience. But how it nagged me. I couldn’t leave it like that when I knew I had the right idea somewhere in my brain.

This is what I had a few hours later:

#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route

require 'mechanize'

@previous_level = []
@current_level = []
@checked = {}
@a = Mechanize.new


def findlinksbetween(url1, url2)
	#For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
	step = 0
	found = false
	page = @a.get(url1)
	#Initialise level
	addlinks(page, step)
	loop do #Horrors
		#Check current level links for a match
		check(url2)
		#Not found so carry on
		@previous_level = @current_level
		@current_level = []
		step += 1
		@previous_level.each do |item|
			unless @checked[item[0]]
				begin
					page = @a.get(item[0])
					puts "Level: #{step}, #{@previous_level.length} urls: fetching page #{item[0]}\n"
					addlinks(page, step)
					@checked[item[0]] = true
					#TODO: Only problem by collating all the links for a level before checking is it takes a long time
				rescue
					#Could have been an image, etc
				end
			end
		end
		#but what happens
	end
end


def addlinks(page, step)
	#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
	links = page.links_with(:href => /^\/wiki\// )
	links.each do |link|
		#Further filter out nonsense, colons should get most of the irrelevant stuff
		unless link.href =~ /:/ or link.href =~ /Main_Page/
			@current_level << [link.href, step]
		end
	end
	#Might have duplicates at current level, but that is ok as won't visit it twice
end


def check(url)
	@current_level.each do |item|
		if item[0] == url
			puts "Found #{url} after #{item[1]} step(s)\n"
			exit
		end
	end
end

Which worked in principal, but was extremely slow and memory inefficient. I had to kill it before I got a result as it was using too much of the 1GB total I had on Linode.

You can see (maybe) what I was trying to with my initial attempt, i.e:

Keeping track of the previous levels urls
Keeping track of the current levels urls
Keeping track of urls checked so time isn’t wasted checking them again

The current level becomes the next previous level when we reach the end, etc.

You can also see bits that I knew were a problem and that were inefficient, but that I didn’t really care about because I was sure I could figure them out later if need be (I.e. I didn’t see them as the main problem):

Filtering out irrelevant links. I.e. all the header and footer links on a Wikipedia page which aren’t anything to do with the article itself.
Checking all the links at the end of a step - slower than checking on the fly.

I decided to use Redis as a more memory efficient alternative to Ruby arrays for my final attempt for that day (by that point it was purely an effort for my own self-esteem). Redis is really awesome, by the way:

#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route

require 'mechanize'
require 'redis'

#TODO: Maybe should make previous_level a redis database as well, could flip/flop between current/previous
@previous_level = []
@redis = Redis.new
#current level in 0, checked in 1
@a = Mechanize.new

#Flush just to be safe
@redis.flushall

def findlinksbetween(url1, url2)
	#For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
	step = 0
	found = false
	page = @a.get(url1)
	#Initialise level
	addlinks(page, step)
	loop do #Horrors
		#Check current level links for a match
		check(url2)
		#Not found so carry on
		@redis.select(0)
		@previous_level = @redis.keys
		@redis.flushdb
		step += 1
		@previous_level.each do |item|
			@redis.select(1)
			unless @redis.exists(item)
				begin
					page = @a.get(item)
					puts "Level: #{step}, #{@previous_level.length} urls: fetching page #{item}\n"
					addlinks(page, step)
					@redis.select(1)
					@redis.set(item, true)
					#TODO: Only problem by collating all the links for a level before checking is it takes a long time
				rescue
					#Could have been an image, etc
				end
			end
		end
		#but what happens
	end
end


def addlinks(page, step)
	@redis.select(0)
	#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
	links = page.links_with(:href => /^\/wiki\// )
	links.each do |link|
		#Further filter out nonsense, colons should get most of the irrelevant stuff
		unless link.href =~ /:/ or link.href =~ /Main_Page/
			@redis.set(link.href, step)
		end
	end
end


def check(url)
	@redis.select(0)
	if @redis.exists(url)
		puts "Found #{url} after #{@redis.get(url)} step(s)\n"
		exit
	end
end

This was good enough to get it to run reasonably ok on the resources I had, but I knew improvements could still be made. That would have to wait for another day though as it was bedtime; I hadn’t worked on this constantly that evening, I’d had other stuff to do.

Next, I figured out what I’d said to myself in a comment and managed to used Redis’s built in databases for both the previous and current levels meaning I could easily switch between them. I also made it command line callable to make it a bit easier to use.

#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route

require 'optparse'
require 'mechanize'
require 'redis'

optparse = OptionParser.new do |opts|
	opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
	opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!

@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"
@a = Mechanize.new

#Flush just to be safe
@redis.flushall

def findlinksbetween(page1, page2)
	step = 0
	found = false
	page = @a.get(Wikipedia+page1)
	#All other links should resolve relatively
	#Initialise level
	addlinks(page, step)
	loop do #Horrors
		#Check current level links for a match
		check(page2)
		#Not found so carry on
		#Switch previous and current
		@previous ^= 1
		@current ^= 1
		@redis.select(@current)
		@redis.flushdb
		step += 1
		@redis.select(@previous)
		keys = @redis.keys
		#No way not to avoid an array at some point
		keys.each do |item|
			@redis.select(Checked)
			unless @redis.exists(item)
				begin
					page = @a.get(item)
					puts "Level: #{step}, #{keys.length} urls: fetching page #{item}\n"
					addlinks(page, step)
					@redis.select(Checked)
					@redis.set(item, true)
					#TODO: Only problem by collating all the links for a level before checking is it takes a long time
				rescue
					#Could have been an image, etc
				end
			end
		end
	end
end


def addlinks(page, step)
	@redis.select(@current)
	#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
	links = page.links_with(:href => /^\/wiki\// )
	links.each do |link|
		#Further filter out nonsense, colons should get most of the irrelevant stuff
		unless link.href =~ /:/ or link.href =~ /Main_Page/
			@redis.set(link.href, step)
		end
	end
end


def check(page)
	@redis.select(@current)
	if @redis.exists("/wiki/"+page)
		puts "Found #{page} after #{@redis.get("/wiki/"+page)} step(s)\n"
		exit
	end
end

if defined?(Page1) and defined?(Page2)
	findlinksbetween(Page1, Page2)
end

The next attempt was to use threading. I realised that, since I would have concurrent threads going on, I couldn’t use multiple Redis databases as each thread could override the database selection made in another thread and things would get very messy. I decided to use Redis Sets instead in just one database. The mechanize bit became a separate function to make threading it easier - I.e. I could extend this to three instances/threads easily. I also added in better feedback (counting down urls remaining).

In making the threaded version I discovered the need for better error handling, e.g. checks for images and checks for a successful page load: On a single thread I never had failure of the program to work; However, after adding threads (before I put in the page load check) it would blitz through, but not always find the result. I.e. some page loads must have failed, but been lost in the begin/rescue. This approach made sure to re-add the page to the list to be checked if the page failed to load.

I also realised I had lied to myself when I’d commented “No way not to avoid an array at some point”.

#Find shortest route between urls on Wikipedia

#Notes:
# - A depth first search is easier, but slower/doesn't find shortest route
# - Have to be careful threads don't interfere, i.e. select wrong database in other thread. Options:
#   - Redis sets
#   - Different Redis instances
#   - Implement some kind of global lock

require 'optparse'
require 'mechanize'
require 'redis'

optparse = OptionParser.new do |opts|
	opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
	opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!

@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"

#Use two instances to speed things up a bit
@a = Mechanize.new
@b = Mechanize.new
@remaining = 1
$found = false

#Flush just to be safe
@redis.flushall

def findlinksbetween(page1, page2)
	step = 0
	@redis.set("step", step)
	page = @a.get(Wikipedia+page1)
	#All other links should resolve relatively
	#Initialise level
	addlinks(page)
	#TODO: Does this have to be a global var and not a class one to work? I.e. otherwise becomes localised in thread?
	#Nah, i think things are failing and not repeating
	until $found or step == 6
		#Check current level links for a match
		check(page2)
		#TODO: Only problem by collating all the links for a level before checking is it takes a long time
		#Not found so carry on
		#Switch previous and current
		@previous ^= 1
		@current ^= 1
		step += 1
		@redis.set("step", step)
		@remaining = @redis.smembers(@previous).length
		puts "***Starting*** Level: #{step}, #{@remaining} urls"
		t1 = Thread.new { mech(@a) }
		t2 = Thread.new { mech(@b) }
		Thread.new do
			until @remaining == 0
				puts "Level: #{step}, #{@remaining} urls"
				sleep 1
			end
		end
		#Makes sure both have finished though
		t1.join
		t2.join
	end
end


def addlinks(page)
	#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
	links = page.links_with(:href => /^\/wiki\// )
	links.each do |link|
		#Further filter out nonsense, colons should get most of the irrelevant stuff
		unless link.href =~ /:/ or link.href =~ /Main_Page/
			@redis.sadd(@current, link.href)
		end
	end
end


def check(page)
	if @redis.sismember(@current, "/wiki/"+page)
		puts "Found #{page} after #{@redis.get("step")} step(s)\n"
		$found = true
		exit
	end
end

def mech(m)
	url = ""
	until url == nil
		url = @redis.spop(@previous)
		@remaining = @redis.smembers(@previous).length
		unless @redis.sismember(Checked, url)
			#Be careful, might normally follow relative links fine, but re-added ones seem to throw it off?
			page = m.get("https://en.wikipedia.org#{url}")
			#Skip images
			unless page.response["content-type"].include?("image")
				addlinks(page)
			end
			if page.code == "200"
				@redis.sadd(Checked, url)
			else
				#re add and check later
				@redis.sadd(@previous, url)
			end
		end
	end
	@remaining = 0
end


if defined?(Page1) and defined?(Page2)
	findlinksbetween(Page1, Page2)
end

However, GIL is still a thing so threading was not providing any performance benefits which prompted a fairly trivial switch from threads to processes instead. This represents the final state of the program; Obviously further improvements are possible, but I was happy enough with everything in principal.

#Find shortest route between urls on Wikipedia

#Notes:
# - A depth first search is easier, but slower/doesn't find shortest route
# - Have to be careful threads don't interfere, i.e. select wrong database in other thread. Options:
#   - Redis sets
#   - Different Redis instances
#   - Implement some kind of global lock

require 'optparse'
require 'mechanize'
require 'redis'

optparse = OptionParser.new do |opts|
	opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
	opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!

@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"

#Use two instances to speed things up a bit
@a = Mechanize.new
@b = Mechanize.new
@remaining = 1
$found = false

#Flush just to be safe
@redis.flushall

def findlinksbetween(page1, page2)
	step = 0
	@redis.set("step", step)
	page = @a.get(Wikipedia+page1)
	#All other links should resolve relatively
	#Initialise level
	addlinks(page)
	#TODO: Does this have to be a global var and not a class one to work? I.e. otherwise becomes localised in thread?
	#Nah, i think things are failing and not repeating
	until $found or step == 6
		#Check current level links for a match
		check(page2)
		#TODO: Only problem by collating all the links for a level before checking is it takes a long time
		#Not found so carry on
		#Switch previous and current
		@previous ^= 1
		@current ^= 1
		step += 1
		@redis.set("step", step)
		@remaining = @redis.smembers(@previous).length
		puts "***Starting*** Level: #{step}, #{@remaining} urls"
		fork { mech(@a, step) }
		fork { mech(@b, step) }
		#Makes sure both have finished though
		Process.waitall
	end
end


def addlinks(page)
	#TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
	links = page.links_with(:href => /^\/wiki\// )
	links.each do |link|
		#Further filter out nonsense, colons should get most of the irrelevant stuff
		unless link.href =~ /:/ or link.href =~ /Main_Page/
			@redis.sadd(@current, link.href)
		end
	end
end


def check(page)
	if @redis.sismember(@current, "/wiki/"+page)
		puts "Found #{page} after #{@redis.get("step")} step(s)\n"
		$found = true
		exit
	end
end

def mech(m, step)
	url = ""
	until url == nil
		url = @redis.spop(@previous)
		@remaining = @redis.smembers(@previous).length
		puts "Level: #{step}, #{@remaining} urls"
		unless @redis.sismember(Checked, url)
			#Be careful, might normally follow relative links fine, but re-added ones seem to throw it off?
			page = m.get("https://en.wikipedia.org#{url}")
			#Skip images
			unless page.response["content-type"].include?("image")
				addlinks(page)
			end
			if page.code == "200"
				@redis.sadd(Checked, url)
			else
				#re add and check later
				@redis.sadd(@previous, url)
			end
		end
	end
	@remaining = 0
end


if defined?(Page1) and defined?(Page2)
	findlinksbetween(Page1, Page2)
end

I decided to chuck some resources at the processes version and see if I could speed things up. The original task I had been set was to go from the Wikipedia Ruby page to the David Bowie page. But whilst developing this program I checked for a known single step of Ruby to Metastable (The Ruby page links to Diamond, which links to Metastable).

On a two-core system:

                  Real / User / Sys 

No threads:      22.993/15.360/2.923
Two threads:     21.142/15.867/3.780
Two processes:   13.149/14.493/2.553
Three processes: 10.132/13.620/2.423

On a four-core system:

No threads:      24.845/16.360/3.197
Two threads:     21.355/16.693/3.833
Two processes:   12.955/16.457/3.370
Four processes:   9.043/20.060/5.027

So the additional processes are definitely helping. Mind you, doing six levels deep would be crazy on a single machine as it turns out David Bowie is only two levels deep and on a four-core with four processes this still took: 108m6.189/173m5.299/7m24.900.

Tags:
- program

atomicules

Mostly walking the dogs

Six Degrees of Wikipedia