atomicules

Push propelled program tinkerer and picture maker.

Finally Switched Netbsd From Xen To Kvm On Linode

Finally got around to it. Thanks to some tips from a fellow Linode/NetBSDer I could avoid almost all of the pitfalls:

  1. Ahead of migration edit /etc/fstab and change all xbd1.* to wd0.*. They were "1"s because XEN required a boot disk, but since that was going to be deleted with KVM I knew the disk would change to "0".
  2. Also before migration edit rc.conf and change xennet0 to wm0.
  3. And ideally before migration, but I forgot and only did after migration: Edit /etc/cgd/cgd.conf and change xbd1.* to wd0.* and rename /etc/cgd/xbd1e to /etc/cgd/wd0e.

And that's it. Was much less painful than I was anticipating. At the moment I don't have serial bootblocks so I don't have lish access, only glish, but come the next NetBSD update I'll correct that.

Netbsd Under Kvm On Linode

In a way this really doesn't need a blog post. It's not as fiddly as XEN was and if you are willing to just do everything through Glish it pretty much just proceeds as expected. But my concern was having to be reliant on Glish as until recently it didn't work on NetBSD Firefox (does now) and connecting via lish would give you a blank screen (not fun). But you can get both to work:

  1. Create three disks, 1x 1024 MB (ext) called "Rescue", 1x 1024 MB (raw) called "Install" and 1x the remainder (raw) called "NetBSD". An extra Rescue disk is required owing to the size of the image and unzipping it.
  2. Create a two configuration profiles, use Direct Disk, Full Virtualisation and turn off the FileSystem/Boot helpers. One has the NetBSD and Install disks mounted and is to boot from the Install disk in order to install to the NetBSD drive, the second is the final one which just boots NetBSD.
  3. Boot into Rescue mode with the Rescue disk first, Install disk second, etc.
  4. mount /dev/sda /media/sda and then cd /media/sda.
  5. Get the USB image, wget http://cdn.netbsd.org/pub/NetBSD/NetBSD-7.1/images/NetBSD-7.1-amd64-install.img.gz.
  6. gunzip it and copy to the install drive dd if=NetBSD-7.1-amd64-install.img of=/dev/sdb.
  7. Boot the install configuration and within 30 seconds get the lish console open and press space to drop out of menu. Select 4 for the boot prompt, enter consdev auto, then menu, then press enter to boot; If you want to read about consdev it's in man 8 boot_console.
  8. For whatever reason it fails to automatically launch the install so login as root and then type sysinst to run the installer.
  9. Go through the install, on the bootblocks screen accept default of "Use serial port com0" and then under "Set serial baud rate" select 115200 for the baud rate.
  10. Shutdown the install configuration and then boot the NetBSD one. Hey presto! Glish and Lish work.

I still haven't converted my XEN install to KVM yet though, but might soon


[EDIT: 2017-06-26] Wow, if you want to play with NetBSD 8.0 Beta then you might as well double the Rescue and Install images.

tmux: Kill all sessions except these

I use tmux an awful lot at work since it allows for a workflow where each session name references a separate piece of work that may need to be returned to (it's impossible to know upfront) and if it does it's far easier to pull up the previous session than start off from scratch.

However, the problem with this approach is it's really easy to rapidly accumulate sessions and to have no idea which ones I need to keep hanging around which is how I regularly end up with fifty sessions; in fact it's really not unusual for me to end up with over a hundred sessions.

This means every few weeks I need to purge my sessions. Until I've figured out a way to automatically look up the work id and see if it's closed (there is an API, but I've not figured if that kind of query is possible yet) I either have to close one at a time or hope there is just one session I want to keep so I can use tmux kill-session -a -t theoneIwanttokeep. Which is great if there is just one I want to keep, but invariably I know for a fact there are four or five I want.

So I finally wrote a simple script to do just that:

#!/bin/bash
for i in $(tmux list-sessions -F '#S'); do
  if [[ ! $@ =~ $i ]]; then
    tmux kill-session -t $i
  fi
done

Then I can call this as ./tkse sessiona sessionb sessionc sessiond, etc and it'll kill everything except those sessions.

This won't work if a session name has a space in it, but what kind of heathen does that?

Making use of Haskerdeux

I honestly thought that after I got Haskerdeux working again I'd not be doing much else to improve it, but lo and behold the whole "fix your own frustrations" thing kicked in and I ended up improving it so I could use it to work on todos of any date. Why? Because I wanted to be able to easily script adding a whole bunch of todos on various dates in the future.

I actually ended up doing some Ruby to (technically) Haskell scripting as follows:

require 'helpscout'
helpscout = HelpScout::Client.new("<api key>")

# Read in an array of "conversations" I had from a previous action
# I.e. a text file with one of something like this per line: https://api.helpscout.net/v1/conversations/326542115.json
conversations = File.readlines("conversations.txt").map(&:strip)

conversations.each do |conv|
    ticket  = helpscout.conversation(conv[43..-6])
    # DateTime.parse is clever enough to pick out date from a string like "something something needs to happen on 2017-03-05 so here's some text"
    todo_date = (DateTime.parse(ticket.preview)-1).to_date.iso8601
    `./haskerdeux #{todo_date} new "Need to do this before tomorrow [#{ticket.number}](#{ticket.url})"`
end

(This Helpscout Ruby gem is really good, by the way).

I.e. I had a list of HelpScout conversations that had been generated previously and I had actions based on these I needed to do on a certain date. So I iterated through each conversation, parsed the date from the ticket preview text and then added them to my Teuxdeux list via a Ruby shell/system call to Haskerdeux. Simple, hacky, nifty.

NPF so far

I am late to the NPF party, but let's face it: I don't use NetBSD because I want the latest and greatest, I use NetBSD because I want something I can (mostly) make work wherever I put it.

Since setting up a proper desktop machine with NetBSD I decided to use it as a playground for NPF with the ultimate goal of switching to NPF from IPF (IPFilter) where it actually matters to me most; IPF still works for me, but it seems to become more quirky with each NetBSD release.

It was not obvious - at all - to me from the documentation how to get NPF working with IPv6 and IPv4 (I have both at home, thanks excellent broadband provider!). With IPF there are two separate .conf files that are mostly independent from each other. Anyway, in case anyone else is struggling similarly I ended up doing the following and couldn't figure a way to completely avoid duplication:

# Simple npf.conf for Desktop with wired connection and IPv4 and IPv6

$ext4_if = inet4(bge0)
$ext6_if = inet6(bge0)

$services_in_tcp = domain
$services_in_udp = domain

procedure "log" {
    log: npflog0
}

group "external" on $ext4_if {
    pass stateful out final all

    pass stateful in final family inet4 proto tcp to $ext4_if port ssh apply "log"
    pass stateful in final proto tcp to $ext4_if port $services_in_tcp
    pass stateful in final proto udp to $ext4_if port $services_in_udp

    # Passive FTP
    pass stateful in final proto tcp to $ext4_if port 49151-65535
    # Traceroute
    pass stateful in final proto udp to $ext4_if port 33434-33600
}

group "external6" on $ext6_if {
    pass stateful out final all

    pass stateful in final proto tcp to $ext6_if port $services_in_tcp
    pass stateful in final proto udp to $ext6_if port $services_in_udp

    # Passive FTP
    pass stateful in final proto tcp to $ext6_if port 49151-65535
    # Traceroute
    pass stateful in final proto udp to $ext6_if port 33434-33600
}
group default {
    pass final on lo0 all
    block all
}

Attempts to have one group that would work on both IPv6 and IPv4 failed. Maybe it is possible somehow, but I sure as hell couldn't figure it out and the above does work.

There are some good tips at the bottom of the npfctl man page that you can use to test the rules are loaded and working:

  1. Use npfctl reload, not load otherwise you get a weird error of "npfctl_config_load: no such file or directory".
  2. Then use npfctl start
  3. Then use npfctl show which should list the rules.

Since I seemingly had it working on the desktop I had a whirl on XEN, which requires building your own kernel and then being able to publish the kernel somewhere you can download it from so you can boot it. This is more fiddly than it sounds when your web host is also your XEN NetBSD install. However, after doing all that it turns out NPF is broken on XEN on 7.0.2. I could fight it, but I've decided to just wait until 7.1.

Coming back to the desktop I was surprised to find that just having npf=YES in /etc/rc.conf wasn't enough to actually load the rules and start npf on boot - or at least that is what I thought. I started playing about with trying to explicitly call /etc/rc.d/npf reload and /etc/rc.d/npf start in rc.local, but then found it would produce an error because the interfaces didn't seem to be ready. After a bit more searching I found this: npf startup failure when using dhcpcd inet4 and inet6. The exact issue I was seeing.

Guess I'm waiting for 7.1 on the desktop as well!

Definitely feel fine with not rushing towards npf

Gradle and Java SSL debug

I hate Java and I hate Java build systems even more. This is why for the most part I always try to keep things simple and just use javac and java directly where I can. However, invariably if you are going to touch other people's Java you can't avoid things like Gradle and Maven. Gradle isn't too bad I suppose, but I spent hours and hours trying to figure out why I couldn't do this:

gradle run -Djavax.net.debug=SSL,trustmanager

this was with 3.2.1 and seemingly it should work. I did my default assumption: It must be me doing something wrong, but after finally running:

gradle run -Djavax.net.debug=SSL,trustmanager --debug

and inspecting the output I discovered those commands are not being passed through. However, this also gave me the solution to the problem: Copy the command from the --debug output (search for "Starting process 'command") and then paste that directly into the terminal where I could then insert -Djavax.net.debug=SSL,trustmanager and run it. Hey presto, another example of build managers just getting in the way. I wonder if gradle just has a command like gradle show to show the command that it would run?

Haskerdeux Neux

I always said that had I had a smartphone I would have stuck with Teuxdeux all along. There is something about it that just sits right with me, hence all my efforts to emulate it with Taskwarrior.

So it was no surprise I switched back when the opportunity finally arose. But I'm still keen to be able to accomplish as much as I can via the Terminal on NetBSD - I'll accept platform agnosticism over owning every byte of my own data. I'd stumbled across dmi3/teuxdeux in the meantime (wilderness years) which was really interesting to me as I'd spent a huge amount of time playing with curl trying to figure out the neux API when it was first (not actually) released and could never get it to work. Thanks to dmi3/teuxdeux I realised it must actually be possible to work with the new non-API.

The first mistake

After playing much more I finally realised that the whole (well, 98%) reason I couldn't get things to work was because I'd been misspelling "authenticity". I checked. Four years ago in my curl calls I was spelling it "authencity". Doh! I'm pretty sure I would have persevered had I realised.

The second mistake

What the hell, things still didn't work?! After much more confusion I then realised in the put request had "vi" and not "v1" for the URL. Typos have screwed me up a lot, but this is taking the biscuit.

The third mistake

I'll accept this one. The next reason why I failed to get this to work was because I'd not used the "-L" parameter (follow location) on curl when logging on. That was significant and I only ended up trying that after stumbling across this old blog post, Using curl with a web site secured by Rails Authenticity Token.

The last mistake

Since in my attempts to get this to work I'd long dropped back from Haskell to straight use of curl, the last mistake I'd made was a misunderstanding of the curl man page. I thought you could just use --cookie-jar for writing and reading cookies, but you can't you need to use it in conjunction with --cookie for reading and sending the cookies, --cookie-jar is just for saving cookies.

Mistakes I realised I'd made last time

After finally getting to the point where I could logon on make simple api calls on the command line I started changing the Haskell app to suit, still using Network.Curl even though very out of date, because it still (seemed to) work and the state of other options seemed incomplete and confusing as it was a few years ago. I also realised I'd done a lot of bad things originally, like recreating the curl object anew on each method so I improved that and passed about a curl object instead.

I had things "working" again with Network.curl, apart from I couldn't figure out what it was doing with cookies. As far as I could tell it seemed to be handling them in memory. The CurlCookie options worked and were necessary, but it didn't actually seem to be saving the cookies to a file anywhere. I ended up having to re-logon everytime.

Not a mistake, but a change

Even after that work-around though I ran into a road block with PUT requests. They seemingly should have worked via Network.Curl, but I eventually had to conclude they didn't.

I admitted defeat and went for straight system calls instead since I knew the command line worked. It doesn't make it spectacularly Haskelly, but it does mean it works.

Same, but different

So after a few months I'm finally back to the exact same functionality I had a few years ago. The only real differences are:

  • Uses boring system calls instead of a nice Haskelly way (but it is all the same deep under the surface, right?) and so doesn't have as a nice way of checking for returns like it did with 200 status codes when using Network.Curl.
  • You (and by that I mean "I") can only logon with netrc (which suits me).
  • It doesn't pass credentials all the time (because of the change in the API) which is a lot better, but it might mean when the cookies eventually expire you'd (I'd) have to reset the login by removing the cached auth token.
  • I can't be bothered to make it better (I had faint hope the first time, but no, it does what I want it to do)

Anyway, it's back alive again here: Haskerdeux.

There is no flashy fancy code to show off here, I just made it work again - that was achievement enough; The ironic thing about officially working in the Software/Technology industry is that I now have much less time for writing code than I did when I was "officially" a Mechanical Engineer.

cabal-install notes for NetBSD

Just a quick little post so in another three years time I'll be able to figure out how to do this again a bit more quickly.

If you are trying to use cabal with headers and things on none standard paths (e.g. pkgsrc) then you need to do:

cabal install curl --extra-include-dirs=/usr/pkg/include/ --extra-lib-dirs=/usr/pkg/lib --configure-option=CPPFLAGS=-I/usr/pkg/include/ --configure-option=LDFLAGS=-L/usr/pkg/lib

The important bit being the --configure-option flags as the --extra ones are documented and thus a bit more obvious.

Writing crap Elixir code from the point of view of someone who writes crap Erlang code

Woo, quite a long title, but I think that effectively explains what this is going to be about. There are already lots of good posts on Elixir from proper Erlang programmers. This isn't meant to be another one of them. It's just a collection of things I've noticed.

It is pretty much the same

Superficially there is no real difference (or benefit to be gained) between Erlang and Elixir code. Here's some Erlang code:

pad_to(Length, Binary_string) when length(Binary_string) < Length ->
    Padded_binary_string = "0"++Binary_string,
    pad_to(Length, Padded_binary_string);
pad_to(Length, Binary_string) when length(Binary_string) == Length ->
    Binary_string.

And the equivalent in Elixir:

def pad_to(length, binary_string) when length(binary_string) < length do
  padded_binary_string = '0'++binary_string
  pad_to(length, padded_binary_string)
end
def pad_to(length, binary_string) when length(binary_string) == length do
  binary_string
end

Apart from the wrapping of the code in def and end it looks pretty much the same.

You have to still use Erlang

Quite a bit. I was surprised how quick I came across the need to do this, but not all Erlang things have Elixir equivalents so you have to use Erlang code; However, at least this is seamless and painless.

Original Erlang:

Binary_string = hd(io_lib:format("~.2B", [Binary_number])),

Elixir port:

binary_string = hd(:io_lib.format("~.2B", [binary_number]))

defp is really neat

In Erlang you export only the functions you want:

-module(polyline).
-export([six_bit_chunks/1]).

six_bit_chunks(Encoded_polyline) ->
    six_bit_chunks_(Encoded_polyline, []).
six_bit_chunks_([Head | Rest], Chunks_list) ->
    Six_bit_chunk = six_bit_chunk(Head),
    %Add to Reversed_chunks
    six_bit_chunks_(Rest, [Six_bit_chunk]++Chunks_list);
six_bit_chunks_([], Chunks_list) ->
    lists:reverse(Chunks_list).

In Elixir everything that uses def is exported automatically. If you don't want something exported you use defp. This is really neat.

def six_bit_chunks(encoded_polyline) do
  six_bit_chunks_(encoded_polyline, [])
end
defp six_bit_chunks_([head | rest], chunks_list) do
  six_bit_chunk = six_bit_chunk(head)
  #Add to Reversed_chunks
  six_bit_chunks_(rest, [six_bit_chunk]++chunks_list)
end
defp six_bit_chunks_([], chunks_list) do
  Enum.reverse(chunks_list)
end

Indices are different

I don't know why, but indices are different. I don't think there is a clever reason. It just is.

In Erlang lists:sublist("words", 2, 3) returns "ord". And in Elixir Enum.slice('words', 2, 3) returns 'rds'; That is something else to be aware off the differences in single and double quotes; single quotes are charlists as per Erlang, whereas double quotes are for Elixir strings (something different).

Flipping order of args... and pipes.

In Elixir some functions have the order of the arguments flipped from how they are in Erlang. This threw me to start with and I thought it was to make Ruby programmers happy, but actually it's because of pipes.

In Erlang:

Eight_bit_chunks = lists:map(
    fun(Group_of_chunks) ->
        eight_bit_chunks(Group_of_chunks)
    end,
Five_bit_chunks).

In Elixir the list, etc is always first:

eight_bit_chunks = Enum.map(
  five_bit_chunks,
  fn(group_of_chunks) ->
    eight_bit_chunks(group_of_chunks)
  end)

This means you can then do things like this:

five_bit_chunks
|>  Enum.map(
      fn(group_of_chunks) ->
        eight_bit_chunks(group_of_chunks)
      end)

I.e. pull out and pipe five_bit_chunks into the Enum.map. Which doesn't look that impressive on its own but it means you can chain things together. Pipes are neat. My only quibble is that it is a step towards Haskell's esoteric symbols (I liked Erlang over Haskell just because it was a little easier to understand).

Phoenix

I am mid-way through porting my Erlang command line application to Elixir. My plan is to then use Phoenix to turn this into a web application so other people (ha!) can use it. I also have plans to improve it. Slow, long term plans. I mention this only to say that you can write Ruby without going near Rails and the same is true of Elixir and Phoenix; and both are something you should do.

Six Degrees of Wikipedia

This may or may not have come up recently due to my circumstances; I don't think me blogging about my efforts is going to significantly improve anyone else's chances when it comes to the moment.

The idea in principal is dead simple: Given a page on Wikipedia how many steps/links/pages does it take to get to a given second page?

We don't care how we get there, there is no need to record the route, just count how many steps. The only other rule is you have to go from A -> B, you can't start at both ends and converge somewhere in the middle, i.e. A -> C <- B.

I really thought I'd have this nailed. I've done so much with Ruby and Mechanize in the past and this was an ideal fit. But I found this quite a humbling experience. For whatever reason (could have perhaps been the cat relentlessly pawing at the door) I lost my cool in the last thirty minutes of the ninety minutes allotted time, shut down a bit and stopped thinking/communicating out loud, and just couldn't get my brain to work: I knew what I was trying to do, but couldn't even communicate that to myself, let alone anyone else.

Here's exactly what I had when time was up:

#Want take url
#Destination url
#Both urls from wikipedia
#Shortest route from one to the next
#https://en.wikipedia.org/wiki/Ruby
#

@queue = {}

def findlinksbetween(url1, url2)
    #For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
    steps = 0
    found = false
    A = Mechanize.new
    page = A.get(url1)
    until found
        links = page.links_with(:href => /\/wiki\// )
        found, steps = checkfor(page, url2, steps, found)
        #Get next link to try.

        #Next link in list, but don't want to wipe out top list, etc
        page = A.get(link.href)
        #Then follow those links and try all the links on that page
        #Then try each in turn
    end
    put steps
end


def checkfor(page, url, steps, found)
    steps+=1 #this isn't right
    links = page.links_with(:href => /\/wiki\// )
    links.each do |link|
        #Need to stop if we find the url
        if link == url and not found
            #then we want to stop
            found = true
            break
        end
    end
    [found, steps]
end


    #Could search through links 
    #Need to keep a cache of links checked
    #follow justthe 
    #Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize?

Not pretty is it? Doesn't work, at all. It's more like a collection of random thoughts. Like I said: a humbling experience. But how it nagged me. I couldn't leave it like that when I knew I had the right idea somewhere in my brain.

This is what I had a few hours later:

#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route

require 'mechanize'

@previous_level = []
@current_level = []
@checked = {}
@a = Mechanize.new


def findlinksbetween(url1, url2)
    #For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
    step = 0
    found = false
    page = @a.get(url1)
    #Initialise level
    addlinks(page, step)
    loop do #Horrors
        #Check current level links for a match
        check(url2)
        #Not found so carry on
        @previous_level = @current_level
        @current_level = []
        step += 1
        @previous_level.each do |item|
            unless @checked[item[0]]
                begin
                    page = @a.get(item[0])
                    puts "Level: #{step}, #{@previous_level.length} urls: fetching page #{item[0]}\n"
                    addlinks(page, step)
                    @checked[item[0]] = true
                    #TODO: Only problem by collating all the links for a level before checking is it takes a long time
                rescue
                    #Could have been an image, etc
                end
            end
        end
        #but what happens
    end
end


def addlinks(page, step)
    #TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
    links = page.links_with(:href => /^\/wiki\// )
    links.each do |link|
        #Further filter out nonsense, colons should get most of the irrelevant stuff
        unless link.href =~ /:/ or link.href =~ /Main_Page/
            @current_level << [link.href, step]
        end
    end
    #Might have duplicates at current level, but that is ok as won't visit it twice
end


def check(url)
    @current_level.each do |item|
        if item[0] == url
            puts "Found #{url} after #{item[1]} step(s)\n"
            exit
        end
    end
end

Which worked in principal, but was extremely slow and memory inefficient. I had to kill it before I got a result as it was using too much of the 1GB total I had on Linode.

You can see (maybe) what I was trying to with my initial attempt, i.e:

  1. Keeping track of the previous levels urls
  2. Keeping track of the current levels urls
  3. Keeping track of urls checked so time isn't wasted checking them again

The current level becomes the next previous level when we reach the end, etc.

You can also see bits that I knew were a problem and that were inefficient, but that I didn't really care about because I was sure I could figure them out later if need be (I.e. I didn't see them as the main problem):

  • Filtering out irrelevant links. I.e. all the header and footer links on a Wikipedia page which aren't anything to do with the article itself.
  • Checking all the links at the end of a step - slower than checking on the fly.

I decided to use Redis as a more memory efficient alternative to Ruby arrays for my final attempt for that day (by that point it was purely an effort for my own self-esteem). Redis is really awesome, by the way:

#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route

require 'mechanize'
require 'redis'

#TODO: Maybe should make previous_level a redis database as well, could flip/flop between current/previous
@previous_level = []
@redis = Redis.new
#current level in 0, checked in 1
@a = Mechanize.new

#Flush just to be safe
@redis.flushall

def findlinksbetween(url1, url2)
    #For now assume that url2 is just the /wiki/whatever bit <- Need to fix that
    step = 0
    found = false
    page = @a.get(url1)
    #Initialise level
    addlinks(page, step)
    loop do #Horrors
        #Check current level links for a match
        check(url2)
        #Not found so carry on
        @redis.select(0)
        @previous_level = @redis.keys
        @redis.flushdb
        step += 1
        @previous_level.each do |item|
            @redis.select(1)
            unless @redis.exists(item)
                begin
                    page = @a.get(item)
                    puts "Level: #{step}, #{@previous_level.length} urls: fetching page #{item}\n"
                    addlinks(page, step)
                    @redis.select(1)
                    @redis.set(item, true)
                    #TODO: Only problem by collating all the links for a level before checking is it takes a long time
                rescue
                    #Could have been an image, etc
                end
            end
        end
        #but what happens
    end
end


def addlinks(page, step)
    @redis.select(0)
    #TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
    links = page.links_with(:href => /^\/wiki\// )
    links.each do |link|
        #Further filter out nonsense, colons should get most of the irrelevant stuff
        unless link.href =~ /:/ or link.href =~ /Main_Page/
            @redis.set(link.href, step)
        end
    end
end


def check(url)
    @redis.select(0)
    if @redis.exists(url)
        puts "Found #{url} after #{@redis.get(url)} step(s)\n"
        exit
    end
end

This was good enough to get it to run reasonably ok on the resources I had, but I knew improvements could still be made. That would have to wait for another day though as it was bedtime; I hadn't worked on this constantly that evening, I'd had other stuff to do.

Next, I figured out what I'd said to myself in a comment and managed to used Redis's built in databases for both the previous and current levels meaning I could easily switch between them. I also made it command line callable to make it a bit easier to use.

#Find shortest route between urls on Wikipedia
#A depth first search is easier, but slower/doesn't find shortest route

require 'optparse'
require 'mechanize'
require 'redis'

optparse = OptionParser.new do |opts|
    opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
    opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!

@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"
@a = Mechanize.new

#Flush just to be safe
@redis.flushall

def findlinksbetween(page1, page2)
    step = 0
    found = false
    page = @a.get(Wikipedia+page1)
    #All other links should resolve relatively
    #Initialise level
    addlinks(page, step)
    loop do #Horrors
        #Check current level links for a match
        check(page2)
        #Not found so carry on
        #Switch previous and current
        @previous ^= 1
        @current ^= 1
        @redis.select(@current)
        @redis.flushdb
        step += 1
        @redis.select(@previous)
        keys = @redis.keys
        #No way not to avoid an array at some point
        keys.each do |item|
            @redis.select(Checked)
            unless @redis.exists(item)
                begin
                    page = @a.get(item)
                    puts "Level: #{step}, #{keys.length} urls: fetching page #{item}\n"
                    addlinks(page, step)
                    @redis.select(Checked)
                    @redis.set(item, true)
                    #TODO: Only problem by collating all the links for a level before checking is it takes a long time
                rescue
                    #Could have been an image, etc
                end
            end
        end
    end
end


def addlinks(page, step)
    @redis.select(@current)
    #TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
    links = page.links_with(:href => /^\/wiki\// )
    links.each do |link|
        #Further filter out nonsense, colons should get most of the irrelevant stuff
        unless link.href =~ /:/ or link.href =~ /Main_Page/
            @redis.set(link.href, step)
        end
    end
end


def check(page)
    @redis.select(@current)
    if @redis.exists("/wiki/"+page)
        puts "Found #{page} after #{@redis.get("/wiki/"+page)} step(s)\n"
        exit
    end
end

if defined?(Page1) and defined?(Page2)
    findlinksbetween(Page1, Page2)
end

The next attempt was to use threading. I realised that, since I would have concurrent threads going on, I couldn't use multiple Redis databases as each thread could override the database selection made in another thread and things would get very messy. I decided to use Redis Sets instead in just one database. The mechanize bit became a separate function to make threading it easier - I.e. I could extend this to three instances/threads easily. I also added in better feedback (counting down urls remaining).

In making the threaded version I discovered the need for better error handling, e.g. checks for images and checks for a successful page load: On a single thread I never had failure of the program to work; However, after adding threads (before I put in the page load check) it would blitz through, but not always find the result. I.e. some page loads must have failed, but been lost in the begin/rescue. This approach made sure to re-add the page to the list to be checked if the page failed to load.

I also realised I had lied to myself when I'd commented "No way not to avoid an array at some point".

#Find shortest route between urls on Wikipedia

#Notes:
# - A depth first search is easier, but slower/doesn't find shortest route
# - Have to be careful threads don't interfere, i.e. select wrong database in other thread. Options:
#   - Redis sets
#   - Different Redis instances
#   - Implement some kind of global lock

require 'optparse'
require 'mechanize'
require 'redis'

optparse = OptionParser.new do |opts|
    opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
    opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!

@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"

#Use two instances to speed things up a bit
@a = Mechanize.new
@b = Mechanize.new
@remaining = 1
$found = false

#Flush just to be safe
@redis.flushall

def findlinksbetween(page1, page2)
    step = 0
    @redis.set("step", step)
    page = @a.get(Wikipedia+page1)
    #All other links should resolve relatively
    #Initialise level
    addlinks(page)
    #TODO: Does this have to be a global var and not a class one to work? I.e. otherwise becomes localised in thread?
    #Nah, i think things are failing and not repeating
    until $found or step == 6
        #Check current level links for a match
        check(page2)
        #TODO: Only problem by collating all the links for a level before checking is it takes a long time
        #Not found so carry on
        #Switch previous and current
        @previous ^= 1
        @current ^= 1
        step += 1
        @redis.set("step", step)
        @remaining = @redis.smembers(@previous).length
        puts "***Starting*** Level: #{step}, #{@remaining} urls"
        t1 = Thread.new { mech(@a) }
        t2 = Thread.new { mech(@b) }
        Thread.new do
            until @remaining == 0
                puts "Level: #{step}, #{@remaining} urls"
                sleep 1
            end
        end
        #Makes sure both have finished though
        t1.join
        t2.join
    end
end


def addlinks(page)
    #TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
    links = page.links_with(:href => /^\/wiki\// )
    links.each do |link|
        #Further filter out nonsense, colons should get most of the irrelevant stuff
        unless link.href =~ /:/ or link.href =~ /Main_Page/
            @redis.sadd(@current, link.href)
        end
    end
end


def check(page)
    if @redis.sismember(@current, "/wiki/"+page)
        puts "Found #{page} after #{@redis.get("step")} step(s)\n"
        $found = true
        exit
    end
end

def mech(m)
    url = ""
    until url == nil
        url = @redis.spop(@previous)
        @remaining = @redis.smembers(@previous).length
        unless @redis.sismember(Checked, url)
            #Be careful, might normally follow relative links fine, but re-added ones seem to throw it off?
            page = m.get("https://en.wikipedia.org#{url}")
            #Skip images
            unless page.response["content-type"].include?("image")
                addlinks(page)
            end
            if page.code == "200"
                @redis.sadd(Checked, url)
            else
                #re add and check later
                @redis.sadd(@previous, url)
            end
        end
    end
    @remaining = 0
end


if defined?(Page1) and defined?(Page2)
    findlinksbetween(Page1, Page2)
end

However, GIL is still a thing so threading was not providing any performance benefits which prompted a fairly trivial switch from threads to processes instead. This represents the final state of the program; Obviously further improvements are possible, but I was happy enough with everything in principal.

#Find shortest route between urls on Wikipedia

#Notes:
# - A depth first search is easier, but slower/doesn't find shortest route
# - Have to be careful threads don't interfere, i.e. select wrong database in other thread. Options:
#   - Redis sets
#   - Different Redis instances
#   - Implement some kind of global lock

require 'optparse'
require 'mechanize'
require 'redis'

optparse = OptionParser.new do |opts|
    opts.on('-p', "--page1 PAGE", "1st Wikipedia page") { |p| Page1 = p }
    opts.on('-P', "--page2 PAGE", "2nd Wikipedia page") { |q| Page2 = q }
end
optparse.parse!

@redis = Redis.new
@current = 0
@previous = 1
Checked = 2
Wikipedia = "https://en.wikipedia.org/wiki/"

#Use two instances to speed things up a bit
@a = Mechanize.new
@b = Mechanize.new
@remaining = 1
$found = false

#Flush just to be safe
@redis.flushall

def findlinksbetween(page1, page2)
    step = 0
    @redis.set("step", step)
    page = @a.get(Wikipedia+page1)
    #All other links should resolve relatively
    #Initialise level
    addlinks(page)
    #TODO: Does this have to be a global var and not a class one to work? I.e. otherwise becomes localised in thread?
    #Nah, i think things are failing and not repeating
    until $found or step == 6
        #Check current level links for a match
        check(page2)
        #TODO: Only problem by collating all the links for a level before checking is it takes a long time
        #Not found so carry on
        #Switch previous and current
        @previous ^= 1
        @current ^= 1
        step += 1
        @redis.set("step", step)
        @remaining = @redis.smembers(@previous).length
        puts "***Starting*** Level: #{step}, #{@remaining} urls"
        fork { mech(@a, step) }
        fork { mech(@b, step) }
        #Makes sure both have finished though
        Process.waitall
    end
end


def addlinks(page)
    #TODO: Probably best way is to get links from just bodyContent. Might need Nokogiri for that. <- not then sure how to get the links in a nice format. Back through mechanize? Until then just get all /wiki/ links
    links = page.links_with(:href => /^\/wiki\// )
    links.each do |link|
        #Further filter out nonsense, colons should get most of the irrelevant stuff
        unless link.href =~ /:/ or link.href =~ /Main_Page/
            @redis.sadd(@current, link.href)
        end
    end
end


def check(page)
    if @redis.sismember(@current, "/wiki/"+page)
        puts "Found #{page} after #{@redis.get("step")} step(s)\n"
        $found = true
        exit
    end
end

def mech(m, step)
    url = ""
    until url == nil
        url = @redis.spop(@previous)
        @remaining = @redis.smembers(@previous).length
        puts "Level: #{step}, #{@remaining} urls"
        unless @redis.sismember(Checked, url)
            #Be careful, might normally follow relative links fine, but re-added ones seem to throw it off?
            page = m.get("https://en.wikipedia.org#{url}")
            #Skip images
            unless page.response["content-type"].include?("image")
                addlinks(page)
            end
            if page.code == "200"
                @redis.sadd(Checked, url)
            else
                #re add and check later
                @redis.sadd(@previous, url)
            end
        end
    end
    @remaining = 0
end


if defined?(Page1) and defined?(Page2)
    findlinksbetween(Page1, Page2)
end

I decided to chuck some resources at the processes version and see if I could speed things up. The original task I had been set was to go from the Wikipedia Ruby page to the David Bowie page. But whilst developing this program I checked for a known single step of Ruby to Metastable (The Ruby page links to Diamond, which links to Metastable).

On a two-core system:

                  Real / User / Sys 

No threads:      22.993/15.360/2.923
Two threads:     21.142/15.867/3.780
Two processes:   13.149/14.493/2.553
Three processes: 10.132/13.620/2.423

On a four-core system:

No threads:      24.845/16.360/3.197
Two threads:     21.355/16.693/3.833
Two processes:   12.955/16.457/3.370
Four processes:   9.043/20.060/5.027

So the additional processes are definitely helping. Mind you, doing six levels deep would be crazy on a single machine as it turns out David Bowie is only two levels deep and on a four-core with four processes this still took: 108m6.189/173m5.299/7m24.900.

These are the ten most recent posts, for older posts see the Archive.