Seven Languages in Seven Weeks Part 7: Scala Day 3

Day three got into concurrency, XML, and pattern matching in Scala. The concurrency uses the Actor model and it's easy enough to work with. I don't see the huge improvement in usability and simplicity over other threaded models I've worked with, but I haven't done very much (limited to Android AsyncTask and plain threads using the Thread() class with receivers in Java), so I'm probably just not experienced enough.

The pattern matching is what you'd expect. It works well, it handles regular expressions, etc. I like how you can just write out your regular expression as a string and use the .r method on that string to turn it into a regular expression and then chain from there for matching which feels more like what I'm used to than Java's drawn out creation of objects for doing regular expressions. The syntax for pulling specific matched items out of a string with regular expressions is also nice in that it automatically assigns to a local variable with a proper name rather than $1, $2, and so on although it's not obvious what it is doing if you were to look at it without knowledge of the syntax beforehand.

Now for the exercises. The book started with a simple program to iterate over a list of URLs, make an HTTP request, and report the size of the downloaded response and provided sequential and concurrent methods for doing this. The goals were then to modify that to first return the number of links in the HTML along with the size and then for a second goal, follow each of those links and add their returned size to the original HTML size before returning it. I also added one more variation which I'll get to at the end.

I should note that the way the author used the Actors here is a bit different than how I've seen them used everywhere else. Everyone else seems to subclass Actor to create a new class and object to handle the concurrency.

On to the code! First, returning the number of links. I just assumed any <a href= with http:// specified in the link was a link for this exercise.

import scala.io._
import scala.actors._
import Actor._
import scala.util.matching.Regex

object PageLoader {

  def getPageSize(url: String): (Int, Int) = {
    val html = Source.fromURL(url).mkString
    //Scala 2.7.7 does not deal with quotes inside """ strings correctly in some circumstances, 2.8 does
    val reg = new Regex("<a.+href=\"(http://[^\"]+)\".*>")

    //Scala 2.7.7 does not have length method on scala.util.matching.Regex.MatchIterator
    //which appears to have been added in 2.8, so have to convert to list and then get the length
    val links = reg.findAllIn(html).toList.length
    return (html.length, links)
  }
}

val urls = List("http://www.amazon.com",
                "http://www.cnn.com/",
                "http://www.twitter.com",
                "http://www.google.com"
                )

def timeMethod(method: () => Unit) {
  val start = System.nanoTime
  method()
  val end = System.nanoTime
  println("Method took " + (end - start)/1000000000.0 + " seconds.")
}

// Why the = here?  It was not used in other examples in the book
// or explained, except when a return type is specified, which it is not here
def getPageSizeSequentially() = {
  for(url <- urls) {
    val (size, links) = PageLoader.getPageSize(url)
    println("Size for " + url + ": " + size + " with " + links + " links")
  }
}

//again, why the =
def getPageSizeConcurrently() = {
  val caller = self

  for(url <- urls) {
    actor {caller ! (url, PageLoader.getPageSize(url))}
  }

  for(i <- 1 to urls.size) {
    receive {
      case (url, (size, links)) =>
        println("Size for " + url + ": " + size + " with " + links + " links")
    }
  }
}

//no obvious difference between putting the method to call
//in () or {} here.  The book used {}, I have done one each way
println("Sequential:")
timeMethod(getPageSizeSequentially)

println("Concurrent:")
timeMethod{getPageSizeConcurrently}

Next up is the same thing, but each link is followed and the size is added on. There are occasional differences in syntax to do the same things as above where I was experimenting.

import scala.io._
import scala.actors._
import Actor._
import scala.util.matching.Regex

object PageLoader {
  def getPageSize(url: String) = {
    val html = Source.fromURL(url).mkString
    val reg = new Regex("<a.+href=\"(http://[^\"]+)\".*>")
    val linkList = reg.findAllIn(html).toList
    //Scala 2.8 has a way to ignore exceptions that you don't care about
    //but still might get thrown.  That would be useful here.  HTTP errors throw exceptions
    val size = (html.length /: linkList) {(sum,url) => val reg(linkUrl) = url
                                          try {
                                           sum + Source.fromURL(linkUrl).mkString.length
                                          } catch {
                                            case e => sum
                                          }
                                        }

    val links = linkList.toList.length
    (size,links)
  }
}

val urls = List("http://www.amazon.com",
                "http://www.cnn.com/",
                "http://www.twitter.com",
                "http://www.google.com"
                )

def timeMethod(method: () => Unit) {
  val start = System.nanoTime
  method()
  val end = System.nanoTime
  println("Method took " + (end - start)/1000000000.0 + " seconds.")
}

def getPageSizeSequentially() = {
  for(url <- urls) {
    val (size, links) = PageLoader.getPageSize(url)
    println("Size for " + url + ": " + size + " with " + links + " links")
  }
}

def getPageSizeConcurrently() {
  val caller = self

  for(url <- urls) {
    actor {caller ! (url, PageLoader.getPageSize(url))}
  }

  for(i <- 1 to urls.size) {
    receive {
          case (url, (size, links)) =>
            println("Size for " + url + ": " + size + " with " + links + " links")
    }
  }
}

println("Sequential:")
timeMethod{getPageSizeSequentially}

println("Concurrent:")
timeMethod{getPageSizeConcurrently}

As a final exercise I wanted to take this code and then continue to follow every link on the page, and then follow every link on that page, and so on forever like a very simple search engine spider. For each page I return that separate page's size and url rather than the parent page otherwise we'd never get any output since this will run basically forever. To do this, each Actor kicks off another Actor for every link it finds in the HTML. This has no wait time between page loads or anything like that, so it's probably not something that should be used frequently as is since it's a bit rude to hammer the servers like that for no reason.

import scala.io._
import scala.actors._
import Actor._
import scala.util.matching.Regex

object PageLoader {
    def getPageSize(url: String): Int = {
      val reg = new Regex("<a.+href=\"(http://[^\"]+)\".*>")
      val html = Source.fromURL(url).mkString
      reg.findAllIn(html.toString).foreach({link => val reg(linkUrl) = link
                                           actor {caller ! (linkUrl, PageLoader.getPageSize(linkUrl))}
                                          })

      return html.length
    }
}
val urls = List("http://www.amazon.com",
                "http://www.cnn.com/",
                "http://www.twitter.com",
                "http://www.google.com"
                )

val caller = self
for(url <- urls) {
  actor {caller ! (url, PageLoader.getPageSize(url))}
}

while(true) {
  receive {
    case (url, size) =>
      println("Size for " + url + ": " + size)
  }
}