Saturday, October 28, 2006

Finding Social Security Numbers with Google

Google has a helpful syntax x..y for searching web pages with numbers between x and y. This feature, combined with the stupidity of the general public, results in social security numbers being findable: inurl:resume ssn * 10000...99999999.

Note that Google proactively protects users by forbiding the query 100000000..999999999 (that's the full SSN range). It also bans a varity of searches for credit card numbers. However, using the wider range allows one to still find socials, granted with some false positives.

Wednesday, October 18, 2006

Math.abs Returns a Negative Number

Math.abs (Math.Abs in C# :-) is one of those methods that sould be pretty simple, right? Just take the value, if it's negative, return a positive version, duh! Well, it's not that simple.

There are 232 possible values of int, each of which is positive, negative or zero. There's only one integer that is zero, namely 0x00000000. That means that either there is one bit pattern of ints that doesn't represent an integer, or there is not a 1-1 mapping between positive and negative numbers. It turns out, the second is the case. The odd number is int.MinValue, which is equal to 0x80000000 in binary.

What happens when you negate this number? In two's complement -x=~x+1. ~x = 0x7fffffff. ~x+1=0x80000000. That means -int.MinValue == int.MinValue. Uh oh!

Enter Math.abs. What's the function to do when passed int.MinValue. Well, in C#, Math.Abs throws an OverflowException. Java, on the other hand, happily returns int.MinValue. Either way, the caller of the function probably wasn't expecting this.

I discovered this oddity when reading Effective Java (kindly provided to all Google Engineers). The book talks about the code: Math.abs (new Random ().nextInt ()) % n as a way to generate random numbers between 0 and n. This code usually works, except when Random happens to return int.MinValue. In this case, the abs returns a negative value, causing the modulus operator to return a negative number. Eeeek!

At this point, I realized that for the data structures course I TA, many students used a similar piece of code in a hashtable they wrote. I wrote a quick script to check how many people screwed up on int.MinValue. Most students did.

I wondered how much this occurred in production code. Luckily, I had one of the best test cases at my fingertips. Google has an internal "grep" tool (the inspiration for the new Code Search) utility. I made a quick regex to find where this occurred. There were many instances.

Now that Google has released an external version of this tool, you can see some of the places this anti-pattern is used in the real world:

Each of these snippits is a time bomb. One in 232 executions, the function will do something strange.

Having found this issue in Google's source code, I emailed somebody inside Google who had been running FindBugs internally. He in turn talked to Bill Pugh, who added a detector. I added a test case to the test suite for the homework assignment at CMU, so students will have to handle this case. All in all, a very obscure bug

So, what should you do rather than Math.abs? I've seen primarially two buggy patterns:

  • Hashtables People want to find "which bucket should I put an object with hashcode x in". The best way to do this is (o.GetHashCode () & 0x7fffffff) % table.Length. This has the advantage of being faster than Math.abs (no branches).
  • Random Numbers People want to say Math.Abs (new Random ().Next ()) % N. Not only is this buggy in the rare case, it can also cause a bad distribution of numbers for large N. Use the Next (int, int) method which allows you to specify bounds

Saturday, October 07, 2006

Edgy Memory Usage

I just installed the Ubuntu Edgy beta on my laptop. I'm impressed by the memory usage. On startup, it's using 87mb of ram. WOW.

In system monitor, there's one sore point. This system-tools-backend thingy is launching perl on startup. This requires 9 mb of writable memory on startup. What the hell is this thing?

Whatever the program does, it clearly does not have a reputation for performance. Ryan "kill wakeups" Lortie reported that the program made 20 wakesups per second

Filed a bug

Friday, October 06, 2006

Mono Summit; New Google Products

Mono Summit: I'm going to be at the Mono summit in Boston (thanks Mig!). I look forward to seeing lots of Mono hackers (and having some very good ice cream :-)

New Google Products: In the past month or so, Google has had a pretty crazy schedule of product releases.

  • Blogger Beta: Google finally made lots of improvements to Blogger, giving it tags, etc. Also, it was moved over to Google Accounts, killing yet another password. Most importantly: spell checking!
  • Google Reader: Completely new version. This thing is amazing, it really works the way I'd want a blog reader to work. The old Google Reader had really poor performance on Linux, and this one seems to fix it. I only have two issues with the program. First, it does not auto refresh feeds, so I need to remember to hit "r" to refresh things. Also, the server side seems to have less than stellar performance. I'm betting they did such a fantastic job that they got more users than they had server power.
  • Google Transit: Being a college student, I use public transportation (it's free in the sense that I am forced to pay a flat fee for the entire year). Pittsburgh's public transit does not publish schedules in a usable format. Transit completely fixed this for me.
  • Groups Beta: Google did a refresh of groups. The UI is much better, and they now have a wiki-like functionality and the ability to upload files (100mb of storage -- that's not too bad). Like Google Reader, the servers aren't quite as fast as they could be. I assume this will be fixed when it's out of beta
  • Code Search: global grep. No horrible interface like koders or krugle.
  • Image Labler: This is a version of a game by Luis Von Ahn (a professor here at CMU) that lets people label images. Google needs to do serious work on game play (their images are far too small, likely due to copyright paranoia, and they aren't using taboo words correctly well). However, if you look at the high scores, there are people who have been playing this thing for days. Imagine what would happen if Google were to increase the game play experience and put a link on the home page for one day.

I have to say, this is a pretty amazing array of products.