Saturday, August 29, 2015

Can Data Mining Algorithms Extract Value from your Personal Data (and should you get a piece of the action?)

Go to  to have your say.

Technology has made it easier than ever for people to collect and store a valuable trove of personal information about themselves. However, there is no readily available means by which individuals can reap a financial benefit by selling their personally generated data. Companies such as Facebook, Linkedin and Twitter are multi-billion dollar companies built almost entirely on user-generated data, so it’s clear that when used correct, your personal data is extremely valuable.

There is a growing unease about the disparity between the value that companies realize from personal data and the financial rewards individuals gleam from this information. Prof. Tim Wu from Columbia Law School recently argued that Facebook should pay us for our posts. Individually your data may not be worth very much. but collectibely it is a goldmine. The problem is that there is currently no way for individuals to collect and monetize their data. It’s as if we need a virtual “trade union” for data producers which collected and amalgamates your personal data on your behalf.

The question is: Would you be willing to exchange access to your personal information for financial compensation?

I have designed a simulation to test this question in practice. Go to you will be presented with 20 simulated offers from fictitious companies to test your willingness to participate in a personal information market – you may be surprised by the results.

Tuesday, May 27, 2014

Beyond the Hype - Data Science in the Real World

I will be presenting this talk at Phil Brieley’s Melbourne Data Science Meetup on June 23rd. See for details. Hope you can join.

Thursday, February 14, 2013

Here is the R code for the competition entry mentioned in my previous post. See for the animation.

#clear everything
# Injest data
data <- read.csv("unimelb_public_leaderboard.csv", header=TRUE)
# calculate days and date time as numeric
data <- data.frame(
  , SubmissionDate_datetime = strptime(data$SubmissionDate, format="%m/%d/%Y %H:%M:%S %p")
  , Submission_day = round(strptime(data$SubmissionDate, format="%m/%d/%Y %H:%M:%S %p"), "day")
  , Submission_time_num  = as.numeric(strptime(data$SubmissionDate, format="%m/%d/%Y %H:%M:%S %p"))
start_time <- min(na.omit(data$SubmissionDate_datetime))
end_time <- max(na.omit(data$SubmissionDate_datetime))
start_day <- min(na.omit(data$Submission_day))
end_day <- max(na.omit(data$Submission_day))
duration<- round(end_time - start_time,0)
team_names <- sqldf("select distinct TeamName from data")
num_teams <- nrow(team_names)
competition_days<- seq(start_day, b = "days", length = as.numeric(round(end_day - start_day,0)))
#make a new column with the leaderboard scores in it
#sort data by team and submission date
data <- data[with(data, order(TeamName, Submission_time_num)),]
data <- data.frame(data, lb_score = rep(0, nrow(data)))
current_team <- data$TeamName[1]
current_score <- data$Score[1]
data$lb_score[1] <- current_score
for (i in 2:nrow(data)){
    current_score<- max(current_score, data$Score[i])
    data$lb_score[i] <- current_score  
    current_team <- data$TeamName[i]
    current_score <- data$Score[i]
    data$lb_score[i] <- current_score
# Make the animation
make_animation <- function(){
  start_time <- min(na.omit(data$SubmissionDate_datetime))
  end_time <- max(na.omit(data$SubmissionDate_datetime))
  start_day <- min(na.omit(data$Submission_day))
  end_day <- max(na.omit(data$Submission_day))
  duration<- round(end_time - start_time,0)
  #fix the start time a an exact number of days before the end time
  t <- end_time - as.numeric(duration)*24*60*60
  while(t < end_time){
    t <- min(c(end_time, t+24*60*60))
    days_left <- round((end_time - t), 0)
    #extract the data prior to time t
    temp_dat<- na.omit(data[data$SubmissionDate_datetime <= t,])
      #find the top 5 teams
      top_teams <- sqldf("select TeamName, max(lb_score) as lb_score
        from temp_dat
          group by Teamname
            order  by 2 desc
              limit 7")
      #get the ranking for each team
      top_teams <- data.frame(top_teams, Leaderboard = paste(rownames(top_teams), " ", top_teams$TeamName, " (", top_teams$lb_score, ")"), sep = "")
      # select just the top teams
      temp_dat <- sqldf("select temp_dat.*, top_teams.Leaderboard as Leaderboard
        from temp_dat, top_teams
          where temp_dat.TeamName = top_teams.TeamName and temp_dat.TeamName in (select TeamName from top_teams)")
      #turn off the scale for alpha
      sc <- scale_alpha_continuous()
      sc$legend <- FALSE
         #plot the data
         p <- ggplot(temp_dat, aes(x=SubmissionDate_datetime, y=lb_score, group=Leaderboard, colour = Leaderboard, alpha = rank(lb_score)))
         theme_set(theme_gray(base_size = 18))
         print(p+ geom_line(size = 2)
                + geom_point(size = 4, aes(colour = Leaderboard, alpha = rank(lb_score)))
                + xlab("Submission Date")
                + coord_cartesian(xlim = c(min(na.omit(data$SubmissionDate_datetime)), max(na.omit(data$SubmissionDate_datetime))+20*24*60*60)
                                  , #ylim = c(max(c(0, min(na.omit(data$Score)))), max(na.omit(data$Score))+0.01)
                                  ylim = c(0.85, max(na.omit(data$Score))+0.01)
                #add a verticle line to show the current date
                + geom_vline(linetype = 2, xintercept = as.numeric(t))
                #add a vertical line to show the end date
                + geom_vline(linetype = 2, xintercept = as.numeric(end_time))
                #lable the end date line
                + annotate("text"
                  , label = paste("Competition ends: "
                  , end_day)
                  , x = as.numeric(end_time)+(48*60*60)
                  , y = 0.86
                  , hjust=0
                  , vjust=0
                  , angle = 90
                # add a vertical line to show the current submission date
                + geom_hline(linetype = 2, yintercept = max(temp_dat$Score))
                # add a lable to show the current top score
                + annotate("text"
                  , label = paste("Top Score:"
                  , round(max(temp_dat$Score),4))
                  , x = as.numeric(t)+12*60*60
                  , y = max(temp_dat$Score)
                  , hjust=0
                  , vjust=-1
                #add a lable to show the number of days left
                + annotate("text"
                  , label = paste(days_left, "days left.")
                  , x = as.numeric(t)+12*60*60
                  , y = max(temp_dat$Score)-0.002
                  , hjust=0
                  , vjust=1
                 # add a title
                 + opts(title = paste("'Grant Applications Comp' as of", max(temp_dat$Submission_day)))
                 #turn off the legend for alpha
                 + sc
, title = "Predict Grant Applications"
, description = "Predict Grant Applications"
,ani.width = 900
,ani.height = 600
#create the file
, title = "Predict Grant Applications"
, description = "Predict Grant Applications"
,ani.width = 900
,ani.height = 600

Monday, February 4, 2013

Please support me on Kaggle

Hi All

I have just entered a Kaggle competition. Please vote for my entry here.

Ross Farrelly

Friday, January 25, 2013

Timeless Classics - the Antidote to Time Poverty

If, like many while collar workers in today’s modern economy, you are “time poor” and constantly swamped by the ever growing torrent of information coming at you every day, despair not. Help is at hand. But is comes in a somewhat unlikely guise. It’s not yet more sophisticated news aggregation text-mining algorithm, nor is it the next-gen web 3.0 nanoblogging, retwetting, facebook posting multifunction one-stop web-accumulation app for your smart phone. No. It is those leather bound volumes gathering dust on your bookshelf and that set of penguin classics you bought in a fit of self-improvement last year and have never read.

Let me explain.

The idea of being “time poor” really comes down to a balancing our desires. If we want to do more than we have time for, we say we are time poor. There are two possible solutions – either want to do less, or find a way to do more. Focusing on the latter solution, for many professionals, a closely related problem is that of deciding what information to consume and create at what time. Most professional occupations involve information processing - whether it be consumption, comprehension, dissemination or creation of information, or decision making based on information. The problem can be described as follows. Each of us has a certain bandwidth or ability to ingest, digest and make sense of information each day. The question is, what is the optimal manner in which we should consume information? How do we decide what is the most important information to read on any given day? How do we decide when we should stop consuming and start creating? With the information age, this problem has become ever more pressing. In times gone by the problem was getting the information. Now it is deciding what not to read. My observation is that because of the easy access to vast amounts of low value information the quantity of high quality output has dwindled. Would Tocqueville have produced a classic like Democracy in America if he was constantly answering emails, checking his Facebook page and posting on Twitter? Many people or good at the day-to-day detailed information - the immediate and the short term but  long term investment in substantial texts is often lacking.

My solution to the problem of being time poor is to make some time in my busy schedule to read the classics. When you read you invest your time and energy, your attention and comprehension in understanding the text being read. In order to make the most of your investment you are best advised to read those books which are commonly regarded to be the classics of the western cannon. Because by definition, a classic has stood the test of time – which is to say that many people, over many years, decades and sometimes centuries, have decide to invest their psychic energy into the effort required to read the book. This indicates that the classic contains and expresses something worthwhile about the human condition, something what transcends the day to day and speaks to that which is common between and among people from one epoch to another.

However, by no means does the value of reading a classic end there. The return on the investment of your time is multiplied many times over by the references to these classic works which you encounter as you read subsequent classics.

These observations on the benefits or reading the classics are not based on just a passing acquaintance with the subject. About eight and a half years a friend and I decided to read our way through a selection of the Western classics (I don’t mean the screen play of True Grit). Simon is a banker, I’m a data scientist, so we are both reasonably well educated but not formally trained in the classics, history, philosophy or literature. Nevertheless, we could see no reason why, if we put our minds to it, we couldn’t read these texts and derive some significant benefit from doing so.

We didn’t get too hung up on what was defined to be a classic. We looked at lists such as the Encyclopedia Britannica’s  Great Books of the Western World and the more eclectic list to be found in Harold Bloom’s  The Western Canon  but we ended up choosing as our guide: Invitation to the Classics edited by Louise Cowan.

It wasn’t meant to be an onerous task either, so we would read in our own time and then meet to discuss the book over a leisurely lunch at The East Sydney Hotel, the only pub in Sydney which doesn’t have poker machines and where you can have a decent conversation without competing with a juke box or MTV. (During the 2008 American elections when it was looking as if the US was about to elect its first African American president, a patron approached the bar and asked if the television could be turned on. The barman first eyed the patron suspiciously and then, looking at the TV as if it was an alien invention from another planet grudgingly replied, “Well, I suppose so … but no sound.”)

While our regime was fairly relaxed, there were some informal guidelines we tried to stick to. Firstly we wanted to read to classics in chronological order. Our thinking was that in doing so we would have a better chance of understanding any references we came across to previous classics.

For example, by reading The Iliad before The Odyssey we have a better picture of who Odysseus is, what he has been through, and the perseverance he shows in order to return home. By reading both these text before approaching Aeschylus’s Oresteia we understand something of Agamemnon, his brother’s misfortunes and his sacrifice of Iphigenia to propitiate Artemis. Having read these texts we are then better prepared to understand The Aeneid, the action of which follows directed after the sacking of Troy and which make numerous references to the Odyssey. Finally, when we get to the Divine Comedy and learn that Dante has chosen Virgil as his guide, the name carries weight and meaning for us. In this way every time a work we have read is referenced, we reap the rewards of our prior reading. When, in Uncle Tom’s Cabin – Saint Claire says he “can’t turn Knight-errant” and free every slave, we reap the rewards for having read Don Quixote. These bonuses can be found in the most unexpected places. Who would have thought we would be rewarded for the effort of having read Euripides The Frogs in the middle of Surely You’re Joking Mr. Feynman?

We also undertook to read the texts in chronological order to try to understand the impact the text would have at the time it was written – that is to say to only bring ideas which had been expressed previously to mind when trying the comprehend the text being read. This is not such an easy task. Some would say it’s impossible. Nevertheless, it’s informative to try to read a text in the absence of all ideas and views which have come after because it really helps us understand what the book meant when it was written. This is important because it helps us understand the stature of the author and his or her intention in writing the book.

This also raises another benefit of reading the classics – you gain a deeper, more rounded understanding of many commonly used words and phrases. The connotations of “Machiavellian” are quite different if you’ve read The Prince.  “Oedipus complex” takes on a new shade of meaning of you’ve read Sophocles, and “quixotic” make you grin in recognition if you’re familiar with Cervantes. I imagine you’d even get more out of the dreary reality TV show “Big Brother” if you’ve read Orwell – although that may be a bit of a stretch.

We also resolved to just read the actual text, not a commentary, explanation, interpretation or summary but to read the full text word for word and see what we made of before being influenced by the commentators. (One exception to this was the page or two in Invitation to the Classics which I found to be extremely useful in whetting the appetite to read the full work.)

In our discussions we resolved to discuss without reference to notes, only drawing on our understanding and memory of the works. Knowing that you were going to be called upon to discuss the book vive voce focusses the attention and encourages you to build up a mental picture of the book as you read it, perhaps not in every detail, but certainly in general terms. It also heightens your awareness of memorable themes and incidents which can be discussed later.

Proceeding in this manner we have read our way through classical works by Homer, Aeschylus, Sophocles, Euripides, Aristophanes, Aristotle and Virgil; Christian classics by Augustine, Aquinas and Dante, Luther and Calvin; Renaissance texts by Machiavelli and Cervantes (but skipped Shakespeare as we felt we has enough of a passing acquaintance with a number of his plays) and many more besides including my personal favorites Samuel Johnson’s Essays. We even read How to Read a Book by Mortimer J. Adler (not to be confused with the satirical essay, How to Read Two Books by Erasmus G Addlepate). There have been diversions - Emerson lead us to Swedenborg which turned out to be a bit of a cul-de-sac and I took a detour via the eminently readable Stefan Zweig for a while.

Furthermore, you learn something true and lasting about the human condition. Let’s take just one example of a classic work and explore a few of the novel and important ideas expressed in it. Samuel Johnson’s  Essays contains a wealth of novel insights – just one of his essay can be summarized as:

We can’t be free from troubles but we can rise above them. Although we can never hope for total equanimity we can strive towards it. People with little to do are troubled by small things. Rest without work is not restful. To seek happiness by changing anything but one’s own disposition is fruitless. Contemplation is essential to virtue as virtue involves long term goods.

Not a bad series of insights for a single essay – and he wrote over two hundred of them! Some of the observations in Johnson’s essays have stayed with me and changed the way I lead my life. His comment that everyone, even the thief, vindicates himself and justifies the way they he leads his life has fundamentally changed the way I think about how best to advise my children. His advice to be content with being discontent and the relish the feeling on an unfulfilled desire as more enjoyable than its satisfaction has change the way I set and achieve goals. And it’s not just Johnson. Emerson’s moving essay, Self Reliance, despite being internally contradictory is one of the most motivation pieces of advice I have every come across.

So if you’re engulfed by emails, foundering in Facebook, bowled over by blogs and tormented by Twitter – make some space in your busy schedule for a Great Book or two, invest some time in some serious reading and reap the rewards of time well spent.

Sunday, January 13, 2013

Review of The Innovator’s Dilemma by Clayton M. Christensen

Main Thesis – Companies need to invest in disruptive technologies, technologies which are typically low spec, more expensive and not required by their current customers in order to stay competitive in the long term. If they don’t, companies which do develop these new technologies will soon develop them into a main stream product which will displace the previous industry standard and the company will not thrive.

There is a difference between sustaining innovations, which improve existing ways of doing things, and disruptive technologies, which do things in quite different ways or way a significantly lower cost.
Christensen’s findings include:

  • Companies need to invest in disruptive technologies early on.
  •  They usually need to do so by starting a new company or spin off which is solely dedicated to developing the disruptive technology, commercializing it and finding or creating the right market to sell it into. They need to do this because trying to dedicate sufficient resources within the parent company may be theoretically possible but in practice it usually fails. It is very hard to prevent valuable resources being diverted to solve urgent problems with the main technology which address existing customer’s needs. 
  • Disruptive technologies need to be develop is a very flexible way which will allow them to be reengineered as the market becomes clear. Large scales up-front investment in a fixed product design is not advisable.