Scientific Memo

Sunday 2 December 2012

Why Open Access?

Open access is one of the hot topics discussed these days. I have recently commented about the issue in the context of ACM's position [link]. Editor Prof Vardi was kind enough to give me an opportunity to express my views. Apart from the views I express there, I also would like to point out the parallels with the open source software. Open access publishing currently viewed as just an access to article's content. Probably PLoS is the only model that goes beyond this, that one can leave comments on it after peer-review. I think this signifies future tendencies on peer-review, why a paper should be a static document? For example, largest preprint service arXiv supports versions on the articles. I wouldn't surprise if we adapt similar practices of open source movement under the open science umbrella; living peer-review and access to every details of the research. Well, it opens a big debate as well but looks like industry is embracing the value of openness, slowly.

Thursday 29 November 2012

Football Goal Distributions and Statistical Mechanics

With the advent of complex network science and its allied approaches in the last decade or so with the data driven research, using statistical mechanical techniques out side of materials or atomic physics becomes a standard and quite a popular practice. One of the most interesting of this usage was in football (or soccer for Americans). This is probably because of mass football-mania in the UK and rest of the Europe and the new-world of course. However, using statistics is not a new thing at all but finding similarities with the atomic systems. Specially quantitative approach to sports is very well known, a recent film featuring Brad Pitt, Moneyball, shows this. Every major sports club (merchandise in the US) is now running a statistics division, considering the sky high salaries of players. There are interesting works in the goal statistics, being non-Gaussian [link] [link] [link] and passing network for football strategies [link] among other works. This kind of research is classified as econophysics.

Monday 5 November 2012

Imputation of missing data: Recursive 1D discrete KNN algorithm

Any generated data is often have missing components or values. Probably, the most common occurrence manifest in time series data where there is no value available on the given time point, hence a NaN is placed in general (or NA in R). There is a large literature on how a statistical analysis must be performed in such data sets. For example, a seminal book by Little & Rubin called Statistical Analysis With Missing Data provides very detailed exposure to the field. Probably the simplest of all methods is called imputation. For example there are high quality R packages like imputation or mi that does the job for you. Similarly knnimpute from MATLAB bioinformatics toolbox provides similar solution.

k-nearest neighbour algorithm (KNN) is the most common approach to discover the closest available value in the data vector. However, often implementations of KNN contains a lot of options that are not needed for simple imputations in 1D and Euclidean metric. Here I propose 1D discrete KNN recursive algorithm that scans a given vector and determines the closest available value to given index). The main idea is assuming periodic boundary conditions for the vector index boundaries (Two-way linear search on the ring, see simple sketch). This is an $O(\bf{N})$ algorithm, we scan the vector twice in the worst case scenario. I have implemented this idea in MATLAB, for a given matrix. Files are available on the matworks file exchange [link]. This tool could be useful in imputing data in regression design matrices.

Monday 22 October 2012

Hands on Crash Tutorial for Supervised Learning : Multinomial Logistic Regression (MNL)

If you are a computer scientist, probably you would call a task supervised learning what others might call classification based on data. Data being anything that has a well defined structure and associated classes. We will work on hands on example for multi-class prediction with logistic regression i.e. multinomial logistic regression (MNL). What I meant by multi-class is here that we have $k \in \bf{Z}$ distinct classes for each observation, $k>1$. Actually if you think in terms of link functions from Generalized Linear Models (GLMs), the support of the distribution will tell you the distinction of the nature of the class. In statistics literature classes can be manifest as factors (or categories).

The following pointers would be helpful before you read further. From R perspective, I'd suggest German Rodrigez's notes for background reading. Data mining blog by Will Dwinnell is one of the clear descriptions in this direction for MATLAB. For sure, an authoritative resource on this is the book by Hastie et. al. called elements of statistical learning, you can obtain it from their Stanford page. An other excellent resource from Professor Ripley of Oxford, his books and R packages are known to be de facto standard in the field, particularly here I refer to nnet and applied statistics book.

Recall that a design matrix $\bf{X}$ is noting more then set of observations ($n$) in the rows. Observables ($p$) are placed along columns. The idea of supervised learning is that we train our model, here we choose to be multinomial GLM, against a data set and obtain coefficients of the model (parameters). Let's do this with the simplest possible example. (R codes start with '>' and MATLAB '>>')

Statistical toolbox brings set of functionality to do multinomial logistic regression. mnrfit is the main function we would like to use. Lets use a simple data set, it is trivial and the statistics we would get from this data might be very poor, but we aimed at the concept in this post.

>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]'; % design matrix
>> Y = [1 2 1 3 1 2 1 3 1 1]'; % associated classes

We can split this set to training and validation data set by choosing

>> trainIndex = [2 8 10 4 5];
>> validIndex = [1 3 6 7 9];

Now we can obtain the coefficients via mnrfit.

>> betaHat    = mnrfit(X(trainIndex), Y(trainIndex), 'model', 'ordinal', 'interactions', 'off', 'link', 'logit');
Note that, model is ordinal, meaning that it takes discrete values. Switched off interactions generates only one coefficient per class (betaHat vector). Then we can get the probabilities for the validation set.

>> predictProbs=mnrval(betaHat, X(validIndex), 'model', 'ordinal', 'interactions', 'off', 'link', 'logit');

ans =

    0.2980    0.1958    0.5061
    0.3636    0.2041    0.4324
    0.4242    0.2045    0.3713
    0.4346    0.2039    0.3615
    0.5084    0.1954    0.2961

So the highest probabilites form this matrix will give us [3, 3, 1, 1, 1] predictive set. With this approach we have predicted last two classes correctly.

Let's do this same example with R, even though the following approach may not be exactly the same procedure described above, however this is multinomial as well:
> require(nnet)
> # Design Matrix: 10 observations
> X            = matrix(c(0.0, 0.1, 0.7, 1.0, 1.1, 1.3, 1.4, 1.7, 2.1, 2.2));
> # Corresponding Classes: 3 ordinal classes
> Y            = matrix(c(0, 1, 0, 0,
                           0, 0, 1, 0,
                           0, 1, 0, 0,
                                     0, 0, 0, 1,
                                     0, 1, 0, 0,
                                     0, 0, 1, 0,
                                     0, 1, 0, 0,
                                     0, 0, 1, 0,
                                     0, 1, 0, 0,
                                     0, 1, 0, 0
                              ), nrow = 10, ncol=4, byrow=TRUE)

Similarly we can obtain training and validation sets

># Generate training and validation data sets for X, Y
> trainIndex   = c(2, 8, 10, 4, 5);
> validIndex   = c(1, 3, 6, 7, 9);
Now, we can use nnet package fitting function for multinomial log-linear.

> mfit = multinom(formula = Y[trainIndex, ] ~ X[trainIndex] + 0)

Note that we put 0 for the intercepts while first column of Y is dummy.

> predict(mfit, X[validIndex])
[1] 2 2 2 2 2

Resulting classification is not that great. The above example shown a basic approach to supervised learning in MATLAB and R. However, one must check statistical values, like residuals, deviance, p-values etc. for the quality of results.

Monday 15 October 2012

Compressed Sensing with R

Compressed sensing (CS) is pretty much appealing all current signal processing research community. At the same time, popularity of R language gaining a strong foot in the research and industry. Even though historically MATLAB is a de-facto standard in signal processing community, R is becoming a serious alternative to this. For example, the quality of R in time-series analysis or medical imaging is now an accepted fact.

Last year, I have demonstrated how one can use R in compressed sensing research in a short tutorial in the pub, close to Liverpool street in London. The package R1magic is available in CRAN. Package provides basic interface to perform 1-D compressed sensing with l1, TV penalized minimization. There are other packages doing similar regularized minimization. However the interface of R1magic is particularly designed for CS i.e. appearance of sparse bases in the objective function.

Here is one simple example ( Version 0.1):

library(R1magic)# Signal components
N <- 100
# Sparse components
K <- 4
# Up to Measurements > K LOG (N/K)
M <- 40
# Measurement Matrix (Random Sampling Sampling)
phi <- GaussianMatrix(N,M)
# R1magic generate random signal
xorg <- sparseSignal(N, K, nlev=1e-3)
y <- phi %*% xorg ;# generate measurement
T <- diag(N) ;# Do identity transform
p <- matrix(0, N, 1) ;# initial guess
# R1magic Convex Minimization ! (unoptimized-parameter)
ll <- solveL1(phi, y, T, p)
x1 <- ll$estimate
plot( 1:100, seq(0.011,1.1,0.011), type = "n",xlab="",ylab="")
title(main="Random Sparse Signal Recovery",
xlab="Signal Component",ylab="Spike Value")
lines(1:100, xorg , col = "red")
lines(1:100, x1, col = "blue", cex = 1.5)
# shifted by 5 for clearity

Blue line is the reconstructed signal with R1magic.

Sunday 19 February 2012

Gaussians in n-dimensions

New algorithms in Gaussian Mixture Model (GMM) may sound quite oxymoron to analyze n-dimensional data sets but considering principle of parsimony it may be
the chosen one among mixture models.

Sunday 5 February 2012

Micro Swimming

Mark Haw [link] has interesting popular articles on swimming in micro-scale and physics of biological systems in that scale. Specially in reference to Purcell's
seminal article [pdf].

Sunday 22 January 2012

Quantum Mechanics without wavefunction

During my studies I was fortunate enough to write a short review on density functional theory, specifically reduction of computational burden in computing many electron integrals [pdf]. What strikes me in that short literature review a quote that is attributed to van Vleck about the legitimacy of wave function in standard quantum mechanics: ".. wave function has no legitimate scientific concept as pointed out by Van Vleck for many electrons (more than 10 electrons)..." according to Nobel lecture given by Walter Kohn, so the concept of density functionals in solving quantum mechanical problems. In similar lines, recently Israeli-american scientists has formulated the non-relativistic quantum mechanics without the need of wave function or density functions : Quantum states are represented as ensembles of real-valued quantum trajectories, obtained by extremizing an action and satisfying energy conservation [link] I think interpretation of this new results might be quite interesting, worth to think about it.

Saturday 7 January 2012

Scaling of differentiation in networks

Differentiation of networks (link), a recent work points out efficiency in many metrics networks based on a natural selection appears to be more robust.