thinking: January 2012

Tuesday, 31 January 2012

Matlab draw plot graph closed automatically

Matlab 7.4
Mac 10.6

Problem:
When I use Matlab to draw graphs, like hist(x,y), scatter(x,x), the matlab will close automatically

Solution:
http://www.mathworks.com/matlabcentral/answers/11086-matlab-closing-automatically-when-plotting-data-or-opening-an-existing-saved-figure

There is an incompatibility in a recent Java update (1.6.0_26) affecting MATLAB versions R2007a, R2007b and R2008a, that can be worked around by following these steps:

1. Close MATLAB, if it is running.

2. In Terminal or xterm , type:

open -a TextEdit /Applications/MATLAB_R2008a/bin/.matlab7rc.sh

In the path above, change "/Applications/MATLAB_R2008a/bin/" accordingly depending on your version of MATLAB's root folder name.

3. In the editor, navigate to Line 410 to locate:

DYLD_LIBRARY_PATH=

This line is a part of the following code in the "mac" section of matlab7rc.sh:

if [ "$DYLD_LIBRARY_PATH" != "" ]; then
   DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH
else
   DYLD_LIBRARY_PATH=
fi

4. Change the line from

DYLD_LIBRARY_PATH

DYLD_LIBRARY_PATH=/System/Library/Frameworks/JavaVM.framework/Libraries

For users on MATLAB R2007a and R2007b, after making the change in the "mac" section, make the same change the "maci" section (right below the "mac" section).

5. Save the changes (Command-S).

6. Restart MATLAB.

Sunday, 29 January 2012

setup MacVim

Initial
cd ~/ //进入到用户根目录
vim .gvimrc //创建并且编辑该文件

重新启动macvim

Wednesday, 25 January 2012

Mahout Experience (2) - LDA

将text文件转换为sequenceFile

./mahout seqdirectory -c UTF-8 -i /Users/ruihaidong/Documents/workspace_java/lda/data/sourceFilesource/ -o /Users/ruihaidong/Documents/workspace_java/lda/data/seqfiles

将sequenceFile文件转换为vector

./mahout seq2sparse -i /Users/ruihaidong/Documents/workspace_java/lda/data/seqfiles/ -o /Users/ruihaidong/Documents/workspace_java/lda/data/vectors -ow

运行lda

./mahout lda -i /Users/ruihaidong/Documents/workspace_java/lda/data/vectors/tf-vectors/ -o /Users/ruihaidong/Documents/workspace_java/lda/data/ldaresult/ -k 20

打印lda topics

./mahout ldatopics -i /Users/ruihaidong/Documents/workspace_java/lda/data/ldaresult/state-44/ -d /Users/ruihaidong/Documents/workspace_java/lda/data/vectors/dictionary.file-0 -dt sequencefile -w 10

Monday, 9 January 2012

Mahout Experience (1) --LDA

LDA (Latent Dirichlet Allocation)

https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation

LDA Mahout 的实现是运行在 a collection of SparseVectors of word counts. These word counts必须是非负的整数, 创建这种vector, 详细见 https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
使用TF作为分数，而不要使用TF-IDF作为分数

1 创建 a collection of SparseVectors of word counts
有几个不同的方法可以用来创建 a collection of SparseVector

通过Lucene索引

创建Lucene索引（注意）创建Lucene索引的Lucene版本必须与Mahout所使用的Lucene版本保持一致，可以通过检索Mahout的POM文件来查看Mahout所使用的Lucene的版本，这里我使用的Lucene的版本是 3.4.0
根据lucene索引生成Vector： $MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE INDEX> \ --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] \ <--max <Number of vectors to output>> <--norm {INF|integer >= 0}> <--idField <Name of the idField in the Lucene index>>

通过文本文件
可以利用Mahout的工具，将文本文件生成Vector。

在创建vector之前，需要将文本转化为SequenceFile格式。
$MAHOUT_HOME/bin/mahout seqdirectory \

--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>

然后将SequenceFile转化为SparseVector

$MAHOUT_HOME/bin/mahout seq2sparse \
-i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED> \
<-wt <WEIGHTING METHOD USED> {tf|tfidf}> \
<-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \
<-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT> org.apache.lucene.analysis.standard.StandardAnalyzer> \
<--minSupport <MINIMUM SUPPORT> 2> \
<--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \
<--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \
<--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
<-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"

2 运行LDB by Mahout