Monday, 9 January 2012

Mahout Experience (1) --LDA


LDA (Latent Dirichlet Allocation)

https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation


LDA Mahout 的实现是运行在 a collection of SparseVectors of word counts. These word counts必须是非负的整数, 创建这种vector, 详细见 https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
使用TF作为分数,而不要使用TF-IDF作为分数



1 创建 a collection of SparseVectors of word counts
     有几个不同的方法可以用来创建 a collection of SparseVector

     通过Lucene索引

  •   创建Lucene索引(注意)创建Lucene索引的Lucene版本必须与Mahout所使用的Lucene版本保持一致,可以通过检索Mahout的POM文件来查看Mahout所使用的Lucene的版本,这里我使用的Lucene的版本是 3.4.0
  • 根据lucene索引生成Vector: $MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE INDEX> \ --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] \ <--max <Number of vectors to output>> <--norm {INF|integer >= 0}> <--idField <Name of the idField in the Lucene index>>
    通过文本文件
可以利用Mahout的工具,将文本文件生成Vector。


  • 在创建vector之前,需要将文本转化为SequenceFile格式。
    $MAHOUT_HOME/bin/mahout seqdirectory \
  • --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
    <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
    <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
    <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
  • 然后将SequenceFile转化为SparseVector
    $MAHOUT_HOME/bin/mahout seq2sparse \
    -i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED> \
    <-wt <WEIGHTING METHOD USED> {tf|tfidf}> \
    <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \
    <-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT> org.apache.lucene.analysis.standard.StandardAnalyzer> \
    <--minSupport <MINIMUM SUPPORT> 2> \
    <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \
    <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \
    <--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
    <-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"
  • 
    

2 运行LDB by Mahout

No comments:

Post a Comment