Construindo um classificador de texto no Spark Shell do Mahout

Construindo um classificador de texto no Spark Shell do Mahout

Este tutorial o guiará pelas etapas usadas para treinar um modelo Naive Bayes Multinomial e criar um classificador de texto baseado nesse modelo usando o mahout spark-shell.

Pré-requisitos

Este tutorial pressupõe que você tenha suas variáveis ​​de ambiente do Spark configuradas para a mahout spark-shellvisualização: Reprodução com o Shell do Mahout . Além disso, assumimos que o Mahout está sendo executado no modo de cluster (ou seja, com a MAHOUT_LOCALvariável de ambiente não definida ), conforme estaremos lendo e gravando no HDFS.

Download e vetorização do conjunto de dados da Wikipedia

A partir de Mahout v. 0.10.0, ainda estamos dependentes das versões MapReduce de mahout seqwikimahout seq2sparsepara extrair e vetorizar nosso texto. Uma implementação Spark do seq2sparse está em andamento para o Mahout v. 0.11. No entanto, para baixar o conjunto de dados da Wikipedia, extrair os corpos da documentação, rotular cada documento e vetorizar o texto em vetores TF-IDF, podemos executar o exemplo de wikipedia-classifier.sh de forma simples .

Please select a number to choose the corresponding task to run
1. CBayes (may require increased heap space on yarn)
2. BinaryCBayes
3. clean -- cleans up the work area in /tmp/mahout-work-wiki
Enter your choice :

Digite (2). Isso fará o download de um grande dump XML recente do banco de dados da Wikipedia, em um /tmp/mahout-work-wikidiretório, descompactá-lo e colocá-lo no HDFS. Ele executará um trabalho MapReduce para analisar o conjunto da wikipedia , extraindo e rotulando apenas páginas com tags de categoria para [Estados Unidos] e [Reino Unido] (documentos ~ 11600). Em seguida, ele será executado mahout seq2sparsepara converter os documentos em vetores TF-IDF. O script também irá compilar e testar um modelo Naive Bayes usando o MapReduce . Quando estiver concluído, você verá uma matriz de confusão na tela. Para este tutorial, vamos ignorar o modelo MapReduce e construir um novo modelo usando o Spark baseado na saída de texto vetorizado por seq2sparse.

Começando

Inicie o mahout spark-shell. Há um script de exemplo: spark-document-classifier.mscala(.mscala denota um script Mahout-Scala que pode ser executado de forma semelhante a um script R). Nós estaremos percorrendo este script para este tutorial, mas se você quisesse simplesmente executar o script, você poderia simplesmente emitir o comando:

mahout> :load /path/to/mahout/examples/bin/spark-document-classifier.mscala

Por enquanto, vamos desmontar o roteiro, peça por peça. Você pode recortar e colar os seguintes blocos de código no mahout spark-shell.

Importações

Nossas importações de Mahout Naive Bayes:

import org.apache.mahout.classifier.naivebayes._
import org.apache.mahout.classifier.stats._
import org.apache.mahout.nlp.tfidf._

Importações Hadoop necessárias para ler o nosso dicionário:

import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.LongWritable

Leia em nosso conjunto completo do HDFS como vetorizado por seq2sparse em classify-wikipedia.sh

val pathToData = "/tmp/mahout-work-wiki/"
val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")

Extrair a categoria de cada observação e agregar essas observações por categoria

val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(
                                                             fullData)

Construa um modelo Muitinomial Naive Bayes e autoteste no conjunto de treinamento

val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
val resAnalyzer = SparkNaiveBayes.test(model, fullData, false)
println(resAnalyzer)

imprimir o ResultAnalyzermostrará a matriz de confusão.

Leia no dicionário e conte a frequência de documentos do HDFS

val dictionary = sdc.sequenceFile(pathToData + "wikipediaVecs/dictionary.file-0",
                                  classOf[Text],
                                  classOf[IntWritable])
val documentFrequencyCount = sdc.sequenceFile(pathToData + "wikipediaVecs/df-count",
                                              classOf[IntWritable],
                                              classOf[LongWritable])

// setup the dictionary and document frequency count as maps
val dictionaryRDD = dictionary.map { 
                                case (wKey, wVal) => wKey.asInstanceOf[Text]
                                                         .toString() -> wVal.get() 
                                   }
                                   
val documentFrequencyCountRDD = documentFrequencyCount.map {
                                        case (wKey, wVal) => wKey.asInstanceOf[IntWritable]
                                                                 .get() -> wVal.get() 
                                                           }

val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> x._2.toInt).toMap
val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> x._2.toLong).toMap

Definir uma função para tokenize e vectorize o novo texto usando nosso dicionário atual

Para este exemplo simples, nossa função vectorizeDocument(...)irá converter um novo documento em unigramas usando métodos nativos de String Java e vetorizar usando nossas frequências de dicionário e documento. Você também pode usar um analisador Lucene para bigramas, trigramas, etc. e integrar o Apache Tika para extrair texto de diferentes tipos de documentos (PDF, PPT, XLS, etc.). Aqui, no entanto, vamos mantê-lo simples, separando e tokenizing nosso texto usando regexs e métodos nativos de String.

def vectorizeDocument(document: String,
                        dictionaryMap: Map[String,Int],
                        dfMap: Map[Int,Long]): Vector = {
    val wordCounts = document.replaceAll("[^\\p{L}\\p{Nd}]+", " ")
                                .toLowerCase
                                .split(" ")
                                .groupBy(identity)
                                .mapValues(_.length)         
    val vec = new RandomAccessSparseVector(dictionaryMap.size)
    val totalDFSize = dfMap(-1)
    val docSize = wordCounts.size
    for (word <- wordCounts) {
        val term = word._1
        if (dictionaryMap.contains(term)) {
            val tfidf: TermWeight = new TFIDF()
            val termFreq = word._2
            val dictIndex = dictionaryMap(term)
            val docFreq = dfCountMap(dictIndex)
            val currentTfIdf = tfidf.calculate(termFreq,
                                               docFreq.toInt,
                                               docSize,
                                               totalDFSize.toInt)
            vec.setQuick(dictIndex, currentTfIdf)
        }
    }
    vec
}

Configure nosso classificador

val labelMap = model.labelIndex
val numLabels = model.numLabels
val reverseLabelMap = labelMap.map(x => x._2 -> x._1)

// instantiate the correct type of classifier
val classifier = model.isComplementary match {
    case true => new ComplementaryNBClassifier(model)
    case _ => new StandardNBClassifier(model)
}

Definir uma função argmax

O rótulo com a pontuação mais alta ganha a classificação de um determinado documento.

def argmax(v: Vector): (Int, Double) = {
    var bestIdx: Int = Integer.MIN_VALUE
    var bestScore: Double = Integer.MIN_VALUE.asInstanceOf[Int].toDouble
    for(i <- 0 until v.size) {
        if(v(i) > bestScore){
            bestScore = v(i)
            bestIdx = i
        }
    }
    (bestIdx, bestScore)
}

Defina nosso classificador de vetores TF (-IDF)

def classifyDocument(clvec: Vector) : String = {
    val cvec = classifier.classifyFull(clvec)
    val (bestIdx, bestScore) = argmax(cvec)
    reverseLabelMap(bestIdx)
}

Dois exemplos de artigos de notícias: Futebol dos Estados Unidos e Futebol do Reino Unido

// A random United States football article
// http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128
val UStextToClassify = new String("(Reuters) - Super Bowl security officials acknowledge" +
    " the NFL championship game represents a high profile target on a world stage but are" +
    " unaware of any specific credible threats against Sunday's showcase. In advance of" +
    " one of the world's biggest single day sporting events, Homeland Security Secretary" +
    " Jeh Johnson was in Glendale on Wednesday to review security preparations and tour" +
    " University of Phoenix Stadium where the Seattle Seahawks and New England Patriots" +
    " will battle. Deadly shootings in Paris and arrest of suspects in Belgium, Greece and" +
    " Germany heightened fears of more attacks around the world and social media accounts" +
    " linked to Middle East militant groups have carried a number of threats to attack" +
    " high-profile U.S. events. There is no specific credible threat, said Johnson, who" + 
    " has appointed a federal coordination team to work with local, state and federal" +
    " agencies to ensure safety of fans, players and other workers associated with the" + 
    " Super Bowl. I'm confident we will have a safe and secure and successful event." +
    " Sunday's game has been given a Special Event Assessment Rating (SEAR) 1 rating, the" +
    " same as in previous years, except for the year after the Sept. 11, 2001 attacks, when" +
    " a higher level was declared. But security will be tight and visible around Super" +
    " Bowl-related events as well as during the game itself. All fans will pass through" +
    " metal detectors and pat downs. Over 4,000 private security personnel will be deployed" +
    " and the almost 3,000 member Phoenix police force will be on Super Bowl duty. Nuclear" +
    " device sniffing teams will be deployed and a network of Bio-Watch detectors will be" +
    " set up to provide a warning in the event of a biological attack. The Department of" +
    " Homeland Security (DHS) said in a press release it had held special cyber-security" +
    " and anti-sniper training sessions. A U.S. official said the Transportation Security" +
    " Administration, which is responsible for screening airline passengers, will add" +
    " screeners and checkpoint lanes at airports. Federal air marshals, behavior detection" +
    " officers and dog teams will help to secure transportation systems in the area. We" +
    " will be ramping it (security) up on Sunday, there is no doubt about that, said Federal"+
    " Coordinator Matthew Allen, the DHS point of contact for planning and support. I have" +
    " every confidence the public safety agencies that represented in the planning process" +
    " are going to have their best and brightest out there this weekend and we will have" +
    " a very safe Super Bowl.")

// A random United Kingdom football article
// http://www.reuters.com/article/2015/01/26/manchester-united-swissquote-idUSL6N0V52RZ20150126
val UKtextToClassify = new String("(Reuters) - Manchester United have signed a sponsorship" +
    " deal with online financial trading company Swissquote, expanding the commercial" +
    " partnerships that have helped to make the English club one of the richest teams in" +
    " world soccer. United did not give a value for the deal, the club's first in the sector," +
    " but said on Monday it was a multi-year agreement. The Premier League club, 20 times" +
    " English champions, claim to have 659 million followers around the globe, making the" +
    " United name attractive to major brands like Chevrolet cars and sportswear group Adidas." +
    " Swissquote said the global deal would allow it to use United's popularity in Asia to" +
    " help it meet its targets for expansion in China. Among benefits from the deal," +
    " Swissquote's clients will have a chance to meet United players and get behind the scenes" +
    " at the Old Trafford stadium. Swissquote is a Geneva-based online trading company that" +
    " allows retail investors to buy and sell foreign exchange, equities, bonds and other asset" +
    " classes. Like other retail FX brokers, Swissquote was left nursing losses on the Swiss" +
    " franc after Switzerland's central bank stunned markets this month by abandoning its cap" +
    " on the currency. The fallout from the abrupt move put rival and West Ham United shirt" +
    " sponsor Alpari UK into administration. Swissquote itself was forced to book a 25 million" +
    " Swiss francs ($28 million) provision for its clients who were left out of pocket" +
    " following the franc's surge. United's ability to grow revenues off the pitch has made" +
    " them the second richest club in the world behind Spain's Real Madrid, despite a" +
    " downturn in their playing fortunes. United Managing Director Richard Arnold said" +
    " there was still lots of scope for United to develop sponsorships in other areas of" +
    " business. The last quoted statistics that we had showed that of the top 25 sponsorship" +
    " categories, we were only active in 15 of those, Arnold told Reuters. I think there is a" +
    " huge potential still for the club, and the other thing we have seen is there is very" +
    " significant growth even within categories. United have endured a tricky transition" +
    " following the retirement of manager Alex Ferguson in 2013, finishing seventh in the" +
    " Premier League last season and missing out on a place in the lucrative Champions League." +
    " ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional reporting by Jemima" + 
    " Kelly; editing by Keith Weir)")

Vetorize e classifique nossos documentos

val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap)
val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap)

println("Classifying the news article about superbowl security (united states)")
classifyDocument(usVec)

println("Classifying the news article about Manchester United (united kingdom)")
classifyDocument(ukVec)

Amarre tudo em um novo método para classificar o texto

def classifyText(txt: String): String = {
    val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
    classifyDocument(v)
}

Agora podemos simplesmente chamar nosso método classifyText (…) em qualquer String

classifyText("Hello world from Queens")
classifyText("Hello world from London")

Persistência do modelo

Você pode salvar o modelo no HDFS:

model.dfsWrite("/path/to/model")

E recupere-o com:

val model =  NBModel.dfsRead("/path/to/model")

O modelo treinado agora pode ser incorporado em um aplicativo externo.

 

Fonte:  http://mahout.apache.org/docs/latest/tutorials/samsara/classify-a-doc-from-the-shell.html

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Foto do Google

Você está comentando utilizando sua conta Google. Sair /  Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s