Food & Beverages

Improve the automatic classification accuracy for Arabic tweets using ensemble methods-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0

Description
Tweets classification became interest topics in recent years, especially for the Arabic language. In this paper, the Arabic tweets are classified automatically into one of some predetermined categories mainly: sport, culture, politics, technology and
Published
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Available   online   at   www.sciencedirect.com ScienceDirect  JournalofElectricalSystemsandInformationTechnology5(2018)363–370 ImprovetheautomaticclassificationaccuracyforArabictweetsusingensemblemethods HammamM.Abdelaal a , ∗ ,AhmedN.Elmahdy a ,AliA.Halawa a ,HassanA.Youness b a  DepartmentofComputersandSystemsEngineering,FacultyofEngineering,Al-AzharUniversity,Cairo,Egypt  b  DepartmentofComputersandSystemsEngineering,FacultyofEngineering,MiniaUniversity,Egypt  Received16November2015;receivedinrevisedform27February2018;accepted17March2018Availableonline4April2018 Abstract Tweets   classification   became   interest   topics   in   recent   years,   especially   for   the   Arabic   language.   In   this   paper,   the   Arabic   tweetsare   classified   automatically   into   one   of    some   predetermined   categories   mainly:   sport,   culture,   politics,   technology   and   general,based   on   their   linguistic   characteristics   and   their   contents,   also   the   classification   accuracy   is   improved   for   Arabic   tweets,   by   usingensemble   methods   mainly:   bagging,   boosting   and   stacking   on   the   same   dataset   that   we   used   it   before   in   the   classification,   to   verifyof    the   results,   and   identify   the   best   classifier   gives   high   accuracy.   The   experimental   results   showed   that   using   ensemble   methods   arebetter   than   using   individual   classifier,   to   improve   the   accuracy   of    classification.   Increased   accuracy   of    classifier   Naïve   Bayes   (NB)to   1.6%,   classifier   Sequential   Minimal   Optimization   (SMO)   to   2.2%   and   finally   Decision   Tree   (J48)   classifier   reached   up   to   3.2%,comparing   to   using   the   J48,   NB,   or   SMO   as   a   single   classifier.©   2018   Electronics   Research   Institute   (ERI).   Production   and   hosting   by   Elsevier   B.V.   This   is   an   open   access   article   under   the   CCBY-NC-ND   license   (http://creativecommons.org/licenses/by-nc-nd/4.0/ ). Keywords: Arabictweets;Preprocessing;Classifications;Classifieralgorithms;Ensemblemethods 1.   Introduction Arabic   text   has   different   nature   than   English   text;   therefore,   preprocessing   of    Arabic   text   are   more   challenging,   andveryimportant   technique   before   the   classification   of    texts,   to   get   the   knowledge   from   huge   data   and   reduce   the   processingoperations.Preprocessing   includes   many   steps   mainly:   remove   useless   words,   known   as   the   (Stop   Words)   such   asfrom,in,on,   etc.,   and   unification   of    the   word   (re-word   to   its   root   form)   known   as   (Stemming).   Text   Classification   aims   toclassifydocuments   into   predefined   classes.   It   is   also   called   text   categorization,   document   classification,   and   document ∗ Correspondingauthor.  E-mailaddress: hammammohamed36@yahoo.com(H.M.Abdelaal).PeerreviewundertheresponsibilityofElectronicsResearchInstitute(ERI).https://doi.org/10.1016/j.jesit.2018.03.0012314-7172/©2018ElectronicsResearchInstitute(ERI).ProductionandhostingbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/ ).  364  H.M.Abdelaaletal./JournalofElectricalSystemsandInformationTechnology5(2018)363–370 categorization.   There   are   two   approaches   of    classification:   Manual   classification   and   Automatic   classification,   themanualclassification   depends   on   Rule-based   classification,   and   the   automatic   classification   dependent   on   machinelearningalgorithms.   In   this   research   we   used   automatic   classification   approach.   The   aim   of    this   classification   is   toassignthe   Arabic   Tweets   automatically   into   predetermined   categories   on   the   basis   of    their   linguistic   characteristics,theircontents,   and   some   of    the   words   that   characterize   each   category   from   the   others.   Also   three   different   algorithms:bagging,boosting   and   stacking   were   used,   to   improve   the   accuracy   of    NB,   J48   and   SMO   classifiers,   the   main   goalofthese   algorithms   is   to   convert   a   weak    learning   algorithm   into   a   strong   learning   algorithm,   through   reduce   the   falsepositiverate,   these   algorithms   can   it   decrease   error   suppose   each   base   classifier   has   a   40%   probability   of    error   oneachexample.   Most   base   classifiers   make   it   more   likely   a   quantity   will   be   correct.   An   ensemble   algorithm   has   betteraccuracythan   single   classification   techniques.   The   success   of    the   ensemble   approach   depends   on   the   variety   in   theindividualclassifiers   with   respect   to   misclassified   instances   (Lee   and   Cho,   2010).   In   our   experiments,   we   used   populartoolssuch   as   WEKA   (Waikato   Environment   for   Knowledge   Analysis)   WEKA   is   an   important   for   data   mining   andmachinelearning   algorithms,   through   results   showed   that   using   ensemble   methods   achieve   accuracy   are   more   thanusingindividual   classifier.   The   accuracy   of    SMO   classifier   reached   up   to   88.60%,   J48   classifier   with   86.80%,   andfinallyNB   classifier   88.60%.   Improvement   ratio   for   each   classifier   is   2.2%,   3.2%,   1.6%   respectively. 2.   Related   works Tiwari   and   Prakash   (2014)   used   ensemble   methods   mainly:   boosting,   bagging   and   stacking   to   improve   the   accuracyofJ48   on   Sonar   dataset,   the   dataset   contains   111   patterns   obtained   by   bouncing   sonar   signals   off    a   metal   cylinder   atvariousangles   and   under   various   conditions.   Their   experimental   results   showed   that,   Stacking   works   best.   The   otherensemblesare   better   than   individual   J48   algorithm.Syarif    et   al.   (2010)   used   ensemble   methods   to   improve   the   accuracy   of    network    intrusion   detection   systems.   Theyusedfour   different   data   mining   classifiers,   Naïve   Bayes,   J48   (decision   tree),   JRip   (rule   induction)   and   iBK   (NearestNeighbor).Their   experiment   shows   that   the   prototype   which   applies   four   base   classifiers   and   three   ensemble   methodsachievean   accuracy   of    more   than   99%   in   detecting   known   intrusions.Bekkali   et   al.   (2014)   used   a   data   set   in   their   experiments   which   was   collected   from   Twitter.   They   used   a   NodeXL   ExcelTemplatewhich   is   a   freely   Excel   template,   that   makes   it   is   easier   to   collect   Twitter   network    data.   This   corpus   is   manuallyclassifiedinto   six   categories:   Cinema,   News,   Documentary,   Health,   Tourism   and   Economics.   The   effectiveness   of    theirsystemhas   been   evaluated   and   compared   in   term   of    F1-measure   using   the   NB   and   the   SMO   classifiers.Lee   et   al.   (2011)   classified   tweets   trending   topics   into   18   general   categories   such   assports,   politics,   technology,etc.Use   two   data   modeling   methods,   text-based   data   modeling   and   network-based   data   modeling.   Their   results   showthatclassification   accuracy   of    up   to   65%   and   70%   can   be   achieved   using   text-based   and   network-based   classificationmodelingrespectively.Sriram   et   al.   (2010)   classified   tweets   to   a   predefined   categories   such   as   events,   deals,   news,   opinions,   and   privatemessagesbased   on   author   information   and   domain   specific   features   extracted   from   tweets.   Experimental   results   showthatthe   Bag-Of-Words   (BOW)   approach   performs   decently   but   eight   features   (8F)   performs   significantly   better   withthisset   of    generic   classes. 3.   Tweets   collection   and   preprocessing The   tweets   are   collected   from   twitter   site   using   twitter   search   Application   Programming   Interface   (API).   Thecollectedtweets   were   about   different   public   categories   with   several   keywords   for   each   category.   Table   1,   shows   thenumberof    collected   tweets   for   each   category.   The   data   set   consists   of    500   tweets   of    different   categories;   each   tweet   wasmanuallylabeled   based   on   their   contents   and   the   domain   that   it   was   found   within.   The   tweets   have   been   categorizedintofive   categories   mainly   sport;   politics;   technology;   culture   and   general.Datapreprocessing   is   an   important   step   in   data   mining   because,   it   allows   us   to   clean   up   the   data   from   noisy   words,like‘stop   words’   are   the   terms   that   occur   frequently   in   the   most   of    documents   in   a   given   collection.   They   are   extremelycommonwords   that   would   appear   to   be   of    a   little   value   in   selecting   documents   that   are   matched   with   the   user   need.Thus,they   must   not   be   included   as   indexing   terms.   Most   of    those   words   are   irrelevant   to   the   categorization   task    and   canbedropped   with   no   harm   to   the   classifier   performance,   and   may   even   result   in   improvement   due   to   noise   reduction   (Hanetal.,   2006).   Data   preprocessing   isa   technique   that   can   improve   data   quality,   thereby   helping   itto   improve   the   accuracy   H.M.Abdelaaletal./JournalofElectricalSystemsandInformationTechnology5(2018)363–370 365Table1Numberofcollectedtweetspercategory.CategorynameNumberoftweetsSport100Politic100Technology 100Culture100General100Total500Fig.1.SampleofformatfileincludesBooleanapproach. and   efficiency   of    the   subsequent   mining   process.   It   is   an   important   in   the   knowledge   discovery   process,   because   qualitydecisionsmust   be   based   on   quality   data.   Detecting   data   normalizing,   rectifying   them   early,   and   reducing   the   data   tobeanalyzed   can   lead   to   huge   payoffs   for   decision   making   (Feldman   and   Sanger,   2007).   Preprocessing   includes   thefollowingsteps: •   Remove   non-Arabic   words   (neglect   English   words). • Remove   special   characters   such   as   (#,   %,   &,   @,   ∼ ,   etc.). • Remove   diacritics   and   punctuation   marks   (1.,   a.,   II.). • Replace   (   ,   ,   )   with   (   ). • Remove   definite   article. • Replace   (   )   with   (   ),   and   (   )   with   (   ). • Remove   conjunction   (   )   (and). • Remove   repetition   of    characters   such   as   (‘   ’   turns   into   ‘   ’). • Remove   the   common   repetition   laughing. • Remove   repetition   of    hyphenssuch   as   (‘   ’   turns   into   ‘   ’). • Remove   stop   words   such   as   (   ,   ,   )   (from,   on,   in). • Stemming   is   a   process   of    linguistic   normalization,   in   which   the   variant   forms   of    a   word   are   reduced;   e.g.   the   words‘write’,‘writing’,   and   ‘writer’   are   all   turned   into   ‘write’   (Al-Shalabi   et   al.,   2012) • Term   Weighting   isone   of    pre-processing   methods;   it   helps   us   to   display   important   words   in   a   documents   collectionforclassification   purposes   (Zhengwei   et   al.,   2010).   There   are   several   approaches   of    term   weighting   mainly:   Booleanapproach,and   Term   Frequency-Inverse   Document   Frequency   (TF-IDF),   in   this   research   used   Boolean   approach,   itisrefer   to   absence   or   presence   of    a   word   with   Booleans   0   or   1   respectively   (Saad   and   Ashour,   2010).   Fig.   1   showssampleof    form   file,   using   Boolean   approach   (0,1).Thedata   is   then   converted   into   comma   separated   value   (CSV)   using   WEKA   program.   Fig.   2,describes   the   mainphasesof    the   tweets   collection   and   preprocessing   which   starts   with   collecting   500   tweets.   The   tweets   are   then   dividedintofive   categories   based   on   their   contents,   each   category   is   100   tweets.   The   next   step   isthe   tweets   preprocessing,thatis   a   very   important   step   before   the   categorizing   tweets   get   knowledge   from   the   massive   data,   which   reduce   theprocessingoperations.   Tweets   are   tokenized   (which   means   change   of    text   into   a   sequence   of    discrete   tokens)   after   thatwestore   them   in   the   document   without   repeating   (unique   words).   The   next   step   is   to   classify   the   tweets   by   SMO,   NB,  366  H.M.Abdelaaletal./JournalofElectricalSystemsandInformationTechnology5(2018)363–370 Fig.2.Outlinefortweetspreprocessingandclassification.Table2Overallpercentageaccuraciesforindividualclassifier.Method10-foldcrossvalidationIndividualclassifierJ48NBSMOAccuracy%83.687.086.4 and   J48   classifiers   using   cross-validation   (CV)   method.   The   final   step   is   evaluating   the   accuracy   results   in   two   cases:individualclassifier   and   combination   algorithms   case. 4.   Experimental   results After   the   accuracies   of    J48,   NB,   and   SMO   have   been   got   as   individual   classifier   using   cross   validation   (10-fold)method,where   the   training   data   isdivided   randomly   into   (n)   blocks,   each   block    is   held   out   once   and   the   classifier   istrainedon   the   remaining   (n-1)   blocks,   then   three   algorithms   were   used   to   improve   these   accuracies   for   Arabic   tweetsclassificationthose   algorithms   are   Bagging,   Boosting   and   Stacking.   Table   2,   shows   the   overall   percentage   accuracyforindividual   classifier:   NB,   J48,   and   SMO. 4.1.    Evaluation   of    results There   are   several   measures   that   we   can   use   to   measure   classification   accuracy,   the   measures   are:   accuracy,   precision,recall,and   F-measure.   Accuracy   as   a   measure   isthe   number   of    samples   that   are   correctly   classified.   Calculation   of precisionand   recall   are   according   to   computing   True   Positive   (TP),   True   Negative(TN),   False   Positive   (FP)   and   FalseNegative(FN)   (Sawaf    et   al.,   2001)   as   shown   in   Fig.   3,shows   a   different   outcome   of    a   two   class   prediction   and   the   rateofcorrectly   predicted   classes.We   can   compute   the   precision   as   :Precision (P)  = TPTP   +   FPAlsocan   compute   the   recall   as:   Recall   (R)   = TPTP   +   FN   H.M.Abdelaaletal./JournalofElectricalSystemsandInformationTechnology5(2018)363–370 367Fig.3.Therateofcorrectlypredictedclasses.Table3Overallaccuracies.F,R,andPfortheindividualcategories.Classifier \ categorynameNBJ48SMOCross-validationPRFPRFPRFTechnology0.8950.0770.8280.9520.590.7280.64510.784Sport0.97110.98510.960.9810.930.964General0.910.710.7980.9590.70.8090.9730.720.828Culture0.7160.960.8210.5960.990.7440.9480.730.825Politics0.9190.910.9150.9130.940.9260.9310.940.935Accuracy%87.083.686.4 TP:   the   number   of    tweets   which   are   correctly   assigned   to   the   given   category.TN:the   number   of    tweets   which   are   correctly   assigned   not   to   belong   to   category.FP:the   number   of    tweets   which   are   incorrectly   assigned   to   the   category.FN:the   number   of    tweets   which   are   incorrectly   not   assigned   to   the   category.Table   3   and   Fig.   4;   show   the   F1-measure   (F),   and   Recall   (R)   and   Precision   (P)   for   individual   category.   Thesecategoriesare:   Technology,   Culture,   Sport,   Politics   and   General   using   cross-validation   method.TheF   measure   combines   precision   and   recall,   the   F-measure   is   used   to   calculate   the   performance   of    text   classifiersasfollowing   equation:F1-Measure   (F)   = 2(Precision   ∗   Recall)Precision   +   RecallAnd   finally,   the   accuracy   (overall   success   rate)   isthe   number   of    correct   classifications   divided   by   the   total   numberofclassificationsAccuracy   = TP   +   TNTP   +   TN   +   FP   +   FN Fig.4.F,R,andPfortheindividualcategories.
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x