MEMPREDIKSI PENINGKATAN H-INDEKS UNTUK JURNAL PENELITIAN DENGAN MENGGUNAKAN ALGORITMA COST-SENSITIVE SELECTIVE NAIVE BAYES CLASSIFIERS

Reycardo Henglie, Yunianto Purnomo, Jusia Amanda Ginting

Abstract


Machine learning community is not only interested in maximizing classification accuracy, but also in minimizing the distances between the actual and the predicted class. Some ideas, like the cost-sensitive learning approach, are proposed to face this problem. In this paper, we propose two greedy wrapper forward cost-sensitive selective naive Bayes approaches. Both approaches readjust the probability thresholds of each class to select the class with the minimum-expected cost. The first algorithm (CSSNB-Accuracy) considers adding each variable to the model and measures the performance of the resulting model on the training data. The variable that most improves the accuracy, that is, the percentage of well classified instances between the readjusted class and actual class, is permanently added to the model. In contrast, the second algorithm (CS-SNB-Cost) considers adding variables that reduce the misclassification cost, that is, the distance between the readjusted class and actual class. We have tested our algorithms on the bibliometric indices prediction area. Considering the popularity of the well-known h-index, we have researched and built several prediction models to forecast the annual increase of the h-index for Neurosciences journals in a four-year time horizon. Results show that our approaches, particularly CS-SNB-Accuracy, achieved higher accuracy values than the analyzed cost sensitive classifiers and Bayesian classifiers. Furthermore, we also noted that the CS-SNB-Cost always achieved a lower average cost than all analyzed cost-sensitive and cost-insensitive classifiers. These cost sensitive selective naive Bayes approaches outperform the selective naive Bayes in terms of accuracy and average cost, so the cost-sensitive learning approach could be also applied in different probabilistic classification approaches.

Keywords


CSSNB-Accuracy, CS-SNB-Cost, bibliometric, clasification, predicted distances

References


S. Alonso, F.J. Cabrerizo, E. Herrera-Viedma, F. Herrera, h-index: a review focused in its variants, computation and standardization for different scientific fields, J. Informetr. 3 (4) (2009) 273–289.

S. Alonso, F.J. Cabrerizo, E. Herrera-Viedma, F. Herrera, hg-index: a new index to characterize the scientific output of researchers based on the h- and g-indices, Scientometrics 82 (2) (2010) 391–400.

O.K. Baskurt, Time series analysis of publication counts of a university: what are the implications? Scientometrics 86 (3) (2011) 645–656.

P.D. Batista, M.G. Campiteli, O. Kinouchi, A.S. Martinez, Is it possible to compare researchers with different scientific interests? Scientometrics 68 (1) (2006) 179–189.

L. Bornmann, R. Mutz, H. Daniel, Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of the h index using data from biomedicine, J. Am. Soc. Inf. Sci. Technol. 59 (5) (2008) 830–837.

F.J. Cabrerizo, S. Alonso, E. Herrera-Viedma, F. Herrera, q2-index: quantitative and qualitative evaluation based on the number and impact of papers in the Hirsch core, J. Informetr. 4 (1) (2010) 23–28.

J.S. Cardodo, J.F.P. da Costa, Learning to classify ordinal data: the data replication method, J. Mach. Learn. Res. 8 (2007) 1393–1429.

K. Crammer, Y. Singer, Pranking with ranking, in: Advances in Neural Information Processing Systems, vol. 14, 2002, MIT Press, pp. 641–647.

P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in: Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, 1999, pp. 155–164.

C. Drummond, R. Holte, Exploiting the cost (in)sensitivity of decision tree splitting criteria, in: Proceedings of the 17th International Conference on Machine Learning, 2000, pp. 239–246.

D.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, John Wiley,New York, USA, 1973.

L. Egghe, Dynamic h-index: the Hirsch index in function of time, J. Am. Soc. Inf. Sci. Technol. 58 (3) (2006) 452–454.

L. Egghe, An improvement of the h-index: The g-index, ISSI Newslett. 2 (1) (2006) 8–9.

L. Egghe, The hirsch-index are related impact measures, Annu. Rev. Inf. Sci. Technol. 44 (2010) 65–114.

L. Egghe, R. Rousseau, An informetric model for the hirsch-index, Scientometrics 69 (1) (2006) 121–129.

C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the Seventeenth International Joint Conference of Artificial Intelligence, 2001, pp. 973–978.

E. Frank, M. Hall, A simple approach to ordinal classification, in: Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 145–156.

E. Frank, S. Kramer, Ensembles pf nested dichotomies for multi-class problems, in: Proceedings of the 21st International Conference on Machine Learning, 2004, pp. 305–312.

J. Furnkranz, Pairwise classification as an ensemble technique, in: Proceedings of the 13th European Conference on Machine Learning, 2002, pp. 97–110.

P.E. Hart, The condensed nearest neighbour rule, Trans. Inf. Theory 14 (1968) 515–516.

R. Herbrich, T. Graepel, K. Obermayer, Regression Models for Ordinal Data: A Machine Learning Approach. Technical Report 99-3, Department of Computer Science, Technical University of Berlin, 1999.

R. Herbrich, T. Graepel, K. Obermayer, Large margin rank boundaries for ordinal regression, in: Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, 2000, pp. 115–132 (Chapter 7).

F. F. Tampinongkol, Y. Herdiyeni, and E. N. Herliyana, “Feature extraction of Jabon (Anthocephalus sp) leaf disease using discrete wavelet transform,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 18, no. 2, p. 740, Apr. 2020, doi: 10.12928/telkomnika.v18i2.10714.

D.W. Hosmer, S. Lemeshow, Applied Logistic Regression, 2nd ed., Wiley, New York, USA, 2000.

A. Ibáñez, P. Larrañaga, C. Bielza, Predicting citation count of bioinformatics papers within four years of publication, Bioinformatics 25 (24) (2009) 3303–3309.

K. L. Hartono and J. A. Ginting, “Implementation of web-based Japanese digital handwriting OCR using chain code and manhattan distance,” 2023, p. 020017. doi: 10.1063/5.0174709.

B. Jin, h-index: an evaluation indicator proposed by scientist, Sci. Focus 1 (1) (2006) 8–9.

S.B. Kotsiantis, Local ordinal classification, in: Artificial Intelligence Applications and Innovations. International Federation for Information Processing, Springer, Athens, Greece, 2004, pp. 1–8.

S.B. Kotsiantis, P.E. Pintelas, A cost sensitive technique for ordinal classification problems, in: Methods and Applications of Artificial Intelligence. Lecture Notes in Computer Science, Springer, Samos, Greece, 2004, pp. 220–229.

C. Herdian, S. Widianto, J. A. Ginting, Y. M. Geasela, and J. Sutrisno, “The Use of Feature Engineering and Hyperparameter Tuning for Machine Learning Accuracy Optimization: A Case Study on Heart Disease Prediction,” 2024, pp. 193–218. doi: 10.1007/978-3-031-50300-9_11.

G. Krampen, A. von Eye, G. Schui, Forecasting trends of development of psychology from a bibliometric perspective, Scientometrics 87 (2) (2011) 687–694.

W. Kruskal, W. Wallis, Use of ranks in one-criterion variance analysis, J. Am. Stat. Assoc. 47 (260) (1952) 583–621.

P. Langley, S. Sage, Induction of selective bayesian classifiers, in: Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, 1994, pp. 399–406.

H.T. Lin, L. Li, Reduction from cost-sensitive ordinal ranking to weighted binary classification, Neural Comput. 24 (5) (2012) 1329–1367.

C.X. Ling, Q. Yang, J. Wang, S. Zhang, Decision trees with minimal costs, in: Proceedings of the 21st International Conference on Machine Learning, 2004, pp. 69–77.

H. Mann, D. Whitney, On a test of whether one of two random variables is stochastically larger than the other, Ann. Math. Stat. 18 (1) (1947) 50–60.

P. McCullagh, Regression models for ordinal data, J. R. Stat. Soc. Ser. B 42 (2) (1980) 109–142.

P. McCullagh, J.A. Nelder, Generalized Linear Models, Chapman and Hall, London, 1983.

M. Minsky, Steps toward artificial intelligence, IRE 49 (1) (1961) 8–30.

R. Potharst, J.C. Bioch, Decision trees for ordinal classification, Intell. Data Anal. 4 (2) (2000) 97–112.

J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Francisco, USA, 1993.

F. Ruane, R.S.J. Tol, Rational (successive) h-indices: an application toeconomics in the Republic of Ireland, Scientometrics 75 (2) (2008) 395–405.

A. Shashua, A. Levin, Ranking with large margin principle: two approaches, in: Advances in Neural Information Processing Systems, vol. 15, MIT Press, Cambridge, MA, 2003, pp. 961–968.

V.S. Sheng, C.X. Ling, Roulette sampling for cost-sensitive learning, in: Proceedings of the 18th European Conference on Machine Learning. Lecture Notes in Computer Science, 2007, Springer, pp. 724–731.

A. Sidiropoulos, D. Katsaros, Y. Manolopoulos, Generalized hirsch h-index for disclosing latent facts in citation networks, Scientometrics 72 (2) (2007) 253–280.

M. Stone, Cross-validation choice and assessment of statistical predictions, J. R. Stat. Soc. 36 (1974) 111–147.

K.M. Ting, Inducing cost-sensitive trees via instances weighting, in: Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, 1998, pp. 23–26.

P.D. Turney, Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm, J. Artif. Intell. Res. 2 (1995) 369–409.

I.H. Witten, E. Frank, Data Mining—Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, San Francisco, CA, 2005.

F.Y. Ye, R. Rousseau, The power law model and total career h-index sequences, J. Informetr. 2 (4) (2008) 288–297.

B. Zadrozny, C. Elkan, Learning and making decisions when costs and probabilities are both unknown, in: Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining, 2001, pp. 204–213.

B. Zadrozny, J. Langford, N. Abe, Cost-sensitive learning by cost-proportionate instance weighting, in: Proceedings of the 3rd International Conference on Data Mining, 2003, pp. 435–442.




DOI: http://dx.doi.org/10.30813/j-alu.v7i1.6028

Refbacks

  • There are currently no refbacks.


p-ISSN 2620-620X
e-ISSN 2621-9840

 

Indexed By

  

Recomended Tools: