In this project, over a series of blog posts I'll be buidling a model document classification, also known as text classification and deploying the model as part of a web application to predict the topic of research papers from their abstract. In the first blog post I will be working with the Scikit-learn library and an imbalanced dataset (corpus) that I will create from summaries of papers published on arxiv. The topic of each paper is already labeled as the category therefore alleviating the need for me to label the dataset. The imbalance in the dataset will be caused by the imbalance in the number of samples in each of the categories we are trying to predict. Imbalanced data occurs quite frequently in classification problems and makes developing a good model more challenging. Often times it is too expensive or not possible to get more data on the classes that have to few samples. Developing strategies for dealing with imbalanced data is therefore paramount for creating a good classification model. I will cover some of the basics of dealing with imbalanced data using the Imbalance-Learn library as well as building a Naive Bayes classifier and Support Vector Machine using from Scikit-learn. I will also over the basics of term frequency-inverse document frequency and visualizing it using the Plotly library.