June 22, 2022

Tweet remote learning topics and sentiments among Italian Twitter users

Data

Twitter was chosen as the data source. It is one of the world’s leading social media platforms, with 199 million active users as of April 20214and it is also a common source of text for sentiment analyzes23,24.25.

To collect tweets related to distance education, we used TrackMyHashtag https://www.trackmyhashtag.com/, a tracking tool to monitor hashtags in real time. Unlike the Twitter API, which doesn’t provide tweets older than three weeks, TrackMyHashtag also provides historical data and filters selections by language and geolocation.

For our study, we chose the Italian words for “distance learning” as the search term and selected March 3, 2020 to November 23, 2021 as the period of interest. Finally, we chose Italian tweets only. A total of 25,100 tweets were collected for this study.

Data preprocessing

To clean the data and prepare it for sentiment analysis, we applied the following preprocessing steps using NLP techniques implemented with Python:

  1. 1.

    deletion of mentions, URLs and hashtags,

  2. 2.

    replaced HTML characters with the Unicode equivalent (like replacing ‘&’ with ‘&’),

  3. 3.

    removed HTML tags (such as (), (

    )etc.),

  4. 4.

    removed unnecessary line breaks,

  5. 5.

    removed special characters and punctuation,

  6. 6.

    removed words that are numbers,

  7. seven.

    converted the text of Italian tweets to English using the ‘googletrans’ tool.

In the second part, a higher quality dataset is required for the subject model. Duplicate tweets were removed and only unique tweets were kept. Besides general data cleaning methods, tokenization and lemmatization could allow the model to achieve better performance. Different forms of a word result in misclassification of patterns. Therefore, NLTK’s WorldNet library26 was used to accomplish the lemmatization. Stemming algorithms that aggressively reduce words to a common base even though those words actually have different meanings are not considered here. Finally, we lowercased all text to ensure every word appeared in a consistent format and trimmed vocabulary, removing stop words and unrelated terms, such as “like”, “of” and “would”.

Analysis of feelings and emotions

Between the main algorithms to be used for text mining and specifically for sentiment analysis, we applied the Valence Aware Dictionary for Sentiment Reasoning (VADER) proposed by Hutto et al.27 to determine the polarity and intensity of tweets. VADER is a sentiment lexicon and rules-based sentiment analysis tool gained through the wisdom of the crowd approach. Thanks to considerable human work, this tool can perform social media sentiment analysis quickly and has a very high accuracy similar to that of human beings. We used VADER to derive sentiment scores for pre-processed text data from a tweet. At the same time, according to the classification method recommended by its authors, we mapped the emotional score into three categories: positive, negative and neutral (Fig. 1 step 1).

Figure 1

Steps of sentiment and emotion analysis.

Then, to uncover the emotions underlying the categories, we applied the NRC28 algorithm, which is one of the methods included in the R library package syuzhet29 for the analysis of emotions. In particular, the NRC The algorithm applies an emotion dictionary to score each tweet based on two feelings (positive or negative) and eight emotions (anger, fear, anticipation, confidence, surprise, sadness, joy and disgust). Emotional recognition aims to identify the emotions conveyed by a tweet. If a tweet was associated with a particular emotion or feeling, it scores points that reflect the degree of valence against that category. Otherwise, there would be no score for that category. Therefore, if a tweet contains two words listed in the list of words for the emotion “joy”, the score for that sentence in the joy category will be 2.

When using the NRC lexicon, rather than receiving the algebraic score due to positive and negative words, each tweet gets a score for each category of emotion. However, this algorithm does not correctly take into account deniers. Additionally, it takes the bag-of-words approach, where sentiment is based on the individual words appearing in the text, neglecting the role of syntax and grammar. Therefore, the VADER and NRC the methods are not comparable in terms of number of tweets and polarity categories. Therefore, the idea is to use VADER for sentiment analysis and then apply NRC only to discover positive and negative emotions. The flowchart in Figure 1 depicts sentiment analysis in two steps. VADER’s neutral tweets are very useful in ranking but not interesting for the analysis of emotions; therefore, we focused on tweets with both positive and negative sentiments. VADER’s performance in the area of ​​social media text is excellent. Based on its comprehensive rules, VADER can perform sentiment analysis on various lexical features: punctuation, capitalization, degree modifiers, contrastive conjunction “but” and negation reversing trigrams.

The subject model

The subject model is an unsupervised machine learning method; i.e. it is a text mining procedure with which the topics or themes of documents can be identified from a large corpus of documents30. The Dirichlet Latent Allocation Model (LDA) is one of the most popular topic modeling methods; it is a probabilistic model of expression of a corpus based on a three-level hierarchical Bayesian model. The basic idea of ​​LDA is that every document has a subject, and a subject can be defined as a distribution of words31. Particularly in LDA models, the generation of documents within a corpus follows the following process:

  1. 1.

    A mix of k subjects, (theta)is sampled from a Dirichlet prior, which is parameterized by (alpha);

  2. 2.

    A subject (z_n) is sampled from the multinomial distribution, (p(theta mid alpha )) it is the distribution of document subjects that models (p(z_{n}=imid theta )) ;

  3. 3.

    Correction of the number of subjects (k=1 ldots ,K)the distribution of words for k subjects are designated by (phi) which is also a multinomial distribution whose hyper-parameter (beta) follows the Dirichlet distribution;

  4. 4.

    Given the subject (z_n)a word, (w_n)is then sampled via the multinomial distribution (p(w mid z_{n};beta )).

Overall, the likelihood of a document (or tweet, in our case)”(mathbf {w})containing words can be described as:

$$begin{aligned} p(mathbf{w})=int _theta {p(theta mid alpha )left( {prod limits _{n = 1}^N {sum limits _{z_n = 1}^k {p(w_n mid z_n ;beta )p(z_n mid theta )} } } right) } mathrm{}dtheta end{aligned}$$

(1)

Finally, the probability of the corpus of M documents (D={mathbf{w}_mathbf{1},ldots ,mathbf{w}_mathbf{M}}) can be expressed as the product of the marginal probabilities of each document (D_m)as indicated in (2).

$$begin{aligned} p(D) = prod limits _{m = 1}^M {int _theta {p(theta _m mid alpha)left( {prod limits _ {n = 1}^{N_m } {sum limits _{z_n = 1}^k {p(w_{m,n} mid z_{m,n};beta )p(z_{m,n } mid theta _m )} } } right) } } mathrm{}dtheta _m end{aligned}$$

(2)

In our analysis that includes tweets over a 2-year period, we find that tweet content is variable over time, and therefore topic content is not a static corpus. The Dynamic LDA (DLDA) model is adopted and used on topics aggregated into epochs, and a state-space model handles the transitions of topics from one epoch to another. A Gaussian probabilistic model to obtain posterior probabilities on subjects moving along the timeline is added as an additional dimension.

Figure 2
Figure 2

Dynamic topic model (for three time slices). A set of subjects in the dataset is expanded from the entire previous slice. The model for each time slice corresponds to the original LDA process. In addition, the parameters of each subject evolve over time.

Figure 2 shows a graphical representation of the Dynamic Subject Model (DTM)32. As part of the probabilistic topic model class, the dynamic model can explain how various tweet topics evolve. The tweet dataset used here (March 3, 2020 to November 23, 2021) contains 630 days, or exactly seven quarters of a year. The dynamic thematic model is thus applied to seven time steps corresponding to the seven quarters of the dataset. These time slices are inserted into the model provided by gensim33.

A key challenge in DLDA (like LDA) is determining an appropriate number of subjects. Roder et al. offers consistency scores to assess the quality of each topic model. In particular, subject consistency is the measure used to assess the between-subject consistency inferred by a model. As consistency measures, we used (RESUME) and (C_{umasse}). The first is a sliding window based measurement that uses normalized point mutual information (NPMI) and cosine similarity. In place, (C_{umasse}) is based on the number of document co-occurrences, a segmentation to a precedent and a logarithmic conditional probability as a measure of confirmation. These values ​​are intended to mimic the relative score that a human is likely to assign to a subject and to indicate how well the subject’s words “make sense”. These scores infer the cohesion between “top” words in a given topic. The distribution is also considered on primary component analysis (PCA), which visualizes topic patterns in a spatial distribution of two-dimensional words. A uniform distribution is preferred, which gives a high degree of independence to each subject. The judgment for a good model is a higher consistency and an average distribution on the primary analysis displayed by the pyLDAvis34.