Gender Recognition on Dutch Tweets - PDF

The men, on the other hand, seem to be more interested in computers, leading to important content words like software and game, and correspondingly more determiners and prepositions. On the female side, we see a representation of the world of the prototypical young female Twitter user.

They used lexical features, and present a very good breakdown of various word types.

The second classification system was Linguistic Profiling LP; van Halterenwhich was specifically designed for authorship recognition and profiling. Normalized 4-gram About K features. Finally, as the use of capitalization and diacritics is quite haphazard in the tweets, the tokenizer strips all words of diacritics and transforms them to lower case.

Recognition accuracy as a function of the number of principal components provided to the systems, using token bigrams.

LP keeps its peak at 10, but now even lower than for the Dating avond eindhoven n-grams Bigrams Two adjacent tokens.

As we approached the task from a machine learning viewpoint, we needed to select text features to be provided as input to the machine learning systems, as well as machine learning systems which are to use this input for classification.

Passing Messages

The ones used more by women are plotted in green, those used more by men in red. From each user s tweets, we removed all retweets, as these did not contain original text by the author. For the unigrams, SVR reaches its peak As for systems, we will involve all five systems in the discussion.

The age is reconfirmed by the endearingly high presence of mama and papa. Although LP performs worse than it could on fixed numbers of principal components, its more detailed confidence score allows a better hyperparameter selection, on average selecting around 9 principal components, where TiMBL chooses a wide range of numbers, and generally far lower than is optimal.

This apparently colours not only the discussion topics, which might be expected, but also the general language use. Instead, we will just look at the distribution of the various features over the female and male texts.

Juola and Koppel et al. However, all systems are in principle able to reach the same quality i. Then we outline how we evaluated the various strategies Section 3.

As the input features are numerical, we used IB1 with k equal to 5 so that we can derive a confidence value.

There is Dating avond eindhoven extreme number of misspellings even for Twitterwhich may possibly confuse the systems models.

If we look at the rest of the top males Table 2we may see more varied topics, but the wide recognizability stays. Top rankingfemales insvr ontokenunigrams, with ranksand scoresforsvr Younger men dating older women various feature types.

However, we cannot conclude that what is wiped away by the normalization, use of diacritics, capitals and spacing, holds no information for the gender recognition.

We start with the accuracy of the various features and systems Section 5. In this paper we restrict ourselves to gender recognition, and it is also this aspect we will discuss further in this section. For the measurements with PCA, the number of principal components provided to the classification system is learned from the development data.

Original 5-gram About K features. Feature type Unigram Bigram Trigram Skipgram Char 5-gram Top Function 14 get the impression that Dutch is not his native language, which is supported by his name.

Figure 5 shows all token unigrams. However, we received confirmation that she writes almost all her tweets herself Sargentini, personal communication. The dotted line represents exactly opposite scores for the two genders.


After this, we examine the classification of individual authors Section 5. And LP just mirrors its behaviour with unigrams. The most extreme misclassification is reserved for a female, author Starting with the systems, we see that SVR using original vectors consistently outperforms the other two.