Description: This project aims to classify whether a name is a girl's name or boy's name with Naive Bayes classifier.
We made some assumptions on this problem, some of them turned out to be effective. I would like to share some of our intuitions:
- Characters on the edge (starting and ending characters) is very helpful in classifying names. We define a variable FIX, which means the maximum number of characters to examine. The accuracy is highest when FIX = 3. The accuracy reached 80% after this feature being implemented;
- Secondly, long names and short names should have different features respectively, as the previous feature only care about the beginning and ending characters but does not care about length at all. We think we should include some features which can incorporate the length of name strings. For short names, we checked the pattern of name composition. For example, our pattern represents consecutive vows/ non-vows in a name string. This feature is limited for short names, otherwise, some pattern would be too sparse to be effective. For long names, we extract some hedge segments from training data. This is a complement for edge characters because edge characters only check beginning and ending characters but ignore some important segments in the middle of name strings. The hedge segments are selected with the following constraint: 1) it must appear very frequent in data set; 2) it must appear more frequently in one class than the other. After these 2 features implemented, the result reached as high as 84.6%.
There are something else we learned from this project:
An effective feature should have less sparsity. For example, for edge characters, if FIX is too large, there must be a lot of distinct features and some of them might only appear limited times, which makes it nothing more than 'feature memorization'. The most extreme example is - if FIX is equal to the length of the name, the whole name will be used as a feature.