visit
DOI: 10.34740/KAGGLE/DSV/2309388
For a successful advertising campaign, working with a segment is vital, and the gender of the user simplifies the work of selecting segments at times.
I will tell you how collecting statistics on applications allow ML to predict a user’s gender.users.csv — A list of users with their most likely gender and a list of several installed applications.
bundles_gender.csv — Gender distribution of users in the application.
Pay attention to the
cnt
field — it shows the number of users who have this app installed, whose gender we know, and, accordingly, we can collect statistics regarding the app. This field can also be used as a measure of confidence in information about this application.genders_df[
(genders_df['F']>=0.3325) &
(genders_df['F']<=0.3375)
].describe()
users_df['apps_count'] = users_df['ids'].apply(len)
users_df.groupby('gend')['apps_count'].describe()
g_dict = genders_df['F'].to_dict()
users_df['F_prob'] = users_df['ids'].apply(
lambda x: np.mean(
list(filter(None.__ne__, list(map(g_dict.get, x))))
)
)
np.corrcoef(
users_df['F_prob'],
users_df['gend'].astype('category').cat.codes
)[0,1]
-0.46602945129982887
print(f"Accuracy: \
{accuracy_score(users_df['gend'].astype('category').cat.codes, users_df['F_prob']<0.5)}")
print(f"AUC: \
{1 - roc_auc_score(users_df['gend'].astype('category').cat.codes, users_df['F_prob'])}")
Accuracy: 0.740925288445762
AUC : 0.7793767183917958
train, test = train_test_split(
users_df, train_size=0.7,
random_state=0, stratify=users_df['gend'])
109186
mlb = MultiLabelBinarizer(sparse_output=True)
mlb.fit(users_df['ids'])
train_mlb = mlb.transform(train['ids'])
test_mlb = mlb.transform(test['ids'])
I use the OOF (Out-of-Fold) approach to obtain reliable results and reduce the influence of randomness when dividing into training and validation subsamples. I don’t use third-party libraries and wrote a simple function. Please note that splitting the dataset into folds must be stratified.
def get_oof_lr(n_folds, x_train, y, x_test, seeds):
ntrain = x_train.shape[0]
ntest = x_test.shape[0]
oof_train = np.zeros((len(seeds), ntrain, 2))
oof_test = np.zeros((ntest, 2))
oof_test_skf = np.empty((len(seeds), n_folds, ntest, 2))
models = {}
for iseed, seed in enumerate(seeds):
kf = StratifiedKFold(
n_splits=n_folds,
shuffle=True,
random_state=seed)
for i, (tr_i, t_i) in enumerate(kf.split(x_train, y)):
print(f'\nSeed {seed}, Fold {i}')
x_tr = x_train[tr_i, :]
y_tr = y[tr_i]
x_te = x_train[t_i, :]
y_te = y[t_i]
model = LogisticRegression(
random_state=seed,
max_iter = 10000,
verbose=1,
n_jobs=20
)
model.fit(x_tr, y_tr)
oof_train[iseed, t_i, :] = \
model.predict_proba(x_te)
print(f"AUC: {roc_auc_score(y_te, oof_train[iseed, t_i, :][:,1])}")
oof_test_skf[iseed, i, :, :] = \
model.predict_proba(x_test)
models[(seed, i)] = model
oof_test[:, :] = oof_test_skf.mean(axis=1).mean(axis=0)
oof_train = oof_train.mean(axis=0)
return oof_train, oof_test, models
Seed 0, Fold 0: 0.8752592302937795
Seed 0, Fold 1: 0.87427
Seed 0, Fold 2: 0.8754404425783484
Seed 0, Fold 3: 0.8750862228494931
Seed 0, Fold 4: 0.8767777821454008
Seed 42, Fold 0: 0.876839970445301
Seed 42, Fold 1: 0.87774
Seed 42, Fold 2: 0.8762049208242458
Seed 42, Fold 3: 0.8725705419477277
Seed 42, Fold 4: 0.87309
Seed 888, Fold 0: 0.8752996641300741
Seed 888, Fold 1: 0.8749304780764804
Seed 888, Fold 2: 0.87626
Seed 888, Fold 3: 0.8765240184267109
Seed 888, Fold 4: 0.87256
Accuracy: 0.8208932240918818
AUC : 0.8798990678456793
When I look at the ids feature, I see a list of tokens. Why not try working with this data like plain text?
I chose as the free library for the model. CatBoost is a high-performance, open-source library for gradient boosting on decision trees. From release 0.19.1, it supports text features for classification on GPU out-of-the-box. The main advantage is that CatBoost can include categorical functions and text functions in your data without additional preprocessing. You can find more detail about text features in the article Unconventional Sentiment Analysis: BERT vs. Catboost.
!pip install catboost
def fit_model(train_pool, test_pool, **kwargs):
model = CatBoostClassifier(
task_type='GPU',
iterations=10000,
eval_metric='AUC',
od_type='Iter',
od_wait=1000,
learning_rate=0.1,
**kwargs
)
return model.fit(
train_pool,
eval_set=test_pool,
verbose=1000,
plot=False,
use_best_model=True
)
users_df['ids_txt'] = \
users_df['ids'].apply(
lambda x: " ".join([str(i) for i in x ]))
columns = ['ids_txt', 'apps_count']
oof_train_cb, oof_test_cb, models_cb = get_oof_cb(
n_folds=5,
x_train=train[columns],
y=train['gend'].values,
x_test=test[columns],
text_features=['ids_txt'],
seeds=[0, 42, 888]
)
Accuracy: 0.82011
AUC : 0.88566
As a new feature, I’ve added OOF predictions from a logistic regression model. In addition, do not forget about the F_prob feature, which worked well for the base model.
columns = ['ids_txt', 'F_prob', 'lr', 'apps_count']
oof_train_cb_2, oof_test_cb_2, models_cb_2 = get_oof(
n_folds=5,
x_train=train_2[columns],
y=train_2['gend'].values,
x_test=test_2[columns],
text_features=['ids_txt'],
seeds=[0, 42, 888]
)
Accuracy: 0.836950230713273
AUC : 0.90467