Training ML Models

Training Our Own Models at Talk Hiring

At Talk Hiring, we developed NLP-based text classification models to tackle two significant challenges:

  1. Analyzing mock job interview transcripts: We focused on detecting the PAR method (Problem-Action-Result) in behavioral interviews and identifying rambling (e.g., discussing multiple unrelated examples).
  2. Classifying job application emails: Using our Gmail-API integration, we built a model to classify emails based on whether they pertained to a job application and, if so, determine the status of the application (e.g., Interview, Offer, Rejection, Assessment).

In this post, I’ll outline the process we followed to build one of these models—our custom job relevance email classification model—highlighting what worked and what didn’t.

Overall Strategy

Why Build a Custom Model?

Given the availability and accuracy of models from OpenAI and Anthropic, why did we choose to train a custom model? The answer lies in scalability and cost-efficiency.

Our email parsers needed to run inference predictions on approximately 650,000 emails per day—a number that continues to grow. Processing thousands or tens of thousands of tokens per email, coupled with bursty email traffic, would have led to significant expenses and potential rate limiting from third-party providers like OpenAI or Anthropic.

The Power of Transfer Learning

Transfer learning became our preferred strategy. It allows us to fine-tune a pre-trained model, which has learned from large, generalized datasets, to meet our specific needs with a relatively small labeled dataset.

For this, we leveraged Wikitext-103, a widely-used base model for NLP tasks. Wikitext-103 is trained on about 100 million tokens from high-quality Wikipedia articles, providing a robust foundation of written English.

The Challenge of Email Classification

Training a model to determine whether an email is about a job application is more challenging than it might seem. Job seekers, particularly those who don’t unsubscribe from email lists, often receive a barrage of emails daily—both from real and automated recruiters. These emails often encourage the recipient to apply for jobs (e.g., “You’re a great fit for XYZ job. Apply today!”). Moreover, distinguishing between emails related to job applications and those related to other applications (e.g., loans, housing, school, or training programs) can be tricky.

Building the Training Dataset

Our dataset comprised about 750,000 emails between job seekers and employers, accumulated over time. For this specific model, we needed about ten thousand labeled emails. However, the dataset was highly unbalanced, as most emails people receive aren’t about job applications they’ve submitted. Ideally, we would label more emails, and can do that in the future to nudge accuracy up.

To avoid biasing the model towards predicting all emails as irrelevant to job applications, we constructed a training/test/verification dataset with a 50/50 split between emails about job applications and those that weren’t.

But achieving this split wasn’t enough. We also manually reviewed the dataset to ensure it wasn’t over-indexed on certain types of emails. For example, emails generated by Indeed for job submissions are automated and often look very similar, varying only by employer or job title. We didn’t want the model to excel at predicting machine-generated emails without also being adept at handling human-generated, job-related emails. To address this, I personally reviewed the training dataset to ensure that no more than a handful of examples of the same kind of email were included in our ~10,000 email training dataset.

Labeling the Data

We explored various options for labeling the data:

Model Training

I’m a big fan of FastAI and its native support for transfer learning. Additionally, Google Colab—especially with GPU or TPU-boost—is excellent for saving work and running heavy ML workloads in the cloud. Here’s the process we followed:

  1. Fine-Tuning a Language Model: We started with Wikitext-103 as the base model and trained it to predict the next word given the prior tokens. This language model training is unsupervised, allowing us to train it on many more email records than we have labeled. Once trained, we exported and saved the model weights for use in the next steps.

  2. Training the Email Classifier: This step involves both art and science. We loaded the pre-trained language model and used a Long Short-Term Memory (LSTM) architecture to iterate through model training. Although attention-based architectures are more popular and accurate today, LSTM was our choice as this was a few years back. We started with a higher learning rate, gradually reducing it with each training iteration. The key was to plot learning rate vs. loss and choose a learning rate where the slope was most negative, indicating the most significant reduction in error rates. With each iteration, the learning rate was decreased to improve accuracy incrementally without overshooting. We were able to train a model with > 98% accuracy at predicting whether an email was job relevant or not.

  3. Model Deployment: After training, we exported a .pkl file containing the model weights and stored it in a versioned AWS S3 bucket. This way, we could always revert to an earlier model version if needed. We then built a small Flask server running in Paperspace, equipped with load-balanced GPUs. This server dynamically pulls the appropriate .pkl file within a Docker container at build time to run inferences quickly and accurately.

Final Thoughts

I wish I could share my Jupyter notebook, but it contains sensitive email data. However, if you have any questions, feel free to reach out—I’m happy to share some code snippets or dive deeper into any part of the process.