Calibrations for software engineering interviews

Written on November 2, 2020//10-minute read

After posting my takes on posture in system design interviews in TechWriters, David Golden inspired me to write about calibrations on technical interviews. The angle I'm looking at calibrations in this article is intentionally narrow, as I took a route to talk more about how calibrations contribute to a fairer, more balanced, and as a consequence of that, more efficient hiring processes. For the sake of this, I'm containing the subject of calibrations to two perspectives:

How can a group of peer interviewers assess and land onto a shared understanding of who is the best fit for a role;
How can a hiring manager increase the certainty that they are making the best hire decision for their team and organisation.

Even though my goal is approaching the subject with as much transposability and objectiveness as possible, calibrating interviews isn't in any means an immutable, transferable or exact matter. Similar to any complex system or concept, this is an overly-simplified representation of the subject as an attempt to make it more digestible. Calibrations are a highly nuanced subject even within the same organisational culture context, and even further for different organisations. Different organisations will expect different outcomes from hiring processes, and covering calibrations to the degree of depth to make it justice would overlap with specifics of the hiring process designed for your organisation, which is out of scope and reach to cover.

Context & Pre-requisites

For those who are unfamiliar with the term: interview calibrations serve the purpose of finding common ground between interviewers about a candidate's overall performance and then measuring the performance of candidates against each other, so you move forward with the most appropriate candidate for the next step or to a job offer.

Even though interview calibrations can be unstructured, as in just talking through overall impressions about a candidate, approaching it from this angle has a large propensity to go wrong. Hiring managers and the seniormost people in the interview loop are usually well-positioned – even if unintentionally – to influence other's opinions, therefore making the process less fair and more biased if you go fully unstructured.

To that extent, I believe pre-requisites to reasonably calibrating candidates are:

A framework that defines leveling organisation-wide by outlining roles, expectations and impact for each level – what Senior, Staff, and other roles mean in practice;
A well-defined interview process per level or career track that is equitably used for all candidates, regardless of circumstances such as referrals or returning employees;
For more structured interview stages, such as technical interviews, a database of questions with clear guidelines at the interviewer's disposal;
A training program as a pre-requisite for interviewing covering what matters for the organisation covering from practicalities to D&I and unconscious bias training.

These pre-requisites will help to establish a foundation for objective comparison between existing talent at the organisation, a standardised way to assess candidates, and that interviewers understand how to evaluate candidates fairly.

The interviewer's role

I believe a good starting point from the interviewers' perspective, regardless of their role, is studying the framework for leveling people. Understanding leveling is inherently understanding expectations for a role – and that's a starting point to assure fairness. In other words, further than defining that whoever does x, y and z is Senior Engineer, defining what type of impact and effect you expect these people to have in the role helps to clarify what you are looking for the needs for the open role. Studying the framework helps calibrations as if you understand expectations for a role, you can assess a candidate more objectively.

Next up, having comparative data and criteria between candidates under different lenses can be helpful too. In situations where there are constraints such as fewer roles open than candidates passing the interview bar, it can be hard to define who is better suited for the role, especially if different people interviewed them. I've seen this solved in two ways: having more than one interviewer per interview (e.g., one lead interviewer and one shadow interviewer) or having more than one interview of the same type per candidate (e.g., two system design interviews within the same interview loop done by different people). Both can help to decrease bias and give interviewers more angles to debate, compare and calibrate candidates between themselves through more perspectives for the same criteria.

Another critical task is defining what is expected as a good outcome for an interview stage upfront, so you have what to calibrate against after the interview in case you have multiple passes. What "better" means is for the majority of the times relative, so discussing it on the same terms with other interviewers in case of multiple passes will help to keep discussions to specifics. As an example, for a system design interview, you are likely looking for criteria such as "completion", "prioritisation", "depth of knowledge", "breadth of knowledge", "communication", "trade-off awareness" and so forth. Further observation out of scope as a note can help on the overall decision, but to ensure the process works as envisioned when designed is essential to limit grading to the stage's criteria.

To this point, defining how interview stage performance maps to candidate leveling can help to inform the decision process. For context, candidate leveling usually takes into consideration the overall performance of a candidate during the interview, plus their previous accomplishments and experiences. That is, the overall performance should be supported by stage performance, which then should be backed by interview criteria. A slightly more tangible example is, for instance, in system design interviews, seasoned engineers are more likely to have a deeper level of awareness of the trade-offs being made compared to less experienced engineers. Understanding what to expect performance-wise for each interview stage for each level can help to make the decision fairer. Be mindful though that turning the assessment into an exclusively objective or mechanical process is not desirable too, which brings us to the next two points.

You likely won't need to make a decision exclusively given the signals gathered in the interview stage you conducted. That said, giving candidates the benefit of the doubt by default can be a compassionate way to speak of someone's abilities that you barely know. Interviewing is a stressful experience, especially when the candidate really wants the role. Instead of pretentiously judging whether someone is fit or not for a role, understanding the interview as a snapshot of behaviours observed helps to keep yourself open to being convinced otherwise – both to favour hiring or to withdraw your hire recommendation.

Finally, if you get to a situation where you have more than one great candidate per role and grading between interviewers for the same interview type reveals a draw, delegate the decision to other stages. It is hard to get an accurate picture of one's ability through a 45-minute interview. At some point, the precision for stacking candidates speaks more about your own bias than about their qualities, so leave it for other criteria to drive the decision instead.

The hiring manager's role

One of the main learnings I've had as a hiring manager is hiring first for the organisation, then for the team. Things will go wrong often, regardless of the organisation's maturity. Having a narrow/tightly-fit hiring strategy can backfire dramatically. Good examples of that are a project the candidate was an excellent fit for getting disincentivised or deciding to buy software from a vendor instead of building it in-house. Those are circumstances where calibrations are helpful – if leveling is standardised across the organisation, moving people to their next endeavour is less disruptive for everyone involved.

That said, resist the urge to think exclusively of the capability gaps your teams have and your particular needs. As a hiring manager, you obviously should be well focused in your problem space, although people can end up needing more than that to stay or grow within the organisation. Investigating whether similar needs exist somewhere else within your organisation can help to make a more thoughtful decision. Validating your assumptions with peers and aligning potential further remit for your hires before even hiring them can force you to think of what is this candidate's progression path and ensure that you can bring someone with a clear problem space to own. On the flip side of the previous example, if things go incredibly well, this person will have options to pick for their path forward.

Before deciding for a specific candidate, calibrate them against the leveling framework to assure you're leveling newcomers fairly. Most companies have situations where levels have overlapping responsibilities and compensation bands. To achieve promotion to the next level, one has to be consistently performing at the next level, so when bringing new people in, you want to level them correctly, but even more so, you want to make sure to make things right for their peers too. In cases where you are unsure, down-leveling is usually the right thing to do as promoting someone within 6-12 months is simpler than demoting them.

A typical scenario as a hiring manager is biasing hiring towards your projects, teams and needs. This article from Google describes how project/team-driven hiring impacts overall talent quality. As the hiring manager, what you can do to that regard is calibrate candidates against their potential peer group. Doing this can help to achieve similar results in talent quality terms to the strategy used by Google. Think of adjacent teams or organisations who have people on the same level that you're hiring for, within the same discipline, and understand the peer group duties, impact, and responsibilities. Doing so will likely reveal how this candidate would fit into the overall picture, thus assuring talent quality across the organisation.

Furthermore, calibrate interviewers against each other, so you have a weighted average as a measure. Different interviewers have different criteria to assess people, even with a well-set frame. I have the impression that I tend to often ebb on the bright-side. Others might be much more conservative to risks and tend to recommend "no-hire" where I make " hire" recommendations. The vast majority of application tracking systems have calibration reporting which can help to understand how interviewers tend to grade interviewees who got offers, and this can serve as a signal to gauge overall performance quantitatively without going too deep into analysis.

Then, for split decisions, where interviewers have diametrically differing opinions or the uncertainty level is high across multiple stages, increase alignment and consistency by hosting an interview packet review meeting. This type of meeting usually consists of bringing all interviewers that attended each stage of the candidate's interview process together, build consensus, and then make a final decision. Usually, it is an excellent opportunity to understand where there might be ambiguity for a criterion within an interview stage, to give feedback to interviewers and receive feedback about the candidate and process, and as a consequence of that, improve the process and make fairer and more accurate decisions.

As previously said, this is just the tip of the iceberg – software engineering interview calibrations are a fascinating subject which likely will require cross-disciplinary efforts from people teams such as talent acquisition and human resources as well as technical input. To assure fairer calibrations and hiring, it is fundamental to continuously assess, measure, and through quantitative and qualitative signals evolve the process.