Getting started with interview transcription for postgraduate research

How do you learn how to transcribe well? Should you just pay someone else to do it for you? Here's what I found out!

I want to share more of the experience of being a postgraduate researcher, so welcome to a few thoughts on learning to transcribe interviews during my first year of doctoral work. I hope it's useful to new postgrads out there!

This is based on doing some fairly straightforward, substantive content only transcription, so if you need to capture more nuance, also read something like this advice from KCL (part 2, part 3).

Good audio quality

So regardless of whether you're going to transcribe by yourself or hire someone to do it, getting the highest quality recording possible is vital to making it go smoothly later. A bad recording will lose intelligibility in key moments and cost more of your time (or money if you pay someone else to do it).

Fortunately for me, long ago I studied audio engineering on the internationally renowned Tonmeister course here at Surrey, so I'm well equipped for the technical aspects of recording. In my interviews I used my personal Zoom H2n field recorder, which gets excellent results out of a small package, with an option to use mid-side recording if you have a lot of off-axis noise you'd want to reduce. It can also be used as a USB audio interface with a laptop, if you need to record direct to encrypted storage for information governance reasons.

For phone conversations, using this Olympus in-ear microphone gave me surprisingly good results and luckily my H2n is able to power it through the input jack (not all recorders do this).

The H2n automatically manages the record gain level in a very reliable way without clipping, plus has a built in limiter pre-set for dialogue, so with no further work I can get very usable audio for my transcriptions.

I still found there was some noise that was worth reducing, be it traffic noise, people passing in the corridor outside, or the plastic tapping of my phone against the Olympus earpiece. For all of that I used the industry state of the art: Izotope RX8. I'll have a full review of RX8 for researchers in future with audio examples, for now what you need to know is that even the cheapest Elements version is really great for dialogue. Easy to use and can be had from as little as $18 when it's on sale (but get it as part of the Elements Suite for $30 on sale, the other plugins will be useful too). Izotope offer a generous 50% educational discount too if you can't wait for a better price.

Once you've got nice, clean audio to listen to, you'll want to invest in some studio-quality headphones. Preferably closed-back for physical noise isolation, or with active noise-cancellation. I've found many postgrads like to buy Sony or Bose active noise-cancelling headphones for use in the office anyway. If you don't want to spend that much, headphones without active noise-cancelling hit diminishing returns in quality beyond about £130. Personally I use Sennheiser HD 380 Pros, but Audio-Technica ATH-M50Xs are also a great option. If you've never owned good headphones you might be surprised what you can hear! Just watch the volume levels and listen at the very lowest you can get away with, so you don't damage your ears (hearing loss is based on volume and exposure time).

What is good transcription?

After cleaning the audio, my next question was this! What does good transcription actually mean? How exactly literally do you write down what someone has said, in what tone of voice, with what non-verbals? What about pauses of various lengths? How do you handle punctuation for natural speech patterns? Is there some standard notation to follow?

Sage Research Methods seemed the obvious first stop for answering these questions. And while I did find a few good pieces of advice there, it didn't have nearly as much information as I expected. I think the reality is that what and how you transcribe will depend on the needs of your research, so maybe I should've expected a lack of consensus on the technicalities and process.

What I ended up deciding is that because I'm using realist methodology, which looks for units of theoretical explanation (as contexts, mechanisms and outcomes), I would be best off transcribing for substantive content only. That means for me that what someone said is more important than the way they said it. Although that of course will not be true of all research, especially ethnography.

This gave me permission to not have to transcribe the conversational decorations that I had little idea we all did so much until having to write it down. Not just ums and errs, but many "likes" and "you knows" and repetitions while someone works out actually what they're... what they're saying.

As I worked my way through a few interviews, it became clearer to me what meanings I wanted to include and how. So for example only a significant... pause for thought would be shown with ellipses. Broken off and restarted thoughts were indicated with a hyphen- Right into the next sentence. Dependent thoughts were indicated – as you might expect – with en-dashes. Only the most deliberate emphasis was shown with italics, because I found on listening back that we place varying weight on words all the time as part of the natural cadence of speech. In the back of my mind was also the idea that anyone else reading these transcriptions under our Open Data policies shouldn't need a manual to decipher them.

The thing I had most difficulty with was where to draw the line between commas and periods. The way we speak naturally flows into itself much more than written communication, so you end up with comma after comma, in long run-on sentences that we would strictly avoid in written forms. I ended up getting a feel for where, even though there was no pause verbally, I needed to shove a period in to break up the meaning. Otherwise I'd have a lot more difficulty parsing the meaning when it came to analysis time!

Losing the will to go on

I don't think I'm alone in that I found transcription to be a boring mechanical task. One of the great tips I got from Sage Research Methods was to do your first (and "freshest") analytical pass just listening to the audio once through and I'm glad I did. I spent much of the transcription itself struggling to maintain my concentration. It takes a long time and as I said above, you have to keep part of your brain working on issues like tonality and punctuation, rather than consuming the content of the interview. PGR colleagues here have told me they enjoyed transcribing, with listening back sparking their memory of particular details, so your mileage may vary!

How long does it take to transcribe? I'm not a great typist, about 55wpm accurately, so for me it was somewhere between x3 and x4 real-time to transcribe substantive content. I'm told that isn't actually bad, but clearly not at a professional level. I suspect I'm benefitting from a lifetime of musical study too, for combined listening skills and finger dexterity. If I could type at a more reasonable 120wpm, I might be able to do it almost uninterrupted.

To make my life easier I set up VLC media player with global hotkeys enabled for play/pause and time-skipping, then wrote a small script with Bome MIDI Translator to bind the hotkeys to an old MIDI floorboard. This meant I could control playback with my feet while I typed in MS Word. The hotkeys had to be something that wouldn't trigger anything in Word, so I used Ctrl+F1 to F5.

My guitar multi-fx floorboard from the early 2000s, still going strong today.

Don't forget you can use "find and replace" afterwards to create shortcuts for anything you're typing a lot. At the minimum I do this for interviewer and participant monikers that indicate who's speaking, but I'd be an idiot to waste time typing out "Schwartz Rounds" in full every time, instead of "sr".

With all that, there was still physical discomfort. A 45ish minute interview ended up over 7000 words of substantive content and I'm just not used to needing to type that much in a day. I can't for a minute imagine doing this sustainably on a laptop – fortunately my desktop PC at home has a mechanical keyboard. Not office friendly but much nicer to type on at speed. It's not an ergonomic keyboard layout though and I very much should have invested in some wrist support. By the time I'd finished an interview I felt like I'd done a fierce round of Hanon piano exercises! I'm now splitting transcription up into five to ten minute chunks of audio, so I have regular breaks, both physically and mentally.

If you can't touch-type and you're embarking on your first adventures in qualitative research, now is the time to learn! Don't even think about transcribing for yourself if you have to look at the keyboard to know where to tap. There are loads of resources out there for touch-typing, even some fun ones (hello Sega's The Typing Of The Dead), so practicing doesn't have to be boring.

Don't forget to back up your files

No, seriously, don't forget to back up your files in accordance with your data management plan. Multiple local and cloud redundancy is a minimum standard. Don't necessarily trust your host institution to get it right either, I've heard horror stories of researchers with irretrievable data loss that way. And of course encrypt everything you store with at least AES-128 (7-zip is free and will do this for you, while compressing an archive) and a strong password!

Hang on, what about automated transcription?

So, I used some recordings of my supervision meetings (with consent) to see if Amazon's speech-to-text service on AWS (based on their Alexa technology) might get me, say 80% of the way there. It did not.

Firstly, it was not at all competent at understanding multiple speakers with partially overlapping speech, so you would have to use a multi-channel mic setup just to get proper separation between interviewer and participant. I don't think clipping a lapel mic to someone and wiring them into a professional audio-interface is a great way to start an interview, in most circumstances.

Secondly, it only accurately recognised common words, so you have to give it a phonetic dictionary of all the technical domain words you might encounter. Or train a custom model from domain-specific text data. For my purposes, interviewing highly-educated programme architects of a complex intervention, that was a non-starter. The list would just be too laborious to produce.

Thirdly, the output is a JSON formatted list of words and timestamps, with ranked options based on the algorithm's confidence. Where it has misheard a word, you would want to quickly explore the lower confidence alternatives, so now we're in a place where we need to write a custom interface to display and select from that list, while replaying the audio. I did actually find an open-source example of that, but it wasn't quite how I wanted to work. Chances are, many of you reading wouldn't be able to write your own tool for this and after years of coding practice, I'm only just able to now! But it's definitely not worth the time investment for a single research project.

Fourthly, my local information governance office didn't seem to entirely understand the idea of this technology when I submitted for ethics, so even if it was worth it, I'd still have a fight to be able to use it. There are of course technical legal constraints under GDPR about regional transfer, storage and processing of data here. I have a law degree so I'm comfortable with navigating that, but that's not going to be the case for many postgrads. If you wanted to try Google's speech-to-text API for example, you might quickly land yourself in trouble by accidentally sending data outside the EU.

If you've seen a company get automated transcription right I'd love to hear about it, but I suspect most are white-labelling Amazon or Google services with some manual clean-up that they've quietly off-shored to India or similar, to get the accuracy they're claiming.


In conclusion...

  1. Record good audio. Learn how to clean it up for best quality.
  2. Think about what you need to transcribe to preserve the meaning you need and how you'll notate that.
  3. Don't neglect the physical and mental burden. Good chair. Good keyboard. Footpedals. Regular breaks.
  4. Improve your touch-typing as much as you reasonably can.
  5. BACK UP YOUR DATA EFFECTIVELY AND SECURELY. You will be a sad monkey if you don't.

This monkey regrets not backing up their research data. But it is too late.