How to generate SRT subtitles from audio recordings using Word
A quick note about a free-ish (or relatively low cost) and not-too-finicky but a-bit-technical way to generate subtitles for video interviews using Microsoft Word – up to 1hr in length (00:59:59,999) – or longer if you cut up your video into <1hr transcripts. How long this will work, I have no idea – but it works at the time of this post and it doesn’t involve using Youtube and dodgy .srt downloaders.
I looked around the interwebs for Subrip .srt subtitle generators for some long-form video interviews I’ve been doing, and it seems like most options, beyond short try-us-out freebies, require ‘pro’ subscriptions that cost hundreds of dollars a year (advertised as ‘monthly’ subscriptions, but cynically billed as annual charges). That might be fine for some production studios churning out hundreds of hours of content per annum. For me, it’s hardly worth the expense, as I only require this short term for a few videos – and there seems little guarantee that the accuracy of transcription will be very good for regional Australian accents with a technical lexicon, despite the hyperbolic promises of ‘AI’ 90% accuracy. Take that with several large grains of salt.
Like all subtitling approaches, results will need to be edited for human nuance, timing, concision and accuracy. Human-quality edited text transcriptions are also suitable for automated/human translation approaches.
Requires the use of DaVinci Resolve, an audio editor and Word 365 online and Word desktop edition. The pricey bit is Word, so perhaps there are alternative approaches via LibreOffice and other OpenSource speech-to-text solutions? Let me know if you know, in the comments. This approach worked well enough for me, without faffing around with complex regex editors in Bash/nano/vi/Python or whatever. NimbleText looks like a good option on Windows – certainly for learning about how regex works.
It’s far from perfect – this approach concatenates/conflates different Speakers, so you need to be aware of this at the outset – it’s best for a single fluent interviewee. It requires some manual interventions and looks a bit complicated, but once you understand the workflow it’s actually pretty quick – and free if you have access to Word online and desktop versions.
If you have an improvement or alternative solution, constructive feedback is welcome in comments below.
1] Edit your interview video in DaVinci Resolve (it’s my cross-platform preference and imports .srt format subtitle files).
2] Export the voice-only audio track that you wish to transcribe. Resolve has an ‘audio only’ export setting for .mp4 audio, using the AAC codec. Alternatively open your video in Quicktime Pro and export to .wav and convert to .mp3 using Shutter Encoder or Audacity. Shutter Encoder will also export audio directly from video to .mp3. Whatever works.
Note: the reason to use .mp3 or .mp4 is to cut down file size from .wav format, as this will affect the next step.
3] Open Office 365 and create a new empty Word document.
4] Select ‘Transcribe’ under the Dictate menu and upload your audio file. It needs to be not-too-large <200MB – hence our use of .mp3.
5] This will take a little while to upload and process the transcription, just follow the onscreen prompts in the transcription pane to add the transcribed text to the document.
6] You should now have something like this:
7] Now we need to get this into .srt format, which seems to be pretty rigid with regard to whitespaces and other invisible horrors – certainly getting them into DaVinci Resolve required that it be spot on. I tried several times with seemingly innocuous blank lines that were clearly NOT ACCEPTABLE! – with no feedback whatsoever. There’s no conditional processing – if it’s wrong it just won’t work. It’s a dumb parser.
8] First of all delete all unnecessary title/heading info – select and delete the .mp3 file info, all the whitespace down to text (“Transcript”) and stop at the first timestamp 00:00:01. Leave the timestamp and everything after it.
9] We need to get this transcript strictly into Subrip .srt format which is as follows:
(i) Each separate subtitle needs to be numbered in sequence
(ii) Timestamps needs to be of the format HH:MM:SS,xxx – where xxx is milliseconds from 000 to 999.
(iii) Requires a subtitle duration start and end timestamp separated by
That’s “whitespace en-dash en-dash whitespace greater-than-sign” meaning that it’s “whitespace+short dash+short dash+”greater-than-sign”+whitespace” – with your browser it may appear to be displayed as whitespace+long em-dash+whitespace, which is different. Who ever thought of this stuff? Bloody confusing – double-check the image above!
(iv} Requires the subtitle text body
(v) Requires a trailing blank line before the next subtitle number entry
Now we understand that, let’s start
10] Remove the strings ‘Speaker 1’, ‘Speaker 2’ etc by a simple wildcard Find and Replace
(i) Use wildcards
(ii) Find : “
Speaker ?” (without the quotes!)
(iii) Replace: “” (without quotes – yep that means leave it blank)
(iv) Viola! All instances are gone (including in the main text, if your speakers are talking about speakers, but you can add that again later manually if need be). There’s probably a more cunning way using ‘special’ functions rather than wildcards, but this was good enough for me.
(v) Save your file and download a copy to edit on the desktop. Good idea to keep an original version and make ‘save as’ incremental copies (xxx_v001) as you go along and inevitably make mistakes 🙂
11] The first step requires that all subtitle sections be sequentially numbered.
We resolve this by adding number sequence fields using the desktop version of Word.
(i) Position cursor at the first line of the document. Insert a new line if necessary (yep, press ‘enter’ or ‘return’).
(ii) Click on QuickParts in the ribbon and select ‘add field’
(iii) In the Choose fields menu, select ‘Numbering’
(iv) Select ‘Seq’ as field type (highlighted in blue)
(v) Give it a name, like “SEQ myseq”. You need this first identifier – “myseq” (or whatever you wish), but don’t bother with [Bookmark] and [Switches].
(vI) Click OK, and the number “1” will appear on the first line in the document, with an invisible paragraph (^p) marker next to it. That’s it for now – just leave it there in the document – you don’t need to add more. You’ve just added a sequential field (called “myseq”) and the first value is “1”. Simple.
12] Now comes the fun part!
(i) Select Replace
(ii) Use Wildcards – Find: “
00:(??):(??)(*)00:(??):(??)” – without the quotes & make sure there are no white spaces! Note: this will only work for this particular Word transcription format and associated character strings.
This is basically a regular expression (regex) that says: find a string that follows the format “00: “something something” partnumber 1 : “something something” partnumber 2: “some arbitrary bit of text we’ll leave as is” partnumber 3 and then another 00: “something something” partnumber 4: “something something” partnumber 5.
(iii) Replace with (but DON’T CLICK REPLACE YET!) : “
^p^c01:\1:\2,000 --> 01:\4:\5,999^p\3” – no quotes and no extra white spaces!!
(iv) Leave the Replace dialogue box open, go back to the document, select that number ‘1’ number sequence field AND its invisible (or here visible) paragraph marker and control-X to cut it (or control-C to copy) – either way we need to copy it into the clipboard memory.
(v) Now click Replace All. If it stuffs up, just control-Z to undo and try again!
^p^c01:\1:\2,000 --> 01:\4:\5,999^p\3” instruction basically says: “insert a paragraph break to create a blank line before the following”, “insert the clipboard contents (sequential number field and paragraph marker)”, insert “01” (this replaces the 00 with 01, which conforms with DaVinci Resolve’s default timecode start), insert the same partnumber 1 that we found above, insert the same partnumber 2, add “000” as milliseconds, insert the
" --> " separator with its whitespaces at start and end, insert “01”, insert partnumber 4, insert partnumber 5, insert “999” as milliseconds, insert a paragraph break, insert partnumber 3 (the arbitrary bit of text – which is what the Speaker is actually saying). Repeat till the end of the document. Stop.
13] Now you should have something like this. Select all (control-A) and right-click on one of the Field numbers and choose ‘Update Field’ – and all the entries will be sequentially numbered.
Just a couple more things to tidy up.
14] There are now too many blank lines between subtitle parts, so these need to be removed.
(i) Turn wildcards off. Find: “
^w^p” – that means ‘whitespace+paragraph marker’
(ii) Replace: “” – yep, leave blank.
15] Now it should all be in the expected .srt format (see step 9, above) – though it pays to check manually that there are no blank lines between timestamp and interview body text. In this instance there was one I found, so not too bad. Manually deleted it and it works.
The final timestamp of your document must also be checked – manually modified or deleted – as it will be all on its own. Just make sure the preceding one is in the correct format.
16] Read Davinci help to learn how to import, insert and display your .srt subtitles.
Viola! hours of manually transcribing into Resolve’s horrible little subtitle box avoided!
Final note: in my experience there is some discrepancy between the timestamps in the transcription and where they appear in Resolve. But they are close enough and easy to move around and edit. I expect to edit them anyway and correct all the mis-transriptions. Of course, there would be cunning ways to work around this, but that’s not something I’m concerned with in this quick-and-cheap approach.
This is a fairly straight-forward procedure for doing a good-enough job for what I needed to do. Of course, it will take a fair bit of editing within Resolve to get all the subtitles corrected, aligned, and separated into sub-parts as necessary – this is nowhere near an expensive fully automated process, so try at your own risk.
For my purposes it saved what could have been weeks of tedious manual transcription of hours of interviews, and the seemingly hundred of dollars I might have had to pay using current solutions I am aware of. It’s enabled me to get 8 hours of basically accurate subtitles set up in a day, and fixed with a couple of days editing, whilst editing the video concurrently. Not too bad. And didn’t cost me a cent, apart from my time figuring out how to do it and doing it.
So is it worth it? Yes, absolutely – I didn’t need to sign up to yet ~another~ “service”. Don’t need to unsubscribe or cancel automated debit systems. Don’t need to watch awful YouTube regex “Hey Youtubers, Blah blah blah, too-long-canned-logo-sequence, empty promises, adverts, irrelevant bollocks, adverts, simple solution, adverts, now pay here and actually it’s quite expensive”.
Of course, in a few years time this will probably be an AI ‘feature’ of Resolve, FCPX and Premiere Pro etc. Probably yet another subscription with its accompanying deluge of marketing emails. Sigh. Then is the time for DIY Python and making a more solid approach in something like Blender.
Until then, I hope this helps and good luck!