My first coding side project: Audio to Text

29 Sep, 2024

Every year or so, I fantasize about learning how to code, in some way or another.

In college, I went from VBA to Python for finance analysis. I also discovered Javascript and basic algorithm logic.
In 2019 and 2020, I became interested in AI before the current AI madness. I followed a great course from Stanford taught by Andrew Ng and loved it. It provided me with useful high-level knowledge of how machine learning works, as well as Python.
Then I used Python again to follow a crash course on Quantum Computing from Qbraid. I was more interested in the math/physics part of the curriculum but I enjoyed the coding part as well.
At Cyrius, I used HTML and CSS to create phishing email templates.
I also started to experiment with AI code helpers and created some browser extensions and basic web apps in Node.

But I was doing things wrong.

In the same way I realized that theoretical knowledge would not get me into entrepreneurship, I eventually figured out that piling up bits of unfinished courses would not teach me how to code in real life.

Recently, when I got interested in audio-to-text algorithms and RAG, I began by picking up a coding course on Brilliant (which is great, by the way). After a few hours, realizing I was following the same pattern again, I made myself a promise: this time, I would have a demo app live somewhere on the Internet.

My goals were to understand better what it takes to push an AI app into production. That includes digging into how LLMs work, the difference between open-source and commercial models, how to implement RAG... And, on top of that, all the normal steps of releasing a web app (the infra, the relationship between scrips, even Git basics!).

Above all, I would not try to code myself but rather have an AI do it for me. After all, I don't want to become a developer, I just want to understand how tech works.

After a few dozen hours of tweaking and as simple as it may look, I am proud to display to the world my very first deployed app:

Audio To Text

Note: It will probably take 30-60" before loading. Here is the GitHub repo if the page does not load at all.

The Process

I picked up a simple problem I would love to chat with transcripts from my favorite French podcast, Generation Do It Yourself (400+ episodes of 2-4 hours).

I envisioned a database of all transcripts that I could read instead of listening to and a chatbot feature that would retrieve information from all transcripts.

That seemed straightforward: take the audio, convert it to text, and make it searchable. However, the project, which began as a basic transcription tool, evolved into a sophisticated audio processing system through a series of challenges and discoveries.

Side note: As I am having Claude do all the work, I will use a "we" pronoun.

Starting Simple (Version 1)

We started with the simplest possible solution: a local implementation using OpenAI's Whisper model. This approach seemed logical - Whisper was open-source, well-documented, and had shown promising results in various languages, including French.

Our first implementation was basic but functional:

Load the Whisper model locally
Process the entire audio file as a single unit
Output a plain text transcription

However, this simple approach quickly revealed several critical limitations:

Processing Time: What worked fine for short clips became impractical for full episodes. Processing a single four-hour episode could take several hours, making the system impractical for bulk processing or real-time applications.
Memory Constraints: Loading long audio files into memory all at once led to system crashes. Even when the system didn't crash, memory usage was unsustainable for a production environment.
Accuracy Degradation: We noticed that transcription accuracy tended to decrease as the audio length increased. This was particularly noticeable with French content, where context is crucial for accurate transcription.

Moving to the Cloud (Version 2)

These initial challenges led us to the first major pivot: moving from local processing to OpenAI's Whisper API. This decision wasn't just about offloading processing power; it was about accessing a more robust and optimized implementation of the model.

However, this solution introduced its own set of challenges:

API Timeouts: The API had strict timeout limits, making it impossible to process long audio files in a single request. This led to the development of our chunking strategy.
Cost Considerations: While local processing had hardware costs, API calls had direct financial implications. I needed to balance accuracy and completeness with cost-effectiveness.
Chunking Complexity: Breaking long audio files into smaller chunks solved the timeout issue but introduced new challenges:

How to ensure smooth transitions between chunks?
What was the optimal chunk size for balancing processing time and accuracy?
How to maintain speaker consistency across chunk boundaries?

Through experimentation, we settled on 20-minute chunks as our sweet spot. This duration was long enough to maintain context but short enough to avoid API timeouts and manage costs effectively. However, this solution, while functional, highlighted our next major challenge: speaker identification.

The chunk-based approach worked for basic transcription but made it difficult to maintain speaker consistency throughout an episode. When we split a conversation into 20-minute segments, we lost the broader context that helps identify who's speaking. This realization led us to our next major evolution: the integration of speaker diarization.

Enter Speaker Diarization (Version 3)

Our first attempt at solving this problem introduced pyannote.audio, an open-source speaker diarization toolkit. The initial implementation was straightforward: process the audio to identify different speakers, then label each segment of the transcript accordingly.

However, this seemingly simple addition exposed new layers of complexity:

The Chunk Problem: Our existing 20-minute chunk strategy, which worked well for basic transcription, became problematic. The diarization system would restart its speaker labeling for each chunk, meaning "Speaker A" in one segment might be "Speaker B" in the next, even when it was the same person talking.
False Changes: Brief interruptions or background noises sometimes triggered false speaker changes, making conversations appear more fragmented than they actually were.
Processing Demands: Adding diarization significantly increased our processing time, sometimes doubling or tripling it for a single episode.

This version taught us a crucial lesson: adding features often means rethinking your entire approach. What worked for simple transcription wasn't necessarily going to work when adding more sophisticated audio analysis.

A More Sophisticated Solution (Version 4)

Having learned from our initial diarization attempts, we developed a more sophisticated approach. Instead of processing speaker identification independently for each chunk, we implemented a system that could maintain speaker consistency across an entire episode.

The key innovation was speaker clustering - collecting voice characteristics across the whole episode and using this data to ensure consistent speaker identification. We also added smoothing algorithms to reduce sporadic speaker changes and false identifications.

But this solution, while more accurate, introduced its own challenges:

Processing Time: The more sophisticated analysis required even more processing time and resources.
Error Propagation: If the system misidentified a speaker early in the episode, this error could propagate through the entire transcript.
Timeline Sync: Maintaining accurate synchronization between the transcript, speaker labels, and podcast timeline markers became increasingly complex.

Discovering AssemblyAI (Version 5)

Just as we were deep in the process of refining our custom solution, we discovered AssemblyAI. This service promised something compelling: integrated speaker diarization with strong French language support.

After our experience building these features ourselves, we could appreciate the significance of this offering.

The switch to AssemblyAI proved transformative:

Unified Processing: What we had been trying to accomplish with multiple systems and complex coordination was now handled in a single, integrated process.
Improved Accuracy: The service showed notably better accuracy with French content, especially with multiple speakers.
Faster Processing: What had been taking hours could now be done in a fraction of the time.

This was a humbling but valuable lesson in technology choices. Sometimes the best solution isn't building everything yourself, but finding the right tool that already solves your problems effectively.

The Web Solution (Version 6)

Our initial web interface implementation seemed straightforward: create a simple upload form, process the file, and return the results. However, as we started development, we realized that moving from command line to web brought its own set of considerations:

File Handling: We were no longer dealing with local files in a controlled environment. We needed to:

Handle various audio formats (MP3, WAV, OGG, M4A, etc.)
Manage secure file uploads
Process files of varying sizes and quality
Clean up temporary files after processing

User Feedback: Command line users could see processing progress through log messages, but web users needed a different approach:

Progress indicators for file uploads
Status updates during processing
Clear error messages when things went wrong
Estimated completion times

Service Selection: We decided to give users choice in their processing approach:

Option to use either OpenAI or AssemblyAI
Language selection for different content
Quality vs. speed trade-offs

Result Presentation: The transcript needed to be more than just text on a screen:

Easy-to-read formatting with speaker labels
Timeline navigation
Copy and download options
On-demand summarization of sections

The most significant challenge was managing user expectations during processing. What felt acceptable when running a local script became frustrating when waiting for a web page to respond.

This led us to implement real-time processing updates and allow users to work with the transcript while the rest of the file was still being processed.

When it came to security, we had to be particularly careful. Processing audio files on a web server introduced new risks:

Potential exposure of API keys
Server resource consumption
Storage of sensitive content
User data privacy

Our solution was to implement strict file validation, secure temporary storage, and automatic cleanup processes. We made it clear to users that their files were processed and then immediately deleted, never stored permanently on our servers.

A particular breakthrough came when we added on-demand summarization. Users could select any portion of the transcript and generate a quick summary, making it easier to navigate long episodes and find specific content. This feature transformed the tool from a simple transcription service into a content analysis platform.

Conclusion: Beyond Simple Transcription

The Journey of Evolution

What began as a straightforward attempt to transcribe podcast episodes evolved into something far more complex and interesting. Each step of our journey revealed new layers of the problem we hadn't initially considered:

We started by thinking we just needed to convert speech to text (Version 1)
Then discovered we needed to handle long-form content efficiently (Version 2)
Realized the importance of preserving speaker identity (Version 3)
Learned that speaker identification needed to be consistent and accurate (Version 4)
Discovered that sometimes the best solution is knowing when to use existing tools (Version 5)
Finally understood that technical capability means nothing without accessibility (Version 6)

Key Insights

The Power of Incremental Problem-Solving

Perhaps the most valuable lesson from this journey was the importance of incremental problem-solving. Each version of our system didn't just add features – it responded to specific limitations we discovered in real-world use. This approach of "solve one problem, discover the next" proved more effective than trying to anticipate all requirements upfront.

The Balance of Build vs. Buy

Our transition from custom-built solutions to AssemblyAI highlighted an important reality of modern software development: sometimes the best solution isn't building everything yourself. The real skill lies in knowing when to build custom solutions and when to leverage existing tools. Our early attempts at building everything ourselves gave us the knowledge to better evaluate and integrate third-party solutions when we found them.

Final Thoughts

The journey from basic transcription script to full-featured web application mirrors many aspects of software development as a whole: what seems simple at first often reveals hidden complexity, and the best solutions come from understanding not just the technical challenges, but the human needs behind them.

My experience shows that success in software development isn't just about solving technical problems – it's about remaining flexible enough to recognize when your understanding of the problem itself needs to evolve. Sometimes the most important breakthroughs come not from building better solutions, but from better understanding what you're really trying to achieve.

To the next project!