Achille Morin Lemoine

My first coding side project: Audio to Text

Every year or so, I fantasize about learning how to code, in some way or another.

But I was doing things wrong.

In the same way I realized that theoretical knowledge would not get me into entrepreneurship, I eventually figured out that piling up bits of unfinished courses would not teach me how to code in real life.

Recently, when I got interested in audio-to-text algorithms and RAG, I began by picking up a coding course on Brilliant (which is great, by the way). After a few hours, realizing I was following the same pattern again, I made myself a promise: this time, I would have a demo app live somewhere on the Internet.

My goals were to understand better what it takes to push an AI app into production. That includes digging into how LLMs work, the difference between open-source and commercial models, how to implement RAG... And, on top of that, all the normal steps of releasing a web app (the infra, the relationship between scrips, even Git basics!).

Above all, I would not try to code myself but rather have an AI do it for me. After all, I don't want to become a developer, I just want to understand how tech works.

After a few dozen hours of tweaking and as simple as it may look, I am proud to display to the world my very first deployed app:

Audio To Text

Note: It will probably take 30-60" before loading. Here is the GitHub repo if the page does not load at all.

The Process

I picked up a simple problem I would love to chat with transcripts from my favorite French podcast, Generation Do It Yourself (400+ episodes of 2-4 hours).

I envisioned a database of all transcripts that I could read instead of listening to and a chatbot feature that would retrieve information from all transcripts.

That seemed straightforward: take the audio, convert it to text, and make it searchable. However, the project, which began as a basic transcription tool, evolved into a sophisticated audio processing system through a series of challenges and discoveries.

Side note: As I am having Claude do all the work, I will use a "we" pronoun.

Starting Simple (Version 1)

We started with the simplest possible solution: a local implementation using OpenAI's Whisper model. This approach seemed logical - Whisper was open-source, well-documented, and had shown promising results in various languages, including French.

Our first implementation was basic but functional:

However, this simple approach quickly revealed several critical limitations:

Moving to the Cloud (Version 2)

These initial challenges led us to the first major pivot: moving from local processing to OpenAI's Whisper API. This decision wasn't just about offloading processing power; it was about accessing a more robust and optimized implementation of the model.

However, this solution introduced its own set of challenges:

  1. How to ensure smooth transitions between chunks?
  2. What was the optimal chunk size for balancing processing time and accuracy?
  3. How to maintain speaker consistency across chunk boundaries?

Through experimentation, we settled on 20-minute chunks as our sweet spot. This duration was long enough to maintain context but short enough to avoid API timeouts and manage costs effectively. However, this solution, while functional, highlighted our next major challenge: speaker identification.

The chunk-based approach worked for basic transcription but made it difficult to maintain speaker consistency throughout an episode. When we split a conversation into 20-minute segments, we lost the broader context that helps identify who's speaking. This realization led us to our next major evolution: the integration of speaker diarization.

Enter Speaker Diarization (Version 3)

Our first attempt at solving this problem introduced pyannote.audio, an open-source speaker diarization toolkit. The initial implementation was straightforward: process the audio to identify different speakers, then label each segment of the transcript accordingly.

However, this seemingly simple addition exposed new layers of complexity:

This version taught us a crucial lesson: adding features often means rethinking your entire approach. What worked for simple transcription wasn't necessarily going to work when adding more sophisticated audio analysis.

A More Sophisticated Solution (Version 4)

Having learned from our initial diarization attempts, we developed a more sophisticated approach. Instead of processing speaker identification independently for each chunk, we implemented a system that could maintain speaker consistency across an entire episode.

The key innovation was speaker clustering - collecting voice characteristics across the whole episode and using this data to ensure consistent speaker identification. We also added smoothing algorithms to reduce sporadic speaker changes and false identifications.

But this solution, while more accurate, introduced its own challenges:

Discovering AssemblyAI (Version 5)

Just as we were deep in the process of refining our custom solution, we discovered AssemblyAI. This service promised something compelling: integrated speaker diarization with strong French language support.

After our experience building these features ourselves, we could appreciate the significance of this offering.

The switch to AssemblyAI proved transformative:

This was a humbling but valuable lesson in technology choices. Sometimes the best solution isn't building everything yourself, but finding the right tool that already solves your problems effectively.

The Web Solution (Version 6)

Our initial web interface implementation seemed straightforward: create a simple upload form, process the file, and return the results. However, as we started development, we realized that moving from command line to web brought its own set of considerations:

  1. File Handling: We were no longer dealing with local files in a controlled environment. We needed to:
  1. User Feedback: Command line users could see processing progress through log messages, but web users needed a different approach:
  1. Service Selection: We decided to give users choice in their processing approach:
  1. Result Presentation: The transcript needed to be more than just text on a screen:

The most significant challenge was managing user expectations during processing. What felt acceptable when running a local script became frustrating when waiting for a web page to respond.

This led us to implement real-time processing updates and allow users to work with the transcript while the rest of the file was still being processed.

When it came to security, we had to be particularly careful. Processing audio files on a web server introduced new risks:

Our solution was to implement strict file validation, secure temporary storage, and automatic cleanup processes. We made it clear to users that their files were processed and then immediately deleted, never stored permanently on our servers.

A particular breakthrough came when we added on-demand summarization. Users could select any portion of the transcript and generate a quick summary, making it easier to navigate long episodes and find specific content. This feature transformed the tool from a simple transcription service into a content analysis platform.

Conclusion: Beyond Simple Transcription

The Journey of Evolution

What began as a straightforward attempt to transcribe podcast episodes evolved into something far more complex and interesting. Each step of our journey revealed new layers of the problem we hadn't initially considered:

  1. We started by thinking we just needed to convert speech to text (Version 1)
  2. Then discovered we needed to handle long-form content efficiently (Version 2)
  3. Realized the importance of preserving speaker identity (Version 3)
  4. Learned that speaker identification needed to be consistent and accurate (Version 4)
  5. Discovered that sometimes the best solution is knowing when to use existing tools (Version 5)
  6. Finally understood that technical capability means nothing without accessibility (Version 6)

Key Insights

The Power of Incremental Problem-Solving

Perhaps the most valuable lesson from this journey was the importance of incremental problem-solving. Each version of our system didn't just add features – it responded to specific limitations we discovered in real-world use. This approach of "solve one problem, discover the next" proved more effective than trying to anticipate all requirements upfront.

The Balance of Build vs. Buy

Our transition from custom-built solutions to AssemblyAI highlighted an important reality of modern software development: sometimes the best solution isn't building everything yourself. The real skill lies in knowing when to build custom solutions and when to leverage existing tools. Our early attempts at building everything ourselves gave us the knowledge to better evaluate and integrate third-party solutions when we found them.

Final Thoughts

The journey from basic transcription script to full-featured web application mirrors many aspects of software development as a whole: what seems simple at first often reveals hidden complexity, and the best solutions come from understanding not just the technical challenges, but the human needs behind them.

My experience shows that success in software development isn't just about solving technical problems – it's about remaining flexible enough to recognize when your understanding of the problem itself needs to evolve. Sometimes the most important breakthroughs come not from building better solutions, but from better understanding what you're really trying to achieve.

To the next project!