I built a local audio transcriber in a couple of hours with Claude Code

Almost all my clients would rather talk than write. I get WhatsApp voice notes of three, five, eight minutes, with topic changes, background noise and the closing doubt of “hey, what if we make it blue?”. Listening to them in real time doesn’t scale. Taking notes while listening doesn’t either.

The natural move was to throw the audio at a transcription API and be done. But I didn’t want to:

Pay per minute for something I’ll use every day.
Upload client audio to a service I don’t control — in there they sometimes say passwords or things they’d rather not repeat.
Get locked in to a provider that changes its pricing tomorrow.

This morning I didn’t know Whisper existed. This afternoon I have whisper-transcriber running: a local web app that takes files, transcribes them in Spanish and returns .txt. All inside a container, without sending audio to the internet.

whisper-transcriber demo: upload audio, pick a model, transcribe and download the results.

But the day didn’t start in Claude Code. It started in another conversation.

the brief

Before opening the editor, I sat down with Claude (the chat version) and described the problem: long audio, in Spanish, without paying for APIs, without uploading anything to the cloud. The conversation turned into a technical plan that covered almost everything I’d need afterward — the stack, the endpoints, the configuration details, even the UI decisions.

A few lines from the brief I wouldn’t have known how to write from scratch:

“API route at src/app/api/transcribe/route.ts with runtime = 'nodejs' and maxDuration = 300.”

“Store audio temporarily in the project’s /tmp, with unique names (randomUUID). Delete the audio file and the generated .txt after reading the transcription.”

“Set bodySizeLimit to 50mb in experimental.serverActions — WhatsApp audio can be heavy.”

“Process the audio in SERIES, not in parallel, to avoid saturating CPU/RAM.”

Each of those lines solves a problem I didn’t yet know existed. Learning ahead of time which the potholes would be and how to avoid them — that’s the part of the work you normally do by tripping over each one once.

And the line that closed the brief, my favorite:

“If you find any minor ambiguity, decide it yourself with good judgment and tell me at the end what you decided.”

That’s the disposition that opens the door for the next session to be productive: letting Claude Code decide the small details without stopping to ask, but asking it to report at the end what it decided.

from brief to app

I handed the brief to Claude Code. In one pass the whole app came out: batch file upload, model selector (tiny, base, small), serial processing with per-file status (pending, processing, done, error), individual download or all combined into a single .txt with separators. At the end, a short list of “these are the small decisions I made” — exactly what the brief had asked of it.

The heart is a Next.js route handler that receives the file, saves it to tmp/, calls whisper, reads the resulting .txt and returns it:

export const runtime = "nodejs";
export const maxDuration = 300;

export async function POST(request: Request): Promise<Response> {
  const formData = await request.formData();
  const file = formData.get("file");
  const model = formData.get("model");

  if (!(file instanceof File)) {
    return Response.json({ error: "Missing 'file'" }, { status: 400 });
  }
  if (file.size > MAX_BYTES) {
    return Response.json({ error: "File too large" }, { status: 413 });
  }
  if (!isModel(model)) {
    return Response.json({ error: "Invalid 'model'" }, { status: 400 });
  }

  // ...
}

The model validation leans on a tiny type guard, so the client can’t force the download of a model we don’t support:

export const MODELS = ["tiny", "base", "small"] as const;
export type Model = (typeof MODELS)[number];

export function isModel(value: unknown): value is Model {
  return (
    typeof value === "string" && (MODELS as readonly string[]).includes(value)
  );
}

the annoying detail: where did the `.txt` go?

There was a point in the implementation where Claude Code and I stared at the code for a while: nodejs-whisper doesn’t return the path of the output file. The whisper-cli binary with -otxt writes it next to the input, but if it internally converted to wav first, it ends up with a different name. In practice there are three possible patterns:

const candidateTxtPaths = [
  `${wavPath}.txt`, // if it converted to wav
  path.join(tmpDir, `${uuid}.txt`), // if it dropped the extension
  `${audioPath}.txt`, // if it took the original
];

let transcription: string | null = null;
for (const candidate of candidateTxtPaths) {
  try {
    transcription = await fs.readFile(candidate, "utf8");
    break;
  } catch {
    // try the next one
  }
}

Trying three paths is ugly, but it’s honest: it reflects the reality of the wrapper. The alternative was to read the whole directory and filter, which hides the problem instead of documenting it. If nodejs-whisper changes the pattern tomorrow, this code will fail loudly and clearly.

the Dockerfile, a separate iteration

The brief said nothing about Docker. Once the app was running, I asked Claude Code to package it so I could run it on any machine without depending on my environment. That’s where the Dockerfile came out, and it ended up being the part of the project where working with an assistant made the biggest difference.

It’s a non-trivial Dockerfile: multi-stage, native compilation of whisper.cpp, models pre-downloaded from Hugging Face at build time, layers designed to cache, a slim of the tree before runtime. I had never written one like that.

FROM node:22-bookworm AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
      ffmpeg cmake build-essential curl ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN corepack enable
WORKDIR /app

COPY package.json pnpm-lock.yaml ./
RUN pnpm install --frozen-lockfile --config.package-import-method=copy

ENV WHISPER_CPP_DIR=/app/node_modules/nodejs-whisper/cpp/whisper.cpp

RUN ARCH=$(uname -m) \
 && if [ "$ARCH" = "aarch64" ] || [ "$ARCH" = "arm64" ]; then \
      EXTRA_CMAKE_FLAGS="-DCMAKE_C_FLAGS=-march=armv8.2-a+dotprod+fp16 -DCMAKE_CXX_FLAGS=-march=armv8.2-a+dotprod+fp16"; \
    else \
      EXTRA_CMAKE_FLAGS=""; \
    fi \
 && cmake -B "$WHISPER_CPP_DIR/build" -S "$WHISPER_CPP_DIR" \
      -DCMAKE_BUILD_TYPE=Release \
      -DGGML_NATIVE=OFF \
      $EXTRA_CMAKE_FLAGS \
 && cmake --build "$WHISPER_CPP_DIR/build" --config Release -j "$(nproc)"

Three details in there I wouldn’t have known to look for on my own:

package-import-method=copy in pnpm — so node_modules can be copied cleanly into the runtime stage without broken hardlinks to the global store.
GGML_NATIVE=OFF — keeps cmake from passing -march=native, which is unreliable inside a container (the image can be built on one machine and run on another).
The ARM64-specific flags — on Apple Silicon, the NEON intrinsics with dotprod explicitly require armv8.2-a+dotprod+fp16 or gcc 12 fails with "target specific option mismatch".

Any of the three on its own would have cost me an afternoon of Googling. The three together in a block I understand well enough to maintain: that’s what’s new.

Then come the models at build time (not at runtime, so the container starts without internet), the copy of the app code, the Next build, and a slim of the tree that throws out everything not needed in production:

RUN pnpm prune --prod \
 && cd "$WHISPER_CPP_DIR/build" \
 && find . -type f \
      ! -name 'whisper-cli' \
      ! -name '*.so' \
      ! -name '*.so.*' \
      -delete \
 && find . -depth -type d -empty -delete

It passes only the whisper-cli binary and its linked .so files to the final stage. Without that pass, the image weighed nearly twice as much because of cmake intermediates, tests, examples and Metal shaders that aren’t used on Linux.

the flow, from the outside

Once the image is built:

docker run --rm -p 3000:3000 whisper-transcriber

I open localhost:3000, drag in the .ogg files I downloaded from WhatsApp (or any other audio), pick base, hit Transcribe all. Thirty seconds per three-minute audio, roughly. I download one .txt per file or all of them pasted together.

two Claudes, two modes

What I take from the day isn’t the app — it’s the two-step flow.

Claude chat to think. The part where you don’t yet know what tools exist, what stack makes sense, what technical decisions you have to make before touching code. An open conversation produces a brief you wouldn’t have written from scratch. What’s notable isn’t that Claude knows about whisper.cpp — that’s search. What’s notable is that, in the same conversation, it translates that knowledge into a concrete spec for your case: route handler, paths, configuration, constraints.

Claude Code to build. The brief is the input. The session goes from “code that exists + brief” to “code that meets the brief”. Here the conversation is more surgical: errors, adjustments, low-level decisions. The brief’s final line — “decide it yourself with good judgment and tell me at the end what you decided” — is what makes the session flow instead of stopping every five minutes.

What gets saved between the two: the part about learning from scratch the five technical layers the project touches. The part that’s still on me — understanding, deciding, correcting — is the same as when I program without assistance.

What leaves me thinking is how many small day-to-day problems, the ones you normally resign yourself to “I’ll pay for the API and move on”, now fit inside “a conversation to think it through + a couple of hours to build it”.

The code is on GitHub. A docker build, a docker run, and you’ve got it running.

· ·