Apple's New Speech APIs Outpace Whisper for Fast Transcription

36 points by epaga 2 weeks ago | 10 comments
  • ewuhic 2 weeks ago
    Not open source/weights, FU apple
    • eviks 2 weeks ago
      unfortunately no actual measure of quality given in the article
      • 1659447091 2 weeks ago
        >> By harnessing SpeechAnalyzer and SpeechTranscriber on-device, the command line tool tore through the 7GB video file a full 55% faster than MacWhisper’s Large V3 Turbo model, with no noticeable difference in transcription quality.

            App           Transcripiton Time
            --------------------------------
            Yap (uses Apple APIs)       0:45
            MacWhisper (Large V3 Turbo) 1:41
            VidCap                      1:55
            MacWhisper (Large V2)       3:55
        
        >> All three transcription workflows had similar trouble with last names and words like “AppStories,” which LLMs tend to separate into two words instead of camel casing.
        • eviks 2 weeks ago
          Did you forget to paste the 3rd "Transcription Quality Metric" column?
          • 1659447091 2 weeks ago
            No, because I pasted the "with no noticeable difference in transcription quality." part that you missed.
        • GeekyBear 2 weeks ago
          It sounds like the quality is better than YouTube's.

          > a game changer for anyone who uses voice transcription to create text from lectures, podcasts, YouTube videos, and more... generating transcripts that I upload to YouTube because the site’s built-in transcription isn’t very good.

          and on par with Whisper.

          > SpeechAnalyzer and SpeechTranscriber – available across the iPhone, iPad, Mac, and Vision Pro – mark a significant leap forward in transcription speed without compromising on quality

          • eviks 2 weeks ago
            How much better?
            • 6510 2 weeks ago
              All I know is that the most obvious room for improvement in yt transcription is that it will regularly get words wrong that are mentioned 50 times or may even be spelled out in the title or description of the video. If there are many thousands of clues the video is about [say] LLM's it shouldn't be necessary to endlessly fail to spell "chat GPT"
        • czarofvan 2 weeks ago
          WER scores?