Download Japanese podcast
2024-09-06 perl podcast Mojo::UserAgent Mojolicious japaneseI was looking for some podcasts to improve my Japanese language learning. Based on recommendations from Reddit, I found the Nihongo Con Teppei podcast. The idea is to listen to a native Japanese speaker discuss various topics, which helps in getting a feel for the language and eventually understanding it.
Since I often learn offline, I wanted to download the podcast episodes in .mp3
format. Note that the pages are currently static, as the author has moved to Spotify for delivering newer episodes. Therefore, I don’t need to worry too much about future changes. Upon analyzing the web page, I noticed there are numbered index.html
pages like:
index-70.html
- first batch of episodesindex-69.html
- second batch- and so on
Each page contains article
entries with links to a generic download page is lower part of the entry. The link includes a parameter u
that points to the .mp3
file that we want to download:
<article class="entry">
<h2 class="entry__title">
<a href="http://teppeisensei.com/article/458763341.html">#49友達について</a>
</h2>
<div class="meta">
<div class="date">
<span class="icn icn--calendar"></span><time datetime="2018年04月13日">2018年04月13日</time>
</div>
</div>
<div class="article article--entry">
<div class="audio-link">
<audio style="max-width: 100%" controls="controls">
このブラウザでは再生できません。
<source src="https://blog.seesaa.jp/pages/tools/download/play?d=e7ba055eebd78bb90650b70a0679c452&u=https://teppeisensei.up.seesaa.net/image/Nihongo20con20Teppei2349.mp3">
</audio>
<br>
再生できない場合、ダウンロードは🎵
<a href="https://blog.seesaa.jp/pages/tools/download/index?d=e7ba055eebd78bb90650b70a0679c452&u=https://teppeisensei.up.seesaa.net/image/Nihongo20con20Teppei2349.mp3">
こちら
</a>
</div>
<a name="more"></a>
</div>
</article>
Using this, it’s relatively straightforward to build a simple script based on the Mojolicious framework and its web client, Mojo::UserAgent.
use 5.16.3;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $page = 'http://teppeisensei.com/index-66.html';
my $dom = $ua->get($page)->result->dom;
my @download_urls = $dom->find('a')
# find links with href attributes pointing to .mp3 files
->grep(sub { defined $_->attr('href') && $_->attr('href') =~ /\.mp3/ })
->map(sub { Mojo::URL->new($_->attr('href')) })
->map(sub { Mojo::URL->new($_->query->param('u')) })
->each;
for my $url (@download_urls) {
my $filename = $url->path->parts->[-1];
next if -e $filename;
warn "$url -> $filename\n";
my $tx = $ua->get($url);
$tx->result->save_to($filename);
}
Here are a few notes on the script above:
- Mojolicious provides well-structured objects, making it easy to manipulate data. For example, to obtain the
@download_urls
variable, you can turn the user agent result into a DOM parser, find all links using Mojo::Collection, filter URLs containing.mp3
, and extract theu
parameter as a Mojo::URL object. - The download section extracts the filename from the URL, fetches the
.mp3
file, and saves it locally.
I can highly recommend the podcast and this quick script allowed me to enjoy it offline.