Как выдернуть адрес сайта (linux) из html?

Есть примерно файлы с таким содержимым (полных исходным кодом сайтов):
' Art Ever<em>&bull;</em></span></a></div>
<div class="sect clearfix"><a data="foo" class="logo" href="http://tasteofcountry.com" target="_blank" rel="nofollow"><img src="https://s3.amazonaws.com/tsm-images/logos/footer/204-light.png?id=78" alt="Logo for the website http://tasteofcountry.com"></a><a data-id="295354" class="article" href="http://tasteofcountry.com/shocking-country-music-splits/" title="Love Hurts: Most Shocking Country Music Splits" target="_blank" rel="nofollow"><span>Love Hurts: Most Shocking Country Music Splits<em>&bull;</em></span></a><a data-id="295275" class="article" href="http://tasteofcountry.com/reba-mcentire-narvel-blackstock-relationship-timeline/" title="Their Last 30 Years: A Look Back at Reba and Narvel's Relationship" target="_blank" rel="nofollow"><span>Their Last 30 Years: A Look Back at Reba and Narvel's Relationship<em>&bull;</em></span></a></div>
</div><hr><div class="row clearfix"><div class="sect clearfix"><a class="article img" href="http://screencrush.com/official-batman-vs-superman-plot-synopsis/?footer" title="The Real Reason Batman & Superman Are Fighting" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/screencrush.com/files/2015/07/batman-vs-superman-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="The Real Reason Batman & Superman Are Fighting"><span>The Real Reason Batman & Superman Are Fighting</span></a></div>
<div class="sect clearfix"><a class="article img" href="http://popcrush.com/stars-who-were-born-rich/?footer" title="15 Stars Who Were Born Filthy Rich" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/popcrush.com/files/2015/04/born-rich-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="15 Stars Who Were Born Filthy Rich"><span>15 Stars Who Were Born Filthy Rich</span></a></div>
<div class="sect clearfix"><a class="article img" href="http://diffuser.fm/offensive-band-names/?footer" title="27 Most Offensive Band Names Ever" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/diffuser.fm/files/2015/03/offensive-band-names.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="27 Most Offensive Band Names Ever"><span>27 Most Offensive Band Names Ever</span></a></div>
<div class="sect clearfix"><a class="article img" href="http://comicsalliance.com/comic-book-movie-behind-the-scenes-pictures/?footer" title="Spectacular Behind-the-Scenes Pics From Comic Book Movies" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/comicsalliance.com/files/2015/05/behind-the-scenes-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="Spectacular Behind-the-Scenes Pics From Comic Book Movies"><span>Spectacular Behind-the-Scenes Pics From Comic Book Movies</span></a></div>
<div class="sect clearfix"><a class="article img" href="http://tasteofcountry.com/you-think-you-know-country-taylor-swift/?footer" title="Surprising Taylor Swift Facts You Probably Didn't Know" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/tasteofcountry.com/files/2014/08/taylor-swift-sexy.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="Surprising Taylor Swift Facts You Probably Didn't Know"><span>Surprising Taylor Swift Facts You Probably Didn't Know</span></a></div>


Как выдернуть из кода только адреса начаниющихся с "http://, "https:// и заканчивающихся символом " и записать их в файлик (каждый урл с новой строки?)
Чтобы на выходе было так:
http://wac.450f.edgecastcdn.net/80450F/tasteofcountry.com/files/2014/08/taylor-swift-sexy.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://tasteofcountry.com/you-think-you-know-country-taylor-swift/?footer
http://wac.450f.edgecastcdn.net/80450F/comicsalliance.com/files/2015/05/behind-the-scenes-300.jpg
http://comicsalliance.com/comic-book-movie-behind-the-scenes-pictures/?footer
  • Вопрос задан
  • 449 просмотров
Пригласить эксперта
Ответы на вопрос 4
@abcd0x00
В два прохода: сначала готовим ссылки, потом выделяем.
Для текста выше, записанного в file.html
[guest@localhost tmp]$ cat "file.html" | sed 's/"http/\n&/g' | sed -n 's/^"\(http[^"]*\)".*/\1/p'
http://tasteofcountry.com
https://s3.amazonaws.com/tsm-images/logos/footer/204-light.png?id=78
http://tasteofcountry.com/shocking-country-music-splits/
http://tasteofcountry.com/reba-mcentire-narvel-blackstock-relationship-timeline/
http://screencrush.com/official-batman-vs-superman-plot-synopsis/?footer
http://wac.450f.edgecastcdn.net/80450F/screencrush.com/files/2015/07/batman-vs-superman-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://popcrush.com/stars-who-were-born-rich/?footer
http://wac.450f.edgecastcdn.net/80450F/popcrush.com/files/2015/04/born-rich-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://diffuser.fm/offensive-band-names/?footer
http://wac.450f.edgecastcdn.net/80450F/diffuser.fm/files/2015/03/offensive-band-names.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://comicsalliance.com/comic-book-movie-behind-the-scenes-pictures/?footer
http://wac.450f.edgecastcdn.net/80450F/comicsalliance.com/files/2015/05/behind-the-scenes-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://tasteofcountry.com/you-think-you-know-country-taylor-swift/?footer
http://wac.450f.edgecastcdn.net/80450F/tasteofcountry.com/files/2014/08/taylor-swift-sexy.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
[guest@localhost tmp]$
Ответ написан
Комментировать
@Aves
grep -oP '(?<=")https?://.+?(?=")' file
или
perl -ne 'while(m/"(https?:\/\/.+?)"/g){print "$1\n"}' file
Ответ написан
Комментировать
echo -e 'asdfasdfya.ru"asfafgoogle.com"adfadsf\nreddit.com"\nhttps://reddit.com/blabla"' | grep -E -o 'http://[^"]+"|https://[^"]+"' | sed 's/"//g'
Ответ написан
Комментировать
sanchomaster
@sanchomaster
deployment engineer
Возможно сделать еще проще

egrep -o "(http|https)://[0-9a-z/A-Z.?=&-]*" file.html

или еще, кажется короче не написать

egrep -o https?://[^\"]* file.html
Ответ написан
Комментировать
Ваш ответ на вопрос

Войдите, чтобы написать ответ

Похожие вопросы