PDA

View Full Version : script question


kong
10-07-2001, 01:14 PM
Any one of a script or a fast way to strip urls from an html file?

Im trying to creat a file of gallery urls, one per line.

-Kong
icq: 96944506

salsbury
10-07-2001, 01:58 PM
this might work, but might not be perfect for all cases, and will require some work afterwards to make sure the output only contains the urls you want:


#!/usr/bin/perl

open (STDOUT, "| sort -u");
while (my $input = shift) {
if (!open (INPUT, "<$input")) {
print STDERR "$input: $!";
next;
}
while (<INPUT> ) {
chomp;
while (/\"(http:\/\/[^\"]+)\"/) {
my $url = $1;
$_ =~ s/\Q$url//;
print "$url\n";
}
while (/(http:\/\/[^ >']+)[ >']/) {
my $url = $1;
$_ =~ s/\Q$url//;
print "$url\n";
}
}
close (INPUT);
}

note: it's not guaranteed that this script can catch every url listed in a file, especially those not preceeded with http:// (of course) and those that had to be specially escaped for javascript. also note, it pipes the output through sort -u - this is a hack but it's the easiest way to strip out duplicates. if you remove that first open line, you'll get the "raw" output.

usage:

perl scriptname filename.html filename2.html etc..

salsbury
10-08-2001, 12:38 AM
actually it can be compacted to:

#!/usr/bin/perl

open (STDOUT, "| sort -u");
while (my $input = shift) {
if (!open (INPUT, "<$input")) {
print STDERR "$input: $!";
next;
}
while (<INPUT> ) {
chomp;
while (/[\"]*(http:\/\/[^\" >'\\]+)[\" > '\\]/) {
my $url = $1;
$_ =~ s/\Q$url//;
print "$url\n";
}
}
close (INPUT);
}

i don't know why i had it in two while blocks before, hehe. must've been thinkin of somethin else. also, this fixes a bug related to javascript. btw, you can also make it list non-image files (assuming all images are named png, gif, or jp[e]g) with:

next if $url =~ /\.(png|gif|jp[e]{0,1}g)$/i;

just before the print line. enjoy.