作者deh3215 ()
看板Perl
标题Re: [问题] 如何清除email里的html tag??
时间Wed Mar 4 02:11:34 2009
※ 引述《flylinux (ㄚ琪)》之铭言:
: s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
: 试试!
: ※ 引述《deh3215 ()》之铭言:
: : 只用s/<.*?>//gsi;似乎效果不好,有人有用模组清除html tag的经验吗?
<FONT face=Verdana size=1><br> Dear
PayPalCustomer,<BR></div> <DIV>
<b> CONGRATULATIONS!</b><DIV> <DIV> You have
been chosen by our online department <br> to take part in our quick and
easy online departament.<br> In return we will credit $20
to your account - Just for your time! <====> 上面为222.html/txt内容
---------------------------------------------------
用HTML::FormatText清除後结果:
Dear PayPal Customer,
CONGRATULATIONS! You have been chosen by our online department
to take part in our quick and easy online departament.
In return we will credit $20 to your account - Just for your time!
---------------------------------------------------
require HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new->parse_file("c:/222.html");
require HTML::FormatText;
$formatter = HTML::FormatText->new;
print $formatter->format($tree);
---------------------------------------------------
上面略为修改结果(程式码在下方): 看起来tag都被清掉了,但有些字也被清空了
Dear ______ == > PayPal,time!<==
Customer, 似乎是空白的关系,这是模组的bug吗
CONGRATULATIONS!
You have been chosen by our online
department
to take part in our quick and easy
online
departament.
In return we will credit $20 to
your account - Just for your ______
-----------------------------------------------
open(INPUT, "c:/222.txt") or die;
@temp = <INPUT>;
chomp @temp;
use HTML::TreeBuilder;
use HTML::FormatText;
foreach $string(@temp) {
$tree = HTML::TreeBuilder->new;
$tree->parse($string);
$formatter = HTML::FormatText->new;
print $formatter->format($tree);
}
--------------------------------------------------
用HTML::Strip清除後结果
??Dear PayPalCustomer, ??CONGRATULATIONS!?? You have been chosen by our
online department ??to take part in our quick and easy online departament.
??In return we will credit $20 to your account - Just for your time!
---------------------------------------------------
use HTML::Strip;
open(INPUT, "c:/222.txt");
@temp = <INPUT>;
chomp @temp;
foreach $t (@temp) {
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse($t);
$hs->eof;
print "$clean_text";
}
---------------------------------------------------
用HTML::Strip清除,不知为何会有问号? ====>把txt档中" "删除即没有问号
---------------------------------------------------
还是HTML::TreeBuilder搭配HTML::FormatText模组会清的比较乾净,但是某些乱插
入的16进位0x123544这类的可能要手动清除@@
--
※ 发信站: 批踢踢实业坊(ptt.cc)
◆ From: 59.116.2.192
※ 编辑: deh3215 来自: 140.117.168.75 (03/04 20:21)