怎么从提取的网页内容中筛选出超链

如何从提取的网页内容中筛选出超链
我想实现的功能是将网页中的超链全部依次打开,并以文件形式存储。
现在,我抓取了网页内容。如下(部分):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><!--[30,59,1] published at 2009-07-10 21:30:13 from #150 by system--><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /><title>新浪首页</title><meta name="description" content="新浪网为全球用户24小时提供全面及时的中文资讯,内容覆盖国内外突发新闻事件、体坛赛事、娱乐时尚、产业资讯、实用信息等,设有新闻、体育、娱乐、财经、科技、房产、汽车等30多个内容频道,同时开设博客、视频、论坛等自由互动交流空间。"><meta name="stencil" content="6HtwmypggdgP1NLw7NOuQBI2TW8+CfkYCoyeB8IDbn8=" /><script type="text/javascript" src="http://i3.sinaimg.cn/home/sinaflash.js"></script><script language="javascript" type="text/javascript" src="http://d2.sina.com.cn/d1images/button/rotator.js"></script><style type="text/css">/* 全局样式 */body,ul,ol,li,p,h1,h2,h3,h4,h5,h6,form,fieldset,table,td,img,div{margin:0;padding:0;border:0;}body{background:#fff;color:#333;font-size:12px; margin-top:5px;font-family:"宋体";}ul,ol{list-style-type:none;}select,input,img,select{vertical-align:middle;}a{text-decoration:underline;}a:link{color:#009;}a:visited{color:#800080;}a:hover,a:active,a:focus{color:#c00;}.clearit{clear:both;}/* page */#page{width:950px; overflow: visible; _display:inline-block; margin:0 auto;}/* 顶部 top */.top{height:27px; position:relative; z-index:99; padding:1px; border:1px #fdd26c solid; border-bottom:1px #e1a841 solid; color:#000; background:url(http://i1.sinaimg.cn/home/deco/2008/0329/sinahome_0803_ws_001.gif) repeat-x 0 0 #fff;}.top a,.top a:visited{color:#000; text-decoration:none;}.top a:hover,.top a:active{color:#000; text-decoration:underline;}.topBlk{height:27px; overflow:hidden; _display:inline-block; background:url(http://i1.sinaimg.cn/home/deco/2008/0329/sinahome_0803_ws_001.gif) repeat-x 0 -50px
但不知道该如何去筛选出超链!
还望各位解答!!

------解决方案--------------------
2种方式:

1, 如果你是在浏览器环境中编程, 可以通过此方法获取到所有的 <a>标记
C/C++ code

HRESULT getElementsByTagName(          BSTR v,
    IHTMLElementCollection **pelColl
);

------解决方案--------------------
C/C++ code
 
{
\b
# Match the leading part (proto://hostname, or just hostname)
(
# http://, or https:// leading part
(https?)://[-\w]+(\.\w[-\w]*)+
|
# or, try to find a hostname with more specific sub-expression
(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
# Now ending .com, etc. For these, require lowercase
(?-i: com\b
| edu\b
| biz\b
| gov\b
| in(?:t|fo)\b # .int or .info
| mil\b
| net\b
| org\b
| [a-z][a-z]\.[a-z][a-z]\b # two-letter country code
)
)
# Allow an optional port number
( : \d+ )?
# The rest of the URL is optional, and begins with /
(
/
# The rest are heuristics for what seems to work well
[^.!,?;"\' <>()[]{}sx7F-\xFF]*
(
[.!,?]+ [^.!,?;”\’ <>()\[\]{\}s\x7F-\xFF]+
)*
)?
}ix