（转）php抓取网页内容汇总

xajh32y 发表于 2017-4-1 11:24:10

　　①、使用php

获取网页

内容

http://hi.baidu.com/quqiufeng/blog/item/7e86fb3f40b598c67d1e7150.html

header("Content-type: text/html; charset=utf-8");

1、

$xhr = new COM("MSXML2.XMLHTTP");

$xhr->open("GET","http://localhost/xxx.php?id=2",false);

$xhr->send();

echo $xhr->responseText

2、file_get_contents实现

<?php

$url="http://www.blogjava.net/pts";

echo file_get_contents( $url );

?>

3、fopen()实现

<?

if ($stream = fopen('http://www.sohu.com', 'r')) {

    // print all the page starting at the offset 10

    echo stream_get_contents($stream, -1, 10);

    fclose($stream);

}

if ($stream = fopen('http://www.sohu.net', 'r')) {

    // print the first 5 bytes

    echo stream_get_contents($stream, 5);

    fclose($stream);

}

?>

②、使用php获取网页内容

http://www.blogjava.net/pts/archive/2007/08/26/99188.html

简单的做法:

<?php

$url="http://www.blogjava.net/pts";

echo file_get_contents( $url );

?>

或者:

<?

if ($stream = fopen('http://www.sohu.com', 'r')) {

    // print all the page starting at the offset 10

    echo stream_get_contents($stream, -1, 10);

    fclose($stream);

}

if ($stream = fopen('http://www.sohu.net', 'r')) {

    // print the first 5 bytes

    echo stream_get_contents($stream, 5);

    fclose($stream);

}

?>

③、PHP获取网站内容，保存为TXT文件源码

http://blog.chinaunix.net/u1/44325/showart_348444.html

<?

$my_book_url='http://book.yunxiaoge.com/files/article/html/4/4550/index.html';

ereg("http://book.yunxiaoge.com/files/article/html/+/+/",$my_book_url,$myBook);

$my_book_txt=$myBook;

$file_handle = fopen($my_book_url, "r");//读取文件

unlink("test.txt");

while (!feof($file_handle)) { //循环到文件结束

    $line = fgets($file_handle); //读取一行文件

    $line1=ereg("href=\"+.html",$line,$reg); //分析文件内部书的文章页面

       $handle = fopen("test.txt", 'a');

   if ($line1) {

     $my_book_txt_url=$reg; //另外赋值,给抓取分析做准备

   $my_book_txt_url=str_replace("href=\"","",$my_book_txt_url);

      $my_book_txt_over_url="$my_book_txt$my_book_txt_url"; //转换为抓取地址

      echo "$my_book_txt_over_url</p>"; //显示工作状态

      $file_handle_txt = fopen($my_book_txt_over_url, "r"); //读取转换后的抓取地址

      while (!feof($file_handle_txt)) {

       $line_txt = fgets($file_handle_txt);

       $line1=ereg("^&nbsp.+",$line_txt,$reg); //根据抓取内容标示抓取

       $my_over_txt=$reg;

       $my_over_txt=str_replace("    ","    ",$my_over_txt); //过滤字符

       $my_over_txt=str_replace("<br />","",$my_over_txt);

       $my_over_txt=str_replace("<script. language=\"javascript\">","",$my_over_txt);

       $my_over_txt=str_replace(""","",$my_over_txt);

       if ($line1) {

         $handle1=fwrite($handle,"$my_over_txt\n"); //写入文件

       }

      }

    }

}

fclose($file_handle_txt);

fclose($handle);

fclose($file_handle); //关闭文件

echo "完成</p>";

?>

下面是比较嚣张的方法。

这里使用一个名叫Snoopy
的类。

先是在这里看到的：

PHP中获取网页内容的Snoopy
包

http://blog.declab.com/read.php/27.htm

然后是Snoopy的官网：

http://sourceforge.net/projects/snoopy/

这里有一些简单的说明：

代码收藏-Snoopy
类及简单的使用方法
http://blog.passport86.com/?p=161

下载：http://sourceforge.net/projects/snoopy/

今天才发现这个好东西，赶紧去下载了来看看，是用的parse_url

还是比较习惯curl

snoopy是一个php类，用来模仿web浏览器的功能，它能完成获取网页内容和发送表单的任务。

下面是它的一些特征：

1、方便抓取网页的内容

2、方便抓取网页的文字（去掉HTML代码）

3、方便抓取网页的链接

4、支持代理主机

5、支持基本的用户/密码认证模式

6、支持自定义用户agent,referer,cookies和header内容

7、支持浏览器转向，并能控制转向深度

8、能把网页中的链接扩展成高质量的url（默认）

9、方便提交数据并且获取返回值

10、支持跟踪HTML框架（v0.92增加）

11、支持再转向的时候传递cookies

具体使用请看下载文件中的说明。

<?php

include
“
Snoopy.class.php
“
;

$snoopy
=
new
Snoopy
;

$snoopy
->
fetchform
(
“
http://www.phpx.com/happy/logging.php?action=login
“
)
;

print
$snoopy
->
results
;

?>

<?php

include
“
Snoopy.class.php
“
;

$snoopy
=
new
Snoopy
;

$submit_url
=
“
http://www.phpx.com/happy/logging.php?action=login
“
;
$submit_vars
[
"
loginmode
"
]
=
“
normal
“
;

$submit_vars
[
"
styleid
"
]
=
“
1
“
;

$submit_vars
[
"
cookietime
"
]
=
“
315360000
“
;

$submit_vars
[
"
loginfield
"
]
=
“
username
“
;

$submit_vars
[
"
username
"
]
=
“
********
“
;
//你的用户名

$submit_vars
[
"
password
"
]
=
“
*******
“
;
//你的密码

$submit_vars
[
"
questionid
"
]
=
“
0
“
;

$submit_vars
[
"
answer
"
]
=
“”
;

$submit_vars
[
"
loginsubmit
"
]
=
“
提   交
“
;

$snoopy
->
submit
(
$submit_url
,
$submit_vars
)
;

print
$snoopy
->
results
;
?>

　　
下面是
Snoopy的
Readme

NAME:

    Snoopy - the PHP net client v1.2.4



SYNOPSIS:

    include "Snoopy.class.php";

    $snoopy = new Snoopy;



    $snoopy->fetchtext("http://www.php.net/");

    print $snoopy->results;



    $snoopy->fetchlinks("http://www.phpbuilder.com/");

    print $snoopy->results;



    $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";



    $submit_vars["q"] = "amiga";

    $submit_vars["submit"] = "Search!";

    $submit_vars["searchhost"] = "Altavista";



    $snoopy->submit($submit_url,$submit_vars);

    print $snoopy->results;



    $snoopy->maxframes=5;

    $snoopy->fetch("http://www.ispi.net/");

    echo "<PRE>\n";

    echo htmlentities($snoopy->results);

    echo htmlentities($snoopy->results);

    echo htmlentities($snoopy->results);

    echo "</PRE>\n";

    $snoopy->fetchform("http://www.altavista.com");

    print $snoopy->results;

DESCRIPTION:

    What is Snoopy?



    Snoopy is a PHP class that simulates a web browser. It automates the

    task of retrieving web page content and posting forms, for example.

    Some of Snoopy's features:



    * easily fetch the contents of a web page

    * easily fetch the text from a web page (strip html tags)

    * easily fetch the the links from a web page

    * supports proxy hosts

    * supports basic user/pass authentication

    * supports setting user_agent, referer, cookies and header content

    * supports browser redirects, and controlled depth of redirects

    * expands fetched links to fully qualified URLs (default)

    * easily submit form. data and retrieve the results

    * supports following html frames (added v0.92)

    * supports passing cookies on redirects (added v0.92)





REQUIREMENTS:

    Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),

    which should be PHP 3.0.9 and up. For read timeout support, it requires

    PHP 4 Beta 4 or later. Snoopy was developed and tested with PHP 3.0.12.

CLASS METHODS:

    fetch($URI)

    -----------



    This is the method used for fetching the contents of a web page.

    $URI is the fully qualified URL of the page to fetch.

    The results of the fetch are stored in $this->results.

    If you are fetching frames, then $this->results

    contains each frame. fetched in an array.



    fetchtext($URI)

    ---------------



    This behaves exactly like fetch() except that it only returns

    the text from the page, stripping out html tags and other

    irrelevant data.

    fetchform($URI)

    ---------------



    This behaves exactly like fetch() except that it only returns

    the form. elements from the page, stripping out html tags and other

    irrelevant data.

    fetchlinks($URI)

    ----------------

    This behaves exactly like fetch() except that it only returns

    the links from the page. By default, relative links are

    converted to their fully qualified URL form.

    submit($URI,$formvars)

    ----------------------



    This submits a form. to the specified $URI. $formvars is an

    array of the form. variables to pass.





    submittext($URI,$formvars)

    --------------------------

    This behaves exactly like submit() except that it only returns

    the text from the page, stripping out html tags and other

    irrelevant data.

    submitlinks($URI)

    ----------------

    This behaves exactly like submit() except that it only returns

    the links from the page. By default, relative links are

    converted to their fully qualified URL form.

CLASS VARIABLES:    (default value in parenthesis)

    $host            the host to connect to

    $port            the port to connect to

    $proxy_host        the proxy host to use, if any

    $proxy_port        the proxy port to use, if any

    $agent            the user agent to masqerade as (Snoopy v0.1)

    $referer        referer information to pass, if any

    $cookies        cookies to pass if any

    $rawheaders        other header info to pass, if any

    $maxredirs        maximum redirects to allow. 0=none allowed. (5)

    $offsiteok        whether or not to allow redirects off-site. (true)

    $expandlinks    whether or not to expand links to fully qualified URLs (true)

    $user            authentication username, if any

    $pass            authentication password, if any

    $accept            http accept types (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)

    $error            where errors are sent, if any

    $response_code    responde code returned from server

    $headers        headers returned from server

    $maxlength        max return data length

    $read_timeout    timeout on read operations (requires PHP 4 Beta 4+)

                    set to 0 to disallow timeouts

    $timed_out        true if a read operation timed out (requires PHP 4 Beta 4+)

    $maxframes        number of frames we will follow

    $status            http status of fetch

    $temp_dir        temp directory that the webserver can write to. (/tmp)

    $curl_path        system path to cURL binary, set to false if none



EXAMPLES:

    Example:     fetch a web page and display the return headers and

                the contents of the page (html-escaped):



    include "Snoopy.class.php";

    $snoopy = new Snoopy;



    $snoopy->user = "joe";

    $snoopy->pass = "bloe";



    if($snoopy->fetch("http://www.slashdot.org/"))

    {

        echo "response code: ".$snoopy->response_code."<br>\n";

        while(list($key,$val) = each($snoopy->headers))

            echo $key.": ".$val."<br>\n";

        echo "<p>\n";



        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";

    }

    else

        echo "error fetching document: ".$snoopy->error."\n";

    Example:    submit a form. and print out the result headers

                and html-escaped page:

    include "Snoopy.class.php";

    $snoopy = new Snoopy;



    $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";



    $submit_vars["q"] = "amiga";

    $submit_vars["submit"] = "Search!";

    $submit_vars["searchhost"] = "Altavista";



    if($snoopy->submit($submit_url,$submit_vars))

    {

        while(list($key,$val) = each($snoopy->headers))

            echo $key.": ".$val."<br>\n";

        echo "<p>\n";



        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";

    }

    else

        echo "error fetching document: ".$snoopy->error."\n";

    Example:    showing functionality of all the variables:



    include "Snoopy.class.php";

    $snoopy = new Snoopy;

    $snoopy->proxy_host = "my.proxy.host";

    $snoopy->proxy_port = "8080";



    $snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";

    $snoopy->referer = "http://www.microsnot.com/";



    $snoopy->cookies["SessionID"] = 238472834723489l;

    $snoopy->cookies["favoriteColor"] = "RED";



    $snoopy->rawheaders["Pragma"] = "no-cache";



    $snoopy->maxredirs = 2;

    $snoopy->offsiteok = false;

    $snoopy->expandlinks = false;



    $snoopy->user = "joe";

    $snoopy->pass = "bloe";



    if($snoopy->fetchtext("http://www.phpbuilder.com"))

    {

        while(list($key,$val) = each($snoopy->headers))

            echo $key.": ".$val."<br>\n";

        echo "<p>\n";



        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";

    }

    else

        echo "error fetching document: ".$snoopy->error."\n";

    Example:     fetched framed content and display the results



    include "Snoopy.class.php";

    $snoopy = new Snoopy;



    $snoopy->maxframes = 5;



    if($snoopy->fetch("http://www.ispi.net/"))

    {

        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";

        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";

        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";

    }

    else

        echo "error fetching document: ".$snoopy->error."\n";

<?php
//获取所有内容url保存到文件
function get_index($save_file, $prefix="index_"){
$count = 68;
$i = 1;
if (file_exists($save_file)) @unlink($save_file);
$fp = fopen($save_file, "a+") or die("Open ". $save_file ." failed");
while($i<$count){
$url = $prefix . $i .".htm";
echo "Get ". $url ."...";
$url_str = get_content_url(get_url($url));
echo " OKn";
fwrite($fp, $url_str);
++$i;
}
fclose($fp);
}
//获取目标多媒体对象
function get_object($url_file, $save_file, $split="|--:**:--|"){
if (!file_exists($url_file)) die($url_file ." not exist");
$file_arr = file($url_file);
if (!is_array($file_arr) || empty($file_arr)) die($url_file ." not content");
$url_arr = array_unique($file_arr);
if (file_exists($save_file)) @unlink($save_file);
$fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
foreach($url_arr as $url){
if (empty($url)) continue;
echo "Get ". $url ."...";
$html_str = get_url($url);
echo $html_str;
echo $url;
exit;
$obj_str = get_content_object($html_str);
echo " OKn";
fwrite($fp, $obj_str);
}
fclose($fp);
}
//遍历目录获取文件内容
function get_dir($save_file, $dir){
$dp = opendir($dir);
if (file_exists($save_file)) @unlink($save_file);
$fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
while(($file = readdir($dp)) != false){
if ($file!="." && $file!=".."){
echo "Read file ". $file ."...";
$file_content = file_get_contents($dir . $file);
$obj_str = get_content_object($file_content);
echo " OKn";
fwrite($fp, $obj_str);
}
}
fclose($fp);
}

//获取指定url内容
function get_url($url){
$reg = '/^http://[^/].+$/';
if (!preg_match($reg, $url)) die($url ." invalid");
$fp = fopen($url, "r") or die("Open url: ". $url ." failed.");
while($fc = fread($fp, 8192)){
$content .= $fc;
}
fclose($fp);
if (empty($content)){
die("Get url: ". $url ." content failed.");
}
return $content;
}
//使用socket获取指定网页
function get_content_by_socket($url, $host){
$fp = fsockopen($host, 80) or die("Open ". $url ." failed");
$header = "GET /".$url ." HTTP/1.1rn";
$header .= "Accept: */*rn";
$header .= "Accept-Language: zh-cnrn";
$header .= "Accept-Encoding: gzip, deflatern";
$header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)rn";
$header .= "Host: ". $host ."rn";
$header .= "Connection: Keep-Alivern";
//$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-rnrn";
$header .= "Connection: Closernrn";
fwrite($fp, $header);
while (!feof($fp)) {
$contents .= fgets($fp, 8192);
}
fclose($fp);
return $contents;
}

//获取指定内容里的url
function get_content_url($host_url, $file_contents){
//$reg = '/^(#|javascript.*?|ftp://.+|http://.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i';
//$reg = '/^(down.*?.html|d+_d+.htm.*?)$/i';
$rex = "/()s*=s*['"]*([^>'"s]+)["'>]*s*/i";
$reg = '/^(down.*?.html)$/i';
preg_match_all ($rex, $file_contents, $r);
$result = ""; //array();
foreach($r as $c){
if (is_array($c)){
foreach($c as $d){
if (preg_match($reg, $d)){ $result .= $host_url . $d."n"; }
}
}
}
return $result;
}
//获取指定内容中的多媒体文件
function get_content_object($str, $split="|--:**:--|"){
$regx = "/hrefs*=s*['"]*([^>'"s]+)["'>]*s*(<b>.*?</b>)/i";
preg_match_all($regx, $str, $result);
if (count($result) == 3){
$result = str_replace("<b>多媒体： ", "", $result);
$result = str_replace("</b>", "", $result);
$result = $result . $split .$result . "n";
}
return $result;
}
?>

php抓取网页特定div区块及图片

(2009-06-05 09:56:23)
http://simg.sinajs.cn/blog7style/images/common/sg_trans.gif
转载

标签：

php

抓取

图片

it

分类：
PHP

　　
1. 取得指定網頁內的所有圖片：
<?php

//取得指定位址的內容，並儲存至text

$text=file_get_contents('http://andy.diimii.com/');

//取得第一個img標籤，並儲存至陣列match（regex語法與上述同義）

preg_match('/<img[^>]*>/Ui',
$text, $match);

//印出match

print_r($match);

?>

-----------------

2. 取得指定網頁內的第一張圖片：
<?php

//取得指定位址的內容，並儲存至text

$text=file_get_contents('http://andy.diimii.com/');

//取得第一個img標籤，並儲存至陣列match（regex語法與上述同義）

preg_match('/<img[^>]*>/Ui',
$text, $match);

//印出match

print_r($match);

?>

------------------------------------

3. 取得指定網頁內的特定div區塊（藉由id判斷）：
<?php

//取得指定位址的內容，並儲存至text

$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');

//去除換行及空白字元（序列化內容才需使用）

//$text=str_replace(array("\r","\n","\t","\s"),
'', $text);

//取出div標籤且id為PostContent的內容，並儲存至陣列match

preg_match('/<div[^>]*id="PostContent"[^>]*>(.*?)
<\/div>/si',$text,$match);

//印出match

print($match);

?>

-------------------------------------------

4. 上述2及3的結合：
<?php

//取得指定位址的內容，並儲存至text

$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');

//取出div標籤且id為PostContent的內容，並儲存至陣列match

preg_match('/<div[^>]*id="PostContent"[^>]*>(.*?)
<\/div>/si',$text,$match);

//取得第一個img標籤，並儲存至陣列match2

preg_

页: [1]

运维网's Archiver

（转）php抓取网页内容汇总