PHP多进程爬虫-Curl中的 SSL 和 pcntl_fork

论坛 期权论坛 脚本     
匿名技术用户   2020-12-23 14:08   22   0

PHP多进程爬虫-Curl中的 SSLpcntl_fork

起源

最近在使用PHP多进程写爬虫的时候,遇到一个很奇怪的问题。在PHP多进程程序中,如果父进程对某域名(比如:https://www.jd.com)进行https请求后,那么子进程https请求同样的网站,会请求失败。

比如:

<?php
 $ch = curl_init('https://www.jd.com/');          
 curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
 curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 $success = curl_exec($ch);
 var_dump($success !== false); // true
 curl_close($ch);

 $pid = pcntl_fork();

 if ($pid === 0) {
     $ch = curl_init('https://www.jd.com/');
     curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
     curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
     $success = curl_exec($ch);
     var_dump($success !== false); // false

     $errno = curl_errno($ch); // 35
     $error = curl_error($ch); // SSL connect error
     curl_close($ch);
 } else if ($pid > 0) {
     // wait for child process
     pcntl_wait($status);
 }
bool(true)
bool(false)

打开curl调试,有以下调试信息。

*Trying 183.56.147.1...
*TCP_NODELAY set
*Connected to www.jd.com (183.56.147.1) port 443 (#0)
*Initializing NSS with certpath: none
*skipping SSL peer certificate verification
*ALPN, server accepted to use http/1.1
*SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
*Server certificate:
*subject: CN=*.jd.com,O="BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO., LTD.",L=beijing,ST=beijing,C=CN
*start date: Jul 04 05:47:07 2017 GMT
*expire date: Aug 28 09:42:54 2018 GMT
*common name: *.jd.com
*issuer: CN=GlobalSign Organization Validation CA - SHA256 - G2,O=GlobalSign nv-sa,C=BE
> GET / HTTP/1.1
Host: www.jd.com
Accept: */*

< HTTP/1.1 200 OK
< Server: JDWS/2.0
< Date: Sat, 18 Nov 2017 05:45:52 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 124343
< Connection: keep-alive
< Vary: Accept-Encoding
< Vary: Accept-Encoding
< Expires: Sat, 18 Nov 2017 05:45:52 GMT
< Cache-Control: max-age=30
< ser: 101.115
< Via: BJ-M-YZ-NX-76(HIT), http/1.1 GZ-CT-1-JCS-24 ( [cRs f ])
< Age: 24
< Strict-Transport-Security: max-age=360
<
*Curl_http_done: called premature == 0
*Connection #0 to host www.jd.com left intact
bool(true)
*Trying 183.56.147.1...
*TCP_NODELAY set
*Connected to www.jd.com (183.56.147.1) port 443 (#0)
*NSS error -8023 (SEC_ERROR_PKCS11_DEVICE_ERROR)
*A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot.
*Curl_http_done: called premature == 0
*Closing connection 0

从调试信息我们会发现

*Trying 183.56.147.1...
*TCP_NODELAY set
*Connected to www.jd.com (183.56.147.1) port 443 (#0)
*NSS error -8023 (SEC_ERROR_PKCS11_DEVICE_ERROR)
*A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot.
*Curl_http_done: called premature == 0
*Closing connection 0
子进程中的https请求发生和NSS错误, 其中NSS是libcurl库中负责SSL证书加密的功能

原因

通过在网上查找资料,发现这个原因可能是PHP中curl使用的libcurl库所导致的, 众所周知,https请求会在http请求的基础上加上一个验证证书和对称加密传输内容的步骤,而libcurl的实现可能 是在生成加密密钥的时候是利用了进程的pid来生成的,所以一旦在父进程通过https访问网站,相应的密钥和证书就会生成。 但是之后在子进程中再次通过https访问相同的网站,由于pid不一样,生成的私钥也不同,网站的公钥不配对,所以验证失败, 出现上面的错误。

解决方法

  1. 父进程中采用http访问,或者所有子进程中都都采用http访问
<?php
 $ch = curl_init('http://www.jd.com/');          
 curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
 curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 $success = curl_exec($ch);
 var_dump($success !== false); // true
 curl_close($ch);

 $pid = pcntl_fork();

 if ($pid === 0) {
     $ch = curl_init('https://www.jd.com/');
     curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
     curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
     $success = curl_exec($ch);
     var_dump($success !== false); // false

     $errno = curl_errno($ch); // 35
     $error = curl_error($ch); // SSL connect error
     curl_close($ch);
 } else if ($pid > 0) {
     // wait for child process
     pcntl_wait($status);
 }
bool(true)
bool(true)
  1. 使用 socket 代替 curl

  2. 使用 pork_exec() 代替 pork_fork()

参考&引用

  1. https://stackoverflow.com/questions/26285311/ssl-requests-made-with-curl-fail-after-process-fork

  2. https://stackoverflow.com/questions/15466809/libcurl-ssl-error-after-fork

  3. https://stackoverflow.com/questions/34901910/curl-and-pcntl-fork?lq=1

分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:7942463
帖子:1588486
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP