爬虫代理 IP 池及隧道代理(2022.05.24)
爬虫代理 IP 池及隧道代理
目录
- 爬虫代理 IP 池及隧道代理
- 1. 代理 IP 池
- 1.1 简介
- 1.2 实现
- 1.3 测试
- 2. 隧道代理
- 2.1 简介
- 2.2 实现
- 2.2.1 目录结构
- 2.2.2 配置文件
- 2.2.3 openresty
- 2.3 测试
- 1. 代理 IP 池
日常开发中,偶尔会遇到爬取网页数据的需求,为了隐藏本机真实 IP,常常会用到代理 IP 池,本文将基于 openresty 与代理 IP 池搭建更为易用的隧道代理。
1. 代理 IP 池
1.1 简介
代理 IP 池即在数据库中维护一个可用的 IP 代理队列,一般实现思路如下:
- 定时从免费或收费代理网站获取代理 IP 列表;
- 将代理 IP 列表以 Hash 结构存入 Redis;
- 定时检测代理 IP 的可用性,剔除不可用的代理 IP;
- 对外提供 API 接口用来管理代理 IP 池;
1.2 实现
此处笔者采用的开源项目jhao104/proxy_pool,具体实现方式参考其文档。
1.3 测试
import json
import requests
from retrying import retry
def get_proxy_ip() -> str:
resp = requests.get(url="http://192.168.0.121:5010/get")
assert resp.status_code == 200
return f"http://{json.loads(resp.text)['proxy']}"
@retry(stop_max_attempt_number=5)
def proxy_test() -> None:
resp = requests.get(url="http://httpbin.org/get", proxies={"http": get_proxy_ip()}, timeout=5)
assert resp.status_code == 200
print(f"origin: {json.loads(resp.text)['origin']}")
if __name__ == "__main__":
try:
proxy_test()
except Exception as e:
print(f"Error: {e}.")
2. 隧道代理
2.1 简介
通过代理 IP 池实现了隐藏本机真实 IP,但每次需要通过 API 接口获取新的代理 IP,不太方便,所以出现了隧道代理。隧道代理内部自动将请求通过不同的代理 IP 进行转发,对外提供统一的代理地址。
2.2 实现
此处笔者通过 openresty 配合上文搭建的代理 IP 池实现隧道代理。
2.2.1 目录结构
openresty
├── conf.d
│ └── tunnel-proxy.stream
├── docker.sh
└── nginx.conf
2.2.2 配置文件
-
文件为 openresty 的主配置文件,主要修改为引入了 stream 相关的配置文件,具体内容如下:nginx.conf
# nginx.conf -- docker-openresty
#
# This file is installed to:
# `/usr/local/openresty/nginx/conf/nginx.conf`
# and is the file loaded by nginx at startup,
# unless the user specifies otherwise.
#
# It tracks the upstream OpenResty's `nginx.conf`, but removes the `server`
# section and adds this directive:
# `include /etc/nginx/conf.d/*.conf;`
#
# The `docker-openresty` file `nginx.vh.default.conf` is copied to
# `/etc/nginx/conf.d/default.conf`. It contains the `server section
# of the upstream `nginx.conf`.
#
# See https://github.com/openresty/docker-openresty/blob/master/README.md#nginx-config-files
#
#user nobody;
#worker_processes 1;
# Enables the use of JIT for regular expressions to speed-up their processing.
pcre_jit on;
#error_log logs/error.log;
#error_log logs/error.log notice;
#error_log logs/error.log info;
#pid logs/nginx.pid;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
# Enables or disables the use of underscores in client request header fields.
# When the use of underscores is disabled, request header fields whose names contain underscores are marked as invalid and become subject to the ignore_invalid_headers directive.
# underscores_in_headers off;
#log_format main '$remote_addr - $remote_user [$time_local] "$request" '
# '$status $body_bytes_sent "$http_referer" '
# '"$http_user_agent" "$http_x_forwarded_for"';
#access_log logs/access.log main;
# Log in JSON Format
# log_format nginxlog_json escape=json '{ "timestamp": "$time_iso8601", '
# '"remote_addr": "$remote_addr", '
# '"body_bytes_sent": $body_bytes_sent, '
# '"request_time": $request_time, '
# '"response_status": $status, '
# '"request": "$request", '
# '"request_method": "$request_method", '
# '"host": "$host",'
# '"upstream_addr": "$upstream_addr",'
# '"http_x_forwarded_for": "$http_x_forwarded_for",'
# '"http_referrer": "$http_referer", '
# '"http_user_agent": "$http_user_agent", '
# '"http_version": "$server_protocol", '
# '"nginx_access": true }';
# access_log /dev/stdout nginxlog_json;
# See Move default writable paths to a dedicated directory (#119)
# https://github.com/openresty/docker-openresty/issues/119
client_body_temp_path /var/run/openresty/nginx-client-body;
proxy_temp_path /var/run/openresty/nginx-proxy;
fastcgi_temp_path /var/run/openresty/nginx-fastcgi;
uwsgi_temp_path /var/run/openresty/nginx-uwsgi;
scgi_temp_path /var/run/openresty/nginx-scgi;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 65;
#gzip on;
include /etc/nginx/conf.d/*.conf;
# Don't reveal OpenResty version to clients.
# server_tokens off;
}
stream {
log_format proxy '$remote_addr [$time_local] '
'$protocol $status $bytes_sent $bytes_received '
'$session_time "$upstream_addr" '
'"$upstream_bytes_sent" "$upstream_bytes_received" "$upstream_connect_time"';
access_log /usr/local/openresty/nginx/logs/access.log proxy;
error_log /usr/local/openresty/nginx/logs/error.log notice;
open_log_file_cache off;
include /etc/nginx/conf.d/*.stream;
} -
为配置隧道代理的文件,通过查询 Redis 获取代理 IP,并将请求通过代理 IP 转发到指定目标地址,具体内容如下:tunnel-proxy.stream
# tunnel-proxy.stream
upstream backend {
server 0.0.0.0:9870;
balancer_by_lua_block {
local balancer = require "ngx.balancer"
local host = ngx.ctx.proxy_host
local port = ngx.ctx.proxy_port
local success, msg = balancer.set_current_peer(host, port)
if not success then
ngx.log(ngx.ERR, "Failed to set the peer. Error: ", msg, ".")
end
}
}
server {
# 对外代理监听端口
listen 9870;
listen [::]:9870;
proxy_connect_timeout 10s;
proxy_timeout 10s;
proxy_pass backend;
preread_by_lua_block {
local redis = require("resty.redis")
local redis_instance = redis:new()
redis_instance:set_timeout(3000)
# Redis 地址
local rhost = "192.168.0.121"
# Redis 端口
local rport = 6379
# Redis 数据库
local database = 0
# Redis Hash 键名
local rkey = "use_proxy"
local success, msg = redis_instance:connect(rhost, rport)
if not success then
ngx.log(ngx.ERR, "Failed to connect to redis. Error: ", msg, ".")
end
redis_instance:select(database)
local proxys, msg = redis_instance:hkeys(rkey)
if not proxys then
ngx.log(ngx.ERR, "Proxys num error. Error: ", msg, ".")
return redis_instance:close()
end
math.randomseed(tostring(ngx.now()):reverse():sub(1, 6))
local proxy = proxys[math.random(#proxys)]
local colon_index = string.find(proxy, ":")
local proxy_ip = string.sub(proxy, 1, colon_index - 1)
local proxy_port = string.sub(proxy, colon_index + 1)
ngx.log(ngx.NOTICE, "Proxy: ", proxy, ", ip: ", proxy_ip, ", port: ", proxy_port, ".");
ngx.ctx.proxy_host = proxy_ip
ngx.ctx.proxy_port = proxy_port
redis_instance:close()
}
}
2.2.3 openresty
通过 docker 启动 openresty,此处笔者为了方便,将 docker 命令保存成了 shell 文件,具体内容如下:
docker run --name openresty -itd --restart always \
-p 9870:9870 \
-v $PWD/nginx.conf:/usr/local/openresty/nginx/conf/nginx.conf \
-v $PWD/conf.d:/etc/nginx/conf.d \
-e LANG=C.UTF-8 \
-e TZ=Asia/Shanghai \
--log-driver json-file \
--log-opt max-size=1g \
--log-opt max-file=3 \
openresty/openresty:alpine
执行
bash docker.sh
命名启动 openresty,至此隧道代理搭建完成。 2.3 测试
import json
import requests
from retrying import retry
proxies = {
"http": "http://192.168.0.121:9870"
}
@retry(stop_max_attempt_number=5)
def proxy_test() -> None:
resp = requests.get(
url="http://httpbin.org/get", proxies=proxies, timeout=5, )
assert resp.status_code == 200
print(f"origin: {json.loads(resp.text)['origin']}")
if __name__ == "__main__":
try:
proxy_test()
except Exception as e:
print(f"Error: {e}.")
参考链接:
- 只要5分钟,创建一个隧道代理 - 知乎 (zhihu.com)
- openresty正向代理搭建 - 简书 (jianshu.com)