爬取2016年世界500强的数据，进行分析

时间：2017-3-20

作者：MingWei

起源是在查找苹果公司的收入和利润的时候，发现苹果公司的利润率远高于旁边的公司，就萌生了一个想法，想看看苹果公司的利润率能够排名第几？

想到了，就行动起来。

需要的知识or技术储备or环境：

开发环境：Anaconda的Ipython
需要的库：Numpy，Pandas，Requests，BeautifulSoup
知识：Python 2.7的编程基础

开始编程：

#第一部分代码是爬取数据并且用列表的形式存储起来

from bs4 import BeautifulSoup
import requests

import pandas as pd

import bumpy as np

#网页路径 
path="http://www.fortunechina.com/fortune500/c/2016-07/20/content_266955.htm"
data= requests.get(path).content
soup =BeautifulSoup(data,'html.parser',from_encoding='utf-8')

#获得世界500强表格的数据 
table =soup.find('table',attrs={'class':'rankingtable'})
th =table.findAll('th')
th_content = []

for i in th:
     th_ = i.get_text()
     th_content.append(th_)

td = table.findAll('td')
number =[]
last_number=[]
company=[]
income=[]
profit=[]
country=[]

#通过分析网站，可以看到500强的表格是按照一个公司有6个td来进行展示的，所以使用对6取余来获取数据
for n in td:
     content = n.get_text()
     if i%6==0:
         number.append(content)
         i=i+1
     elif i%6==1:
         last_number.append(content)
         i=i+1
     elif i%6==2:
         company.append(content)
         i=i+1
     elif i%6==3:
         income.append(content)
         i=i+1
     elif i%6==4:
         profit.append(content)
         i=i+1
     else:
         country.append(content)
         i=i+1

#将数据序列化，同时放入pandas的DataFrame数据格式中

ser_num =pd.Series(number)
ser_num.head()
ser_last_num = pd.Series(last_number)
ser_company=pd.Series(company)
ser_country = pd.Series(country)
ser_income=pd.Series(income)
ser_profit =pd.Series(profit)

top_500 = pd.DataFrame(number,columns=['number'])
top_500['last_number']=ser_last_num
top_500['company']=ser_company
top_500['income']=ser_income
top_500['profit']=ser_profit
top_500['country']=ser_country

#把数据保存起来

top_500.to_csv('top_500.txt',encoding='utf-8')

#尝试把obj格式的营收和利润转换为float格式

top_500['int_profit']=top_500.profit.apply(lambda x:float(x.replace(',','')))
top_500['int_income']=top_500.income.apply(lambda x:float(x.replace(',','')))

#计算利润率，并且对利润率进行排序

top_500['rate']=top_500.int_profit/top_500.int_income
sort_top_500 = top_500.sort_index(by='rate',ascending=False)

#把排名数据转换为int格式，如果是新上榜的则把‘--’替换为500，并计算上升和下降的个数

top_500['int_last_number']=top_500.last_number.apply(lambda x:x.replace('--','500'))
top_500['int_last_number']=top_500.int_last_number.apply(lambda x:int(x))
top_500['int_number']=top_500.number.apply(lambda x:int(x))

top_500['change']= top_500.int_last_number - top_500.int_number

len(top_500[top_500.change>=0])

len(top_500[top_500.change<0])

#查看各国的上榜企业个数

top_500.country.value_counts()

#查看利润率排名前20的企业

sort_top_500[:20]

总结：

这算是用来两个小时进行学习的一个小项目，从数据爬取，到数据格式转换，同时对数据进行预处理，最糊进行简单的统计分析，锻炼了我对python的掌握能力。之后会陆续将日常做的小项目上传上来。

爬取2016年世界500强的数据，进行分析

浏览过的版块