本文主要是介绍体积模量预测,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
K = − V ∂ p ∂ V K=-V {\partial p\over \partial V} K=−V∂V∂p
体积模量 (K)也称为不可压缩量,是材料对于表面四周压强产生形变程度的度量。
定义:产生单位相对体积收缩所需的压强,在SI单位制中的基本单位是帕斯卡。
数据含义
- G_Reuss:多晶材料的剪切模量下界
- G_VRH:G_Reuss与G_Voigt的平均值
- G_Voigt:多晶材料的剪切模量上界
- K_Reuss:多晶材料的体积模量下界
- K_VRH:K_Reuss与K_Voigt的平均值
- K_Voigt:多晶材料的体积模量上界
- cif:可选:结构的描述字符串
- compliance_tensor:描述弹性行为的张量
- elastic_anisotropy:材料弹性方向依赖性的度量,度量总是>= 0
- elastic_tensor:描述对应于IEEE 方向的弹性行为的张量,对称于晶体结构
- elastic_tensor_original:描述弹性行为的张量,非对称的,对应于POSCAR 常规标准单元方向
- formula:材料的化学组成
- kpoint_density:可选:计算中的采样参数
- material_id:材料的Materials Project ID
- nsites:计算单胞的原子数
- poisson_ratio 描述对负载的横向响应
- poscar:可选:POSCAR数据
- space_group:材料晶体结构的空间群
- structure:pandas 系列定义了材料的结构
- volume:以立方埃为单位的晶胞体积,对于超晶胞计算,这个量是指整个超晶胞的体积。
导入和处理数据
#导入数据
from matminer.datasets import load_dataset
data=load_dataset('elastic_tensor_2015',data_home='.')
data.head()
material_id | formula | nsites | space_group | volume | structure | elastic_anisotropy | G_Reuss | G_VRH | G_Voigt | K_Reuss | K_VRH | K_Voigt | poisson_ratio | compliance_tensor | elastic_tensor | elastic_tensor_original | cif | kpoint_density | poscar | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | mp-10003 | Nb4CoSi | 12 | 124 | 194.419802 | [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... | 0.030688 | 96.844535 | 97.141604 | 97.438674 | 194.267623 | 194.268884 | 194.270146 | 0.285701 | [[0.004385293093993, -0.0016070693558990002, -... | [[311.33514638650246, 144.45092552856926, 126.... | [[311.33514638650246, 144.45092552856926, 126.... | #\#CIF1.1\n###################################... | 7000 | Nb8 Co2 Si2\n1.0\n6.221780 0.000000 0.000000\n... |
1 | mp-10010 | Al(CoSi)2 | 5 | 164 | 61.987320 | [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... | 0.266910 | 93.939650 | 96.252006 | 98.564362 | 173.647763 | 175.449907 | 177.252050 | 0.268105 | [[0.0037715428949660003, -0.000844229828709, -... | [[306.93357350984974, 88.02634955100905, 105.6... | [[306.93357350984974, 88.02634955100905, 105.6... | #\#CIF1.1\n###################################... | 7000 | Al1 Co2 Si2\n1.0\n3.932782 0.000000 0.000000\n... |
2 | mp-10015 | SiOs | 2 | 221 | 25.952539 | [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] | 0.756489 | 120.962289 | 130.112955 | 139.263621 | 295.077545 | 295.077545 | 295.077545 | 0.307780 | [[0.0019959391925840004, -0.000433146670736000... | [[569.5291276937579, 157.8517489654999, 157.85... | [[569.5291276937579, 157.8517489654999, 157.85... | #\#CIF1.1\n###################################... | 7000 | Si1 Os1\n1.0\n2.960692 0.000000 0.000000\n0.00... |
3 | mp-10021 | Ga | 4 | 63 | 76.721433 | [[0. 1.09045794 0.84078375] Ga, [0. ... | 2.376805 | 12.205989 | 15.101901 | 17.997812 | 49.025963 | 49.130670 | 49.235377 | 0.360593 | [[0.021647143908635, -0.005207263618160001, -0... | [[69.28798774976904, 34.7875015216915, 37.3877... | [[70.13259066665267, 40.60474945058445, 37.387... | #\#CIF1.1\n###################################... | 7000 | Ga4\n1.0\n2.803229 0.000000 0.000000\n0.000000... |
4 | mp-10025 | SiRu2 | 12 | 62 | 160.300999 | [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... | 0.196930 | 100.110773 | 101.947798 | 103.784823 | 255.055257 | 256.768081 | 258.480904 | 0.324682 | [[0.00410214297725, -0.001272204332729, -0.001... | [[349.3767766177825, 186.67131003104407, 176.4... | [[407.4791016459293, 176.4759188081947, 213.83... | #\#CIF1.1\n###################################... | 7000 | Si4 Ru8\n1.0\n4.037706 0.000000 0.000000\n0.00... |
要尝试预测 K _ VRH 和 G _ VRH (分别是体积模量和剪切模量的 Voight-Reuss-Hill 平均值)和elastic_anisotropy。
删除无关列:volume,nsites,compliance_tensor,elastic_tensor,elastic_tensor_original,G_Reuss,G_Voigt,K_Reuss,K_Voigt,cif,kpoint_density ,poscar
deleted_col=['volume','nsites','compliance_tensor','elastic_tensor','elastic_tensor_original','G_Reuss','G_Voigt','K_Reuss','K_Voigt','cif','kpoint_density','poscar']
data=data.drop(deleted_col,axis=1)
data.head()
material_id | formula | space_group | structure | elastic_anisotropy | G_VRH | K_VRH | poisson_ratio | |
---|---|---|---|---|---|---|---|---|
0 | mp-10003 | Nb4CoSi | 124 | [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... | 0.030688 | 97.141604 | 194.268884 | 0.285701 |
1 | mp-10010 | Al(CoSi)2 | 164 | [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... | 0.266910 | 96.252006 | 175.449907 | 0.268105 |
2 | mp-10015 | SiOs | 221 | [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] | 0.756489 | 130.112955 | 295.077545 | 0.307780 |
3 | mp-10021 | Ga | 63 | [[0. 1.09045794 0.84078375] Ga, [0. ... | 2.376805 | 15.101901 | 49.130670 | 0.360593 |
4 | mp-10025 | SiRu2 | 62 | [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... | 0.196930 | 101.947798 | 256.768081 | 0.324682 |
data.describe()
space_group | elastic_anisotropy | G_VRH | K_VRH | poisson_ratio | |
---|---|---|---|---|---|
count | 1181.000000 | 1181.000000 | 1181.000000 | 1181.000000 | 1181.000000 |
mean | 163.403895 | 2.145013 | 67.543145 | 136.259661 | 0.287401 |
std | 65.040733 | 19.140097 | 44.579408 | 72.886978 | 0.062177 |
min | 4.000000 | 0.000005 | 2.722175 | 6.476135 | 0.042582 |
25% | 124.000000 | 0.145030 | 34.117959 | 76.435350 | 0.249159 |
50% | 193.000000 | 0.355287 | 59.735163 | 130.382766 | 0.290198 |
75% | 221.000000 | 0.923117 | 91.332142 | 189.574194 | 0.328808 |
max | 229.000000 | 397.297866 | 522.921225 | 435.661487 | 0.467523 |
数据添加描述符
我们正在寻找输入(材料的组成和晶体结构)和输出(弹性性质,如 K _ VRH,G _ VRH 和elastic_anisotropy)之间的关系。为了找到这样的关系,我们需要“特征化”输入数据,使它们成为有意义地表示底层物理量的数字。例如,材料组成的一个“特征”或“描述符”,如 Nb4CoSi,将是化合物中元素的鲍林标准差电负性(按化学计量加权)。这个量值较高的组分离子较多,值较低的组分趋向于共价或离子。晶体结构的一个描述可能是位点的平均配位数; 更高的配位数表示更多的键,因此可能表示更硬的材料。使用 matminer,我们可以开始使用可用的描述符库生成数百个可能的描述符。数据挖掘技术可以使用可用的输出数据作为指导,帮助缩小与目标问题最相关的描述符的范围。
添加基于组合的特性
Matminer 中的一个主要类别的特征使用化学成份来特征化输入数据。让我们向 DataFrame 添加一些基于组合的特性
第一步是让一个列表示化学成份作为 pymatgen 组合对象。做到这一点的一种方法是使用 matminer 中的转换 conversions 将 String 组合(我们之前的formula列)转换为 pymatgen 组合。
from matminer.featurizers.conversions import StrToComposition
data=StrToComposition().featurize_dataframe(df=data,col_id='formula')
data.head()
StrToComposition: 0%| | 0/1181 [00:00<?, ?it/s]
material_id | formula | space_group | structure | elastic_anisotropy | G_VRH | K_VRH | poisson_ratio | composition | |
---|---|---|---|---|---|---|---|---|---|
0 | mp-10003 | Nb4CoSi | 124 | [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... | 0.030688 | 97.141604 | 194.268884 | 0.285701 | (Nb, Co, Si) |
1 | mp-10010 | Al(CoSi)2 | 164 | [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... | 0.266910 | 96.252006 | 175.449907 | 0.268105 | (Al, Co, Si) |
2 | mp-10015 | SiOs | 221 | [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] | 0.756489 | 130.112955 | 295.077545 | 0.307780 | (Si, Os) |
3 | mp-10021 | Ga | 63 | [[0. 1.09045794 0.84078375] Ga, [0. ... | 2.376805 | 15.101901 | 49.130670 | 0.360593 | (Ga) |
4 | mp-10025 | SiRu2 | 62 | [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... | 0.196930 | 101.947798 | 256.768081 | 0.324682 | (Si, Ru) |
StrToComposition.featurize_dataframe
执行数据转换并动态设置目标列。
Args:
- df (Pandas.DataFrame): 包含输入数据的数据框架。.
- col_id (str or list of str): 包含要特征化的对象的列标签。如果特征化函数需要多个输入,则可以是多个标签
使用 matminer 中的一个特性向 DataFrame 添加一组描述符。
from matminer.featurizers.composition import ElementProperty
ep_feat=ElementProperty.from_preset(preset_name='magpie')#从预设字符串返回 ElementProperty
data=ep_feat.featurize_dataframe(data,'composition')#将composition转为特征
data.head()
ElementProperty: 0%| | 0/1181 [00:00<?, ?it/s]
material_id | formula | space_group | structure | elastic_anisotropy | G_VRH | K_VRH | poisson_ratio | composition | MagpieData minimum Number | ... | MagpieData range GSmagmom | MagpieData mean GSmagmom | MagpieData avg_dev GSmagmom | MagpieData mode GSmagmom | MagpieData minimum SpaceGroupNumber | MagpieData maximum SpaceGroupNumber | MagpieData range SpaceGroupNumber | MagpieData mean SpaceGroupNumber | MagpieData avg_dev SpaceGroupNumber | MagpieData mode SpaceGroupNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | mp-10003 | Nb4CoSi | 124 | [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... | 0.030688 | 97.141604 | 194.268884 | 0.285701 | (Nb, Co, Si) | 14.0 | ... | 1.548471 | 0.258079 | 0.430131 | 0.0 | 194.0 | 229.0 | 35.0 | 222.833333 | 9.611111 | 229.0 |
1 | mp-10010 | Al(CoSi)2 | 164 | [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... | 0.266910 | 96.252006 | 175.449907 | 0.268105 | (Al, Co, Si) | 13.0 | ... | 1.548471 | 0.619388 | 0.743266 | 0.0 | 194.0 | 227.0 | 33.0 | 213.400000 | 15.520000 | 194.0 |
2 | mp-10015 | SiOs | 221 | [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] | 0.756489 | 130.112955 | 295.077545 | 0.307780 | (Si, Os) | 14.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.0 | 194.0 | 227.0 | 33.0 | 210.500000 | 16.500000 | 194.0 |
3 | mp-10021 | Ga | 63 | [[0. 1.09045794 0.84078375] Ga, [0. ... | 2.376805 | 15.101901 | 49.130670 | 0.360593 | (Ga) | 31.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.0 | 64.0 | 64.0 | 0.0 | 64.000000 | 0.000000 | 64.0 |
4 | mp-10025 | SiRu2 | 62 | [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... | 0.196930 | 101.947798 | 256.768081 | 0.324682 | (Si, Ru) | 14.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.0 | 194.0 | 227.0 | 33.0 | 205.000000 | 14.666667 | 194.0 |
5 rows × 141 columns
另外,请注意,每个特性化工具还有一个citations()函数,它告诉您在哪里可以找到有关特性化工具的更多信息。
ep_feat.citations()
['@article{ward_agrawal_choudary_wolverton_2016, title={A general-purpose machine learning framework for predicting properties of inorganic materials}, volume={2}, DOI={10.1038/npjcompumats.2017.28}, number={1}, journal={npj Computational Materials}, author={Ward, Logan and Agrawal, Ankit and Choudhary, Alok and Wolverton, Christopher}, year={2016}}']
添加更多基于组合的特性
除了 ElementProperty 之外,还有很多基于组合的特性可以在 matmin.featurizer.composition找到。让我们试试ElectronegativityDiff 功能,这需要知道各种元素的氧化状态的组成。目前还没有这方面的信息,但是我们可以使用conversions包来尝试猜测氧化状态,然后将ElectronegativityDiff功能应用到本列。
from matminer.featurizers.conversions import CompositionToOxidComposition
from matminer.featurizers.composition import OxidationStates
data=CompositionToOxidComposition().featurize_dataframe(data,'composition')
data.head()
CompositionToOxidComposition: 0%| | 0/1181 [00:00<?, ?it/s]
material_id | formula | space_group | structure | elastic_anisotropy | G_VRH | K_VRH | poisson_ratio | composition | MagpieData minimum Number | ... | MagpieData mean GSmagmom | MagpieData avg_dev GSmagmom | MagpieData mode GSmagmom | MagpieData minimum SpaceGroupNumber | MagpieData maximum SpaceGroupNumber | MagpieData range SpaceGroupNumber | MagpieData mean SpaceGroupNumber | MagpieData avg_dev SpaceGroupNumber | MagpieData mode SpaceGroupNumber | composition_oxid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | mp-10003 | Nb4CoSi | 124 | [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... | 0.030688 | 97.141604 | 194.268884 | 0.285701 | (Nb, Co, Si) | 14.0 | ... | 0.258079 | 0.430131 | 0.0 | 194.0 | 229.0 | 35.0 | 222.833333 | 9.611111 | 229.0 | (Nb0+, Co0+, Si0+) |
1 | mp-10010 | Al(CoSi)2 | 164 | [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... | 0.266910 | 96.252006 | 175.449907 | 0.268105 | (Al, Co, Si) | 13.0 | ... | 0.619388 | 0.743266 | 0.0 | 194.0 | 227.0 | 33.0 | 213.400000 | 15.520000 | 194.0 | (Al3+, Co2+, Co3+, Si4-) |
2 | mp-10015 | SiOs | 221 | [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] | 0.756489 | 130.112955 | 295.077545 | 0.307780 | (Si, Os) | 14.0 | ... | 0.000000 | 0.000000 | 0.0 | 194.0 | 227.0 | 33.0 | 210.500000 | 16.500000 | 194.0 | (Si4-, Os4+) |
3 | mp-10021 | Ga | 63 | [[0. 1.09045794 0.84078375] Ga, [0. ... | 2.376805 | 15.101901 | 49.130670 | 0.360593 | (Ga) | 31.0 | ... | 0.000000 | 0.000000 | 0.0 | 64.0 | 64.0 | 0.0 | 64.000000 | 0.000000 | 64.0 | (Ga0+) |
4 | mp-10025 | SiRu2 | 62 | [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... | 0.196930 | 101.947798 | 256.768081 | 0.324682 | (Si, Ru) | 14.0 | ... | 0.000000 | 0.000000 | 0.0 | 194.0 | 227.0 | 33.0 | 205.000000 | 14.666667 | 194.0 | (Si4-, Ru2+) |
5 rows × 142 columns
os_feat=OxidationStates()
data=os_feat.featurize_dataframe(data,'composition_oxid')
data.head()
OxidationStates: 0%| | 0/1181 [00:00<?, ?it/s]
material_id | formula | space_group | structure | elastic_anisotropy | G_VRH | K_VRH | poisson_ratio | composition | MagpieData minimum Number | ... | MagpieData maximum SpaceGroupNumber | MagpieData range SpaceGroupNumber | MagpieData mean SpaceGroupNumber | MagpieData avg_dev SpaceGroupNumber | MagpieData mode SpaceGroupNumber | composition_oxid | minimum oxidation state | maximum oxidation state | range oxidation state | std_dev oxidation state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | mp-10003 | Nb4CoSi | 124 | [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... | 0.030688 | 97.141604 | 194.268884 | 0.285701 | (Nb, Co, Si) | 14.0 | ... | 229.0 | 35.0 | 222.833333 | 9.611111 | 229.0 | (Nb0+, Co0+, Si0+) | 0 | 0 | 0 | 0.000000 |
1 | mp-10010 | Al(CoSi)2 | 164 | [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... | 0.266910 | 96.252006 | 175.449907 | 0.268105 | (Al, Co, Si) | 13.0 | ... | 227.0 | 33.0 | 213.400000 | 15.520000 | 194.0 | (Al3+, Co2+, Co3+, Si4-) | -4 | 3 | 7 | 3.872983 |
2 | mp-10015 | SiOs | 221 | [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] | 0.756489 | 130.112955 | 295.077545 | 0.307780 | (Si, Os) | 14.0 | ... | 227.0 | 33.0 | 210.500000 | 16.500000 | 194.0 | (Si4-, Os4+) | -4 | 4 | 8 | 5.656854 |
3 | mp-10021 | Ga | 63 | [[0. 1.09045794 0.84078375] Ga, [0. ... | 2.376805 | 15.101901 | 49.130670 | 0.360593 | (Ga) | 31.0 | ... | 64.0 | 0.0 | 64.000000 | 0.000000 | 64.0 | (Ga0+) | 0 | 0 | 0 | 0.000000 |
4 | mp-10025 | SiRu2 | 62 | [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... | 0.196930 | 101.947798 | 256.768081 | 0.324682 | (Si, Ru) | 14.0 | ... | 227.0 | 33.0 | 205.000000 | 14.666667 | 194.0 | (Si4-, Ru2+) | -4 | 2 | 6 | 4.242641 |
5 rows × 146 columns
添加一些基于结构的特性
并不是所有的l featurizers都起作用。Matminer 还可以分析晶体结构并对其进行特征描述。让我们从添加一些简单的密度特性开始。
from matminer.featurizers.structure import DensityFeaturesdf_feat=DensityFeatures()
data=df_feat.featurize_dataframe(data,'structure')#输入structure列
data.head()
DensityFeatures: 0%| | 0/1181 [00:00<?, ?it/s]
material_id | formula | space_group | structure | elastic_anisotropy | G_VRH | K_VRH | poisson_ratio | composition | MagpieData minimum Number | ... | MagpieData avg_dev SpaceGroupNumber | MagpieData mode SpaceGroupNumber | composition_oxid | minimum oxidation state | maximum oxidation state | range oxidation state | std_dev oxidation state | density | vpa | packing fraction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | mp-10003 | Nb4CoSi | 124 | [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... | 0.030688 | 97.141604 | 194.268884 | 0.285701 | (Nb, Co, Si) | 14.0 | ... | 9.611111 | 229.0 | (Nb0+, Co0+, Si0+) | 0 | 0 | 0 | 0.000000 | 7.834556 | 16.201654 | 0.688834 |
1 | mp-10010 | Al(CoSi)2 | 164 | [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... | 0.266910 | 96.252006 | 175.449907 | 0.268105 | (Al, Co, Si) | 13.0 | ... | 15.520000 | 194.0 | (Al3+, Co2+, Co3+, Si4-) | -4 | 3 | 7 | 3.872983 | 5.384968 | 12.397466 | 0.644386 |
2 | mp-10015 | SiOs | 221 | [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] | 0.756489 | 130.112955 | 295.077545 | 0.307780 | (Si, Os) | 14.0 | ... | 16.500000 | 194.0 | (Si4-, Os4+) | -4 | 4 | 8 | 5.656854 | 13.968635 | 12.976265 | 0.569426 |
3 | mp-10021 | Ga | 63 | [[0. 1.09045794 0.84078375] Ga, [0. ... | 2.376805 | 15.101901 | 49.130670 | 0.360593 | (Ga) | 31.0 | ... | 0.000000 | 64.0 | (Ga0+) | 0 | 0 | 0 | 0.000000 | 6.036267 | 19.180359 | 0.479802 |
4 | mp-10025 | SiRu2 | 62 | [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... | 0.196930 | 101.947798 | 256.768081 | 0.324682 | (Si, Ru) | 14.0 | ... | 14.666667 | 194.0 | (Si4-, Ru2+) | -4 | 2 | 6 | 4.242641 | 9.539514 | 13.358418 | 0.598395 |
5 rows × 149 columns
查看新添加的了什么特性
df_feat.feature_labels()
['density', 'vpa', 'packing fraction']
使用机器学习模型
定义输入数据和输出数据
现在,我们将使用 K _ VRH (体积模量)作为输出。
对于输入,我们将使用我们生成的所有特性。也就是说,除了输出数据和非数值列(如composition和structure.)之外的所有内容。
y=data['K_VRH'].values
del_col=['G_VRH','formula','material_id','elastic_anisotropy','poisson_ratio','composition','composition_oxid','K_VRH','structure']
X=data.drop(del_col,axis=1).values
X.shape
(1181, 140)
使用 scikit-learn 建立一个线性回归模型
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as nplr=LinearRegression()lr.fit(X,y)
LinearRegression()
print('training R2=%.3f'%lr.score(X,y))
training R2=0.928
print('training RMSE:%.3f'%np.sqrt(mean_squared_error(y,lr.predict(X))))
training RMSE:19.587
这看起来是合理的,因为线性回归是一个简单(高偏见)的模型。但是,要真正验证我们没有过度拟合,我们需要检查交叉验证得分,而不是拟合得分。
from sklearn.model_selection import KFold,cross_val_score#使用10折交叉和验证
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
scores=cross_val_score(lr,X,y,scoring='neg_mean_squared_error',cv=crossvalidation)rmse_scores=[np.sqrt(abs(s)) for s in scores]
r2_scores=cross_val_score(lr,X,y,scoring='r2',cv=crossvalidation)
print(' mean R2:%.3f'%np.mean(r2_scores))
mean R2:0.903
print('mean RMSE:%.3f'%np.mean(rmse_scores))
mean RMSE:22.377
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_predict
sns.scatterplot(y,cross_val_predict(lr,X,y,cv=crossvalidation),palette=current_palette)
plt.plot(np.arange(0,400,1),np.arange(0,400,1),'r--')
E:\Anaconda\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.warnings.warn([<matplotlib.lines.Line2D at 0x1e13781cb80>]
尝试一个随机森林模型
from sklearn.ensemble import RandomForestRegressorrf=RandomForestRegressor(n_estimators=50,random_state=1)rf.fit(X,y)
print('training R2=%.3f'%rf.score(X,y))
training R2=0.989
print('traing RMSE:%.3f'%np.sqrt(mean_squared_error(y,rf.predict(X))))
traing RMSE:7.687
#使用10折交叉和验证
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
scores=cross_val_score(rf,X,y,scoring='neg_mean_squared_error',cv=crossvalidation)rmse_scores=[np.sqrt(abs(s)) for s in scores]
r2_scores=cross_val_score(rf,X,y,scoring='r2',cv=crossvalidation)
print(' mean R2:%.3f'%np.mean(r2_scores))
mean R2:0.924
print('mean RMSE:%.3f'%np.mean(rmse_scores))
mean RMSE:19.277
cm = plt.cm.get_cmap('RdYlBu')
plt.scatter(y,cross_val_predict(rf,X,y,cv=crossvalidation),cmap=cm,c=data['poisson_ratio'])
plt.plot(np.arange(0,400,1),np.arange(0,400,1),'r--')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1e1356f21c0>
#划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
rf_reg = RandomForestRegressor(n_estimators=50, random_state=1)
rf_reg.fit(X_train, y_train)
RandomForestRegressor(n_estimators=50, random_state=1)
print('training R2:%.3f'%rf_reg.score(X_train,y_train))
training R2:0.987
print('training RMSE:%.3f'%np.sqrt(mean_squared_error(y_train,rf_reg.predict(X_train))))
training RMSE:8.259
print('testing R2:%.3f'%rf_reg.score(X_test,y_test))
testing R2:0.942
print('test RMSE:%.3f'%np.sqrt(mean_squared_error(y_test,rf_reg.predict(X_test))))
test RMSE:16.928
sns.histplot(x=y_train-rf_reg.predict(X_train),stat='probability',color='r',label='train')
sns.histplot(x=y_test-rf_reg.predict(X_test),stat='probability',color='b',label='test')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x1e135a1a040>
观察随机森林模型使用的最重要的特性是什么。
importances=rf.feature_importances_
inclued=data.drop(del_col,axis=1).columns
inclued
Index(['space_group', 'MagpieData minimum Number', 'MagpieData maximum Number','MagpieData range Number', 'MagpieData mean Number','MagpieData avg_dev Number', 'MagpieData mode Number','MagpieData minimum MendeleevNumber','MagpieData maximum MendeleevNumber','MagpieData range MendeleevNumber',...'MagpieData mean SpaceGroupNumber','MagpieData avg_dev SpaceGroupNumber','MagpieData mode SpaceGroupNumber', 'minimum oxidation state','maximum oxidation state', 'range oxidation state','std_dev oxidation state', 'density', 'vpa', 'packing fraction'],dtype='object', length=140)
b = a[i:j:s]这种格式呢,i,j与上面的一样,但s表示步进,缺省为1.
所以a[i:j:1]相当于a[i:j]
当s<0时:i缺省时,默认为-1; j缺省时,默认为-len(a)-1
所以a[::-1]相当于 a[-1:-len(a)-1:-1],也就是从最后一个元素到第一个元素复制一遍。
indices=np.argsort(importances)[::-1]
chart=sns.barplot(x=inclued[indices][0:15],y=importances[indices][0:15],palette='BuGn_r')
chart.set_xticklabels(chart.get_xticklabels(), rotation=90, horizontalalignment='right')
[Text(0, 0, 'MagpieData mean MeltingT'),Text(1, 0, 'vpa'),Text(2, 0, 'MagpieData minimum MeltingT'),Text(3, 0, 'density'),Text(4, 0, 'MagpieData maximum MeltingT'),Text(5, 0, 'MagpieData mean GSvolume_pa'),Text(6, 0, 'MagpieData minimum Column'),Text(7, 0, 'MagpieData maximum GSvolume_pa'),Text(8, 0, 'MagpieData mean Electronegativity'),Text(9, 0, 'MagpieData mode Column'),Text(10, 0, 'MagpieData mode NValence'),Text(11, 0, 'MagpieData mean NUnfilled'),Text(12, 0, 'MagpieData range MendeleevNumber'),Text(13, 0, 'MagpieData minimum MendeleevNumber'),Text(14, 0, 'packing fraction')]
在随机森林模型中,与熔点和每个原子/密度的体积有关的特征是最重要的。
这篇关于体积模量预测的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!