- 195.50 KB
- 2022-08-29 发布
- 1、本文档由用户上传,淘文库整理发布,可阅读全部内容。
- 2、本文档内容版权归属内容提供方,所产生的收益全部归内容提供方所有。如果您对本文有版权争议,请立即联系网站客服。
- 3、本文档由用户上传,本站不保证质量和数量令人满意,可能有诸多瑕疵,付费之前,请仔细阅读内容确认后进行付费下载。
- 网站客服QQ:403074932
基础统计学BasicStatistics\n数据类型TypesofData数据Data计量型Variable计数型Attribute·离散性Discrete·连续性Continuous名义性序数性OrdinalNominal\n计量值和计数值VariableandAttributeData定量的数据叫做计量值。这些可测量的数据往往是用来回答如“多长”,“多少体积”,“多少时间”此类问题的(Datathatquantitativeiscalledvariabledata.Themeasureddatathatareanswerstoquestionslike”howlong”“whatvolume”&“howmuchtime”.)定性的数据叫做计数值。这些数据是用来回答如“多少”,“多久一次”此类问题(Whiledatathatisqualitativeiscalledattributedata.Counteddataareanswerstothequestionsof“howmany”or“howoften”)例子:上班时间和迟到记录(Example:Travelingtowork)Daytime18.0527.5538.1547.45DayLateOntime1v2v3v4vVariabledataAttributedataDataVariableAttributeDiscreteContinuousOrdinalNominal\n连续性和离散性ContinuousandDiscreteDataVariableAttributeDiscreteContinuousOrdinalNominal连续变量(ContinuousVariable)能够将刻度尺寸划分地更精确(Infinitelydivisiblescaleintodecimalorcontinuum.)数据通过测量获得(Dataobtainedbymeasuring.)数据可在趋势图中展示出来(Datadisplayedintrendchart.)比如:温度,时间和内径直径(E.g.temperature,timeandIDdiameter)离散变量(DiscreteVariable)在某一指定区域内不能再分,数据通过计算获得(cannotbeplottedonaninfinitelydivisiblescale.dataobtainedbycounting.)数据可以通过直方图来表示(DataBarchart.)例如:将缺陷的数目或零件故障的数目细分为1.5便没有任何意义(E.g.subdivisionsarenotmeaningfulasnumberofdefectsornumberofpartfailures1.5)\n名义性和序数性NominalandOrdinalDataVariableAttributeDiscreteContinuousOrdinalNominal名义性(Nominal)直观的,如:男性和女性以及准时和迟到(Categoricale.gMaleandFemale&on-timeandlate.)允许没有次序的计划安排(Noorderingschemeispossible)没有可比较性(Noonevalueisgreaterthananother)例如:一箱盒子包含下面的颜色:(E.g.Aboxofcassettecontainedthefollowingcolors:)绿色(Blue)17黄色(Yellow)11黑色(Black)10序数性(Ordinal)直观和有序的(Categoricalandorderable).如:某种饮料的等级为1到10,如10作为最高等级必然为大部分人们所选.(E.g.Therateofthesoftdrink;1to10,10indicateshighergradingformostpeoplechoose.)又如:产品的缺陷数目被做划分如下:(Productdefectsaretabulatedasfollows)A1’16B132C942\n统计学Statistics统计学是一门讲述通过对数据的收集,陈述,分析,诠释进行一系列处理以用于决策及解决问题的分支科学.(Statisticsisthebranchofsciencethatdealswiththecollection,presentation,analysis&interpretationofdataforthepurposeofdecision-makingandproblem-solving.)统计学作为品质改善上一项重要的技术,可用于描述和理解可变性.(Statisticsisacriticalskillinqualityimprovementasstatisticaltechniquescanbeusedtodescribeandtounderstandvariability.)\n总体和样本(PopulationvsSample)总体(Population)预测量的整体对象范围(theentiresetofmeasurementsofinterest)样本(Sample)来自总体的一个子集(asubsetofdatafromthepopulation)参数(Parameters)代表总体的测量数值(numericalmeasuresofapopulation)统计学(Statistics)代表样本的测量数值(numericalmeasuresofasample)PopulationSampleX,Sμ,σ\n参量和统计(ParametersvsStatistics)ParameterStatistic均值(Mean)方差(Variance)标准偏差(StandardDeviation)\n数据的测量NumericalMeasures描述数组的特性Describesthecharacteristicsofthedataset.主要的数组衡量(Keynumericalmeasures):位置的衡量(中值趋势)measuresoflocation(centraltendency)分散程度的衡量(方差)measuresofdispersion(variation)形状的衡量(分布)measuresofshape(distribution)\n测量的位置(MeasuresofLocation)均值Mean中值Median众值Mode\n均值(Mean)均值是指所观察一组样品的平均值;例子:SSI房10个员工的平均高度计算如下:(Meanisaverageoftheobservationforasampleofsize;n.Example:heightofthe10employeesinSSIroom)Mean,x=1.65+1.68+1.71+1.65+1.67+1.65+1.68+1.62+1.60+1.65=16.56/10=1.656\n中值(Median)中值的优点是不被数组里的最大值或最小值而影响.Theadvantageofthemedianisthatitisnotinfluencedverymuchbyhigherorlowervalues.中值是将一组数据由上升或下降趋势排列后所取的中间数值.如果是一个偶数数组,中值则是由中间两个数据和的平均值得到.(Medianisthemiddlevalueinasetofdatapointssortedeitherordescendingorder.Ifanevennumberofdatapoints,themiddleofthelistishalfwaybtwthe2middledatapoints.)1.601.621.651.651.651.661.671.681.681.71Median=(1.65+1.66)/2=1.655\n均值对中值MeanVsMedian:例1:Example1如有观察数组是:(Ifthesampleobservationsare)1342786那么此数组的均值中值是4.4和4.(Thesamplemeanandmedianare4.4and4respectively.)两个数据都表示出这组数据中心趋势的合理量度.Bothquantitiesgiveareasonablemeasureofthecentraltendencyofthedata.如果最后一个数据的值改变为:(Ifthelastobservationischangedsothatthedataare)1342782450这组数据的均值是353.6而中值没有改变(Thesamplemeanis353.6whilethesamplemedianremainsunchanged).\n众值(Mode)众值是指在一组观察数据中出现频率最多的数值.Modeistheobservationthatitoccursmostfrequentlyvalueinasetofthesample/datapoints.众值是比较独特的,可以是多个,有时也众值也不存在Themodemaybeunique,ortheremaybemorethan1mode.Sometimes,themodemaynotexist.1.601.621.651.651.651.661.671.681.681.71Mode=1.65众值跟中值一样,也不会因为出现一个较大或较小的值而受影响.Asformedian,itisnotinfluencedmuchbyhigherorlowervalue\n众值:例2Mode:Example2如果一组观察数据是:(Ifthesampleobservationsare)3693583463110这组数据的众值是3,因为它出现了4次.(Thesamplemodeis3,sinceitoccursfourtimes.)如果一组观察数据是:(Ifthesampleobservationsare)36935834631106256这组数据的众值是3和6,因为它们都出现了4次Thesamplemodesareat3and6,sincetheybothoccurfourtimes.如果一组观察数据是:Ifthesampleobservationsare1342768这组数据没有众值Thesamplemodedoesnotexist.\n分散的衡量(MeasuresofDispersion)极差Range方差Variance标准偏差StandardDeviation\n1)极差Range是一组观察数组中最大值和最小值的差距Thedifferencebetweenthelargestandthesmallestsampleobservations例如:SSI房10位员工的高度Example:Heightof10employeesinSSI1.601.621.651.651.651.661.671.681.681.71MaxvalueMinvalueRange=1.71–1.60=0.11分散的衡量MeasuresofDispersion\n距差:信息的丢失Range:InformationLoss分析两组观察数据1,3,5,8,9和1,5,5,5,9Considerthetwosamples1,3,5,8,9and1,5,5,5,9.都有着同样的距差(r=8)Bothhavethesamerange(r=8).然而,第二组数据中仅仅是两端的极值不同,而第一组中间的数据差值都相当.However,inthesecondsamplethereisvariabilityonlyinthetwoextremevalues,whileinthefirstsamplethemiddlevaluesvaryconsiderably.当观察数组个数小于10时,与距差相关的信息丢失不会太严重Whenthesampleissmall(n10),theinformationlossassociatedwiththerangeisnottooserious.\n2)方差和标准偏差Variance&StandardDeviation标准偏差—是表现数据相对于数据中心的集中程度StandardDeviation–howcloselythedatapointsareclusteredaroundthemeanvalueinasetofdataSigma,s=√∑(x-μ)2Mean=1.656m1.601.621.651.651.651.661.671.681.681.71n-1分散的衡量MeasuresofDispersion\n2)方差和标准偏差(Variance&StandardDeviation)分散的衡量(MeasuresofDispersion)1.60–1.656=-0.0561.62–1.656=-0.0361.65–1.656=-0.0061.65–1.656=-0.0061.65–1.656=-0.0061.66–1.656=0.0041.67–1.656=0.0141.68–1.656=0.0241.68–1.656=0.0241.71–1.656=0.054StdDeviationσ=√∑(x-μ)2n-11.60–1.656=(-0.056)21.62–1.656=(-0.036)21.65–1.656=(-0.006)21.65–1.656=(-0.006)21.65–1.656=(-0.006)21.66–1.656=(0.004)21.67–1.656=(0.014)21.68–1.656=(0.024)21.68–1.656=(0.024)21.71–1.656=(0.054)21.60–1.656=(-0.056)2+1.62–1.656=(-0.036)2+1.65–1.656=(-0.006)2+1.65–1.656=(-0.006)2+1.65–1.656=(-0.006)2+1.66–1.656=(0.004)2+1.67–1.656=(0.014)2+1.68–1.656=(0.024)2+1.68–1.656=(0.024)2+1.71–1.656=(0.054)2+\n2)方差和标准偏差Variance&StandardDeviation分散的衡量MeasuresofDispersionStdDeviation,σ=∑(x-μ)2n-1StdDeviation,σ=√0.00884/(10-1)=0.0306Variance=σ2=0.0009821.60–1.656=(-0.056)2+1.62–1.656=(-0.036)2+1.65–1.656=(-0.006)2+1.65–1.656=(-0.006)2+1.65–1.656=(-0.006)2+1.66–1.656=(0.004)2+1.67–1.656=(0.014)2+1.68–1.656=(0.024)2+1.68–1.656=(0.024)2+1.71–1.656=(0.054)2+∑(x-u)2=0.00884\n方差和标准偏差Variance&StandardDeviation以两个前面引用的数组为例:ForthetwosamplesquotedearlierSampleA:1,3,5,8,9SampleB:1,5,5,5,9分散的衡量例3MeasuresofDispersion–Example3\n形状的衡量MeasuresofShape偏度Skewness峰度Kurtosis\n偏度Skewness围绕均值分布的不对称程度被称为它的偏度Thedegreeofasymmetryofadistributionarounditsmeanisreferredtoasitsskewness.正偏度是指不对称一方尾部的分布趋向更高值.Positiveskewnessimpliesadistributionwithanasymmetrictailextendingtowardshighervalues.常常也叫做右偏度.Sometimesreferredtoasright-handedskew.负偏度是指不对称一方尾部的分布趋向更低值.Negativeskewnessimpliesadistributionwithanasymmetrictailextendingtowardslowervalues.常常也叫做左偏度.Sometimesreferredtoasleft-handedskew.\n偏度Skewness\n峰度Kurtosis峰度是相对于正态分布而言对频数分布曲线高峰形态尖梢或平阔的测度Kurtosischaracterizestherelativepeakednessorflatnessofadistributioncomparedtoanormal(mesokurtic)distribution.正峰度是指其分布较正态分布的峰尖峭Positivekurtosisindicatesarelativelypeaked(leptokurtic)distributioncomparedtothenormaldistribution.负峰度指其分布较正态分布的峰平阔。Negativekurtosisindicatesarelativelyflat(platykurtic)distributioncomparedtothenormaldistribution.峰度仅与不对称分布有关Kurtosisisrelevantonlyforsymmetricaldistributions.\nKurtosis\n绘图陈述GraphicalPresentations可视性的阐述一系列数据:Visualinterpretationthedataset.一般的绘图工具:(Commongraphicaltools)散布图Scatterplot(X/Yplot)柏拉图ParetoChart时间序列图TimeSeriesDiagram控制图ControlCharts点状图DotPlot直方图Histogram茎叶图Stem&LeafDiagram\n1)X/Y图(散布图)X/Yplots(ScatterPlot)展现两个变量之间的关系showstherelationshipbtw2variablesxyxyxy\n2)柏拉图ParetoDiagram问题或原因中的相对重要部分(Relativeimportanceinproblemorcauses.)在改善问题中帮助集中注意力在需要优先解决的问题上(Helptofocusonthepriorityissuesforimproving.)解决分布为80%的缺陷(Cutofmarkat80%)\n3)时间序列图TimeSeriesDiagram-一个线形图表将数据用时间顺序表示出来Alinegraphthatshowdatainatimesequence.X-轴表示时间X-axisastimeAVGtime\n类似一个时间序列图asatimeseriesplot.但不同的是上控制线和下控制线被加入控制图中,为了监控制程能力或后来的改善。ButthedifferentisonlytheUCLandLCLareaddedintocontrolchart.TomonitortheprocessperformanceorafterimprovementAVGtimeUCLLCL4)控制图ControlCharts\n5)点状图DotPlot来自两个供应商的结合线拉力情况如下:ThepullstrengthofbondingwiresfromtwosuppliersareshownbelowA:16.8516.4017.2116.3516.5217.0416.9617.1516.5916.57B:17.5017.6318.2518.0017.8617.7518.2217.9017.9618.15点状图揭示出A供应商的线材拉力强度较低,但是两组的变异性相同。TheDotPlotrevealsthatwiresfromsupplierAseemstoresultinlowerpullstrength,butthevariabilitywithinbothgroupsisaboutthesame.\n6)直方图Histograma)c)b)SampleSize=100unitsa)Bins=5Width=40b)Bins=9Width=20c)Bins=18Width=10\n7)茎叶图Stem&LeafDiagramStemLeafStemLeafStemLeaf613455661346170113578896*5566t381344788701136f45592357*578896s6813446*8*78870119237t39*57f67s77*889818t38f448s78*8899t239f59s9*注释Note:茎叶图相当于一个更原始的直方图,不会丢失信息细节.TheStem&LeafDiagramcanserveasacrudeHistogram,withoutlosingdetailsoftheinformation.然而茎的数量过多也会导致数据信息的丢失Excessivestemscanleadtoalossofinformation.Considerthefollowingsetofobserveddata:95717075928381778884847965788771648866637873616593