- 2.01 MB
- 2022-08-09 发布
- 1、本文档由用户上传,淘文库整理发布,可阅读全部内容。
- 2、本文档内容版权归属内容提供方,所产生的收益全部归内容提供方所有。如果您对本文有版权争议,请立即联系网站客服。
- 3、本文档由用户上传,本站不保证质量和数量令人满意,可能有诸多瑕疵,付费之前,请仔细阅读内容确认后进行付费下载。
- 网站客服QQ:403074932
1COVARIANCE,VARIANCE,ANDCORRELATIONThischapterintroducescovarianceandcorrelation,twoconceptsthatwillpreparethewayforthetreatmentofregressionanalysistocome.Asecondandequallyimportantobjectiveistoshowyouhowtomanipulateexpressionsinvolvingsamplevarianceandcovariance.Severaldetailedexamplesareprovidedtogiveyoupractice.Theyareusedveryextensivelyinfuturechaptersanditisvitalthattheybecomesecondnaturetoyou.Theysimplifythemathematicsandmaketheanalysismucheasiertofollow.1.1SampleCovarianceSamplecovarianceisameasureofassociationbetweentwovariables.Theconceptwillbeillustratedwithasimpleexample.Table1.1showsyearsofschooling,S,andhourlyearningsin1992,indollars,Y,forasubsetof20respondentsfromtheUnitedStatesNationalLongitudinalSurveyofYouth,thedatasetwhichisusedformanyofthepracticalillustrationsandexercisesinthistext.Sisthehighestgradecompleted,inthecaseofthosewhodidnotgoontocollege,and12plusthenumberofyearsofcollegecompleted,fortheothers.Figure1.1showsthedataplottedasascatterdiagram.Youcanseethatthereisa(weak)positiveassociationbetweenthetwovariables.TABLE1.1ObservationSYObservationSY11517.24111715.3821615.00121212.703814.91131226.00464.501497.5051518.0015155.006126.29161221.6371219.23171612.1081818.6918125.559127.2119127.50102042.0620148.00ÓC.Dougherty1999.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof1August1999\nCOVARIANCE,VARIANCE,ANDCORRELATION245403530252015Hourlyearnings($)105002468101214161820HighestgradecompletedFigure1.1.Hourlyearningsandschooling,20NLSYrespondentsThesamplecovariancestatisticenablesyoutosummarizethisassociationwithasinglenumber.Ingeneral,givennobservationsontwovariablesxandy,thesamplecovariancebetweenxandyisgivenby1Cov(x,y)=[(x1-x)(y1-y)+...+(xn-x)(yn-y)]n(1.1)1n=å(xi-x)(yi-y)ni=1Note:InSection1.4wewillalsodefinethepopulationcovariance.Todistinguishbetweenthetwo,wewilluseCov(x,y),withacapitalC,torefertothesamplecovarianceandpop.cov(x,y)torefertothepopulationcovariancebetweenxandy.Occasionallyitwillbeconvenienttorefertothelatterassxy.Thisconventionisparalleltotheonewewilluseforvariance:Var(x)referringtothesample2variance,andpop.var(x)andsxreferringtothepopulationvariance.Furthernote:Sometextsdefinesamplecovariance,likesamplevariance,dividingbyn-1insteadofn,forreasonsthatwillbeexplainedinSection1.5.ThecalculationofthesamplecovarianceforSandYisshowninTable1.2.Westartbycalculatingthesamplemeansforschoolingandearnings,whichwewilldenoteSandY.Sis13.250andYis14.225.WethencalculatethedeviationsofSandYfromthesemeansforeachindividualinthesample(thirdandfourthcolumnsofthetable).Nextwecalculatetheproductofthedeviationsforeachindividual(fifthcolumn).Finallywecalculatethemeanoftheseproducts,15.294,andthisisthesamplecovariance.Youwillnotethatinthiscasethecovarianceispositive.Thisiswhatyouwouldexpect.Apositiveassociation,asinthisexample,willbesummarizedbyapositivesamplecovariance,andanegativeassociationbyanegativeone.\nCOVARIANCE,VARIANCE,ANDCORRELATION3TABLE1.2ObservationSY(S-S)(Y-Y)(S-S)(Y-Y)11517.241.753.0165.27721615.002.750.7752.1333814.91-5.250.685-3.599464.5-7.25-9.72570.50351518.001.753.7766.6076126.29-1.25-7.9359.91871219.23-1.255.006-6.25781818.694.754.46621.2119127.21-1.25-7.0158.768102042.066.7527.836187.890111715.383.751.1564.333121212.70-1.25-1.5251.906131226.00-1.2511.776-14.7191497.50-4.25-6.72528.57915155.001.75-9.225-16.143161221.63-1.257.406-9.257171612.102.75-2.125-5.84218125.55-1.25-8.67510.84319127.50-1.25-6.7258.40620148.000.75-6.225-4.668Total265284.49305.888Average13.25014.22515.2944540DA35)30252015Hourlyearnings($10CB5002468101214161820HighestgradecompletedFigure1.2.\nCOVARIANCE,VARIANCE,ANDCORRELATION4Itisworthwhileinvestigatingthereasonforthis.Figure1.2isthesameasFigure1.1,butthescatterofobservationshasbeenquarteredbyverticalandhorizontallinesdrawnthroughthepointsSandY,respectively.Theintersectionoftheselinesthereforeisthepoint(S,Y),thepointgivingmeanschoolingandmeanhourlyearningsforthesample.Touseaphysicalanalogy,thisisthecenterofgravityofthepointsrepresentingtheobservations.AnypointlyinginquadrantAisforanindividualwithabove-averageschoolingandabove-averageearnings.Forsuchanobservation,both(S-S)and(Y-Y)arepositive,and(S-S)(Y-Y)mustthereforebepositive,sotheobservationmakesapositivecontributiontothecovarianceexpression.Forexample,individual10,whomajoredinbiologyincollegeandthenwenttomedicalschool,has20yearsofschoolingandherearningsaretheequivalentof$42.06perhour.(S-S)is6.75,(Y-Y)27,84,andtheproductis187.89.NextconsiderquadrantB.Heretheindividualshaveabove-averageschoolingbutbelow-averageearnings.(S-S)ispositive,but(Y-Y)isnegative,so(S-S)(Y-Y)isnegative,andsothecontributiontothecovarianceisnegative.Example:Individual20completedtwoyearsoffour-yearcollegemajoringinmediastudies,butthendroppedout,andearnsonly$8.00perhourworkingintheofficeofanautomobilerepairshop.InquadrantC,bothschoolingandearningsarebelowaverage,so(S-S)and(Y-Y)arebothnegative,and(S-S)(Y-Y)ispositive.Example:Individual4,whowasborninMexicoandhadonlysixyearsofschooling,isamanualworkerinamarketgardenandhasverylowearnings.Finally,individualsinquadrantDhaveaboveaverageearningsdespitehavingbelow-averageschooling,so(S-S)isnegative,(Y-Y)ispositive,(S-S)(Y-Y)isthereforenegative,andsoanegativecontributionismadetothecovariance.Example:Individual3hasslightlyabove-averageearningsasaconstructionlaborer,despiteonlycompletingelementaryschool.Sincethesamplecovarianceissimplytheaveragevalueof(S-S)(Y-Y)forthe20observations,itwillbepositiveifpositivecontributionsfromquadrantsAandCdominateandnegativeifthenegativeonesfromquadrantsBandDdominate.Inotherwords,thesamplecovariancewillbepositiveif,asinthisexample,thescatterisupward-sloping,andnegativeifthescatterisdownward-sloping.1.2SomeBasicCovarianceRulesTherearesomerulesthatfollowinaperfectlystraightforwardwayfromthedefinitionofcovariance,andsincetheyaregoingtobeusedmanytimesinfuturechaptersitisworthwhileestablishingthemimmediately:CovarianceRule1Ify=v+w,Cov(x,y)=Cov(x,v)+Cov(x,w)\nCOVARIANCE,VARIANCE,ANDCORRELATION5TABLE1.3ExpenditureSecondSampleHouseholdonFoodandExpenditureExpenditureExpenditureHouse-IncomeClothingonFoodonClothingonFoodandClothinghold(x)(y)(v)(w)(z)13,0001,10085025022022,50085070015017034,0001,20095025024046,0001,6001,15045032053,3001,00080020020064,5001,300950350260Total23,3007,0505,4001,6501,410Average3,8831,175900275235CovarianceRule2Ify=az,whereaisaconstantandzisavariable,Cov(x,y)=aCov(x,z)CovarianceRule3Ify=a,whereaisaconstant,Cov(x,y)=0Firsttheseruleswillbeillustratedwithanexample,andwewillverifythattheyaresatisfied;thenproofswillbegiven.Aswithmuchofthistext,itismuchmoreimportanttoknowwhattherulesmeanandhowtousethemthantobeabletoprovethem,butinfacttheproofsarenotdifficult.DemonstrationandProofofRule1SupposethatyouhavethedatashowninTable1.3forsixhouseholds:totalannualincome,x;expenditureonfoodandclothing,y;expenditureonfood,v;andexpenditureonclothing,w.yofcourseisequaltothesumofvandw.Ignorezinthetableforthetimebeing.InTable1.4,(x-x),(y-y),(v-v),and(w-w)arecalculatedforeachhousehold.Hence,oneobtains(x-x)(y-y),(x-x)(v-v),and(x-x)(w-w)foreachhousehold.Cov(x,y)isgivenbytheaverageof(x-x)(y-y),whichis266,250.SimilarlyCov(x,v)is157,500andCov(x,w)is108,750.WehaveverifiedthatCov(x,y)isthesumofCov(x,w)andCov(x,v).Itisquiteeasytoseewhythismustbethecase.Considerhouseholdi.(xi-x)(yi-y)isitscontributiontothecalculationofCov(x,y).Sinceyi=vi+wi,andy=v+w,(xi-x)(yi-y)=(xi-x)(vi+wi-v-w)=(xi-x)(vi-v)+(xi-x)(wi-w)(1.2)sowehaveshownthatthecontributionofhouseholditoCov(x,y)isthesumofitscontributionstoCov(x,v)andCov(x,w).Thismustbetrueforallthehouseholdsandsoforthecovarianceasawhole.\nCOVARIANCE,VARIANCE,ANDCORRELATION6TABLE1.4Household(x-x)(y-y)(x-x)(y-y)(v-v)(x-x)(v-v)(w-w)(x-x)(w-w)1-883-7566,250-5044,167-2522,0832-1,383-325449,583-200276,667-125172,9173117252,917505,833-25-2,91742,117425899,583250529,167175370,4165-583-175102,083-10058,333-7543,750661712577,0835030,8337546,250Total1,597,500945,000652,500Average266,250157,500108,750DemonstrationandProofofRule2InTable1.3thelastcolumn,z,givestheexpenditureonfoodandclothingofasecondsetofsixhouseholds.zisinfactexactlyonefifthofyineachcase.Itissupposedthatthedataforxarethesameforthesecondset.TocalculateCov(x,z),youneeddataon(xi-x),asbefore,and(zi-z)(Table1.5).FromTable1.5itcanbeseenthatCov(x,z)is53,250,whichisexactlyonefifthofCov(x,y).SowehaveverifiedthatCov(x,y)=Cov(x,5z)=5Cov(x,z).Again,itiseasytoseewhythisisso.Considerhouseholdi.Sinceyi=5zi,andy=5z,(xi-x)(yi-y)isequalto(xi-x)(5zi-5z)andhenceto5(xi-x)(zi-z),sothecontributionofhouseholditothecalculationofCov(x,y)isexactlyfivetimesthatofitscontributiontothecalculationofCov(x,z).Thesameistrueforalltheotherhouseholds.Theaveragevalueof(x-x)(y-y)isthereforeexactlyfivetimesthatof(x-x)(z-z),andsoCov(x,y)=5Cov(x,z).Generalizing,ify=az(andhencey=az)11Cov(x,y)=å(xi-x)(yi-y)=å(xi-x)(azi-az)nn(1.3)a=å(xi-x)(zi-z)=aCov(x,z)nTABLE1.5Household(x-x)(z-z)(x-x)(z-z)1-883-1513,2452-1,383-6589,8953117558542,11785179,9455-583-3520,40566172515,425Total319,500Average53,250\nCOVARIANCE,VARIANCE,ANDCORRELATION7TABLE1.6Householdxa(x-x)(a-a)(x-x)(a-a)13,0002-8830022,5002-1,3830034,00021170046.00022,1170053,3002-5830064,500261700Total23,300120Average3,88320DemonstrationandProofofRule3Thisistrivial.Supposethateachofthehouseholdsinthesamplecontainstwoadults,andsupposethatinamomentofmadnessyoutrytocalculatethecovariancebetweentotalincome,x,andthenumberofadultsinthehousehold,a.Ofcourseai=2forallhouseholds,soa=2,and(ai-a)=0foreachhousehold.Ifyouinsistedonconstructingtheusualtable,itwouldlooklikeTable1.6.FurtherDevelopmentsWiththesebasicrules,youcansimplifymuchmorecomplicatedcovarianceexpressions.Forexample,ifavariableyisequaltothesumofthreevariablesu,v,andw,Cov(x,y)=Cov(x,[u+v+w])=Cov(x,u)+Cov(x,[v+w])(1.4)usingRule1andbreakingupyintotwoparts,uandv+w.HenceCov(x,y)=Cov(x,u)+Cov(x,v)+Cov(x,w)(1.5)usingRule1again.Anotherexample:Ify=a+bz,whereaandbareconstantsandzisavariable,Cov(x,y)=Cov(x,[a+bz])usingRule1=Cov(x,a)+Cov(x,bz)usingRule3=0+Cov(x,bz)usingRule2=bCov(x,z)(1.6)Itisnotdifficultwhenyouhavehadabitofpractice.\nCOVARIANCE,VARIANCE,ANDCORRELATION81.3AlternativeexpressionforsamplecovarianceThesamplecovariancebetweenxandyhasbeendefinedas1nCov(x,y)=å(xi-x)(yi-y)ni=1(1.7)1=[(x1-x)(y1-y)+...+(xn-x)(yn-y)]nAnalternative,andequivalent,expressionisé1nùCov(x,y)=êåxiyiú-xy(1.8)ëni=1ûInotherwords,youcancalculatethecovariancebycalculatingx1y1+...+xnyn,dividingbynandsubtractingxy.Youmayfindthistobemoreconvenientifyouareunfortunateenoughtohavetocalculateacovariancebyhand.Inpracticeyouwillnormallyperformcalculationsofthiskindusingastatisticalpackageonacomputer.Forthosewhoarekeen,hereisaproofthatthetwoexpressionsareequivalent.1nCov(x,y)=å(xi-x)(yi-y)ni=11n=å(xiyi-xiy-xyi+xy)ni=11=[x1y1-x1y-xy1+xyn+...+xnyn-xny-xyn+xy](1.9)Addingeachcolumn,andusingthefactthatSxi=nxandSyi=ny,1énnnùCov(x,y)=êåxiyi-yåxi-xåyi+nxyúnëi=1i=1i=1û1énù=êåxiyi-nxy-nxy+nxyúnëi=1û1énù=êåxiyi-nxyúnëi=1û1énù=êåxiyiú-xy(1.10)nëi=1û\nCOVARIANCE,VARIANCE,ANDCORRELATION9Exercises1.1Inalargebureaucracytheannualsalaryofeachindividual,y,isdeterminedbytheformulay=10,000+500s+200twheresisthenumberofyearsofschoolingoftheindividualandtisthelengthoftime,inyears,ofemployment.xistheindividual’sage.CalculateCov(x,y),Cov(x,s),andCov(x,t)forthesampleoffiveindividualsshownbelowandverifythatCov(x,y)=500Cov(x,s)+200Cov(x,t).Explainanalyticallywhythisshouldbethecase.AgeYearsofLengthofIndividual(years)SchoolingServiceSalary11811115,70022914618,20033312817,600435161020,00054512517,0001.2*Inacertaincountrythetaxpaidbyafirm,T,isdeterminedbytheruleT=-1.2+0.2P–0.1IwherePisprofitsandIisinvestment,thethirdtermbeingtheeffectofaninvestmentincentive.Sissales.Allvariablesaremeasuredin$millionsatannualrates.CalculateCov(S,T),Cov(S,P),andCov(S,I)forthesampleoffourfirmsshownbelowandverifythatCov(S,T)=0.2Cov(S,P)–0.1Cov(S,I).Explainanalyticallywhythisshouldbethecase.FirmSalesProfitsInvestmentTax110020101.8250940.23801240.84701561.2\nCOVARIANCE,VARIANCE,ANDCORRELATION101.4PopulationCovarianceIfxandyarerandomvariables,theexpectedvalueoftheproductoftheirdeviationsfromtheirmeansisdefinedtobethepopulationvariance,sxy:pop.cov(x,y)=sxy=E{(x–mx)(y–my)}(1.11)wheremxandmyarethepopulationmeansofxandy,respectively.Asyouwouldexpect,ifthepopulationcovarianceisunknown,thesamplecovariancewillprovideanestimateofit,givenasampleofobservations.Unfortunatelytheestimatewillbebiaseddownwards,forn-1E{Cov(x,y)}=pop.cov(x,y)(1.12)nThereasonisthatthesampledeviationsaremeasuredfromthesamplemeansofxandyandtendtounderestimatethedeviationsfromthetruemeans.Obviouslywecanconstructanunbiasedestimatorbymultiplyingthesampleestimatebyn/(n-1).Aproofof(1.12)willnotbegivenhere,butyoucouldconstructoneyourselfusingAppendixR.3asaguide(firstreadSection1.5).Therulesforpopulationcovarianceareexactlythesameasthoseforsamplecovariance,buttheproofswillbeomittedbecausetheyrequireintegralcalculus.Ifxandyareindependent,theirpopulationcovarianceiszero,sincethenE{(x–mx)(y–my)}=E(x–mx)E(y–my)=0x0(1.13)byvirtueoftheindependencepropertynotedinthereviewandthefactthatE(x)andE(y)areequaltomxandmy,respectively.1.5SampleVarianceIntheprecedingreviewthetermvariancehasbeenusedtorefertothepopulationvariance,asdefinedintheReview.Forpurposesthatwillbecomeapparentinthediscussionofregressionanalysis,itwillbeusefultointroduce,withthreewarnings,thenotionofsamplevariance.Forasampleofnobservations,x1,...,xn,thesamplevariancewillbedefinedastheaveragesquareddeviationinthesample:1n2Var(x)=å(xi-x)(1.14)ni=1Thethreewarningsare:1.Thesamplevariance,thusdefined,isabiasedestimatorofthepopulationvariance.Appendix2R.3demonstratesthats,definedas\nCOVARIANCE,VARIANCE,ANDCORRELATION11212s=å(xi-x)n-122isanunbiasedestimatorofs.ItfollowsthattheexpectedvalueofVar(x)is[(n-1)/n]sandthatitisthereforebiaseddownwards.Notethatasnbecomeslarge,(n-1)/ntendstoone,sothebiasbecomesprogressivelyattenuated.ItcaneasilybeshownthatplimVar(x)isequalto2sandhencethatitisanexampleofaconsistentestimatorthatisbiasedforsmallsamples.22.Becausesisunbiased,sometextsprefertodefineitasthesamplevarianceandeitheravoidreferringtoVar(x)atallorfindsomeothernameforit.Unfortunately,thereisnogenerallyagreeduponconventiononthispoint.Ineachtext,youmustcheckthedefinition.3.Becausethereisnoagreeduponconvention,thereisnoagreeduponnotation,andagreatmanysymbolshavebeenpressedintoservice.Inthistextthepopulationvarianceofa2variablexisdenotedpop.var(x)or,ifconvenient,sx.Ifthereisnoambiguityconcerningthevariableinquestion,thesubscriptmaybedropped.ThesamplevariancewillalwaysbedenotedVar(x)withacapitalV.Whydoesthesamplevarianceunderestimatethepopulationvariance?Thereasonisthatitiscalculatedastheaveragesquareddeviationfromthesamplemeanratherthanthetruemean.Becausethesamplemeanisautomaticallyinthecenterofthesample,thedeviationsfromittendtobesmallerthanthosefromthepopulationmean.1.6VarianceRulesTherearesomestraightforwardandveryusefulrulesforvariances,whicharecounterpartsofthoseforcovariancediscussedinSection1.2.Theyapplyequallytosamplevarianceandpopulationvariance.VarianceRule1Ify=v+w,Var(y)=Var(v)+Var(w)+2Cov(v,w)2VarianceRule2Ify=az,whereaisaconstant,Var(y)=aVar(z)VarianceRule3Ify=a,whereaisaconstant,Var(y)=0.VarianceRule4:Ify=v+a,whereaisaconstant,Var(y)=Var(a).First,notethatthevarianceofavariablexcanbethoughtofasthecovarianceofxwithitself:1n21nVar(x)=å(xi-x)=å(xi-x)(xi-x)=Cov(x,x).(1.15)ni=1ni=1Inviewofthisequivalence,wecanmakeuseofthecovariancerulestoestablishthevariancerules.WearealsoabletoobtainanalternativeformforVar(x),makinguseof(1.8),thealternativeformforsamplecovariance:é1nù22Var(x)=êåxiú-x(1.16)ëni=1û\nCOVARIANCE,VARIANCE,ANDCORRELATION12ProofofRule1Ify=v+w,Var(y)=Cov(y,y)=Cov(y,[v+w])=Cov(y,v)+Cov(y,w)usingCovarianceRule1=Cov([v+w],v)+Cov([v+w],w)=Cov(v,v)+Cov(w,v)+Cov(v,w)+Cov(w,w)usingCovarianceRule1again=Var(v)+Var(w)+2Cov(v,w)(1.17)ProofofRule2Ify=az,whereaisaconstant,usingCovarianceRule2twice,Var(y)=Cov(y,y)=Cov(az,y)=aCov(z,y)(1.18)22=aCov(z,az)=aCov(z,z)=aVar(z)ProofofRule3Ify=a,whereaisaconstant,usingCovarianceRule3,Var(y)=Cov(a,a)=0(1.19)Thisistrivial.Ifyisaconstant,itsaveragevalueisthesameconstantand(y-y)iszeroforallobservations.HenceVar(y)iszero.ProofofRule4Ify=v+a,wherevisavariableandaisaconstant,usingVarianceRule1,Var(y)=Var(v+a)=Var(v)+Var(a)+2Cov(v,a)=Var(v)(1.20)Populationvarianceobeysthesamerules,butagaintheproofsareomittedbecausetheyrequireintegralcalculus.Exercises1.3UsingthedatainExercise1.1,calculateVar(y),Var(s),Var(t),andCov(s,t)andverifythatVar(y)=250,000Var(x)+40,000Var(t)+200,000Cov(s,t),explaininganalyticallywhythisshouldbethecase.\nCOVARIANCE,VARIANCE,ANDCORRELATION131.4*UsingthedatainExercise1.2,calculateVar(T),Var(P),Var(I)andCov(P,I),andverifythatVar(T)=0.04Var(P)+0.01Var(I)–0.02Cov(P,I),explaininganalyticallywhythisshouldbethecase.1.7PopulationVarianceoftheSampleMeanIftwovariablesareindependent(andhencetheirpopulationcovarianceiszeero),thepopulationvarianceoftheirsumisequaltothesumoftheirpopulationvariances:pop.var(x+y)=pop.var(x)+pop.var(y)+2pop.cov(x,y)22=pop.var(x)+pop.var(y)=sx+sy(1.21)Thisresultcanbeextendedtoobtainthegeneralrulethatthepopulationvarianceofthesumofanynumberofvariablesisequaltothesumoftheirvariances,andoneisabletoshowthat,ifa2randomvariablexhasvariances,thepopulationvarianceofthesamplemean,x,willbeequalto2s/n,wherenisthenumberofobservationsinthesample,providedthattheobservationsaregeneratedindependently.æx1+...+xnöpop.var(x)=pop.varç÷ènø1=2pop.var(x1+...+xn)n1=2[pop.var(x1)+...+pop.var(xn)](1.22)n12212=[s+...+s]=[ns]22nn2s=nAswehaveseenintheReview,thesamplemeanisthemostefficientunbiasedestimatorofthepopulationmeanprovidedthattheobservationsareindependentlydrawnfromthesamedistribution.1.8TheCorrelationCoefficientInthischapteralotofattentionhasbeengiventocovariance.Thisisbecauseitisveryconvenientmathematically,notbecauseitisaparticularlygoodmeasureofassociation.WeshalldiscussitsdeficienciesinthisrespectinSection1.9.Amuchmoresatisfactorymeasureisitsnear-relative,thecorrelationcoefficient.Likevarianceandcovariance,thecorrelationcoefficientcomesintwoforms,populationandsample.Thepopulationcorrelationcoefficientistraditionallydenotedr,theGreekletterthatistheequivalentof“r”,andpronounced“row”,asinrowaboat.Forvariablesxandyitisdefinedby\nCOVARIANCE,VARIANCE,ANDCORRELATION14pop.cov(x,y)sxyrx,y==(1.23)pop.var(x)pop.var(y)s2s2xyIfxandyareindependent,rx,ywillbeequaltozerobecausethepopulationcovariancewillbezero.Ifthereisapositiveassociationbetweenthem,sxy,andhencerx,y,willbepositive.Ifthereisanexactpositivelinearrelationship,rx,ywillassumeitsmaximumvalueof1.Similarly,ifthereisanegativerelationship,rx,ywillbenegative,withminimumvalueof–1.Thesamplecorrelationcoefficient,r,isdefinedbyreplacingthepopulationcovarianceandvariancesin(1.23)bytheirunbiasedestimators.Wehaveseenthatthesemaybeobtainedbymultiplyingthesamplevariancesandcovariancesbyn/(n-1).HencenCov(x,y)n-1rx,y=(1.24)nnVar(x)Var(y)n-1n-1Thefactorsn/(n-1)cancel,sowecanconvenientlydefinethesamplecorrelationbyCov(x,y)rx,y=(1.25)Var(x)Var(y)Liker,rhasmaximumvalue1,whichisattainedwhenthereisaperfectpositiveassociationbetweenthesamplevaluesofxandy(whenyouplotthescatterdiagram,thepointslieexactlyonanupward-slopingstraightline).Similarly,ithasminimumvalue–1,attainedwhenthereisaperfectnegativeassociation(thepointslyingexactlyonadownward-slopingstraightline).Avalueofzeroindicatesthatthereisnoassociationbetweentheobservationsonxandyinthesample.Ofcoursethefactthatr=0doesnotnecessarilyimplythatr=0orviceversa.IllustrationWewillusetheeducationandearningsexampleinSection1.1toillustratethecalculationofthesamplecorrelationcoefficient.ThedataaregiveninTable1.1andtheyareplottedinFigure1.1.WehavealreadycalculatedCov(S,Y)inTable1.2,equalto15.294,sowenowneedonlyVar(S)andVar(Y),calculatedinTable1.7.FromthelasttwocolumnsofTable1.7,youcanseethatVar(S)is10.888andVar(Y)is77.108.Hence15.92415.924rS,Y===0.55(1.26)10.888´77.10828.975\nCOVARIANCE,VARIANCE,ANDCORRELATION15TABLE1.7ObservationSY(S-S)(Y-Y)(S-S)22(Y-Y)(S-S)(Y-Y)11517.241.753.0163.0639.0935.27721615.002.750.7757.5630.6012.1333814.91-5.250.68527.5630.470-3.599464.5-7.25-9.72552.56394.56670.50351518.001.753.7763.06314.2546.6076126.29-1.25-7.9351.56362.9569.91871219.23-1.255.0061.56325.055-6.25781818.694.754.46622.56319.94121.2119127.21-1.25-7.0151.56349.2038.768102042.066.7527.83645.563774.815187.890111715.383.751.15614.0631.3354.333121212.70-1.25-1.5251.5632.3241.906131226.00-1.2511.7761.563138.662-14.7191497.50-4.25-6.72518.06345.21928.57915155.001.75-9.2253.06385.091-16.143161221.63-1.257.4061.56354.841-9.257171612.102.75-2.1257.5634.514-5.84218125.55-1.25-8.6751.56375.24710.84319127.50-1.25-6.7251.56345.2198.40620148.000.75-6.2250.56338.744-4.668Total265284.49217.7501,542.150305.888Average13.25014.22510.88877.10815.294Exercises1.5IntheyearsfollowingtheSecondWorldWar,theeconomicgrowthofthosecountrieswhichhadsufferedthegreatestdestruction,GermanyandJapan,wasgreaterthanthatofmostotherindustrializedcountries.Varioushypotheseswereofferedtoexplainthis.NicholasKaldor,aHungarianeconomist,arguedthatthecountrieswhichhadsufferedtheworstdevastationhadhadtoinvestcomprehensivelywithnewplantandequipment.Becausetheywereusingup-to-datetechnology,theirmarginalcostswerelowerthanthoseoftheircompetitorsinexportmarkets,andtheygainedmarketshare.Becausetheygainedmarketshare,theyneedtoincreasetheirproductivecapacityandthismeanadditionalinvestment,furtheringloweringtheirmarginalcostsandincreasingtheirmarketshare.Meanwhilethosecountriesthathadsufferedleast,liketheU.S.andtheU.K.,hadlessneedtore-invest.Asaconsequencethesameprocessworkedintheoppositedirection.Theirmarginalcostswererelativelyhigh,sotheylostmarketshareandhadlessneedtoincreasecapacity.Asevidenceforthishypothesis,Kaldorshowedthattherewasahighcorrelationbetweentheoutputgrowthrates,x,andtheproductivitygrowthrate,p,inthemanufacturingsectorsinthe12countrieslistedbelow.Whenacriticpointedoutthatitwasinevitablethatxandpwouldbehighlycorrelated,irrespectiveofthevalidityofthishypothesis,Kaldorproposedavariationonhishypothesis.\nCOVARIANCE,VARIANCE,ANDCORRELATION16Economicgrowthwasinitiallyhighinallcountriesforafewyearsafterthewar,butinsome,particularlytheU.S.andtheU.K.itwassoonchokedbackbyashortageoflabor,andanegativecycletookhold.Inothers,likeGermanyandJapan,whereagriculturestillaccountedforalargeshareofemployment,themanufacturingsectorcouldcontinuetogrowbyattractingworkersfromtheagriculturalsector,andtheywouldthenhaveanadvantage.Apositivecorrelationbetweenthegrowthrateofemployment,e,andthatofproductivitywouldbeevidenceinfavorofhishypothesis.Thetablebelowreproduceshisdataset,whichrelatestotheperiod1953-1954to1963-1964(annualexponentialgrowthrates).Plotascatterdiagramandcalculatethesamplecorrelationcoefficientforeandp.[Ifyouarenotabletouseaspreadsheetapplicationforthispurpose,youarestronglyadvisedtouseequations(1.8)and(1.16)forthesamplecovarianceandvarianceandtokeepacopyofyourcalculation,asthiswillsaveyoutimewithanotherexerciseinChapter2.].Commentonyourfindings.AnnualGrowthRates(%)EmploymentProductivityAustria2.04.2Belgium1.53.9Canada2.31.3Denmark2.53.2France1.93.8Italy4.44.2Japan5.87.8Netherlands1.94.1Norway0.54.4WestGermany2.74.5U.K.0.62.8U.S.0.82.61.6Supposethattheobservationsontworandomvariablesxandylieonastraightliney=a+bx2DemonstratethatCov(x,y)=bVar(x)andthatVar(y)=bVar(x),andhencethatthesamplecorrelationcoefficientisequalto1iftheslopeofthelineispositive,-1ifitisnegative.1.7*Supposethatavariableyisdefinedbytheexactlinearrelationshipy=a+bxandsupposethatasampleofobservationshasbeenobtainedforx,y,andathirdvariable,z.Showthatthesamplecorrelationcoefficientforyandzmustbethesameasthatforxandz,ifbispositive.\nCOVARIANCE,VARIANCE,ANDCORRELATION171.9WhytheCovarianceisNotaGoodMeasureofAssociationThecorrelationcoefficientisamuchbettermeasureofassociationthanthecovariance,themainreasonbeingthatthecovariancedependsontheunitsinwhichthevariablesxandyhappentobemeasured,whereasthecorrelationcoefficientdoesnot.Thiswillbedemonstratedforthesampleconcepts;theproofforthepopulationconceptswillbeleftasanexercise.Returningtotheschoolingandearningsexample,wewillinvestigatewhathappenswhenhourlyearningsaremeasuredincentsratherthandollars.Thecovariancewillbeaffected,butthecorrelationcoefficientwillnot.WewilldenotetherevisedearningsdatabyY'.ThedataforSandY'areshowninTable1.8.OfcoursethedataforY'arejustthedataforYinTable1.2,multipliedby100.Asaconsequence,theaveragevalueofY'inthesampleis100timesaslargeastheaveragevalueofY.Whenwecometocalculatetheearningsdeviations(Y'-Y'),theseare100timesthoseinTable1.2because(Y'-Y')=(100Y’-100Y)=100(Y-Y).Hencetheproducts(S-S)(Y'-Y')are100timesthoseinTable1.2andthesamplecovariance,1529.4,is100timesthatobtainedwhenhourlyearningsweremeasuredindollars.However,thecorrelationcoefficientisunaffected.ThecorrelationcoefficientforSandY'isTABLE1.8ObservationSY'(S-S)(Y'-Y')(S-S)22(Y'-Y')(S-S)(Y'-Y')11517241.75301.63.06390930527.721615002.7577.57.5636010213.3381491-5.2568.527.5634700-359.946450-7.25-972.552.5639456607050.351518001.75377.63.063142540660.7612629-1.25-793.51.563629560991.87121923-1.25500.61.563250550-625.781818694.75446.622.5631994102121.1912721-1.25-701.51.563492030876.8102042066.752783.645.563774815018789.0111715383.75115.614.06313350433.312121270-1.25-152.51.56323240190.613122600-1.251177.61.5631386620-1471.9149750-4.25-672.518.0634521902857.915155001.75-922.53.063850910-1614.316122163-1.25740.61.563548410-925.7171612102.75-212.57.56345140-584.21812555-1.25-867.51.5637524701084.31912750-1.25-672.51.563452190840.620148000.75-622.50.563387440-466.8Total26528449217.7501542150030588.8Average13.2501422510.8887710801529.4\nCOVARIANCE,VARIANCE,ANDCORRELATION18Cov(S,Y')1529.4rS,Y'===0.55,(1.27)Var(S)Var(Y')10.888´771080Thenumerator(thetophalfofthefraction)hasbeenmultipliedby100,butsohasthedenominator2(thebottomhalf),sinceVar(Y')is100Var(Y).(Rememberthat,whenyoumultiplyavariablebyaconstant,youmultiplyitsvariancebytheconstantsquared.)Thedenominatorismultipliedby100,2ratherthan100,becauseVar(Y')isunderasquareroot.Exercise1.8Demonstratethat,ingeneral,thesamplecorrelationcoefficientisnotaffectedbyachangeintheunitofmeasurementofoneofthevariables.\n2SIMPLEREGRESSIONANALYSISThischaptershowshowahypotheticallinearrelationshipbetweentwovariablescanbequantifiedusingappropriatedata.Theprincipleofleastsquaresregressionanalysisisexplained,andexpressionsforthecoefficientsarederived.Moststudentstakinganintroductorycoursewillalreadyhavetakenabasiccalculuscourseandshouldhavenotroublefollowingthederivationsoftheregressioncoefficients.ThosewhohavenotshouldskipSection2.3andtheproofinSection2.5.Theywillthenhavetotaketheexpressionsontrust,buttheyshouldstillbeabletounderstandingeneraltermshowtheexpressionshavebeenderived.2.1TheSimpleLinearModelThecorrelationcoefficientmayindicatethattwovariablesareassociatedwithoneanother,butitdoesnotgiveanyideaofthekindofrelationshipinvolved.Wewillnowtaketheinvestigationastepfurtherinthosecasesforwhichwearewillingtohypothesizethanonevariabledependsonanother.Itmustbestatedimmediatelythatonewouldnotexpecttofindanexactrelationshipbetweenanytwoeconomicvariables,unlessitistrueasamatterofdefinition.Intextbookexpositionsofeconomictheory,theusualwayofdealingwiththisawkwardfactistowritedowntherelationshipasifitwereexactandtowarnthereaderthatitisreallyonlyanapproximation.Instatisticalanalysis,however,onegenerallyacknowledgesthefactthattherelationshipisnotexactbyexplicitlyincludinginitarandomfactorknownasthedisturbanceterm.Weshallstartwiththesimplestpossiblemodel:y=a+bx+u.(2.1)y,describedasthedependentvariable,hastwocomponents:(1)thenonrandomcomponenta+bx,xbeingdescribedastheexplanatory(orindependent)variable,andthefixedquantitiesaandbastheparametersoftheequation,and(2)thedisturbanceterm,u.Figure2.1illustrateshowthesetwocomponentscombinetodeterminey.x1,x2,x3,andx4,arefourhypotheticalvaluesoftheexplanatoryvariable,Iftherelationshipbetweenyandxwereexact,thecorrespondingvaluesofywouldberepresentedbythepointsQ1–Q4ontheline.Thedisturbancetermcausestheactualvaluesofytobedifferent.Inthediagram,thedisturbancetermhasbeenassumedtobepositiveinthefirstandfourthobservationsandnegativeintheothertwo,withtheresultthat,ifoneplotstheactualvaluesofyagainstthevaluesofx,oneobtainsthepointsP1–P4.ÓC.Dougherty1999.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof23.09.99.\nSIMPLEREGRESSIONANALYSIS2yP4u4P1Q4u1u3Q3u2Q2Q1P3aP2x1x2x3x4xFigure2.1.TruerelationshipbetweenyandxItmustbeemphasizedthatinpracticethePpointsareallonecanseeofFigure2.1.Theactualvaluesofaandb,andhencethelocationoftheQpoints,areunknown,asarethevaluesofthedisturbancetermintheobservations.Thetaskofregressionanalysisistoobtainestimatesofaandb,andhencethelocationoftheline,giventhePpoints.Itshouldbeobviousthatthesmallerthevaluesofu,theeasierwillbethistask.Indeed,ifthedisturbancetermdidnotexistatall,thePpointswouldcoincidewiththeQpointsandshowexactlywherethelineislocated.Onecouldmerelyplotitandreadoffthevaluesofaandb.Whydoesthedisturbancetermexistatall?Thereareseveralreasons.1.Omissionofexplanatoryvariables:Therelationshipbetweenyandxisalmostcertaintobeagreatsimplification.Inrealitytherewillbeotherfactorsaffectingy,whichhavebeenleftoutof(2.1),andtheirinfluencewillcausethepointstolieofftheline.Itoftenhappensthattherearevariablesthatyouwouldliketoincludeintheregressionequationbutcannotbecauseyouareunabletomeasurethem.Forexample,lateroninthischapterwewillfitanearningsfunctionrelatinghourlyearningstoyearsofschooling.Weknowverywellthatschoolingisnottheonlydeterminantofearningsandeventuallywewillimprovethemodelbyincludingothervariableslikeyearsofworkexperience.However,eventhebest-specifiedearningsfunctionaccountsforatmosthalfofthevariationinearnings.Manyotherfactorsaffectthechancesofobtainingagoodjob,liketheunmeasurableattributesofanindividual,andevenpureluckinthesenseoftheindividualfindingajobwhichisagoodmatchforhisorherattributes.Alloftheseotherfactorscontributetothedisturbanceterm.2.Aggregationofvariables:Inmanycasestherelationshipisanattempttosummarizeinaggregateanumberofmicroeconomicrelationships.Forexample,theaggregateconsumptionfunctionisanattempttosummarizeasetofindividualexpendituredecisions.Sincetheindividualrelationshipsarelikelytohavedifferentparameters,anyattempttorelateaggregateexpendituretoaggregateincomecanonlybeanapproximation.Thediscrepancyisattributedtothedisturbanceterm.3.Modelmisspecification:Themodelmaybemisspecifiedintermsofitsstructure.Justtogiveoneofthemanypossibleexamples,iftherelationshipreferstotimeseriesdata,thevalueofy\nSIMPLEREGRESSIONANALYSIS3maydependnotontheactualvalueofxbutonthevaluethathadbeenanticipatedinthepreviousperiod.Iftheanticipatedandactualvaluesarecloselyrelated,therewillappeartobearelationshipbetweenyandx,butitwillonlybeanapproximation,andagainthedisturbancetermwillpickupthediscrepancy.4.Functionalmisspecification:Thefunctionalrelationshipbetweenyandxmaybemisspecifiedmathematically.Forexample,thetruerelationshipmaybenonlinearinsteadoflinear.WewillconsiderthefittingofnonlinearrelationshipsinChapter4.Obviously,oneshouldtrytoavoidthisproblembyusinganappropriatemathematicalspecification,buteventhemostsophisticatedspecificationislikelytobeonlyanapproximation,andthediscrepancycontributestothedisturbanceterm.5.Measurementerror:Ifthemeasurementofoneormoreofthevariablesintherelationshipissubjecttoerror,theobservedvalueswillnotappeartoconformtoanexactrelationship,andthediscrepancycontributestothedisturbanceterm.Thedisturbancetermisthecollectiveoutcomeofallthesefactors.Obviously,ifyouwereconcernedonlywithmeasuringtheeffectofxony,itwouldbemuchmoreconvenientifthedisturbancetermdidnotexist.Wereitnotforitspresence,youwouldknowthateverychangeinyfromobservationtoobservation,wasduetoachangeinx,andyouwouldbeabletocalculateaandbexactly.However,infact,partofeachchangeinyisduetoachangeinu,andthismakeslifedifficult.Forthisreason,uissometimesdescribedasnoise.2.2LeastSquaresRegressionSupposethatyouaregiventhefourobservationsonxandyrepresentedinFigure2.1andyouareaskedtodeducethevaluesofaandbinequation(2.1).Asaroughapproximation,youcoulddothisbyplottingthefourPpointsanddrawingalinetofitthemasbestyoucan.ThishasbeendoneinFigure2.2.Theintersectionofthelinewiththey-axisprovidesanestimateoftheintercepta,whichwillbedenoteda,andtheslopeprovidesanestimateofb,whichwillbedenotedb.ThefittedlineyP4P1P3P2aFigure2.2.Fittedlinex1x2x3x4xFigure2.2.Fittedline\nSIMPLEREGRESSIONANALYSIS4yP4e4R4P1R3e1e3e2R2R1P3P2ax1x2x3x4xFigure2.3.Fittedregressionlineshowingresidualswillbewrittenyˆ=a+bx(2.2)thecaretmarkoveryindicatingthatitisthefittedvalueofycorrespondingtox,nottheactualvalue.InFigure2.3,thefittedpointsarerepresentedbythepointsR1–R4.Onethingthatshouldbeacceptedfromthebeginningisthatyoucanneverdiscoverthetruevaluesofaandb,howevermuchcareyoutakeindrawingtheline.aandbareonlyestimates,andtheymaybegoodorbad.Onceinawhileyourestimatesmaybeabsolutelyaccurate,butthiscanonlybebycoincidence,andeventhenyouwillhavenowayofknowingthatyouhavehitthebull's-eyeexactly.Thisremainsthecaseevenwhenyouusemoresophisticatedtechniques.Drawingaregressionlinebyeyeisallverywell,butitleavesalottosubjectivejudgement.Furthermore,aswillbecomeobvious,itisnotevenpossiblewhenyouhaveavariableydependingontwoormoreexplanatoryvariablesinsteadofonlyone.Thequestionarises,isthereawayofcalculatinggoodestimatesofaandbalgebraically?Thefirststepistodefinewhatisknownasaresidualforeachobservation.Thisisthedifferencebetweentheactualvalueofyinanyobservationandthefittedvaluegivenbytheregressionline,thatis,theverticaldistancebetweenPiandRiinobservationi.Itwillbedenotedei:e=y-yˆ(2.3)iiiTheresidualsforthefourobservationsareshowninFigure2.3.Substituting(2.2)into(2.3),weobtainei=yi–a–bxi(2.4)\nSIMPLEREGRESSIONANALYSIS5andhencetheresidualineachobservationdependsonourchoiceofaandb.Obviously,wewishtofittheregressionline,thatis,chooseaandb,insuchawayastomaketheresidualsassmallaspossible.Equallyobviously,alinethatfitssomeobservationswellwillfitothersbadlyandviceversa.Weneedtodeviseacriterionoffitthattakesaccountofthesizeofalltheresidualssimultaneously.Thereareanumberofpossiblecriteria,someofwhichworkbetterthanothers.Itisuselessminimizingthesumoftheresiduals,forexample.Thesumwillautomaticallybeequaltozeroifyoumakeaequaltoyandbequaltozero,obtainingthehorizontalliney=y.Thepositiveresidualswillthenexactlybalancethenegativeones,butthefitmaybeappalling.OnewayofovercomingtheproblemistominimizeS,thesumofthesquaresoftheresiduals.ForFigure2.3,2222S=e1+e2+e3+e4(2.5)ThesmalleronecanmakeS,thebetteristhefit,accordingtothiscriterion.IfonecouldreduceStozero,onewouldhaveaperfectfit,forthiswouldimplythatalltheresidualsareequaltozero.Thelinewouldgothroughallthepoints,butofcourseingeneralthedisturbancetermmakesthisimpossible.Thereareotherquitereasonablesolutions,buttheleastsquarescriterionyieldsestimatesofaandbthatareunbiasedandthemostefficientoftheirtype,providedthatcertainconditionsaresatisfied.Forthisreason,theleastsquarestechniqueisfarandawaythemostpopularinuncomplicatedapplicationsofregressionanalysis.TheformusedhereisusuallyreferredtoasordinaryleastsquaresandabbreviatedOLS.Variantsdesignedtocopewithparticularproblemswillbediscussedlaterinthetext.2.3LeastSquaresRegression:TwoExamplesExample1First,averysimpleexampleindeed,withonlytwoobservations,justtoshowthemechanicsworking.yisobservedtobeequalto3whenxisequalto1;yisequalto5whenxisequalto2,asshowninFigure2.4.y654321123xFigure2.4.Two-observationexample\nSIMPLEREGRESSIONANALYSIS6TABLE2.1xyyˆe13a+b3-a-b25a+2b5-a-2bWeshallassumethatthetruemodelisy=a+bx+u(2.6)andweshallestimatethecoefficientsaandboftheequationyˆ=a+bx.(2.7)Obviously,whenthereareonlytwoobservations,wecanobtainaperfectfitbydrawingtheregressionlinethroughthetwopoints,butweshallpretendthatwehavenotrealizedthis.Insteadweshallarriveatthisconclusionbyusingtheregressiontechnique.Whenxisequalto1,yˆisequalto(a+b),accordingtotheregressionline.Whenxisequalto2,yˆisequalto(a+2b).Therefore,wecansetupTable2.1.Thevalueofyˆ1(theyvalueofthepointR1inFigure2.3)is(a+b),andthevalueofyˆ2is(a+2b).Sotheresidualforthefirstobservation,e1,whichisgivenby(y1-yˆ1),isequalto(3-a-b),ande2,givenby(y2-yˆ2),isequalto(5-a-2b).Hence22S=(3-a-b)+(5-a-2b)22=9+a+b-6a-6b+2ab22+25+a+4b-10a-20b+4ab22=34+2a+5b-16a-26b+6ab(2.8)NowwewanttochooseaandbtominimizeS.Todothis,weusethecalculusandfindoutwhatvaluesofaandbsatisfy¶S¶S=0and=0(2.9)¶a¶bTakingpartialdifferentials,¶S=4a+6b-16(2.10)¶aand¶S=10b+6a-26(2.11)¶bandsowehave2a+3b-8=0(2.12)\nSIMPLEREGRESSIONANALYSIS7and3a+5b-13=0(2.13)Solvingthesetwoequations,weobtaina=1andb=2,andhencetheregressionequationyˆ=1+2x(2.14)Justtocheckthatwehavecometotherightconclusion,weshallcalculatetheresiduals:e1=3-a-b=3-1-2=0(2.15)e2=5-a-2b=5-1-4=0(2.16)Thusboththeresidualsarezero,implyingthatthelinepassesexactlythroughbothpoints,whichofcourseweknewfromthebeginning.Ifyouhaveonlytwoobservations,youdrawalinethroughthem;thereisnoneedtobotherwithregressionanalysis.Example2Weshalltaketheexampleintheprevioussectionandaddathirdobservation:yisequalto6whenxisequalto3.Thethreeobservations,showninFigure2.5,donotlieonastraightline,soitisimpossibletoobtainaperfectfit.Wewilluseleastsquaresregressionanalysistocalculatethepositionoftheline.Westartwiththestandardequationyˆ=a+bx.(2.17)Forvaluesofxequalto1,2,and3,thisgivesfittedvaluesofyequalto(a+b),(a+2b),and(a+3b),respectively,andonehasTable2.2.y654321123xFigure2.5.Three-observationexample\nSIMPLEREGRESSIONANALYSIS8TABLE2.2xyyˆe13a+b3-a-b25a+2b5-a-2b36a+3b6-a-3bHence222S=(3-a-b)+(5-a-2b)+(6-a-3b)22=9+a+b-6a-6b+2ab22+25+a+4b-10a-20b+4ab22+36+a+9b-12a-36b+6ab22=70+3a+14b-28a-62b+12ab(2.18)¶S¶SThefirst-orderconditions=0and=0giveus¶a¶b6a+12b-28=0(2.19)and12a+28b-62=0(2.20)Solvingthesetwoequations,oneobtainsa=1.67,b=1.50.Theregressionequationisthereforeyˆ=1.67+1.50x(2.21)ThethreepointsandtheregressionlineareshowninFigure2.6.y654321123xFigure2.6.Three-observationexamplewithregressionline\nSIMPLEREGRESSIONANALYSIS92.4LeastSquaresRegressionwithOneIndependentVariableWeshallnowconsiderthegeneralcasewheretherearenobservationsontwovariablesxandyand,supposingytodependonx,wewillfittheequationyˆ=a+bx.(2.22)Thefittedvalueofthedependentvariableinobservationi,yˆi,willbe(a+bxi),andtheresidualeiwillbe(yi-a–bxi).Wewishtochooseaandbsoastominimizethesumofthesquaresoftheresiduals,S,givenby222S=e1+...+en=åei(2.23)WewillfindthatsisminimizedwhenCov(x,y)b=(2.24)Var(x)anda=y-bx(2.25)Thederivationoftheexpressionsforaandbwillfollowthesameprocedureasthederivationinthetwoprecedingexamples,andyoushouldcomparethegeneralversionwiththeexamplesateachstep.Wewillbeginbyexpressingthesquareoftheresidualinobservationiintermsofa,b,andthedataonxandy:222ei=(yi-yˆi)=(yi-a-bxi)(2.26)2222=yi+a+bxi-2ayi-2bxiyi+2abxiSummingoverallthenobservations,wecanwriteSas22S=(y1-a-bx1)+...+(yn-a-bxn)2222=y1+a+x1b-2y1a-2x1y1b+2x1ab2222+y2+a+x2b-2y2a-2x1y2b+2x2ab+....2222+yn+a+xnb-2yna-2xnynb+2xnab2222=åyi+na+båxi-2aåyi-2båxiyi+2abåxi(2.27)NotethatSiseffectivelyaquadraticexpressioninaandb,withnumericalcoefficientsdeterminedbythedataonxandyinthesample.WecaninfluencethesizeofSonlythroughourchoiceofaandb.\nSIMPLEREGRESSIONANALYSIS10Thedataonxandy,whichdeterminethelocationsoftheobservationsinthescatterdiagram,arefixedoncewehavetakenthesample.Theequationisthegeneralizedversionofequations(2.8)and(2.18)inthetwoexamples.¶S¶SThefirstorderconditionsforaminimum,=0and=0,yieldthefollowingequations:¶a¶b2an-2åyi+2båxi=0(2.28)22båxi-2åxiyi+2aåxi=0(2.29)Theseequationsareknownasthenormalequationsfortheregressioncoefficientsandaregeneralizedversionof(2.12)and(2.13)inthefirstexampleand(2.19),and(2.20)inthesecond.Equation(2.28)allowsustowriteaintermsofy,x,andtheasyetunknownb.Notingthatåxiåyix=andy=,(2.28)mayberewrittennn2an-2ny+2bnx=0(2.30)andhencea=y-bx.(2.31)Substitutingforain(2.29),andagainnotingthatSxiisequaltonx,weobtain22båxi-2åxiyi+2(y-bx)nx=0(2.32)Separatingthetermsinvolvingbandnotinvolvingbonoppositesidesoftheequation,wehave[22]2b(åxi)-nx=2åxiyi-2nxy(2.33)Dividingbothsidesby2n,é1()22ù1()êåxi-xúb=åxiyi-xy(2.34)ënûnUsingthealternativeexpressionsforsamplevarianceandcovariance,thismaybere-writtenbVar(x)=Cov(x,y)(2.35)andsoCov(x,y)b=(2.36)Var(x)Havingfoundbfrom(2.36),youfindafrom(2.31).ThosewhoknowaboutthesecondorderconditionswillhavenodifficultyconfirmingthatwehaveminimizedS.\nSIMPLEREGRESSIONANALYSIS11InthesecondnumericalexampleinSection2.3,Cov(x,y)isequalto1.5,Var(x)to1.0,yto4.67,xto2.00,sob=1.5/1.0=1.5(2.37)a=y-bx=4.67–1.5(2.00)=1.67(2.38)whichconfirmstheoriginalcalculation.AlternativeexpressionsforbFromthedefinitionsofCov(x,y)andVar(x)onecanobtainalternativeexpressionsforbinSnotation:1å(xi-x)(yi-y)å(x-x)(y-y)Cov(x,y)niib===(2.39)Var(x)12å(x-x)2å(xi-x)inOnemyobtainfurthervariationsusingequation(2.33):1n()åxiyi-xyåxiyi-nxyb==(2.40)122()åx2-x2åxi-nxin2.5TwoDecompositionsoftheDependentVariableIntheprecedingpageswehaveencounteredtwowaysofdecomposingthevalueofthedependentvariableinaregressionmodel.Theyaregoingtobeusedthroughoutthetext,soitisimportantthattheybeunderstoodproperlyandthattheybekeptapartconceptually.Thefirstdecompositionrelatestotheprocessbywhichthevaluesofyaregenerated:yi=a+bxi+ui.(2.41)Eachvalueofyisgeneratedasthesumoftwocomponents,thenonstochasticcomponent,a+bxi,andthedisturbancetermui.Thisdecompositionispurelytheoretical.Wewilluseitintheanalysisofthepropertiesoftheregressionestimators.ItisillustratedinFigure2.7a,whereQTisthenonstochasticcomponentofyandPQisthedisturbancetermTheotherdecompositionrelatestotheregressionline:yi=yˆi+ei.(2.42)=a+bxi+ei\nSIMPLEREGRESSIONANALYSIS12y=a+bxyyˆ=a+bxPyiuiQaa+bxiaTxixFigure2.7a.Decompositionofyintononstochasticpartanddisturbancetermy=a+bxyyˆ=a+bxPyieiRaa+bxiaTxixFigure2.7b.DecompositionofyintofittedvalueandresidualOncewehavechosenthevaluesofaandb,eachvalueofyissplitintothefittedvalue,yˆi,andtheresidual,ei.Thisdecompositionisoperational,butitistosomeextentarbitrarybecauseitdependsonourcriterionfordeterminingaandbanditwillinevitablybeaffectedbytheparticularvaluestakenbythedisturbancetermintheobservationsinthesample.ItisillustratedinFigure2.7b,whereRTisthefittedvalueandPRistheresidual.2.6InterpretationofaRegressionEquationTherearetwostagesintheinterpretationofaregressionequation.Thefirstistoturntheequationintowordssothatitcanbeunderstoodbyanonstatistician.Thesecondistodecidewhetherthisliteralinterpretationshouldbetakenatfacevalueofwhethertherelationshipshouldbeinvestigatedfurther.Bothstagesareimportant.Wewillleavetheseconduntillaterandconcentrateforthetimebeingonthefirst.Itwillbeillustratedwithanearningsfunction,hourlyearningsin1992,EARNINGS,measuredindollars,beingregressedonhighestgradecompleted,HGC,forthe570respondentsin\nSIMPLEREGRESSIONANALYSIS13.regearningshgcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=65.64Model|3977.3801613977.38016Prob>F=0.0000Residual|34419.656956860.5979875R-squared=0.1036---------+------------------------------AdjR-squared=0.1020Total|38397.037156967.4816117RootMSE=7.7845------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|1.073055.13245018.1020.000.81290281.333206_cons|-1.3910041.820305-0.7640.445-4.9663542.184347------------------------------------------------------------------------------DataSet21intheEAEFdatasetonthewebsite.ThetableabovegivestheStataoutputfortheregression.ThescatterdiagramandregressionlineareshowninFigure2.8.Forthetimebeing,ignoreeverythingexceptthecolumnheaded“coef.”inthebottomhalfofthetable.ThisgivestheestimatesofthecoefficientofHGCandtheconstant,andthusthefollowingfittedequation:EARNIˆNGS=-1.39+1.07HGC.(2.43)Interpretingitliterally,theslopecoefficientindicatesthat,asHGCincreasesbyoneunit(ofHGC),EARNINGSincreasesby1.07units(ofEARNINGS).SinceHGCismeasuredinyears,andEARNINGSismeasuredindollarsperhour,thecoefficientofHGCimpliesthathourlyearningsincreaseby$1.07foreveryextrayearofschooling.Whatabouttheconstantterm?Strictlyspeaking,itindicatesthepredictedlevelofEARNINGSwhenHGCiszero.Sometimestheconstantwillhaveaclearmeaning,butsometimesnot.Ifthesamplevaluesoftheexplanatoryvariablearealongwayfromzero,extrapolatingtheregressionlinebacktozeromaybedangerous.Eveniftheregressionlinegivesagoodfitforthesampleofobservations,thereisnoguaranteethatitwillcontinuetodosowhenextrapolatedtotheleftortheright.807060504030Hourlyearnings($)2010001234567891011121314151617181920-10HighestgradecompletedFigure2.8.ASimpleEarningsFunction\nSIMPLEREGRESSIONANALYSIS14InterpretationofaLinearRegressionEquationThisisafoolproofwayofinterpretingthecoefficientsofalinearregressionyˆ=a+bxwhenyandxarevariableswithstraightforwardnaturalunits.Thefirststepistosaythataone-unitincreaseinx(measuredinunitsofx)willcauseabunitincreaseiny(measuredinunitsofy).Thesecondstepistochecktoseewhattheunitsofxandyactuallyare,andtoreplacetheword"unit"withtheactualunitofmeasurement.Thethirdstepistoseewhethertheresultcouldbeexpressedinabetterway,withoutalteringitssubstance.Theconstant,a,givesthepredictedvalueofy(inunitsofy)forxequaltozero.Itmayormaynothaveaplausiblemeaning,dependingonthecontext.Inthiscasealiteralinterpretationoftheconstantwouldleadtothenonsensicalconclusionthatanindividualwithnoschoolingwouldhavehourlyearningsof$-1.39.Inthisdataset,noindividualhadlessthansixyearsofschoolingandonlythreefailedtocompleteelementaryschool,soitisnotsurprisingthatextrapolationtozeroleadstotrouble.Itisimportanttokeepthreethingsinmindwheninterpretingaregressionequation.First,aisonlyanestimateofaandbisonlyanestimateofb,sotheinterpretationisreallyonlyanestimate.Second,theregressionequationrefersonlytothegeneraltendencyforthesample.Anyindividualcasewillbefurtheraffectedbytheluckfactor.Third,theinterpretationisconditionalontheequationbeingcorrectlyspecified.Infact,thisisactuallyanaïvespecificationofanearningsfunction.Wewillreconsideritseveraltimesinlaterchapters.YoushouldbeundertakingparallelexperimentsusingoneoftheotherEAEFdatasetsonthewebsite.Havingfittedaregression,thenextquestiontoaskis,Dowehaveanymeansoftellinghowaccurateourestimatesare?Thisveryimportantissuewillbediscussedinthenextchapter.ExercisesNote:SomeoftheexercisesinthisandlaterchaptersrequireyoutofitregressionsusingoneoftheEAEFdatasetsonthewebsite(http://econ.lse.ac.uk/staff/crsd/ie/).YouwillneedtodownloadtheEAEFregressionexercisesmanualandoneofthe20datasets.2.1*Theresultoffittinganeducationalattainmentfunction,regressingHGC(highestgradecompleted)onASVABC(scoreonatestofcognitiveability)usingDataSet21isshownbelow.Giveaninterpretationoftheregressioncoefficients.TheASVABCscoresarescaledsothattheyhavemean50andstandarddeviation10.Thelowestscoreinthesamplewas22.ForfurtherdetailsonASVABC,seetheEAEFregressionexercisesmanual.\nSIMPLEREGRESSIONANALYSIS15use"C:C220AEFaef21.dta",clear.reghgcasvabcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=284.89Model|1153.8086411153.80864Prob>F=0.0000Residual|2300.438735684.05006818R-squared=0.3340---------+------------------------------AdjR-squared=0.3329Total|3454.247375696.07073351RootMSE=2.0125------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1545378.009155916.8790.000.1365543.1725213_cons|5.770845.466847312.3610.0004.8538886.687803------------------------------------------------------------------------------2.2CalculatebyhandaregressionofthedataforponthedataforeinExercise1.3,firstusingall12observations,thenexcludingtheobservationforJapan,andprovideaneconomicinterpretation.(Note:Youdonotneedtocalculatetheregressioncoefficientsfromscratch,sinceyouhavealreadyperformedmostofthearithmeticalcalculationsinExercise1.3)2.3FitaneducationalattainmentfunctionparalleltothatinExercise2.1,usingyourEAEFdataset,andgiveaninterpretationofthecoefficients.2.4FitanearningsfunctionparalleltothatdiscussedinSection2.6,usingyourEAEFdataset,andgiveaninterpretationofthecoefficients.2.5*Theoutputbelowshowstheresultofregressingtheweightoftherespondentin1985,measuredinpounds,againsthisorherheight,measuredininches.Provideaninterpretationofthecoefficients..regweight85heightSource|SSdfMSNumberofobs=550---------+------------------------------F(1,548)=343.00Model|245463.0951245463.095Prob>F=0.0000Residual|392166.897548715.633025R-squared=0.3850---------+------------------------------AdjR-squared=0.3838Total|637629.9935491161.43897RootMSE=26.751------------------------------------------------------------------------------weight85|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------height|5.399304.291534518.5200.0004.8266435.971966_cons|-210.188319.85925-10.5840.000-249.1979-171.1788------------------------------------------------------------------------------2.6*ThetablebelowshowstheaverageratesofgrowthofGDP,g,andemployment,e,for25OECDcountriesfortheperiod1988-1997.Theregressionoutputshowstheresultofregressingeong.Provideaninterpretationofthecoefficients.\nSIMPLEREGRESSIONANALYSIS16AverageRatesofEmploymentGrowthandGDPGrowth,1988-1997employmentGDPemploymentGDPAustralia1.683.04Korea2.577.73Austria0.652.55Luxembourg3.025.64Belgium0.342.16Netherlands1.882.86Canada1.172.03NewZealand0.912.01Denmark0.022.02Norway0.362.98Finland-1.061.78Portugal0.332.79France0.282.08Spain0.892.60Germany0.082.71Sweden-0.941.17Greece0.872.08Switzerland0.791.15Iceland-0.131.54Turkey2.024.18Ireland2.166.40UnitedKingdom0.661.97Italy-0.301.68UnitedStates1.532.46Japan1.062.81.regemploygdpifemploy<4.5Source|SSdfMSNumberofobs=25---------+------------------------------F(1,23)=33.22Model|14.2762167114.2762167Prob>F=0.0000Residual|9.8835986923.429721682R-squared=0.5909---------+------------------------------AdjR-squared=0.5731Total|24.1598154241.00665898RootMSE=.65553------------------------------------------------------------------------------employ|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------gdp|.4846863.08409075.7640.000.3107315.6586411_cons|-.5208643.2707298-1.9240.067-1.080912.039183------------------------------------------------------------------------------2.7TwoindividualsfitearningsfunctionsrelatingEARNINGStoHGCasdefinedinSection2.6,usingDataSet21.ThefirstindividualdoesitcorrectlyandobtainstheresultfoundinSection2.6:EARNˆINGS=-1.39+1.07HGCThesecondindividualmakesamistakeandregressesHGConEARNINGS,obtainingthefollowingresult:HGˆC=12.255+0.097EARNINGSFromthisresultthesecondindividualderivesEARNˆINGS=-126.95+10.36HGCExplainwhythisequationisdifferentfromthatfittedbythefirstindividual.\nSIMPLEREGRESSIONANALYSIS172.8*Derive,withaproof,thecoefficientsthatwouldhavebeenobtainedinExercise2.5ifweightandheighthadbeenmeasuredinmetricunits?(Note:onepoundis454grams,andoneinchis2.54cm.)2.9*Aresearcherhasdataontheaggregateexpenditureonservices,y,andaggregatedisposablepersonalincome,x,bothmeasuredin$billionatconstantprices,foreachoftheU.S.statesandfitstheequationy=a+bx+uTheresearcherinitiallyfitstheequationusingOLSregressionanalysis.However,suspectingthattaxevasioncausesbothyandxtobesubstantiallyunderestimated,theresearcheradoptstwoalternativemethodsofcompensatingfortheunder-reporting:1.Theresearcheradds$90billiontothedataforyineachstateand$200billiontothedataforx.2.Theresearcherincreasesthefiguresforbothyandxineachstateby10%.Evaluatetheimpactoftheadjustmentsontheregressionresults.2.10*Aresearcherhasinternationalcross-sectiondataonaggregatewages,W,aggregateprofits,P,andaggregateincome,Y,forasampleofncountries.Bydefinition,Y=W+PTheregressionsWˆ=a+aY01Pˆ=b+bY01arefittedusingOLSregressionanalysis.Showthattheregressioncoefficientswillautomaticallysatisfythefollowingequations:a1+b1=1a0+b0=0Explainintuitivelywhythisshouldbeso.2.11*Derivefromfirstprinciplestheleastsquaresestimatorofbinthemodely=bx+u2.12Derivefromfirstprinciplestheleastsquaresestimatorofaintheevenmoreprimitivemodely=a+u(Inotherwords,yconsistssimplyofaconstantplusadisturbanceterm.FirstdefineSandthendifferentiate).\nSIMPLEREGRESSIONANALYSIS18ThreeUsefulResultsRelatingtoOLSRegressions_(i)e=0,(ii)yˆ=y,(iii)Cov(yˆ,e)=0.Proofof(i)ei=yi-yˆi=yi-a-bxisoåei=åyi-na-båxiDividingbyn,e=y-a-bx=y-(y-bx)-bx=0Proofof(ii)e=y-yˆiiisoåe=åy-åyˆiiiDividingbyn,e=y-yˆCov(yˆ,e)=Cov([Butea=+0bx,so],eyˆ)==yCov.(a,e)+Cov(bx,e)Proofof(iii)Cov(yˆ,e)=Cov([a+bx],e)=Cov(a,e)+Cov(bx,e)=bCov(x,e)=bCov(x,[y-a-bx])=b[]Cov(x,y)-Cov(x,a)-bVar(x)éCov(x,y)ù=bêCov(x,y)-Var(x)ú=0ëVar(x)û22.7GoodnessofFit:RTheaimofregressionanalysisistoexplainthebehaviorofthedependentvariabley.Inanygivensample,yisrelativelylowinsomeobservationsandrelativelyhighinothers.Wewanttoknowwhy.ThevariationsinyinanysamplecanbesummarizedbythesamplevarianceVar(y).Weshouldliketobeabletoaccountforthesizeofthisvariance.\nSIMPLEREGRESSIONANALYSIS19Wehaveseenthatwecansplitthevalueofyiineachobservationintotwocomponents,yˆiandei,afterrunningaregression.yt=yˆi+ei(2.44)Wecanusethistodecomposethevarianceofy:Var(y)=Var(yˆ+e)(2.45)=Var(yˆ)+Var(e)+2Cov(yˆ,e)NowitsohappenstheCov(yˆ,e)mustbeequaltozero(seethebox).HenceweobtainVar(y)=Var(yˆ)+Var(e)(2.46)Thismeansthatwecandecomposethevarianceofyintotwoparts,Var(yˆ),thepart"explained"bytheregressionline,andVar(e),the"unexplained"part.[Note:Thewordsexplainedandunexplainedhavebeenputinquotationmarksbecausetheexplanationmayinfactbebogus.ymightreallydependonsomeothervariablez,andxmightbeactingasaproxyforz(moreaboutthislater).Itwouldbesafertousetheexpressionapparentlyexplainedinsteadofexplained.]Inviewof(2.46),Var(yˆ)/Var(y)istheproportionofthevarianceexplainedbytheregressionline.2Thisproportionisknownasthecoefficientofdetermination,ormoreusuallyR:2Var(yˆ)R=(2.47)Var(y)2ThemaximumvalueofRis1.Thisoccurswhentheregressionlinefitstheobservationsexactly,sothatyˆi=yiinallobservationsandalltheresidualsarezero.ThenVar(yˆ)=Var(y),Var(e)iszero,andone2hasaperfectfit.Ifthereisnoapparentrelationshipbetweenthevaluesofyandxinthesample,Rwillbeclosetozero.Oftenitisconvenienttodecomposethevarianceas"sumsofsquares".From(2.46)onehas11_1222å(yi-y)=å(yˆi-yˆ)+å(ei-e)(2.48)nnnandso222å(yi-y)=å(yˆi-y)+åei(2.49)_multiplyingthroughbynandusinge=0andyˆ=y(seethebox).ThusTSS=ESS+RSS(2.50)whereTSS,thetotalsumofsquares,isgivenbytheleftsideoftheequationandESS,theexplainedsumofsquares,andRSS,theresidualsumofsquares,arethetwotermsontherightside.2Otherthingsbeingequal,onewouldlikeRtobeashighaspossible.Inparticular,wewouldlike2thecoefficientsaandbtobechoseninsuchawayastomaximizeR.Doesthisconflictwithour\nSIMPLEREGRESSIONANALYSIS20criterionthataandbshouldbechosentominimizethesumofthesquaresoftheresiduals?No,theyare2easilyshowntobeequivalentcriteria.Inviewof(2.46)wecanrewriteRas2Var(e)R=1-(2.51)Var(y)Thus12åei2nR=1-(2.52)Var(y)22andsothevaluesofaandbwhichminimizeåeautomaticallymaximizeR.i2AlternativeinterpretationofRItshouldbeintuitivelyobviousthat,thebetteristhefitachievedbytheregressionequation,thehigher2shouldbethecorrelationcoefficientfortheactualandpredictedvaluesofy.WewillshowthatRisinfactequaltothesquareofthiscorrelationcoefficient,whichwewilldenotery,yˆCov(y,yˆ)Cov([yˆ+e],yˆ)ry,yˆ==Var(y)Var(yˆ)Var(y)Var(yˆ)Cov(yˆ,yˆ)+Cov(e,yˆ)Var(yˆ)==Var(y)Var(yˆ)Var(y)Var(yˆ)Var(yˆ)2==R(2.53)Var(y)NotethattheproofmakesuseofthefactthatCov(e,yˆ)=0(seethebox).2ExampleofhowRiscalculated2Risalwayscalculatedbythecomputeraspartoftheregressionoutput,sothisexampleisforillustrationonly.Weshallusetheprimitivethree-observationexampledescribedinSection2.3,wheretheregressionlineyˆ=1.6667+1.5000x(2.54)\nSIMPLEREGRESSIONANALYSIS21TABLE2.3Observationxyyˆey-yyˆ-yˆ(y-y)22e2(yˆ-yˆ)1133.1667-0.1667-1.6667-1.52.77782.250.02782254.66670.33330.33330.00.11110.000.11113366.1667-0.16671.33331.51.77782.250.0278Total614144.66674.500.1667Average24.66674.66671.55561.500.0556wasfittedtotheobservationsonxandyinTable2.3.Thetablealsoshowsyˆiandeiforeachobservation,calculatedfrom(2.52),andalltheotherfiguresneededtocalculateVar(y),Var(yˆ)and2Var(e).(Notethatemustbezero,soVar(e)is[1/n]åei.)FromTable2.3,youcanseethatVar(y)=1.5556,Var(yˆ)=1.5000,andVar(e)=0.0556.Notethat2Var(y)=Var(yˆ)+Var(e),asitmust.Fromthesefigures,wecancalculateRusingeither(2.47)or(2.51):2Var(yˆ)1.5000R===0.96(2.55)Var(y)1.55562Var(e)0.0556R=1-=1-=0.96(2.56)Var(y)1.5556Exercises2.13UsingthedatainTable2.3,calculatethecorrelationbetweenyandyˆandverifythatitssquare2isequaltothevalueofR.22.14WhatwasthevalueofRintheeducationalattainmentregressionfittedbyyouinExercise2.3?Commentonit.22.15WhatwasthevalueofRintheearningsfunctionfittedbyyouinExercise2.4?Commentonit.2.16*Theoutputbelowshowstheresultofregressingweightin1994onheight,usingEAEFData2Set21.In1994therespondentswereaged29-36.ExplainwhyRislowerthanintheregressionreportedinExercise2.5..regweight94heightSource|SSdfMSNumberofobs=545---------+------------------------------F(1,543)=247.48Model|268361.401268361.40Prob>F=0.0000Residual|588805.0415431084.35551R-squared=0.3131---------+------------------------------AdjR-squared=0.3118Total|857166.445441575.6736RootMSE=32.93------------------------------------------------------------------------------weight94|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------height|5.659775.3597715.7320.0004.9530646.366487_cons|-212.835824.51422-8.6820.000-260.9901-164.6815----------------------------------------------------------------------\n3PROPERTIESOFTHEREGRESSIONCOEFFICIENTSANDHYPOTHESISTESTINGWiththeaidofregressionanalysiswecanobtainestimatesoftheparametersofarelationship.However,theyareonlyestimates.Thenextquestiontoaskis,howreliablearethey?Weshallanswerthisfirstingeneralterms,investigatingtheconditionsforunbiasednessandthefactorsgoverningtheirvariance.Buildingonthis,weshalldevelopameansoftestingwhetheraregressionestimateiscompatiblewithaspecificpriorhypothesisconcerningthetruevalueofaparameter,andhenceweshallderiveaconfidenceintervalforthetruevalue,thatis,thesetofallhypotheticalvaluesnotcontradictedbytheexperimentalresult.Weshallalsoseehowtotestwhetherthegoodnessoffitofaregressionequationisbetterthanmightbeexpectedonthebasisofpurechance.3.1TheRandomComponentsoftheRegressionCoefficientsAleastsquaresregressioncoefficientisaspecialformofrandomvariablewhosepropertiesdependonthoseofthedisturbancetermintheequation.Thiswillbedemonstratedfirsttheoreticallyandthenbymeansofacontrolledexperiment.Inparticular,wewillinvestigatetheimplicationsfortheregressioncoefficientsofcertainassumptionsconcerningthedisturbanceterm.Throughoutthediscussionweshallcontinuetoworkwiththesimpleregressionmodelwhereydependsonxaccordingtotherelationshipy=a+bx+u(3.1)andwearefittingtheregressionequationyˆ=a+bx(3.2)givenasampleofnobservations.Weshallalsocontinuetoassumethatxisanonstochasticexogenousvariable;thatis,itsvalueineachobservationmaybeconsideredtobepredeterminedbyfactorsunconnectedwiththepresentrelationship.First,notethatyhastwocomponents.Ithasanonrandomcomponent(a+bx),whichowesnothingtothelawsofchance(aandbmaybeunknown,butneverthelesstheyarefixedconstants),andithastherandomcomponentu.Thisimpliesthat,whenwecalculatebaccordingtotheusualformulaÓC.Dougherty1999.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof22.09.99.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS2Cov(x,y)b=(3.3)Var(x)balsohasarandomcomponent.Cov(x,y)dependsonthevaluesofy,andthevaluesofydependonthevaluesofu.Ifthevaluesofthedisturbancetermhadbeendifferentinthenobservations,wewouldhaveobtaineddifferentvaluesofy,henceofCov(x,y),andhenceofb.Wecanintheorydecomposebintoitsnonrandomandrandomcomponents.Inviewof(3.1),Cov(x,y)=Cov(x,[a+bx+u])=Cov(x,a)+Cov(x,bx)+Cov(x,u)(3.4)usingCovarianceRule1inSection1.2.ByCovarianceRule3,Cov(x,a)mustbeequaltozero.ByCovarianceRule2,Cov(x,bx)isequaltobCov(x,x).Cov(x,x)isthesamethingasVar(x).HencewecanwriteCov(x,y)=bVar(x)+Cov(x,u)(3.5)andsoCov(x,y)Cov(x,u)b==b+(3.6)Var(x)Var(x)Thuswehaveshownthattheregressioncoefficientbobtainedfromanysampleconsistsof(1)afixedcomponent,equaltothetruevalue,b,and(2)arandomcomponentdependentonCov(x,u),whichisresponsibleforitsvariationsaroundthiscentraltendency.Similarly,onemayeasilyshowthatahasafixedcomponentequaltothetruevalue,a,plusarandomcomponentthatdependsontherandomfactoru.Notethatyouarenotabletomakethesedecompositionsinpracticebecauseyoudonotknowthetruevaluesofaandbortheactualvaluesofuinthesample.Weareinterestedinthembecausetheyenableustosaysomethingaboutthetheoreticalpropertiesofaandb,givencertainassumptions.3.2AMonteCarloExperimentNobodyseemstoknowforcertainhowtheMonteCarloexperimentgotitsname.Probablyithassomethingtodowiththefamouscasino,asasymbolofthelawsofchance.Thebasicconceptwillbeexplainedbymeansofananalogy.Supposeyouhavetrainedapigtofindtrufflesforyou.ThesefungigrowwildinthegroundinFranceandItalyandareconsideredtobedelicious.Theyareexpensivebecausetheyarehardtofind,andagoodtrufflepigishighlyvalued.Thequestionis,howdoyouknowifyourpigisanygoodattrufflehunting?Itmayfindthemfromtimetotime,butforallyouknowitmaymissalotaswell.Ifyouwerereallyinterestedyoucouldevaluateyourpigbytakingapieceofland,buryingtrufflesinseveralplaces,lettingthepigloose,andseeinghowmanyitlocated.Bymeansofthiscontrolledexperiment,youwouldhaveadirectmeasure\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS3ofitssuccessrate.Whathasthisgottodowithregressionanalysis?Theproblemisthatweneverknowthetruevaluesofaandb(otherwise,whyshouldweuseregressionanalysistoestimatethem?),sowehavenomeansoftellingwhetherthetechniqueisgivingusgoodorbadestimates.AMonteCarloexperimentisanartificial,controlledexperimentthatallowsustocheck.ThesimplestpossibleMonteCarloexperimenthasthreeparts.First,1.youchoosethetruevaluesofaandb,2.youchoosethevalueofxineachobservation,and3.youusesomerandomnumbergeneratingprocesstoprovidetherandomfactoruineachobservation.Second,yougeneratethevalueofyineachobservation,usingtherelationship(3.1)andthevaluesofa,b,xandu.Third,usingonlythevaluesofythusgeneratedandthedataforx,youuseregressionanalysistoobtainestimatesaandb.Youcanthenseeifaisagoodestimatorofaandifbisagoodestimatorofb,andthiswillgiveyousomeideaofwhethertheregressiontechniqueisworkingproperly.Inthefirsttwostepsyouarepreparingachallengefortheregressiontechnique.Youareincompletecontrolofthemodelthatyouareconstructingandyouknowthetruevaluesoftheparametersbecauseyouyourselfhavedeterminedthem.Inthethirdstepyouseewhethertheregressiontechniquecanmeetyourchallengeandprovidegoodestimatesofaandbusingonlythedataonyandx.Notethattheinclusionofastochasticterminthegenerationofyisresponsiblefortheelementofchallenge.Ifyoudidnotincludeit,theobservationswouldlieexactlyonthestraightline(3.1),anditwouldbeatrivialmattertodeterminetheexactvaluesofaandbfromthedataonyandx.Quitearbitrarily,letusputaequalto2andbequalto0.5,sothetruerelationshipisy=2+0.5x+u(3.7)TABLE3.1xuyxuy1-0.591.91111.599.092-0.242.7612-0.927.083-0.832.6713-0.717.7940.034.0314-0.258.755-0.384.12151.6911.196-2.192.81160.1510.1571.036.53170.0210.5280.246.2418-0.1110.8992.539.0319-0.9110.5910-0.136.87201.4213.42\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS4TABLE3.2Sampleab11.630.5422.520.4832.130.4542.140.5051.710.5661.810.5171.720.5683.180.4191.260.58101.940.52Tokeepthingssimple,wewillassumethatwehave20observationsandthatthevaluesofxgofrom1to20.Foru,thedisturbanceterm,wewilluserandomnumbersdrawnfromanormallydistributedpopulationwithzeromeanandunitvariance.Wewillneedasetof20andwilldenotethemrn1torn20,u1,thedisturbanceterminthefirstobservation,issimplyequaltorn1,u2torn2,etc.Giventhevalueofxanduineachobservation,itispossibletocalculatethevalueofyusing(3.7),andthisisdoneinTable3.1.Ifyounowregressyonx,youobtainyˆ=1.63+0.54x(3.8)Inthiscaseaisanunderestimateofa(1.63asopposedto2.00)andbisaslightoverestimateofb(0.54asopposedto0.50).Thediscrepanciesarecausedbythecollectiveeffectsofthedisturbancetermsinthe20observations.Ofcourse,oneexperimentsuchasthisishardlyenoughtoallowustoevaluatetheregressiontechnique.Itgavequitegoodresults,butperhapsthiswasafluke.Tocheckfurther,wewillrepeattheexperiment,keepingthesametrueequation(3.7)andthesamevaluesofx,butusinganewsetofrandomnumbersforthedisturbancetermdrawnfromthesamedistribution(zeromeanandunitvariance).Fromthese,andthevaluesofx,wegenerateanewsetofvaluesfory.Tosavespace,thetablegivingthenewvaluesofuandyisomitted.Theresultwhenthenewvaluesofyareregressedonxisyˆ=2.52+0.48x(3.9)Thissecondexperimentalsoturnedoutquitewell.Nowaisanoverestimateofaandbisaslightunderestimateofb.Table3.2givestheestimatesaandbwiththeexperimentrepeated10times,usingadifferentsetofrandomnumbersforthedisturbancetermineachcase.Youcanseethat,althoughyousometimesgetoverestimatesandsometimesunderestimates,onthewholeaandbareclusteredaroundthetruevaluesof2.00and0.50,respectively.Andtherearemoregoodestimatesthanbadones.Takingb,forexample,ifyourepeatedtheexperimentaverylargenumberoftimesandconstructedafrequencytable,youwouldobtainanapproximationtotheprobabilitydensityfunctionshowninFigure3.1.Itisanormaldistributionwithmean0.50andstandarddeviation0.0388.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS5probabilitydensityfunctionofb1086420.250.500.75bFigure3.1.DistributionofbintheMonteCarloexperimentIthasbeenassertedthatthediscrepanciesbetweentheregressioncoefficientsandthetruevaluesoftheparametersarecausedbythedisturbancetermu.Aconsequenceofthisisthatthebiggeristherandomelement,themoreaccuratewillbetheestimate,ingeneral.ThiswillbeillustratedwithasecondsetofMonteCarloexperimentsrelatedtothefirst.Weshallusethesamevaluesforaandbasbefore,andthesamevaluesofx,andthesamesourceofrandomnumbersforthedisturbanceterm,butwewillnowmakethedisturbancetermineachobservation,whichwillbedenotedu',equaltotwicetherandomnumberdrawn:u'1=2rn1,u'2,=2rn2,etc.Infact,wewilluseexactlythesamesampleofrandomnumbersasbefore,butdoublethem.CorrespondingtoTable3.1,wenowhaveTable3.3.Regressingyonx,wenowobtaintheequationyˆ=1.26+0.58x(3.10)Thisismuchlessaccuratethanitscounterpart,equation(3.8).TABLE3.3xuyxuy1-1.181.32113.1810.682-0.482.5212-1.846.163-1.661.8413-1.427.0840.063.9414-0.508.505-0.763.74153.3812.886-4.380.62160.3010.3072.067.56170.0410.5480.486.4818-0.2210.7895.0611.5619-1.829.6810-0.266.74202.8414.84\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS6TABLE3.4Sampleab11.260.5823.050.4532.260.3942.280.5051.420.6161.610.5271.440.6384.370.3390.520.65101.880.55Table3.4givestheresultsforall10experiments,puttingu'=2rn.WewillcallthissetofexperimentsIIandtheoriginalset,summarizedinTable3.2,I.ComparingTables3.2and3.4,youcanseethatthevaluesofaandbaremuchmoreerraticinthelatter,althoughthereisstillnosystematictendencyeithertounderestimateortooverestimate.Detailedinspectionrevealsanimportantfeature.InSetI,thevalueofbinsample1was0.54,anoverestimateof0.04.InSetII,thevalueofbinsample1was0.58,anoverestimateof0.08.Exactlytwiceasmuchasbefore.Thesameistrueforeachoftheotherninesamples,andalsofortheregressioncoefficientaineachsample.Doublingthedisturbancetermineachobservationcausesadoublingoftheerrorsintheregressioncoefficients.Thisresultfollowsdirectlyfromthedecompositionofbgivenby(3.6).InSetItheerrorcomponentofbisgivenbyCov(x,u)/Var(x).InSetIIitisgivenbyCov(x,u')/Var(x),andCov(x,u')Cov(x,2u)Cov(x,u)==2(3.11)Var(x)Var(x)Var(x)probabilitydensityfunctionofb1086420.250.500.75bFigure3.2.Distributionofbwhenthestandarddeviationofuisdoubled\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS7TheincreaseininaccuracyisreflectedintheprobabilitydensityfunctionforbinSetII,shownasthesolidcurveinFigure3.2.Thisisstillcenteredoverthetruevalue,0.50,but,ifyoucompareitwiththatforSetI,thedottedcurve,youwillseethatitisflatterandwider.Doublingthevaluesofuhascausedadoublingofthestandarddeviationofthedistribution.3.3AssumptionsConcerningtheDisturbanceTermItisthusobviousthatthepropertiesoftheregressioncoefficientsdependcriticallyonthepropertiesofthedisturbanceterm.Indeedthelatterhastosatisfyfourconditions,knownastheGauss-Markovconditions,ifordinaryleastsquaresregressionanalysisistogivethebestpossibleresults.Itisnotanexaggerationtosaythatoneofthethingsthatdistinguishesacompetentuserofregressionanalysisfromanincompetentoneisanawarenessoftheimportanceoftheseconditions.Iftheyarenotsatisfied,theusershouldbeawareofthefact.Ifremedialactionispossible,heorsheshouldbecapableoftakingit.Ifitisnotpossible,heorsheshouldbeabletojudgehowseriouslytheresultsmayhavebeenaffected.Weshalllisttheconditionsonebyone,explainingbrieflywhytheyareimportant.Thelastthreewillbetreatedindetailinlaterchapters.Gauss-MarkovCondition1:E(ui)=0forAllObservationsThefirstconditionisthattheexpectedvalueofthedisturbanceterminanyobservationsshouldbezero.Sometimesitwillbepositive,sometimesnegative,butitshouldnothaveasystematictendencyineitherdirection.Actually,ifaconstanttermisincludedintheregressionequation,itisusuallyreasonabletoassumethatthisconditionissatisfiedautomaticallysincetheroleoftheconstanttermistopickupanysystematictendencyinynotaccountedforbytheexplanatoryvariablesincludedintheregressionequation.Gauss-MarkovCondition2:pop.var(ui)ConstantforAllObservationsThesecondconditionisthatthevarianceofthedisturbancetermshouldbeconstantforallobservations.Sometimesthedisturbancetermwillbegreater,sometimessmaller,butthereshouldnotbeanyapriori2reasonforittobemoreerraticinsomeobservationsthaninothers.Theconstantisusuallydenotedsu,2oftenabbreviatedtos,andtheconditioniswritten2pop.var(ui)=suforalli(3.12)2SinceE(ui)iszero,pop.var(ui)isequaltoE(ui),sotheconditioncanalsobewritten22E(ui)=suforalli(3.13)\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS8su,ofcourse,isunknown.Oneofthetasksofregressionanalysisistoestimatethestandarddeviationofthedisturbanceterm.Ifthisconditionisnotsatisfied,theOLSregressioncoefficientswillbeinefficient,andyoushouldbeabletotoobtainmorereliableresultsbyusingamodificationoftheregressiontechnique.ThiswillbediscussedinChapter**.Gauss-MarkovCondition3:uidistributedindependentlyofuj(i¹j)Thisconditionstatesthatthereshouldbenosystematicassociationbetweenthevaluesofthedisturbanceterminanytwoobservations.Forexample,justbecausethedisturbancetermislargeandpositiveinoneobservation,thereshouldbenotendencyforittobelargeandpositiveinthenext(orlargeandnegative,forthatmatter,orsmallandpositive,orsmallandnegative).Thevaluesofthedisturbancetermshouldbeabsolutelyindependentofoneanother.Theconditionimpliesthatthepopulationcovariancebetweenuiandujiszero,becausepop.cov(ui,uj)=E[(ui-mu)(uj-mu)]=E(uiuj)=E(ui)E(uj)=0(3.14)(Notethatthepopulationmeansofuiandujarezero,byvirtueofthefirstGauss-Markovcondition,andthatrewritingE(uiuj)asE(ui)E(uj)requiresuiandujtobeindependent–seetheReviewchapter.)Ifthisconditionisnotsatisfied,OLSwillagaingiveinefficientestimates.Chapter**discussestheproblemsthatariseandwaysofgettingaroundthem.Gauss-MarkovCondition4:TheDisturbanceTermShouldBeDistributedIndependentlyoftheExplanatoryVariablesThefinalconditioncomesintwoversions,weakandstrong.Thestrongversionisthattheexplanatoryvariablesshouldbenon-stochastic,thatis,nothaverandomcomponents.Thisisactuallyveryunrealisticforeconomicvariablesandwewilleventuallyswitchtotheweakversionofthecondition,wheretheexplanatoryvariablesareallowedtohaverandomcomponentsprovidedthattheyaredistributedindependentlyofthedisturbanceterm.Howeverforthetimebeingwewillusethestrongversionbecauseitsimplifiestheanalysisofthepropertiesoftheestimators.Itisnoteasytothinkoftrulynonstochasticvariables,otherthantime,sothefollowingexampleisalittleartificial.Supposethatwearerelatingearningstoschoolingintermsofhighestgradecompleted(HGC).Supposethatweknowfromthenationalcensusthat1%ofthepopulationhaveHGC=8,3%haveHGC=9,5%haveHGC=10,7%haveHGC=11,43%haveHGC=12(graduationfromhighschool),andsoon.Supposethatwehavedecidedtoundertakeasurveywithsamplesize1,000andwewantthesampletomatchthepopulationasfaraspossible.Wemightthenselectwhatisknownasastratifiedrandomsample,designedsothatitincludes10individualswithHGC=8,30individualswithHGC=9,andsoon.ThevaluesofHGCinthesamplewouldthenbepredeterminedandthereforenonstochastic.Schoolingandotherdemographicvariablesinlargesurveysdrawninsuchawayastoberepresentativeofthepopulationasawhole,liketheNationalLongitudinalSurvey\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS9ofYouth,probablyapproximatethisconditionquitewell.Ifthisconditionissatisfied,itfollowsthatthepopulationcovariancebetweentheexplanatoryvariableandthedisturbancetermiszero.SinceE(ui)iszero,andtheterminvolvingxisnonstochastic,pop.cov(xi,ui)=E[(xi-x)(ui-mu)]=(xi-x)E(ui)=0(3.15)Chapters**and**discusstwoimportantcasesinwhichthisconditionisunlikelytobesatisfied,andtheconsequences.TheNormalityAssumptionInadditiontotheGauss-Markovconditions,oneusuallyassumesthatthedisturbancetermisnormallydistributed.Youshouldknowallaboutthenormaldistributionfromyourintroductorystatisticscourse.Thereasonisthatifuisnormallydistributed,sowillbetheregressioncoefficients,andthiswillbeusefultouslaterinthechapterwhenwecometothebusinessofperformingtestsofhypothesesandconstructingconfidenceintervalsforaandbusingtheregressionresults.ThejustificationfortheassumptiondependsontheCentralLimitTheorem.Inessence,thisstatesthat,ifarandomvariableisthecompositeresultoftheeffectsofalargenumberofotherrandomvariables,itwillhaveanapproximatelynormaldistributionevenifitscomponentsdonot,providedthatnoneofthemisdominant.Thedisturbancetermuiscomposedofanumberoffactorsnotappearingexplicitlyintheregressionequation,so,evenifweknownothingaboutthedistributionofthesefactors(oreventheiridentity),weareentitledtoassumethattheyarenormallydistributed.Atanyrate,youareveryunlikelytobechallengedonthispoint.3.4UnbiasednessoftheRegressionCoefficientsFrom(3.6)wecanshowthatbmustbeanunbiasedestimatorofbifthefourthGauss-Markovconditionissatisfied:ìCov(x,u)üìCov(x,u)üE(b)=Eíb+ý=b+Eíý(3.16)îVar(x)þîVar(x)þsincebisaconstant.IfweadoptthestrongversionofthefourthGauss-Markovconditionandassumethatxisnonrandom,wemayalsotakeVar(x)asagivenconstant,andso1E(b)=b+E[]Cov(x,u)(3.17)Var(x)WewilldemonstratethatE[Cov(x,u)]iszero:\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS10()æ1öECov(x,u)=Eçå(xi-x)(ui-u)÷ènø1=åE()(xi-x)(ui-u)(3.18)n1=å(xi-x)E(ui-u)=0nInthesecondline,thesecondexpectedvaluerulehasbeenusedtobring(1/n)outoftheexpressionasacommonfactor,andthefirstrulehasbeenusedtobreakuptheexpectationofthesumintothesumoftheexpectations.Inthethirdline,theterminvolvingxhasbeenbroughtoutbecausexisnonstochastic.ByvirtueofthefirstGauss-Markovcondition,E(ui)iszero,andhenceE(u)isalsozero.ThereforeE(b)=b(3.19)Inotherwords,bisanunbiasedestimatorofb.WecanobtainthesameresultwiththeweakversionofthefourthGauss-Markovcondition(allowingxtohavearandomcomponentbutassumingthatitisdistributedindependentlyofu);thisisdemonstratedinChapter**.Unlesstherandomfactorinthenobservationshappenstocanceloutexactly,whichcanhappenonlybycoincidence,bwillbedifferentfrombinanyparticularexperiment,butinviewof(3.19)therewillbenosystematictendencyforittobeeitherhigherorlower.Thesameistruefortheregressioncoefficienta.Usingequation(2.31),a=y-bx(3.20)HenceE(a)=E(y)-xE(b)(3.21)Sinceyiisdeterminedbyyi=a+bxi+ui(3.22)wehaveE(yi)=a+bxi+E(ui)=a+bxi(3.23)becauseE(ui)iszeroifthefirstGauss-Markovconditionissatisfied.HenceE(y)=a+bx(3.24)Substitutingthisinto(3.21),andusingtheresultthatE(b)=b,E(a)=(a+bx)-xb=a(3.25)ThusaisanunbiasedestimatorofaprovidedthattheGauss-Markovconditions1and4aresatisfied.Ofcourseinanygivensampletherandomfactorwillcauseadiscrepancy.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS113.5PrecisionoftheRegressionCoefficientsNowweshallconsiderthepopulationvariancesofaandbabouttheirpopulationmeans.Thesearegivenbythefollowingexpressions(proofsforequivalentexpressionscanbefoundinThomas,1983,Section8.3.3):2ì2ü2suxsupop.var(a)=í1+ýandpop.var(b)=(3.26)nîVar(x)þnVar(x)Equation(3.26)hasthreeobviousimplications.First,thevariancesofbothaandbaredirectlyinverselyproportionaltothenumberofobservationsinthesample.Thismakesgoodsense.Themoreinformationyouhave,themoreaccurateyourestimatesarelikelytobe.Second,thevariancesareproportionaltothevarianceofthedisturbanceterm.Thebiggertheluckfactor,theworsetheestimatesarelikelytobe,otherthingsbeingequal.ThisisillustratedgraphicallyinFigures3.3aand3.3b.Inbothdiagramsthenonstochasticcomponentoftherelationshipbetweenyandx,depictedbythedottedline,isgivenbyy=3.0+0.8x(3.27)Thereare20observations,withthevaluesofxbeingtheintegersfrom1to20.Thesamerandomnumbersareusedtogeneratethevaluesofthedisturbanceterm,butthoseintheFigure3.3bhavebeenmultipliedbyafactorof5.Asaconsequencetheregressionline,depictedbythesolidline,isamuchpoorerapproximationtothenonstochasticrelationshipinFigure3.3bthaninFigure3.3a.Third,thevarianceoftheregressioncoefficientsisinverselyrelatedtothevarianceofx.Whatisthereasonforthis?Rememberthat(1)theregressioncoefficientsarecalculatedontheassumptionthattheobservedvariationsinyareduetovariationsinx,but(2)theyareinrealitypartlyduetovariationsinxandpartlytovariationsinu.Thesmallerthevarianceofx,thegreaterislikelytobetherelativeinfluenceoftheluckfactorindeterminingthevariationsinyandthemorelikelyisregressionanalysistobeledastray.ThisisillustratedbyFgures3.4aand3.4b.Thenonstochasticcomponentoftherelationshipisgivenby(3.26),andthedisturbancetermsareidentical.InFigure3.4athevaluesofxaretheintegersfrom1to20.InFigure3.4b,thevaluesofxarethenumbers9.1,9.2,...,10.9,11.InFigure3.4btheeffectofthevarianceofxonthevaluesofyiscompletelyovershadowedbytheeffectofthemuchlargerrandomvarianceofu.Ofcourse,Figures3.3and3.4makethesamepointindifferentways.Ascanbeseenfrom(3.26),2itistherelativesizeofsuandVar(x)thatisimportant,ratherthantheactualsizeofeither.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS12y35302520regressionlinenonstochastic15relationship105005101520x-5-10-15Figure3.3a.Disturbancetermwithrelativelysmallvariancey353025regressionline20nonstochastic15relationship105005101520x-5-10-15Figure3.3b.Disturbancetermwithrelativelylargevariance\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS13y35302520regressionlinenonstochastic15relationship105005101520x-5-10-15Figure3.4a.xwithrelativelylargevariancey3530regressionline2520nonstochastic15relationship105005101520x-5-10-15Figure3.4b.xwithrelativelysmallvariance\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS14Unfortunately,inpractice,onecannotcalculatethepopulationvariancesofeitheraorbbecause22suisunknown.However,wecanderiveanestimatorofsufromtheresiduals.Clearlythescatteroftheresidualsaroundtheregressionlinewillreflecttheunseenscatterofuabouttheliney=a+bx,althoughingeneraltheresidualandthevalueofthedisturbanceterminanygivenobservationarenotequaltooneanother.Hencethesamplevarianceoftheresiduals,Var(e),whichwecanmeasure,will2beaguidetosu,whichwecannot.Beforegoinganyfurther,askyourselfthefollowingquestion.Whichlineislikelytobeclosertothepointsrepresentingthesampleofobservationsonxandy,thetrueliney=a+bxortheregressionlineyˆ=a+bx?Theansweristheregressionline,becausebydefinitionitisdrawninsuchawayastominimizethesumofthesquaresofthedistancesbetweenitandtheobservations.Hencethespreadoftheresidualswilltendtobesmallerthanthespreadofthevaluesofu,andVar(e)2willtendtounderestimatesu.Indeed,itcanbeshownthattheexpectedvalueofVar(e),whenthere22isjustoneexplanatoryvariable,is[(n-2)/n]su.However,itfollowsthat,ifonedefinessuby2nsu=Var(e)(3.28)n-222suwillbeanunbiasedestimatorofsu(foraproof,seeThomas).Using(3.26)and(3.28),onecanobtainestimatesofthepopulationvariancesofaandband,bytakingsquareroots,estimatesoftheirstandarddeviations.Ratherthantalkaboutthe“estimateofthestandarddeviationoftheprobabilitydensityfunction”ofaregressioncoefficient,whichisabitcumbersome,oneusestheterm“standarderror”ofaregressioncoefficient,whichinthistextwillfrequentlybeabbreviatedtos.e.Forsimpleregressionanalysis,therefore,onehas2ì2ü2suxsus.e.(a)=í1+ýands.e.(b)=(3.29)nîVar(x)þnVar(x)Thestandarderrorsoftheregressioncoefficientwillautomaticallybecalculatedforyouaspartofthecomputeroutput.TheserelationshipswillbeillustratedwiththeMonteCarloexperimentdescribedinSection3.2.InSetI,uwasdeterminedbyrandomnumbersdrawnfromapopulationwithzeromeanandunit2variance,sosu=1.xwasthesetofnumbersfrom1to20,andonecaneasilycalculateVar(x),whichis33.25.Henceì2ü110.5pop.var(a)=í1+ý=0.2158,(3.30)20î33.25þand1pop.var(b)==0.001504.(3.31)20´33.25\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS15TABLE3.5Samples.e.(b)Samples.e.(b)10.04360.04420.04170.03930.03880.04040.03590.03350.027100.033Therefore,thetruestandarddeviationofbis0.001504=0.039.Whatdidthecomputermakeofitinthe10samplesinSetI?Ithastocalculatethestandarderrorusing(3.29),withtheresultsshowninTable3.5inthe10samples.Asyoucansee,mostoftheestimatesarequitegood.Onefundamentalpointmustbeemphasised.Thestandarderrorgivesonlyageneralguidetothelikelyaccuracyofaregressioncoefficient.Itenablesyoutoobtainsomeideaofthewidth,ornarrowness,ofitsprobabilitydensityfunctionasrepresentedinFigure3.1,butitdoesnottellyouwhetheryourregressionestimatecomesfromthemiddleofthefunction,andisthereforeaccurate,orfromthetails,andisthereforerelativelyinaccurate.Thehigherthevarianceofthedisturbanceterm,thehigherthesamplevarianceoftheresidualsislikelytobe,andhencethehigherwillbethestandarderrorsofthecoefficientsintheregressionequation,reflectingtheriskthatthecoefficientsareinaccurate.However,itisonlyarisk.Itispossiblethatinanyparticularsampletheeffectsofthedisturbanceterminthedifferentobservationswillcanceleachotheroutandtheregressioncoefficientswillbeaccurateafterall.Thetroubleisthatingeneralthereisnowayoftellingwhetheryouhappentobeinthisfortunatepositionornot.ExercisesWhereperformanceonagameofskillismeasurednumerically,theimprovementthatcomeswithpracticeiscalledalearningcurve.Thisisespeciallyobviouswitharcade-typevideogames.Thefirsttimeplayerstryanewone,theyarelikelytoscoreverylittle.Withmoreattempts,theirscoresshouldgraduallyimproveastheybecomeaccustomedtothegame,althoughobviouslyastherewillbevariationscausedbytheluckfactor.Supposethattheirscoresaredeterminedbythelearningcurvey=500+100x+u,whereyisthescore,xisthenumberoftimesthattheyhaveplayedbefore,anduisadisturbanceterm.Thefollowingtablegivestheresultsofthefirst20gamesofanewplayer:xautomaticallygoesfrom0to19;uwassetequalto400timesthenumbersgeneratedbyanormallydistributedrandomvariablewithzeromeanandunitvariance;andywasbeendeterminedbyxanduaccordingtothelearningcurve.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS16Observationxuy10-23626421-9650432-332368431281254-15274865-876124764121,51287961,296981,0122,312109-521,34811106362,1361211-3681,2321312-2841,4161413-1001,70015146762,5761615602,060171682,1081817-442,1561918-3641,93620195682,968Regressingyonx,oneobtainstheequation(standarderrorsinparentheses):yˆ=369+116.8x(190)(17.1)3.1Whyistheconstantinthisequationnotequalto500andthecoefficientofxnotequalto100?3.2Whatisthemeaningofthestandarderrors?3.3Theexperimentisrepeatedwithnineothernewplayers(thedisturbancetermbeinggeneratedby400timesadifferentsetof20randomnumbersineachcase),andtheregressionresultsforalltenplayersareshowninthefollowingtable.Whydotheconstant,thecoefficientofx,andthestandarderrorsvaryfromsampletosample?StandarderrorofStandarderrorofPlayerConstantCoefficientofxconstantcoefficientofx1369190116.817.1269918490.116.5353116978.515.2455515899.514.25407120122.610.86427194104.317.57412175123.815.8861319295.817.39234146130.113.110485146109.613.1\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS173.4Thevarianceofxisequalto33.25andthepopulationvarianceofuisequalto160,000.Showthatthestandarddeviationoftheprobabilitydensityfunctionofthecoefficientofxisequalto15.5usingequation(3.29).Arethestandarderrorsinthetablegoodestimatesofthisstandarddeviation?3.6TheGauss-MarkovTheoremIntheReview,weconsideredestimatorsoftheunknownpopulationmeanmofarandomvariablex,givenasampleofobservations.Althoughweinstinctivelyusethesamplemeanxasourestimator,wesawthatitwasonlyoneofaninfinitenumberofpossibleunbiasedestimatorsofm.Thereasonthatthesamplemeanispreferredtoanyotherestimatoristhat,undercertainassumptions,itisthemostefficient.Similarconsiderationsapplytoregressioncoefficients.WeshallseethattheOLSestimatorsarenottheonlyunbiasedestimatorsoftheregressioncoefficients,but,providedthattheGauss-Markovconditionsaresatisfied,theyarethemostefficient.Theothersideofthecoinisthat,iftheGauss-Markovconditionsarenotsatisfied,itwillingeneralbepossibletofindestimatorsforthataremoreefficientthanOLS.Wewillnotattemptageneraldiscussionoftheseissueshere.Wewillinsteadgiveanillustration.Weshallassumethatwehavearelationshipgivenbyy=a+bx+u,(3.32)andweshallconfineourattentiontoestimatorsofb.Someonewhohadneverheardofregressionanalysis,onseeingascatterdiagramofasampleofobservations,mightbetemptedtoobtainanestimateoftheslopemerelybyjoiningthefirstandthelastobservations,andbydividingtheincreaseintheheightbythehorizontaldistancebetweenthem,asinFigure3.5.Theestimatorbwouldthenbegivenbyyn-y1b=.(3.33)xn-x1yyny1x1xnxFigure3.5.Naïveestimationofb\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS18Whatarethepropertiesofthisestimator?First,wewillinvestigatewhetheritisbiasedorunbiased.Applying(3.32)tothefirstandlastobservations,wehavey1=a+bx1+u1(3.34)yn=a+bxn+un(3.35)Hencebxn+un-bx1-u1b=xn-x1(3.36)un-u1=b+xn-x1Thuswehavedecomposedthisnaïveestimatorintotwocomponents,thetruevalueandanerrorterm.ThisdecompositionisparalleltothatfortheOLSestimatorinSection3.1,buttheerrortermisdifferent.Theexpectedvalueoftheestimatorisgivenbyæun-u1öE(b)=E(b)+Eç÷ç÷èxn-x1ø(3.37)1=b+E(un-u1)xn-x1sincebisaconstantandx1andxnarenonstochastic.IfthefirstGauss-Markovconditionissatisfied,E(un–u1)=E(un)-E(u1)=0(3.38)Therefore,despitebeingnaïve,thisestimatorisunbiased.ThisisnotbyanymeanstheonlyestimatorbesidesOLSthatisunbiased.Youcouldderiveonebyjoininganytwoarbitrarilyselectedobservations,andinfactthepossibilitiesareinfiniteifyouarewillingtoconsiderlessnaïveprocedures.Itisintuitivelyeasytoseethatwewouldnotpreferanaïveestimatorsuchas(3.33)toOLS.UnlikeOLS,whichtakesaccountofeveryobservation,itemploysonlythefirstandthelastandiswastingmostoftheinformationinthesample.Thenaiveestimatorwillbesensitivetothevalueofthedisturbancetermuinthosetwoobservations,whereastheOLSestimatorcombinesallthatthevaluesofthedisturbancetermandtakesgreateradvantageofthepossibilitythattosomeextentcanceleachotherout.Morerigorously,itcanbeshownthatthepopulationvarianceofthenaïveestimatorisgreaterthanthatoftheOLSestimator,andthatthenaïveestimatoristhereforelessefficient.WithlessnaiveestimatorsthesuperiorefficiencyofOLSmaynotbesoobvious.Nevertheless,providedthattheGauss-Markovconditionsforthedisturbancetermaresatisfied,theOLSregressioncoefficientswillbebestlinearunbiasedestimators(BLUE):unbiased,ashasalreadybeendemonstrated;linear,becausetheyarelinearfunctionsofthevaluesofy;andbestbecausetheyarethemostefficientoftheclassofaunbiasedlinearestimators.ThisisprovedbytheGauss-Markovtheorem(foraconcisetreatmentnotusingmatrixalgebra,seeThomas,1983,Section8.3).\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS19Exercises3.5Aninvestigatorcorrectlybelievesthattherelationshipbetweentwovariablesxandyisgivenbyy=a+bx+u,Givenasampleofnobservations,theinvestigatorestimatesbbycalculatingitastheaveragevalueofydividedbytheaveragevalueofx.Discussthepropertiesofthisestimator.Whatdifferencewoulditmakeifitcouldbeassumedthataisequaltozero?3.6*Aninvestigatorcorrectlybelievesthattherelationshipbetweentwovariablesxandyisgivenbyy=a+bx+u,Givenasampleofobservationsony,xandathirdvariablez(whichisnotadeterminantofy),theinvestigatorestimatesbasCov(y,z)/Cov(x,z).Discussthepropertiesofthisestimator.(ItcanbeshownthethatitspopulationvarianceisequaltothepopulationvarianceofthecorrespondingOLSestimatordividedbythesquareofrx,z,whererx,zisthecorrelationcoefficientforxandz.)3.7TestingHypothesesRelatingtotheRegressionCoefficientsWhichcomesfirst,theoreticalhypothesizingorempiricalresearch?Thereisabitlikeaskingwhichcamefirst,thechickenortheegg.Inpractice,theorizingandexperimentationfeedoneachother,andquestionsofthistypecannotbeanswered.Forthisreason,wewillapproachthetopicofhypothesistestingfrombothdirections.Ontheonehand,wemaysupposethatthetheoryhascomefirstandthatthepurposeoftheexperimentistoevaluateitsplausibility.Thiswillleadtotheexecutionofsignificancetests.Alternatively,wemayperformtheexperimentfirstandthenconsiderwhattheoreticalhypotheseswouldbeconsistentwiththeresults.Thiswillleadtotheconstructionofconfidenceintervals.Youwillalreadyhaveencounteredthelogicunderlyingsignificancetestsandconfidenceintervalsinanintroductorystatisticscourse.Youwillthusbefamiliarwithmostoftheconceptsinthefollowingapplicationstoregressionanalysis.Thereis,however,onetopicthatmaybenew:theuseofone-tailedtests.Suchtestsareusedveryfrequentlyinregressionanalysis.Indeed,theyare,ortheyoughttobe,morecommonthanthetraditionaltextbooktwo-tailedtests.Itisthereforeimportantthatyouunderstandtherationalefortheiruse,andthisinvolvesasequenceofsmallanalyticalsteps.Noneofthisshouldpresentanydifficulty,butbewarnedthat,ifyouattempttouseashortcutor,worse,trytoreducethewholebusinesstothemechanicaluseofafewformulae,youwillbeaskingfortrouble.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS20FormulationofaNullHypothesisWewillstartbyassumingthatthetheoryprecedestheexperimentandthatyouhavesomehypotheticalrelationshipinyourmind.Forexample,youmaybelievethatthepercentagerateofpriceinflationinaneconomy,p,dependsonthepercentagerateofwageinflation,w,accordingtothelinearequationp=a+bw+u(3.39)whereaandbareparametersanduisadisturbanceterm.Youmightfurtherhypothesizethat,apartfromtheeffectsofthedisturbanceterm,priceinflationisequaltowageinflation.Underthesecircumstancesyouwouldsaythatthehypothesisthatyouaregoingtotest,knownasyournullhypothesisanddenotedH0,isthatbisequalto1.Wealsodefineanalternativehypothesis,denotedH1,whichrepresentsyourconclusioniftheexperimentaltestindicatesthatH0isfalse.InthepresentcaseH1issimplythatbisnotequalto1.ThetwohypothesesarestatedusingthenotationH0:b=1H1:b¹1Inthisparticularcase,ifwereallybelievethatpriceinflationisequaltowageinflation,wearetryingtoestablishthecredibilityofH0bysubjectingittothestrictestpossibletestandhopingthatitemergesunscathed.Inpractice,however,itismoreusualtosetupanullhypothesisandattackitwiththeobjectiveofestablishingthealternativehypothesisasthecorrectconclusion.Forexample,considerthesimpleearningsfunctionEARNINGS=a+bHGC+u(3.40)whereEARNINGSishourlyearningsindollarsandHGCishighestgradecompleted.Onveryreasonabletheoreticalgrounds,youexpectearningstobedependentonschooling,butyourtheoryisnotstrongenoughtoenableyoutospecifyaparticularvalueforb.Youcanneverthelessestablishthedependenceofearningsonschoolingbytheinverseprocedureinwhichyoutakeasyournullhypothesistheassertionthatearningsdoesnotdependonschooling,thatis,thatbiszero.Youralternativehypothesisisthatbisnotequaltozero,thatis,thatschoolingdoesaffectearnings.Ifyoucanrejectthenullhypothesis,youhaveestablishedtherelationship,atleastingeneralterms.Usingtheconventionalnotation,yournullandalternativehypothesesareH0:b=0andH1:b¹0,respectively.Thefollowingdiscussionusesthesimpleregressionmodely=a+bx+u,(3.41)Itwillbeconfinedtotheslopecoefficient,b,butexactlythesameproceduresareappliedtotheconstantterm,a.Wewilltakethegeneralcase,whereyouhavedefinedanullhypothesisthatbisequaltosomespecificvalue,sayb0,andthealternativehypothesisisthatbisnotequaltothisvalue(H0:b=b0,H1:b¹b0),andyoumaybeattemptingtoattackordefendthenullhypothesisasitsuitsyourpurpose.WewillassumethatthefourGauss-Markovconditionsaresatisfied.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS21probabilitydensityfunctionofbb0-2sdb0-sdb0b0+sdb0+2sdbFigure3.6.StructureofthenormaldistributionofbintermsofstandarddeviationsaboutthemeanDevelopingtheImplicationsofaHypothesisIfH0iscorrect,estimatesofbobtainedusingregressionanalysiswillbedistributedwithmeanb0and2variance)]su/[nVar(x(see3.26).Wewillnowintroducetheassumptionthatuhasanormaldistribution.Ifthisisthecase,bwillalsobenormallydistributed,asshowninFigure3.6."sd"inthe2figurereferstothestandarddeviationofb,thatissu/[nVar(x)].Inviewofthestructureofthenormaldistribution,mostestimatesofbwillliewithintwostandarddeviationsofb0(ifH0:b=b0istrue).Initiallywewillassumethatweknowthestandarddeviationofb.Thisisamostunreasonableassumption,andwewilldropitlater.Inpracticewehavetoestimateit,alongwithaandb,butitwillsimplifythediscussionifforthetimebeingwesupposethatweknowitexactly,andhenceareinapositiontodrawFigure3.6.Wewillillustratethiswiththepriceinflationmodel(3.39).Supposethatforsomereasonweknowthatthestandarddeviationofbisequalto0.1.Then,ifournullhypothesisH0:b=1iscorrect,probabilitydensityfunctionofb43210.70.80.911.11.21.3b\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS22Figure3.7Exampledistributionofb(priceinflation/wageinflationmodel)regressionestimateswouldbedistributedasshowninFigure3.7.Youcanseethat,providedthatthenullhypothesisiscorrect,theestimateswillgenerallyliebetween0.8and1.2.Compatibility,FreakinessandtheSignificanceLevelNowwecometothecrunch.SupposethatwetakeanactualsampleofobservationsonaverageratesofpriceinflationandwageinflationoverthepastfiveyearsforasampleofOECDcountriesandestimatebusingregressionanalysis.Iftheestimateiscloseto1.0,weshouldalmostcertainlybesatisfiedwiththenullhypothesis,sinceitandthesampleresultarecompatiblewithoneanother,butsuppose,ontheotherhand,thattheestimateisalongwayfrom1.0.Supposethatitisequalto0.7.Thisisthreestandarddeviationsbelow1.0.Theprobabilityofbeingthreestandarddeviationsawayfromthemean,positiveornegative,isonly0.0027,whichisverylow.Youcouldcometoeitheroftwoconclusionsaboutthisworrisomeresult:1.Youcouldcontinuetomaintainthatyournullhypothesisb=1iscorrect,andthattheexperimenthasgivenafreakresult.Youconcedethattheprobabilityofsuchalowvalueofbisverysmall,butneverthelessitdoesoccur0.27percentofthetimeandyoureckonthatthisisoneofthosetimes.2.Youcouldconcludethatthehypothesisiscontradictedbytheregressionresult.Youarenotconvincedbytheexplanationin(1)becausetheprobabilityissosmallandyouthinkthatamuchmorelikelyexplanationisthatbisnotreallyequalto1.Inotherwords,youadoptthealternativehypothesisH0:b¹1instead.Howdoyoudecidewhentochoose(1)andwhentochoose(2)?Obviously,thesmallertheprobabilityofobtainingaregressionestimateliketheoneyouhaveobtained,givenyourhypothesis,themorelikelyyouaretoabandonthehypothesisandchoose(2).Howsmallshouldtheprobabilitybebeforechoosing(2)?Thereis,andtherecanbe,nodefiniteanswertothisquestion.Inmostappliedworkineconomicseither5percentor1percentistakenasthecriticallimit.If5percentistaken,theswitchto(2)ismadewhenthenullhypothesisimpliesthattheprobabilityofobtainingsuchanextremevalueofbislessthan5percent.Thenullhypothesisisthensaidtoberejectedatthe5percentsignificancelevel.Thisoccurswhenbismorethan1.96standarddeviationsfromb0.Ifyoulookupthenormaldistributiontable,TableA.1attheendofthetext,youwillseethattheprobabilityofbbeingmorethan1.96standarddeviationsaboveitsmeanis2.5percent,andsimilarlytheprobabilityofitbeingmorethan1.96standarddeviationsbelowitsmeanis2.5percent.Thetotalprobabilityofitbeingmorethan1.96standarddeviationsawayisthus5percent.Wecansummarizethisdecisionrulemathematicallybysayingthatwewillrejectthenullhypothesisif\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS23Z>1.96orZ<-1.96(3.42)whereZisthenumberofstandarddeviationsbetweentheregressionestimateandthehypotheticalvalueofb:distancebetweenregressionestimateandhypotheticalvalueb-b0Z==(3.43)standarddeviationofbs.d.(b)Thenullhypothesiswillnotberejectedif-1.96£Z£1.96(3.44)Thisconditioncanbeexpressedintermsofbandb0bysubstitutingforZfrom(3.43):b-b0-1.96££1.96(3.45)s.d.(b)Multiplyingthroughbythestandarddeviationofb,oneobtains-1.96s.d.(b)£b-b0£1.96s.d.(b)(3.46)fromwhichoneobtainsb0-1.96s.d.(b)£b£b0+1.96s.d.(b)(3.47)Equation(3.47)givesthesetofvaluesofbwhichwillnotleadtotherejectionofaspecificnullhypothesisb=b0.Itisknownastheacceptanceregionforb,atthe5percentsignificancelevel.Inthecaseofthepriceinflation/wageinflationexample,wheres.d.(b)isequalto0.1,youwouldrejectatthe5percentlevelifbliesmorethan0.196aboveorbelowthehypotheticalmean,thatis,above1.196orbelow0.804.Theacceptanceregionisthereforethosevaluesofbfrom0.804to1.196.ThisisillustratedbytheunshadedareainFigure3.8.probabilitydensityfunctionofbacceptanceregion43212.5%2.5%0.80411.196b\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS24Figure3.8.Acceptanceregionforb,5percentsignificancelevelTypeIandTypeIIErrorsinEverydayLifeTheproblemoftryingtoavoidTypeIandTypeIIerrorswillalreadybefamiliartoeverybody.Acriminaltrialprovidesaparticularlyacuteexample.Takingasthenullhypothesisthatthedefendantisinnocent,aTypeIerroroccurswhenthejurywronglydecidesthatthedefendantisguilty.ATypeIIerroroccurswhenthejurywronglyacquitsthedefendant.Similarly,thenullhypothesisissaidtoberejectedatthe1percentsignificancelevelifthehypothesisimpliesthattheprobabilityofobtainingsuchanextremevalueofbislessthan1percent.Thisoccurswhenbismorethan2.58standarddeviationsaboveorbelowthehypotheticalvalueof,thatis,whenZ>2.58orZ<-2.58(3.48)Lookingatthenormaldistributiontableagain,youwillseethattheprobabilityofbbeingmorethan2.58standarddeviationsaboveitsmeanis0.5percent,andthereisthesameprobabilityofitbeingmorethan2.58standarddeviationsbelowit,sothecombinedprobabilityofsuchanextremevalueis1percent.Inthecaseofourexample,youwouldrejectthenullhypothesisb=1iftheregressionestimatelayabove1.258orbelow0.742.Youmayask,whydopeopleusuallyreport,oratleastconsiderreporting,theresultsatboththe5percentandthe1percentsignificancelevels?Whynotjustone?TheansweristhattheyaretryingtostrikeabalancebetweentherisksofmakingTypeIerrorsandTypeIIerrors.ATypeIerroroccurswhenyourejectatruenullhypothesis.ATypeIIerroroccurswhenyoudonotrejectafalseone.Obviously,theloweryourcriticalprobability,thesmalleristheriskofaTypeIerror.Ifyoursignificancelevelis5percent,youwillrejectatruehypothesis5percentofthetime.Ifitis1percent,youwillmakeaTypeIerror1percentofthetime.Thusthe1percentsignificancelevelissaferinthisrespect.Ifyourejectthehypothesisatthislevel,youarealmostcertainlyrighttodoso.Forthisreasonthe1percentsignificancelevelisdescribedashigherthanthe5percent.Atthesametime,ifthenullhypothesishappenstobefalse,thehigheryoursignificancelevel,thewiderisyouracceptanceregion,thegreaterisyourchanceofnotrejectingit,andsothegreateristheriskofaTypeIIerror.Thusyouarecaughtbetweenthedevilandthedeepbluesea.Ifyouinsistonaveryhighsignificancelevel,youincurarelativelyhighriskofaTypeIIerrorifthehypothesishappenstobefalse.Ifyouchoosealowsignificancelevel,yourunarelativelyhighriskofmakingaTypeIerrorifthehypothesishappenstobetrue.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS25Mostpeopletakeoutacrudeformofinsurancepolicyandperformthetestatboththeselevels,beingpreparedtoquotetheresultsofeach.Actually,itisfrequentlysuperfluoustoquotebothresultsexplicitly.Sincebhastobemoreextremeforthehypothesistoberejectedatthe1percentlevelthanatthe5percentlevel,ifyourejectatthe1percentlevelitautomaticallyfollowsthatyourejectatthe5percentlevel,andthereisnoneedtosayso.Indeed,youlookignorantifyoudo.Andifyoudonotrejectatthe5percentlevel,itautomaticallyfollowsthatyouwillnotrejectatthe1percentlevel,andagainyouwouldlookignorantifyousaidso.Theonlytimewhenyoushouldquotebothresultsiswhenyourejectthenullhypothesisatthe5percentlevelbutnotatthe1percentlevel.WhatHappensiftheStandardDeviationofbisNotKnownSofarwehaveassumedthatthestandarddeviationofbisknown,whichismostunlikelyinpractice.Ithastobeestimatedbythestandarderrorofb,givenby(3.29).Thiscausestwomodificationstothetestprocedure.First,Zisnowdefinedusings.e.(b)insteadofs.d.(b),anditisreferredtoasthetstatistic:b-b0t=(3.49)s.e.(b)Second,thecriticallevelsoftdependuponwhatisknownasatdistributioninsteadofanormaldistribution.Wewillnotgointothereasonsforthis,orevendescribethetdistributionmathematically.Sufficetosaythatitisacousinofthenormaldistribution,itsexactshapedependingonthenumberofdegreesoffreedomintheregression,andthatitapproximatesthenormaldistributionincreasinglycloselyasthenumberofdegreesoffreedomincreases.Youwillcertainlyhaveencounteredthetdistributioninyourintroductorystatisticscourse.TableA.2attheendofthetextgivesthecriticalvaluesoftcross-classifiedbysignificancelevelandthenumberofdegreesoffreedom.Theestimationofeachparameterinaregressionequationconsumesonedegreeoffreedominthesample.Hencethenumberofdegreesoffreedomisequaltothenumberofobservationsinthesampleminusthenumberofparametersestimated.Theparametersaretheconstant(assumingthatthisisspecifiedintheregressionmodel)andthecoefficientsoftheexplanatoryvariables.Inthepresentcaseofsimpleregressionanalysis,onlytwoparameters,aandb,areestimatedandhencethenumberofdegreesoffreedomisn-2.Itshouldbeemphasizedthatwhenwecometomultipleregressionanalysisamoregeneralexpressionwillberequired.Thecriticalvalueoft,whichwewilldenotetcrit,replacesthenumber1.96in(3.45),sotheconditionthataregressionestimateshouldnotleadtotherejectionofanullhypothesisH0:b=b0isb-b0-tcrit££tcrit(3.50)s.e.(b)\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS26b-b0b-b0Hencewehavethedecisionrule:rejectH0if>tcrit,donotrejectif£tcrit,s.e.(b)s.e.(b)b-b0whereistheabsolutevalue(numericalvalue,neglectingthesign)oft.s.e.(b)Examples.regearningshgcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=65.64Model|3977.3801613977.38016Prob>F=0.0000Residual|34419.656956860.5979875R-squared=0.1036---------+------------------------------AdjR-squared=0.1020Total|38397.037156967.4816117RootMSE=7.7845------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|1.073055.13245018.1020.000.81290281.333206_cons|-1.3910041.820305-0.7640.445-4.9663542.184347------------------------------------------------------------------------------InSection2.6hourlyearningswereregressedonyearsofschoolingusingdatafromtheUnitedStatesNationallongitudinalSurveyofYouthwiththeoutputshownabove.Thefirsttwocolumnsgivethenamesofthevariables,herejustHGCandtheintercept(Statadenotesthisas_cons)andtheestimatesoftheircoefficients.Thethirdcolumngivesthecorrespondingstandarderrors.Letussupposethatoneofthepurposesoftheregressionwastoconfirmourintuitionthatearningsareaffectedbyeducation.Accordingly,wesetupthenullhypothesisthatbisequaltozeroandtrytorefuteit.Thecorrespondingtstatistic,using(3.49),issimplytheestimateofthecoefficientdividedbyitsstandarderror:b-b0b-01.073t====8.13(3.51)s.e.(b)s.e.(b)0.132Sincethereare570observationsinthesampleandwehaveestimatedtwoparameters,thenumberofdegreesoffreedomis568.TableA.2doesnotgivethecriticalvaluesoftfor568degreesoffreedom,butweknowthattheymustbelowerthanthecorrespondingcriticalvaluesfor120,sincethecriticalvalueisinverselyrelatedtothenumberofdegreesoffreedom.Thecriticalvaluewith120degreesoffreedomatthe5percentlevelis1.980.HencewecanbesurethatwewouldrejectH0atthe5percentlevelwith568degreesoffreedomandweconcludethatschoolingdoesaffectearnings.Toputthistestintowords,with568degreesoffreedomtheupperandlower2.5percenttailsofthetdistributionstartapproximately1.980standarddeviationsaboveandbelowitsmeanofzero.Aregressioncoefficientwhichisestimatedtoliewithin1.980standarddeviationsofzerowouldnotcausethenullhypothesistoberejected.Inthiscase,however,thediscrepancyisequivalentto8.13estimatedstandarddeviationsandwecometotheconclusionthattheregressionresultcontradictsthenullhypothesis.Ofcoursesinceweareusingthe5percentsignificancelevelasthebasisforthetestthereisinprinciplea5percentriskofaTypeIerror,ifthenullhypothesisistrue.Inthiscasewecouldreduce\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS27theriskto1percentbyusingthe1percentsignificancelevelinstead.Thecriticalvalueoftatthe1percentsignificancelevelwith120degreesoffreedomis2.617.Sincethetstatisticisgreaterthanthis,weseethatwecaneasilyrejectthenullhypothesisatthislevelaswell.Notethatwhenthe5percentand1percenttestsleadtothesameconclusion,thereisnoneedtoreportboth,andindeedyouwouldlookignorantifyoudid.Readcarefullytheboxonreportingtestresults.Thisprocedureofestablishingarelationshipbetweenadependentandanexplanatoryvariablebysettingup,andthenrefuting,anullhypothesisthatbisequaltozeroisusedveryfrequentlyindeed.Consequentlyallseriousregressionpackagesautomaticallyprintoutthetstatisticforthisspecialcase,thatis,thecoefficientdividedbyitsstandarderror.Theratioisoftendenoted"the"t-statistic.Intheregressionoutput,thetstatisticsfortheconstantandslopecoefficientappearinthemiddlecolumn.(Youwillnotethatthetstatisticfortheslopecoefficientwasactually8.10,not8.13.Thecalculationabovewassubjecttoroundingerror.)However,ifthenullhypothesisspecifiessomenon-zerovalueofb,themoregeneralexpression(3.50)hastobeusedandthetstatistichastobecalculatedbyhand.Forexample,consideragainthepriceinflation/wageinflationmodel(3.39)andsupposethatthefittedmodelis(standarderrorsinparentheses):pˆ=-1.21+0.82w(3.52)(0.05)(0.10)Ifwenowinvestigatethehypothesisthatpriceinflationisequaltowageinflation,ournullhypothesisisthatthecoefficientofwisequalto1.0.Thecorrespondingtstatisticisb-b00.82-1.00t===-1.8(3.53)s.e.(b)0.10Ifthereare,say,20observationsinthesample,thenumberofdegreesoffreedomis18andthecriticalvalueoftatthe5percentsignificancelevelis2.101.Theabsolutevalueofourtstatisticislessthanthis,soonthisoccasionwedonotrejectthenullhypothesis.Theestimate0.82isbelowourhypothesizedvalue1.00,butnotsofarbelowastoexcludethepossibilitythatthenullhypothesisiscorrect.Onefinalnoteonreportingregressionresults:somewritersplacethetstatisticinparenthesesunderacoefficientinsteadofthestandarderror.Youshouldbecarefultocheck,andwhenyouarepresentingresultsyourself,youshouldmakeitclearwhichyouaregiving.TheReject/Fail-to-RejectTerminologyInthissectionithasbeenshownthatyoushouldrejectthenullhypothesisiftheabsolutevalueofthetstatisticisgreaterthantcrit,andthatyoufailtorejectitotherwise.Why"failtoreject",whichisaclumsyexpression?Woulditnotbebetterjusttosaythatyouacceptthehypothesisiftheabsolutevalueofthetstatisticislessthantcrit?Theargumentagainstusingthetermacceptisthatyoumightfindyourself"accepting"severalmutuallyexclusivehypothesesatthesametime.Forinstance,intheprice-inflation/wageinflationexample,youwouldnotrejectanullhypothesisH0:b=0.9,oranullhypothesisH0:b=0.8.Itislogicaltosaythatyouwouldnotrejectthesenullhypotheses,aswellasthenullhypothesisH0:b=1discussedinthetext,butitmakeslittlesensetosaythatyousimultaneouslyacceptthethreehypotheses.Inthenextsectionyouwillseethatonecandefineawholerangeofhypotheseswhichwouldnotberejectedbyagivenexperimentalresult,soitwouldbeincautioustopickoutoneasbeing"accepted"\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS28ReportingtheResultsofttestsSupposeyouhaveatheoreticalrelationshipy=a+bx+uandyournullandalternativehypothesesareH0:b=b0,H0:b¹b0.Givenanexperimentalestimatebofb0,theacceptanceandrejectionregionsforthehypothesisforthe5percentand1percentsignificancelevelscanberepresentedingeneraltermsbytheleftpartofFigure3.9.Therightsideofthefiguregivesthesameregionsforaspecificexample,thepriceinflation/wagenflationmodel,thenullhypothesisbeingthatbisequalto1.Thenullhypothesiswillnotberejectedatthe5percentlevelifblieswithin2.101standarderrorsofunity,thatis,intherange0.79to1.21,anditwillnotberejectedatthe1percentlevelifblieswithin2.878standarddeviationsofunity,thatis,intherange0.71to1.29.FromFigure3.9itcanbeseenthattherearethreetypesofdecisionzone:1.wherebissofarfromthehypotheticalbthatthenullhypothesisisrejectedatboththe5percentandthe1percentlevels.2.wherebisfarenoughfromthehypotheticalbforthenullhypothesistoberejectedatthe5percentbutnotthe1percentlevel3.wherebiscloseenoughtothehypotheticalbforthenullhypothesisnottoberejectedateitherlevel.Fromthediagramitcanbeverifiedthatifthenullhypothesisisrejectedatthe1percentlevel,itisautomaticallyrejectedatthe5percentlevel.Henceincase(1)itisonlynecessarytoreporttherejectionofthehypothesisatthe1percentlevel.Toreportthatitisrejectedalsoatthe5percentlevelissuperfluousandsuggeststhatyouarenotawareofthis.Itwouldbeabitlikereportingthatacertainhigh-jumpercancleartwometres,andthenaddingthattheathletecanalsoclearoneandahalfmetres.Incase(3),likewise,youonlyneedtomakeonestatement,inthiscasethatthehypothesisisnotrejectedatthe5percentlevel.Itautomaticallyfollowsthatitisnotrejectedatthe1percentlevel,andtoaddastatementtothiseffectaswellwouldbelikesayingthatthehigh-jumpercannotclearoneandahalfmetres,andalsoreportingthattheathletecannotcleartwometreseither.Onlyincase(2)isitnecessary(anddesirable)toreporttheresultsofbothtests.Notethatifyoufindthatyoucanrejectthenullhypothesisatthe5%level,youshouldnotstopthere.Youhaveestablishedthatthenullhypothesiscanberejectedatthatlevel,butthereremainsa5percentchanceofaTypeIerror.Youshouldalsoperformthetestatthe1%level.Ifyoufindthatyoucanrejectthenullhypothesisatthislevel,thisistheoutcomethatyoushouldreport.TheriskofaTypeIerrorisnowonly1percentandyourconclusionismuchmoreconvincing.Thisiscase(1)above.Ifyoucannotrejectatthe1percentlevel,youhavereachedcase(2)andyoushouldreporttheresultsofbothtests.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS29PRICEINFLATION/GENERALCASEDECISIONWAGEINFLATIONEXAMPLERejectH0at1%level(andalsoat5%level)b0+tcrit(1%)×s.e.1.29RejectH0at5%levelbutnotat1%levelb0+tcrit(5%)×s.e.1.21b0DonotrejectH0at1%level(orat1%level)1.00b0-tcrit(5%)×s.e.0.79RejectH0at5%levelbutnotat1%levelb0-tcrit(1%)×s.e.0.71RejectH0at1%level(andalsoat5%level)Figure3.9.Reportingtheresultsofattest(noneedtoreportconclusionsinparentheses)pvaluesThefifthcolumnoftheoutputabove,headedp|t|,providesanalternativeapproachtoreportingthesignificanceofregressioncoefficients.Thefiguresinthiscolumngivetheprobabilityofobtainingthecorrespondingtstatistic,ifthenullhypothesisH0:b=0weretrue.Apvalueoflessthan0.01meansthattheprobabilityislessthan1percent,whichinturnmeansthatthenullhypothesiswouldberejectedatthe1percentlevel;apvaluebetween0.01and0.05meansthatthenullhypothesiswouldberejectedatthe5percent,butnotthe1percentlevel;andapvalueof0.05ormoremeansthatitwouldnotberejectedatthe5percentlevel.Thepvalueapproachismoreinformativethanthe5percent/1percentapproach,inthatitgivestheexactprobabilityofaTypeIerror,ifthenullhypothesisistrue.Forexample,intheearningsfunctionoutputabove,thepvaluesfortheinterceptis0.445,meaningthattheprobabilityofobtainingatstatisticof0.764orgreater,inabsoluteterms,onapurechancebasisisinthiscase44.5percent.Hencethenullhypothesisthattheinterceptiszerowouldnotberejectedatanysensiblesignificancelevel.Inthecaseoftheslopecoefficient,thepvalueis0.0000,meaningthattheprobabilityofobtainingatstatisticaslargeas8.102,orlarger,islessthan0.005percent.Hencewewouldrejectthenullhypothesisthattheslopecoefficientiszeroatthe1%level.Indeedwewouldrejectitatthe0.1%level–seebelow.Choicebetweenthepvalueapproachandthe5percent/1percentapproachappearstobeentirelyconventional.Themedicalliteratureusespvalues,buttheeconomicsliteraturegenerallyuses5percent/1percent.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS300.1percenttestsIfthetstatisticisveryhigh,youshouldcheckwhetheryoucanrejectthenullhypothesisatthe0.1%level.Ifyoucan,youshouldreportit,becauseitdemonstratesthatyouareabletorejectthenullhypothesiswithminimalriskofaTypeIerror.Exercises3.7GivemoreexamplesofdailyinstancesinwhichdecisionsinvolvingpossibleTypeIandTypeIIerrorsmayarise.3.8Beforebeginningacertaincourse,36studentsaregivenanaptitudetest.Thescores,andthecourseresults(pass/fail)aregivenbelow:testcoursetestcoursetestcoursestudentstudentstudentscoreresultscoreresultscoreresult130fail1326fail259fail229pass1443pass2636pass333fail1543fail2761pass462pass1668pass2879fail559fail1763pass2957fail663pass1842fail3046pass780pass1951fail3170fail832fail2045fail3231pass960pass2122fail3368pass1076pass2230pass3462pass1113fail2340fail3556pass1241pass2426fail3636passDoyouthinkthattheaptitudetestisusefulforselectingstudentsforadmissiontothecourse,andifso,howwouldyoudeterminethepassmark?(Discussthetrade-offbetweenTypeIandTypeIIerrorsassociatedwiththechoiceofpass-mark.)3.9Aresearcherhypothesizesthatyearsofschooling,HGC,mayberelatedtothenumberofsiblings(brothersandsisters),SIBLINGS,accordingtotherelationshipHGC=a+bSIBLINGS+uSheispreparedtotestthenullhypothesisH0:b=0againstthealternativehypothesisH1:b¹0atthe5percentand1percentlevels.Shehasasampleof60observations.Whatshouldshereport1.ifb=0.20,s.e.(b)=0.07?2.ifb=-0.12,s.e.(b)=0.07?3.ifb=0.06,s.e.(b)=0.07?4.ifb=0.20,s.e.(b)=0.07?\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS313.10*Aresearcherwithasampleof50individualswithsimilareducationbutdifferingamountsoftraininghypothesizesthathourlyearnings,EARNINGS,mayberelatedtohoursoftraining,TRAINING,accordingtotherelationshipEARNINGS=a+bTRAINING+uHeispreparedtotestthenullhypothesisH0:b=0againstthealternativehypothesisH1:b¹0atthe5percentand1percentlevels.Whatshouldhereport1.ifb=0.30,s.e.(b)=0.12?2.ifb=0.55,s.e.(b)=0.12?3.ifb=0.10,s.e.(b)=0.12?4.ifb=-0.27,s.e.(b)=0.12?3.11PerformattestontheslopecoefficientandtheinterceptoftheeducationalattainmentfunctionfittedusingyourEAEFdataset,andstateyourconclusions.3.12PerformattestontheslopecoefficientandtheinterceptoftheearningsfunctionfittedusingyourEAEFdataset,andstateyourconclusions.3.13*InExercise2.7,thegrowthrateofemploymentwasregressedonthegrowthrateofGDPforasampleof25OECDcountries.Performttestsontheslopecoefficientandtheinterceptandstateyouronclusions.3.8ConfidenceIntervalsThusfarwehavebeenassumingthatthehypothesisprecededtheempiricalinvestigation.Thisisnotnecessarilythecase.Usuallytheoryandexperimentationareinteractive,andtheearningsfunctionregressionprovidesatypicalexample.Werantheregressioninthefirstplacebecauseeconomictheorytellsustoexpectearningstobeaffectedbyschooling.Theregressionresultconfirmedthisintuitionsincewerejectedthenullhypothesisb=0.butwewerethenleftwithsomethingofavacuum,sinceourtheoryisnotstrongenoughtosuggestthatthetruevalueofbisequaltosomespecificnumber.However,wecannowmoveintheoppositedirectionandaskourselvesthefollowingquestion:givenourregressionresult,whathypotheseswouldbecompatiblewithit?Obviouslyahypothesisb=1.073wouldbecompatible,becausethenhypothesisandexperimentalresultcoincide.Alsob=1.072andb=1.074wouldbecompatible,becausethedifferencebetweenhypothesisandexperimentalresultwouldbesosmall.Thequestionis,howfarcanahypotheticalvaluedifferfromourexperimentalresultbeforetheybecomeincompatibleandwehavetorejectthenullhypothesis?Wecananswerthisquestionbyexploitingthepreviousanalysis.From(3.50),wecanseethatregressioncoefficientbandhypotheticalvaluebareincompatibleifeither\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS32b-bb-b>tcritor<-tcrit(3.54)s.e.(b)s.e.(b)thatis,ifeitherb–b>s.e.(b)´tcritorb–b<-s.e.(b)´tcrit(3.55)thatis,ifeitherb–s.e.(b)´tcrit>borb+s.e.(b)´tcritb0Wehavediscussedthecaseinwhichthealternativehypothesisinvolvedaspecifichypotheticalvalueb1,withb1greaterthanb0.Clearly,thelogicthatledustouseaone-tailedtestwouldstillapplyevenifH1weremoregeneralandmerelyassertedthatb1>b0,withoutstatinganyparticularvalue.WewouldstillwishtoeliminatethelefttailfromtherejectionregionbecausealowvalueofbismoreprobableunderH0:b=b0thanunderH1:b>b0,andthiswouldbeevidenceinfavorofH0,notagainstit.Therefore,wewouldstillpreferaone-tailedttest,usingtherighttailastherejectionregion,toatwo-tailedtest.Notethat,sinceb1isnotdefined,wenowhavenowayofcalculatingthepowerofsuchatest.However,wecanstillbesurethat,foranygivensignificancelevel,thepowerofaone-tailedtestwillbegreaterthanthatofthecorrespondingtwo-tailedtest.H0:b=b1,H1:b0insteadofthemoregeneralH1:b¹0.ThisisanadvantagebecausethecriticalvalueoftforrejectingH0islowerfortheone-tailedtest,soitiseasiertorefutethenullhypothesisandestablishtherelationship.ExamplesIntheearningsfunctionregression,therewere568degreesoffreedomandthecriticalvalueoft,usingthe1percentsignificancelevelandatwo-tailedtest,isapproximately2.59.Ifwetakeadvantageofthefactthatitisreasonabletoexpectincomenottohaveanegativecoefficient,wecoulduseaone-tailedtestandthecriticalvalueisreducedtoapproximately2.33.Thetstatisticisinfactequalto8.10,sointhiscasetherefinementmakesnodifference.Theestimatedcoefficientissolargerelativetoitsstandarderrorthatwerejectthenullhypothesisregardlessofwhetherweuseatwo-tailedoraone-tailedtest.Inthepriceinflation/wageinflationexample,exploitingthepossibilityofusingaone-tailedtestdoesmakeadifference.ThenullhypothesiswasthatwageinflationisreflectedfullyinpriceinflationandwehaveH0:b=1.Themainreasonwhythetypesofinflationmaybedifferentisthatimprovementsinlaborproductivitymaycausepriceinflationtobelowerthanwageinflation.Certainlythiswillnotcausepriceinflationtobegreaterthanwageinflationandsointhiscasewearejustifiedinrulingoutb>1.WeareleftwithH0:b=1andH1:b<1.Givenaregressioncoefficient0.82andastandarderror0.10,thetstatisticforthenullhypothesisis-1.80.Thiswasnothighenough,inabsoluteterms,tocauseH0toberejectedatthe5percentlevelusingatwo-tailedtest(criticalvalue2.10).However,ifweuseaone-tailedtest,asweareentitledto,thecriticalvaluefallsto1.73andwecanrejectthenullhypothesis.Inotherwords,wecanconcludethatpriceinflationissignificantlylowerthanwageinflation.Exercises3.17Explainwhetheritwouldhavebeenpossibletoperformone-tailedtestsinsteadoftwo-tailedtestsinExercise3.9.Ifyouthinkthatone-tailedtestsarejustified,performthemandstatewhethertheuseofaone-tailedtestmakesanydifference.3.18*Explainwhetheritwouldhavebeenpossibletoperformone-tailedtestsinsteadoftwo-tailedtestsinExercise3.10.Ifyouthinkthatone-tailedtestsarejustified,performthemandstatewhethertheuseofaone-tailedtestmakesanydifference.3.19*Explainwhetheritwouldhavebeenpossibletoperformone-tailedtestsinsteadoftwo-tailedtestsinExercise3.11.Ifyouthinkthatone-tailedtestsarejustified,performthemandstatewhethertheuseofaone-tailedtestmakesanydifference.3.10TheFTestofGoodnessofFit\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS38Evenifthereisnorelationshipbetweenyandx,inanygivensampleofobservationstheremayappeartobeone,ifonlyafaintone.Onlybycoincidencewillthesamplecovariancebeexactlyequalto2zero.Accordingly,onlybycoincidencewillthecorrelationcoefficientandRbeexactlyequaltozero.2Thispresentsuswithaproblem.HowdoweknowifthevalueofRfortheregressionreflectsatruerelationshiporifithasarisenasamatterofchance?Wecouldinprincipleadoptthefollowingprocedure.Wetakeasournullhypothesisthatthereis2norelationshipbetweenyandx,andcalculatethevaluethatwouldbeexceededbyRasamatterofchance,5percentofthetime.Wethentakethisfigureasthecriticallevelofthe5percentsignificancetest.Ifitisexceeded,werejectthenullhypothesis.Suchatest,likethettestonacoefficient,wouldnotbefoolproof.Indeed,atthe5percentsignificancelevel,onewouldriskmakingaTypeIerror(rejectingthenullhypothesiswhenitisinfacttrue)5percentofthetime.Ofcourseyoucouldcutdownonthisriskbyusingahigher2significancelevel,forexample,the1percentlevel.ThecriticallevelofRwouldthenbethatwhichwouldbeexceededbychanceonly1percentofthetime,soitwouldbehigherthanthecriticallevelforthe5percenttest.2HowdoesonefindthecriticallevelofRateithersignificancelevel?Well,thereisaslight2problem.ThereisnosuchthingasatableofcriticallevelsofR.ThetraditionalprocedureistouseanindirectapproachandperformwhatisknownasanFtestbasedonanalysisofvariance.(Foranexplanationofthetheorybehindit,see,forexample,MoodandGraybill,1963.)Supposethat,asinthiscase,youcandecomposethevarianceofthedependentvariableinto"explained"and"unexplained"componentsusing(2.46):Var(y)=Var(yˆ)+Var(e)(3.61)Usingthedefinitionofsamplevariance,andmultiplyingthroughbyn,wecanrewritethedecompositionas222å(y-y)=å(yˆ-y)+åe(3.62)(Rememberthateiszeroandthatthesamplemeanofyˆisequaltothesamplemeanofy.)TheleftsideisTSS,thetotalsumofsquaresofthevaluesofthedependentvariableaboutitssamplemean.ThefirsttermontherightsideisESS,theexplainedsumofsquares,andthesecondtermisRSS,theunexplained,residualsumofsquareswhichweinitiallysimplycalledS:TSS=ESS+RSS(3.63)TheFstatisticforthegoodnessoffitofaregressioniswrittenastheexplainedsumofsquares,perexplanatoryvariable,dividedbytheresidualsumofsquares,perdegreeoffreedomremaining:ESS/kF=(3.64)RSS/(n-k-1)wherekisthenumberofexplanatoryvariables.\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS39BydividingboththenumeratorandthedenominatoroftheratiobyTSS,thisFstatisticmay2equivalentlybeexpressedintermsofR:2(ESS/TSS)/kR/kF==(3.65)2(RSS/TSS)/(n-k-1)(1-R)/(n-k-1)Inthepresentcontext,kis1,so(3.65)becomes2RF=(3.66)2(1-R)/(n-2)2HavingcalculatedFfromyourvalueofR,youlookupFcrit,thecriticallevelofF,intheappropriatetable.IfFisgreaterthanFcrit,yourejectthenullhypothesisandconcludethatthe"explanation"ofyisbetterthanislikelytohavearisenbychance.TableA.3givesthecriticallevelsofFatthe5percent,1percentand0.1percentsignificancelevels.Ineachcasethecriticalleveldependsonthenumberofexplanatoryvariables,k,whichisreadfromalongthetopofthetable,andthenumberofdegreesoffreedom,(n-k-1),whichisreadoffdowntheside.Inthepresentcontext,weareconcernedwithsimpleregressionanalysis,kis1,andweshouldusethefirstcolumnofthetable.2Intheearningsfunctionexample,Rwas0.1036.Sincetherewere570observations,theF22statisticisequaltoR/{(1-R)/568}=0.1036/(0.8964/568)=65.65.Atthe0.1percentsignificancelevel,thecriticallevelofFfor1and120degreesoffreedom(lookingatthefirstcolumn,row120)is11.38.Thecriticalvaluefor1and568degreesoffreedommustbelower,sowehavenohesitationin2rejectingthenullhypothesisinthisparticularexample.Inotherwords,theunderlyingvalueofRissohighthatwerejectthesuggestionthatitcouldhavearisenbychance.InpracticetheFstatisticis2alwayscomputedforyou,alongwithR,soyouneveractuallyhavetouse(3.65)yourself.2Whydopeoplebotherwiththisindirectapproach?WhynothaveatableofcriticallevelsofR?2TheansweristhattheFtableisusefulfortestingmanyformsofanalysisofvariance,ofwhichRisonlyone.Ratherthanhaveaspecializedtableforeachapplication,itismoreconvenient(or,atleast,itsavesalotofpaper)tohavejustonegeneraltable,andmaketransformationslike(3.65)whennecessary.2OfcourseyoucouldderivecriticallevelsofRifyouweresufficientlyinterested.Thecritical2levelofRwouldberelatedtothecriticallevelofFby2Rcrit/kFcrit=(3.67)2(1-Rcrit)/(n-k-1)whichyields2kFcritRcrit=(3.68)kFcrit+(n-k-1)\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS40Intheearningsfunctionexample,thecriticalvalueofFatthe1percentsignificancelevelwasapproximately11.38.Henceinthiscase,withk=1,211.38Rcrit==0.020(3.69)11.38+56822Althoughitislow,ourRisgreaterthan0.020,soadirectcomparisonofRwithitscriticalvalueconfirmstheconclusionoftheFtestthatweshouldrejectthenullhypothesis.Exercises3.20InExercise2.6,intheregressionofemploymentgrowthratesongrowthratesofGDPusinga2sampleof25OECDcountries,Rwas0.5909.CalculatethecorrespondingFstatisticandcheckthatitisequalto33.22,thevalueprintedintheoutput.PerformtheFtestatthe5percentand1percentsignificancelevels.Isitnecessarytoreporttheresultsofthetestsatbothlevels?23.21Similarly,calculatetheFstatisticfromthevalueofRobtainedintheearningsfunctionfittedusingyourEAEFdatasetandcheckthatitisequaltothevalueprintedintheouput.PerformanappropriateFtest.3.11RelationshipbetweentheFTestofGoodnessofFitandthettestontheSlopeCoefficientinSimpleRegressionAnalysis2Inthecontextofsimpleregressionanalysis(andonlysimpleregressionanalysis)theFtestonRandthetwo-tailedttestontheslopecoefficientbothhaveH0:b=0astheirnullhypothesisandH1:b¹0astheiralternativehypothesis.Thisgivesrisetothepossibilitythattheymightleadtodifferentconclusions.Fortunately,theyareinfactequivalent.TheFstatisticisequaltothesquareofthetstatistic,andthecriticalvalueofF,atanygivensignificancelevel,isequaltothesquareofthecriticalvalueoft.StartingwiththedefinitionofFin(3.66),\nPROPERTIESOFTHEREGRESSIONCOEFFICIENTS41Var(yˆ)2RVar(y)F==2(1-R)/(n-2)ìVar(yˆ)üí1-ý/(n-2)îVar(y)þVar(yˆ)Var(yˆ)Var(y)Var(y)==(3.70)ìVar(y)-Var(yˆ)üìVar(e)üíý/(n-2)íý/(n-2)îVar(y)þîVar(y)þ22Var(a+bx)bVar(x)b2====tì1ü12e2/(n-2)s2suíåiýuînþnnVar(x)TheproofthatthecriticalvalueofFisequaltothecriticalvalueoftisalittlemorecomplicatedandwillbeomitted.Whenwecometomultipleregressionanalysis,wewillseethattheFtestandthettestshavedifferentrolesanddifferentnullhypotheses.However,insimpleregressionanalysisthefactthattheyareequivalentmeansthatthereisnopointinperformingboth.Indeed,youwouldlookignorantifyoudid.Obviously,providedthatitisjustifiable,aone-tailedttestwouldbepreferabletoeither.Exercises3.22VerifythattheFstatisticintheearningsfunctionregressionrunbyyouusingyourEAEFdatasetisequaltothesquareofthetstatisticfortheslopecoefficient,andthatthecriticalvalueofFatthe1percentsignificancelevelisequaltothesquareofthecriticalvalueoft.23.23InExercise2.7bothresearchersobtainedvaluesofRequalto0.79intheirregressions.Wasthisacoincidence?\n4MULTIPLEREGRESSIONANALYSISInthischapterleastsquaresregressionanalysisisgeneralizedtocoverthecaseinwhichthereareseveralormanyexplanatoryvariablesintheregressionmodel,ratherthanjustone.Twonewtopicsarediscussed.Oneistheproblemofdiscriminatingbetweentheeffectsofdifferentexplanatoryvariables,aproblemthat,whenparticularlysevere,isknownasmulticollinearity.Theotheristheevaluationofthejointexplanatorypoweroftheindependentvariables,asopposedtotheirindividualmarginaleffects.4.1Illustration:AModelwithTwoExplanatoryVariablesMultipleregressionanalysisisanextensionofsimpleregressionanalysistocovercasesinwhichthedependentvariableishypothesizedtodependonmorethanoneexplanatoryvariable.Muchoftheanalysiswillbeastraightforwardextensionofthesimpleregressionmodel,butwewillencountertwonewproblems.First,whenevaluatingtheinfluenceofagivenexplanatoryvariableonthedependentvariable,wenowhavetofacetheproblemofdiscriminatingbetweenitseffectsandtheeffectsoftheotherexplanatoryvariables.Second,weshallhavetotackletheproblemofmodelspecification.Frequentlyanumberofvariablesmightbethoughttoinfluencethebehaviorofthedependentvariable;ontheotherhand,theymightbeirrelevant.Weshallhavetodecidewhichshouldbeincludedintheregressionequationandwhichshouldbeexcluded.ThesecondproblemwillbediscussedinChapter7.Inthischapter,wewillassumethatthemodelspecificationiscorrect.Formuchofit,wewillconfineourselvestothebasiccasewherethereareonlytwoexplanatoryvariables.Wewillbeginbyconsideringanexample,thedeterminantsofearnings.WewillextendtheearliermodeltoallowforthepossibilitythatearningsareinfluencedbycognitiveabilityaswellaseducationandassumethatthetruerelationshipcanbeexpressedasEARNINGS=a+b1HGC+b2ASVABC+u.(4.1)whereEARNINGSishourlyearnings,HGCishighestgradecompleted,ASVABCiscompositescoreonthecognitivetestsintheArmedServicesVocationalAptitudeBattery,anduisadisturbanceterm.Thismodelisstillofcourseagreatsimplification,bothintermsoftheexplanatoryvariablesincludedintherelationshipandintermsofitsmathematicalspecification.ÓC.Dougherty1999.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof05.11.99.\nMULTIPLEREGRESSIONANALYSIS2a+b1HGC+b2ASVABC+ua+b1HGC+b2ASVABCcombinedeffectofa+b2ASVABCHGCandASVABCpureASVABCeffecta+b1HGCpureHGCeffectEARNINGSASVABCHGCFigure4.1.Truemodelwithtwoexplanatoryvariables:earningsasafunctionofschoolingandabilityToillustratetherelationshipgeometrically,oneneedsathree-dimensionaldiagramwithseparateaxesforEARNINGS,HGCandASVABCasinFigure4.1.ThebaseofFigure4.1showstheaxesforHGCandASVABC,and,ifoneneglectstheeffectofthedisturbancetermforthemoment,thetiltedplaneaboveitshowsthevalueofEARNINGScorrespondingtoany(HGC,ASVABC)combination,measuredbytheverticalheightoftheplaneabovethebaseatthatpoint.Sinceearningsmaybeexpectedtoincreasewithbothschoolingandability,thediagramhasbeendrawnontheassumptionthatb1andb2arebothpositive.Literally,theintercepta.givesthepredictedearningsforzeroschoolingandzeroabilityscore.However,suchaninterpretationwouldbedangerousbecausetheASVABCscoreisscaledinsuchawayastomakeitimpossibletoscorelessthan20..Furthermore,therewasnobodywithnoschoolingintheNLSYdataset.Indeedveryfewindividualsfailedtocomplete8yearsofschooling.Mathematically(4.1)impliesthat,ifASVABCwerezero,foranypositiveHGC,earningswouldbeequaltoa+b1HGC,theincreaseb1HGCbeingmarked"pureHGCeffect"inthefigure.KeepingHGCatzero,theequationimpliesthatforanypositivevalueofASVABC,earningswouldbeequaltoa+b2ASVABC,theincreaseb2ASVABCbeingmarked"pureASVABCeffect".Thecombinedeffectofschoolingandability,b1HGC+b2ASVABC,isalsoindicated.Wehavethusfarneglectedthedisturbanceterm.Ifitwerenotforthepresenceofthisin(4.1),thevaluesofEARNINGSinasampleofobservationsonEARNINGS,HGCandASVABCwouldlieexactlyonthetiltedplaneanditwouldbeatrivialmattertodeducetheexactvaluesofb1andb2(nottrivialgeometrically,unlessyouareageniusatconstructingthreedimensionalmodels,buteasyenoughalgebraically).\nMULTIPLEREGRESSIONANALYSIS3.regearningshgcasvabcSource|SSdfMSNumberofobs=570---------+------------------------------F(2,567)=39.98Model|4745.7496522372.87483Prob>F=0.0000Residual|33651.287456759.3497133R-squared=0.1236---------+------------------------------AdjR-squared=0.1205Total|38397.037156967.4816117RootMSE=7.7039------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.7390366.16062164.6010.000.42355061.054523asvabc|.1545341.04294863.5980.000.0701764.2388918_cons|-4.6247492.0132-2.2970.022-8.578989-.6705095------------------------------------------------------------------------------Thedisturbancetermcausestheactualvalueofearningstobesometimesaboveandsometimesbelowthevalueindicatedbythetiltedplane.Consequentlyonenowhasathree-dimensionalcounterparttothetwo-dimensionalproblemillustratedinFigure2.2.Insteadoflocatingalinetofitatwo-dimensionalscatterofpoints,wenowhavetolocateaplanetofitathree-dimensionalscatter.TheequationofthefittedplanewillbeEARNˆINGS=a+b1HGC+b2ASVABC(4.2)anditslocationwilldependonthechoiceofa,b1andb2,theestimatesofa,b1andb2,respectively.UsingDataSet21weobtaintheregressionoutputshownabove.Theequationshouldbeinterpretedasfollows.Foreveryadditionalgradecompleted,holdingtheabilityscoreconstant,hourlyearningsincreaseby$0.74.Foreverypointincreaseintheabilityscore,holdingschoolingconstant,earningsincreaseby$0.15.4.2DerivationandInterpretationoftheMultipleRegressionCoefficientsAsinthesimpleregressioncase,wechoosethevaluesoftheregressioncoefficientstomakethefitasgoodaspossibleinthehopethatwillobtainthemostsatisfactoryestimatesoftheunknowntrueparameters.Asbefore,ourdefinitionofgoodnessoffitistheminimizationofS,thesumofsquaresoftheresiduals:22S=e1+...+en,(4.3)whereeiistheresidualinobservationi,thedifferencebetweentheactualvalueofyinthatobservationandthevalueyˆpredictedbytheregressionequation:yˆi=a+b1x1i+b2x21(4.4)ei=yi-yˆi=yi-a-b1x1i-b2x2i(4.5)\nMULTIPLEREGRESSIONANALYSIS4Using(4.5),wecanwrite22S=ååei=(yi-a-b1x1i-b2x2i)(4.6)¶S¶S¶SThefirst-orderconditionsforaminimum,=0,=0,and=0,yieldthefollowing¶a¶b1¶b2equations:¶S=-2å(yi-a-b1x1i-b2x2i)=0(4.7)¶a¶S=-2åx1i(yi-a-b1x1i-b2x2i)=0(4.8)¶b1¶S=-2åx2i(yi-a-b1x1i-b2x2i)=0(4.9)¶b2Hencewehavethreeequationsinthethreeunknowns,a,b1,andb2.Thefirstcaneasilyberearrangedtoexpressaintermsofb1,b2,andthedataonxandy:a=y-b1x1-b2x2.(4.10)Usingthisexpression,theothertwoequations,withalittleworkonecanobtainthefollowingexpressionforb1:Cov(x1,y)Var(x2)-Cov(x2,y)Cov(x1,x2)b1=(4.11)2Var(x1)Var(x2)-{Cov(x1,x2)}Aparallelexpressionforb2canbeobtainedbyinterchangingx1andx2in(4.11).Theintentionofthisdiscussionistopresshometwobasicpoints.First,theprinciplesbehindthecalculationoftheregressioncoefficientsarethesameformultipleregressionasforsimpleregression.Second,theexpressions,however,aredifferent,andsoyoushouldnottrytouseexpressionsderivedforsimpleregressioninamultipleregressioncontext.TheGeneralModelIntheprecedingexample,weweredealingwithonlytwoexplanatoryvariables.Whentherearemorethantwo,itisnolongerpossibletogiveageometricalrepresentationofwhatisgoingon,buttheextensionofthealgebraisinprinciplequitestraightforward.Weassumethatavariableydependsonkexplanatoryvariablesx1,...,xkaccordingtoatrue,unknownrelationship\nMULTIPLEREGRESSIONANALYSIS5y=a+b1x1+…+bkxk+u.(4.12)Givenasetofnobservationsony,x1,...,xk,weuseleastsquaresregressionanalysistofittheequationyˆ=a+b1x1+...+bkxk(4.13)Thisagainmeansminimizingthesumofthesquaresoftheresiduals,whicharegivenbyei=yi-yˆi=yi-a-b1x1i-...-bkxki(4.14)(4.14)isthegeneralizationof(4.5).Wenowchoosea,b1,...,bksoastominimizeS,thesumofthe2¶S¶S¶Ssquaresoftheresiduals,åei.Weobtaink+1firstorderconditions=0,=0,…,=0,¶a¶b1¶bkandtheseprovidek+1equationsforsolvingforthek+1unknowns.Itcanreadilybeshownthatthefirstoftheseequationsyieldsacounterpartto(4.10)inthetwo-explanatory-variablecase:a=y-b1x1-...-bkxk.(4.15)Theexpressionsforb1,...,bkbecomeverycomplicatedindeedandthemathematicswillnotbepresentedexplicitlyhere.Thecalculationsaremuchbetterdonewithmatrixalgebraandinanycasethereisnotheoreticalorpracticalapplicationintherestofthistext.Forpracticalexamples,handcalculationisgenerallyoutofthequestionandacomputerhastobeusedtofindthesolutions.InterpretationoftheMultipleRegressionCoefficientsMultipleregressionanalysisallowsonetodiscriminatebetweentheeffectsoftheexplanatoryvariables,makingallowanceforthefactthattheymaybecorrelated.Theregressioncoefficientofeachxvariableprovidesanestimateofitsinfluenceony,controllingfortheeffectsofalltheotherxvariables.Thiscanbedemonstratedintwoways.Oneistoshowthattheestimatorsareunbiased,ifthemodeliscorrectlyspecifiedandtheGauss-Markovconditionsarefulfilled.Weshalldothisinthenextsectionforthecasewherethereareonlytwoexplanatoryvariables.Asecondmethodistorunasimpleregressionofyagainstoneofthexvariables,havingfirstpurgedbothyandthexvariableofthecomponentsthatcouldbeaccountedforbytheotherexplanatoryvariables.Theestimateoftheslopecoefficientanditsstandarderrorthusobtainedareexactlythesameasinthemultipleregression.Itfollowsthatascatterdiagramplottingthepurgedyagainstthepurgedxvariablewillprovideavalidgraphicalrepresentationoftheirrelationshipwhichcanbeobtainedinnootherway.ThisresultwillnotbeprovedbutitwillbeillustratedusingtheearningsfunctionintheSection4.1:EARNINGS=a+b1HGC+b2ASVABC+u.(4.16)Supposethatweareparticularlyinterestedintherelationshipbetweenearningsandschoolingandthatwewouldliketoillustrateitgraphically.AstraightforwardplotofEARNINGSonHGC,asinFigure2.8,wouldgiveadistortedviewoftherelationshipbecauseASVABCiscorrelatedwithHGC.Asa\nMULTIPLEREGRESSIONANALYSIS6consequence,asHGCincreases,(i)EARNINGSwilltendtoincrease,becauseb1ispositive;(ii)ASVABCwilltendtoincrease,becauseHGCandASVABCarepositivelycorrelated;and(iii)EARNINGSwillreceiveaboostduetotheincreaseinASVABCandthefactthatb2ispositive.Inotherwords,thevariationsinEARNINGSwillexaggeratetheapparentinfluenceofHGCbecauseinparttheywillbeduetoassociatedvariationsinASVABC.Asaconsequence,inasimpleregressiontheestimatorofb1willbebiased.WewillinvestigatethebiasanalyticallyinSection7.2.Inthisexample,thereisonlyoneotherexplanatoryvariable,ASVABC.TopurgeEARNINGSandHGCoftheirASVABCcomponents,wefirstregressthemonASVABC:EARNˆINGS=c0+c1ASVABC(4.17)HGˆC=d0+d1ASVABC(4.18)Wethensubtractthefittedvaluesfromtheactualvalues:EEARN=EARNINGS-EARNˆINGS(4.19)EHGC=HGC-HGˆC(4.20)ThepurgedvariablesEEARNandEHGCareofcoursejusttheresidualsfromtheregressions(4.17)and(4.18).WenowregressEEARNonEHGCandobtaintheoutputshown..regeearnehgcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=21.21Model|1256.4423911256.44239Prob>F=0.0000Residual|33651.287356859.2452241R-squared=0.0360---------+------------------------------AdjR-squared=0.0343Total|34907.729756961.3492613RootMSE=7.6971------------------------------------------------------------------------------eearn|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------ehgc|.7390366.16048024.6050.000.42382961.054244_cons|-5.99e-09.32239570.0001.000-.6332333.6332333------------------------------------------------------------------------------YoucanverifythatthecoefficientofEHGCisidenticaltothatofHGCinthemultipleregression.Thestandarderrordiffersatthefourthdecimalplace,butthisisjustduetoroundingerrorinthecomputations.Figure4.2showstheregressionlineinascatterdiagram.ThedottedlineinthefigureistheregressionlinefromasimpleregressionofEARNINGSonHGC,shownforcomparison.\nMULTIPLEREGRESSIONANALYSIS770EEARN6050403020100-8-6-4-202468EHGC-10-20Figure4.2.RegressionofEARNINGSresidualsonHGCresidualsTheestimateoftheinterceptintheregressionusesacommonconventionforfittingverylargenumbersorverysmallonesintoapredefinedfield.exindicatesthatthecoefficientshouldbex-xmultipliedby10.Similarlye–xindicatesthatitshouldbemultipliedby10.Thusinthisregressiontheinterceptiseffectivelyzero.Exercises4.1Theresultoffittinganeducationalattainmentfunction,regressingHGConASVABC,HGCM,andHGCF,highestgradecompletedbytherespondent’smotherandfather,respectively,usingDataSet21isshownbelow.Giveaninterpretationoftheregressioncoefficients..reghgcasvabchgcmhgcfSource|SSdfMSNumberofobs=570---------+------------------------------F(3,566)=110.83Model|1278.241533426.080508Prob>F=0.0000Residual|2176.005845663.84453329R-squared=0.3700---------+------------------------------AdjR-squared=0.3667Total|3454.247375696.07073351RootMSE=1.9607------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1295006.009954413.0090.000.1099486.1490527hgcm|.069403.04229741.6410.101-.013676.152482hgcf|.1102684.03119483.5350.000.0489967.1715401_cons|4.914654.50635279.7060.0003.9200945.909214------------------------------------------------------------------------------\nMULTIPLEREGRESSIONANALYSIS84.2FitanearningsfunctionparalleltothatinSection4.1,usingyourEAEFdataset,andgiveaninterpretationofthecoefficients.4.3FitaneducationalattainmentfunctionparalleltothatinExercise4.1,usingyourEAEFdataset,andgiveaninterpretationofthecoefficients.4.4UsingyourEAEFdataset,makeagraphicalrepresentationoftherelationshipbetweenHGCandHGCMusingthetechniquedescribedabove,assumingthatthetruemodelisasinExercise4.3.Todothis,regressHGConASVABCandHGCFandsavetheresiduals.DothesamewithHGCM.PlottheHGCandHGCMresiduals.Alsoregresstheformeronthelatter,andverifythattheslopecoefficientisthesameasthatobtainedinExercise4.3.4.5*ExplainwhytheinterceptintheregressionofEEARNonEHGCisequaltozero.4.3PropertiesoftheMultipleRegressionCoefficientsAsinthecaseofsimpleregressionanalysis,theregressioncoefficientsshouldbethoughtofasspecialkindsofrandomvariableswhoserandomcomponentsareattributabletothepresenceofthedisturbanceterminthemodel.Eachregressioncoefficientiscalculatedasafunctionofthevaluesofyandtheexplanatoryvariablesinthesample,andyinturnisdeterminedbytheexplanatoryvariablesandthedisturbanceterm.Itfollowsthattheregressioncoefficientsarereallydeterminedbythevaluesoftheexplanatoryvariablesandthedisturbancetermandthattheirpropertiesdependcriticallyuponthepropertiesofthelatter.WeshallcontinuetoassumethattheGauss-Markovconditionsaresatisfied,namely(1)thattheexpectedvalueofuinanyobservationiszero,(2)thatthevarianceofitsdistributionisthesameforallobservations,(3)thatthepopulationcovarianceofitsvaluesinanytwoobservationsiszero,and(4)thatitisdistributedindependentlyofanyexplanatoryvariable.Thefirstthreeconditionsarethesameasforsimpleregressionanalysisand(4)isageneralizationofitscounterpart.Forthetimebeingweshalladoptastrongerversionof(4)andassumethattheexplanatoryvariablesarenonstochastic.Therearetwofurtherpracticalrequirements.First,theremustbeenoughdatatofittheregressionline;thatis,theremustbeatleastasmany(independent)observationsasthereareparameterstobeestimated.Second,asweshallseeinthissection,theremustnotbeanexactlinearrelationshipamongtheexplanatoryvariables.UnbiasednessWewillfirstshowthatb1isanunbiasedestimatorofb1inthecasewheretherearetwoexplanatoryvariables.Theproofcaneasilybegeneralized,usingmatrixalgebra,toanynumberofexplanatoryvariables.Asonecanseefrom(4.11),b1iscalculatedasafunctionofx1,x2,andy;yinturnisgeneratedbyx1,x2,andu.Henceb1dependsinfactonthevaluesofx1,x2anduinthesample(providedthatyouunderstandwhatisgoingon,youmayskipthedetailsofthemathematicalworking):\nMULTIPLEREGRESSIONANALYSIS9Cov(x1,y)Var(x2)-Cov(x2,y)Cov(x1,x2)b1=2Var(x1)Var(x2)-[Cov(x1,x2)]1ìCov(x1,[a+b1x1+b2x2+u])Var(x2)ü=íý(4.21)Dî-Cov(x2,[a+b1x1+b2x2+u])Cov(x1,x2)þ1ì[Cov(x1,a)+Cov(x1,b1x1)+Cov(x1,b2x2)+Cov(x1,u)]Var(x2)ü=íýDî-[Cov(x2,a)+Cov(x2,b1x1)+Cov(x2,b2x2)+Cov(x2,u)]Cov(x1,x2)þ2whereDisVar(x1)Var(x1)–[Cov(x1,x2)].Cov(x1,a)andCov(x1,a)arebothzero,usingCovarianceRule3,becauseaisaconstant.Cov(x1,b1x1)isb1Var(x1)usingCovarianceRule2andthefactthatCov(x1,x1)isthesameasVar(x1).SimilarlyCov(x2,b2x2)isb2Var(x2).Hence1ì[b1Var(x1)+b2Cov(x1,x2)+Cov(x1,u)]Var(x2)üb1=íýDî-[b1Cov(x2,x1)+b2Var(x2)+Cov(x2,u)]Cov(x1,x2)þ1ìb1Var(x1)Var(x2)+b2Cov(x1,x2)Var(x2)+Cov(x1,u)Var(x2)ü=íý2Dî-b1[Cov(x2,x1)]-b2Var(x2)Cov(x1,x2)-Cov(x2,u)Cov(x1,x2)þ1{()2}=b1Var(x1)Var(x2)-[Cov(x2,x1)]+Cov(x1,u)Var(x2)-Cov(x2,u)Cov(x1,x2)(4.22)D1={}b1D+Cov(x1,u)Var(x2)-Cov(x2,u)Cov(x1,x2)D1=b1+{}Cov(x1,u)Var(x2)-Cov(x2,u)Cov(x1,x2)DThusb1hastwocomponents:thetruevalueb1andanerrorcomponent.Ifwetakeexpectations,wehave1E(b1)=b1+{}Var(x2)E[Cov(x1,u)]-Cov(x1,x2)E[Cov(x2,u)]D(4.23)=b1providedthatthefourthGauss-Markovconditionissatisfied.(TheproofthatE[Cov(x1,u)]andE[Cov(x2,u)]arezeroisparalleltothatforE[Cov(x,u)]beingzeroinChapter3.)EfficiencyTheGauss-Markovtheoremprovesthat,formultipleregressionanalysis,asforsimpleregressionanalysis,theordinaryleastsquares(OLS)techniqueyieldsthemostefficientlinearestimators,inthesensethatitisimpossibletofindotherunbiasedestimatorswithlowervariances,usingthesamesampleinformation,providedthattheGauss-Markovconditionsaresatisfied.Wewillnotattempttoprovethistheoremsincematrixalgebraisrequired.\nMULTIPLEREGRESSIONANALYSIS10PrecisionoftheMultipleRegressionCoefficientsWewillinvestigatethefactorsgoverningthelikelyprecisionoftheregressioncoefficientsforthecasewheretherearetwoexplanatoryvariables.Similarconsiderationsapplyinthemoregeneralcase,butwithmorethantwovariablesoneneedstoswitchtomatrixalgebra.Ifthetruerelationshipisy=a+b1x1+b2x2+u,(4.24)andyoufittheregressionlineyˆ=a+b1x1+b2x2,(4.25)usingsuitabledata,thepopulationvarianceoftheprobabilitydistributionforx1isgivenby2su1pop.var(b1)=´(4.26)2nVar(x1)1-rxx122wheresuisthepopulationvarianceofu.Aparallelexpressionmaybeobtainedforthepopulationvarianceofb2,replacingVar(x1)withVar(x2).From(4.26)youcanseethat,asinthecaseofsimpleregressionanalysis,itisdesirablefornand22Var(x1)tobelargeandforsutobesmall.However,wenowhavethefurtherterm(1-rx1x2)andclearlyitisdesirablethereshouldbealowcorrelationbetweenx1andx2.Itiseasytogiveanintuitiveexplanationofthis.Thegreaterthecorrelation,theharderitistodiscriminatebetweentheeffectsoftheexplanatoryvariablesony,andthelessaccuratewillbetheregressionestimates.Thiscanbeaseriousproblemanditisdiscussedinthenextsection.Thestandarddeviationofthedistributionofb1isthesquarerootofthevariance.Asinthesimpleregressioncase,thestandarderrorofb1istheestimateofthestandarddeviation.Forthiswe2needtoestimatesu.Thevarianceoftheresidualsprovidesabiasedestimator:{}n-k-12EVar(e)=su(4.27)n2butwecanobtainanunbiasedestimator,su,byneutralizingthebias:2nsu=Var(e)(4.28)n-k-1Thestandarderroristhengivenby2su1s.e.(b1)=´2(4.29)nVar(x1)1-rx1,x2\nMULTIPLEREGRESSIONANALYSIS11ThedeterminantsofthestandarderrorwillbeillustratedbycomparingtheminearningsfunctionsfittedtotwosubsamplesoftherespondentsinDataSet21,thosewhoreportedthattheirwagesweresetbycollectivebargaining(CB)andtheremainder(notCB).Regressionoutputforthetwosubsamplesisshownbelow.InStata,subsamplesmaybeselectedbyaddingan“if”expressiontoacommand.COLLBARGisavariableinthedatasetdefinedtobe0forthenotCBsubsampleand1fortheCBsubsample.Forsomearcanereason,Statarequiresthe=signtobeduplicatedwhenitisemployed..regearningshgcasvabcifcollbarg==0Source|SSdfMSNumberofobs=507---------+------------------------------F(2,504)=40.31Model|4966.9651622483.48258Prob>F=0.0000Residual|31052.206650461.6115211R-squared=0.1379---------+------------------------------AdjR-squared=0.1345Total|36019.171850671.184134RootMSE=7.8493------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.8891909.17416175.1060.000.54701861.231363asvabc|.1398727.04618063.0290.003.0491425.2306029_cons|-6.1009612.15968-2.8250.005-10.34404-1.857877------------------------------------------------------------------------------.regearningshgcasvabcifcollbarg==1Source|SSdfMSNumberofobs=63---------+------------------------------F(2,60)=2.58Model|172.902083286.4510417Prob>F=0.0844Residual|2012.885046033.5480841R-squared=0.0791---------+------------------------------AdjR-squared=0.0484Total|2185.787136235.2546311RootMSE=5.7921------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|-.3872787.3530145-1.0970.277-1.093413.3188555asvabc|.2309133.10192112.2660.027.0270407.4347858_cons|8.2917164.8692091.7030.094-1.44815218.03158------------------------------------------------------------------------------ThestandarderrorofthecoefficientofHGCis0.1742inthefirstregressionand0.3530,twiceaslarge,inthesecond.Wewillinvestigatethereasonsforthedifference.Itwillbeconvenienttorewrite(4.29)insuchawayastoisolatethecontributionsofthevariousfactors:111s.e.(b1)=su´´´(4.30)nVar(x)1-r21x1,x22Thefirstelementweneed,su,canbeobtaineddirectlyfromtheregressionoutput.suisequaltothesumofthesquaresoftheresidualsdividedbyn-k-1,heren-3:2nn12su=Var(e)=´å(ei-e)n-k-1n-k-1n(4.31)121=åei=RSSn-k-1n-k-1\nMULTIPLEREGRESSIONANALYSIS12TABLE4.1DecompositionoftheStandardErrorofHGCComponentsunVar(HGC)rHGC,ASVABCs.e.NotCB7.84935076.06450.58260.1742CB5.7921636.01360.53800.3530FactorNotCB7.84930.04440.40611.23040.1741CB5.79210.12600.40781.18630.3531Note:CB=wagesdeterminedbycollectivebargaining(Notethateisequaltozero.ThiswasprovedforthesimpleregressionmodelinChapter3,andtheproofgeneralizeseasily.)RSSisgiveninthetopleftquarteroftheregressionoutput,aspartofthedecompositionofthetotalsumofsquaresintotheexplainedsumofsquares(heredenotedthemodelsumofsquares)andtheresidualsumofsquares.Thevalueofn-k-1isgiventotherightofRSS,andtheratioRSS/(n-k-1)totherightofthat.Thesquareroot,su,islistedastheRootMSE(rootmeansquareerror)inthetoprightquarteroftheregressionoutput,7.8493forthenotCBregressionand5.7921fortheCBregression.Thenumberofobservations,507inthefirstregressionand63inthesecond,isalsolistedinthetoprightquarteroftheregressionoutput.ThevariancesofHGC,6.0645and6.0136,hadtobecalculatedfromthesampledata.ThecorrelationsbetweenHGCandASVABC,0.5826and0.5380respectively,werecalculatedusingtheStata“cor”command.Thefactorsofthestandarderrorinequation(4.30)werethenderivedandareshowninthelowerhalfofTable4.1.Itcanbeseenthat,inthisexample,thereasonthatthestandarderrorofHGCintheCBsubsampleisrelativelylargeisthatthenumberofobservationsinthatsubsampleisrelativelysmall.TheeffectofthevarianceofHGCisneutral,andthoseoftheothertwofactorsareintheoppositedirection,butnotenoughtomakemuchdifference.ttestsandConfidenceIntervalsttestsontheregressioncoefficientsareperformedinthesamewayasforsimpleregressionanalysis.Notethatwhenyouarelookingupthecriticalleveloftatanygivensignificancelevel,itwilldependonthenumberofdegreesoffreedom,n-k-1:thenumberofobservationsminusthenumberofcoefficientsestimated(onecoefficientforeachexplanatoryvariable,andtheconstantterm).Theconfidenceintervalsarealsoconstructedinexactlythesamewayasinsimpleregressionanalysis,subjecttotheabovecommentaboutthenumberofdegreesoffreedom.Ascanbeseenfromtheregressionoutput,Stataautomaticallycalculatesconfidenceintervalsforthecoefficients(95%bydefault,otherlevelsifdesired),butthisisnotastandardfeatureofregressionpackages.ConsistencyProvidedthatthefourthGauss-Markovconditionissatisfied,OLSyieldsconsistentestimatesinthemultipleregressionmodel,asinthesimpleregressionmodel.Asnbecomeslarge,thepopulation\nMULTIPLEREGRESSIONANALYSIS13varianceoftheestimatorofeachregressioncoefficienttendstozeroandthedistributioncollapsestoaspike,oneconditionforconsistency.Sincetheestimatorisunbiased,thespikeislocatedatthetruevalue,theotherconditionforconsistency.Exercises4.6PerformttestsonthecoefficientsofthevariablesintheeducationalattainmentfunctionreportedinExercise4.1.4.7PerformttestsonthecoefficientsofthevariablesintheearningsfunctionandtheeducationalattainmentfittedbyyouinExercises4.2and4.3.4.8Thefollowingearningsfunctionswerefittedseparatelyformalesandfemales,usingDataSet21(standarderrorsinparentheses):malesEARNˆINGS=-3.6121+0.7499HGC+0.1558ASVABC(2.8420)(0.2434)(0.0600)femalesEARNˆINGS=-5.9010+0.8803HGC+0.1088ASVABC(2.6315)(0.1910)(0.0577)Usingequation(4.30),explainwhythestandarderrorsofthecoefficientsofHGCandASVABCaregreaterforthemalesubsamplethanforthefemalesubsample,andwhythedifferenceinthestandarderrorsisrelativelylargeforHGC.Furtherdata:malesfemalessu8.476.23n325245rHGC,ASVABC0.610.55Var(HGC)5.886.26Var(ASVABC)96.6568.704.9*Demonstratethateisequaltozeroinmultipleregressionanalysis.(Note:Theproofisageneralizationoftheproofforthesimpleregressionmodel,giveninSection2.7.)4.10InvestigatewhetheryoucanextendthedeterminantsofweightmodelusingyourEAEFdataset,takingWEIGHT94asthedependentvariable,andHEIGHTandothercontinuousvariablesinthedataset.Provideaninterpretationofthecoefficientsandperformttestsonthem.4.4MulticollinearityIntheprevioussection,inthecontextofamodelwithtwoexplanatoryvariables,itwasseenthatthehigheristhecorrelationbetweentheexplanatoryvariables,thelargerarethepopulationvariancesof\nMULTIPLEREGRESSIONANALYSIS14thedistributionsoftheircoefficients,andthegreateristheriskofobtainingerraticestimatesofthecoefficients.Ifthecorrelationcausestheregressionmodeltobecomeunsatisfactoryintheserespects,itissaidtobesufferingfrommulticollinearity.Ahighcorrelationdoesnotnecessarilyleadtopoorestimates.Ifalltheotherfactorsdeterminingthevariancesoftheregressioncoefficientsarehelpful,thatis,ifthenumberofobservationsandthesamplevariancesoftheexplanatoryvariablesarelarge,andthevarianceofthedisturbancetermsmall,youmaywellobtaingoodestimatesafterall.Multicollinearitythereforemustbecausedbyacombinationofahighcorrelationandoneormoreoftheotherfactorsbeingunhelpful.Anditisamatterofdegree,notkind.Anyregressionwillsufferfromittosomeextent,unlessalltheexplanatoryvariablesareuncorrelated.Youonlystarttotalkaboutitwhenyouthinkthatitisaffectingtheregressionresultsseriously.Itisanespeciallycommonproblemintimeseriesregressions,thatis,wherethedataconsistsofaseriesofobservationsonthevariablesoveranumberoftimeperiods.Iftwoormoreoftheexplanatoryvariableshaveastrongtimetrend,theywillbehighlycorrelatedandthisconditionmaygiverisetomulticollinearity.Itshouldbenotedthatthepresenceofmulticollinearitydoesnotmeanthatthemodelismisspecified.Accordingly,theregressioncoefficientsremainunbiasedandthestandarderrorsremainvalid.Thestandarderrorswillbelargerthantheywouldhavebeenintheabsenceofmulticollinearity,warningyouthattheregressionestimatesareunreliable.Wewillconsiderfirstthecaseofexactmulticollinearitywheretheexplanatoryvariablesareperfectlycorrelated.Supposethatthetruerelationshipisy=2+3x1+x2+u(4.32)Supposethatthereisalinearrelationshipbetweenx1andx2:x2=2x1-1(4.33)andsupposethatx1increasesbyoneunitineachobservation.x2willincreaseby2units,andybyapproximately5units,forexampleasshowninTable4.2.Lookingatthedata,youcouldcometoanyofthefollowingconclusions:1.thecorrectone,thatyisdeterminedby(4.32)2.thatx2isirrelevantandyisdeterminedbytherelationshipTABLE4.2ChangeinChangeinApproximatex1x2yx1x1Changeiny101951+u1125112156+u2125122361+u3125132566+u4125142771+u5125152976+u6125\nMULTIPLEREGRESSIONANALYSIS15y=1+5x1+u(4.34)3.thatx2isirrelevantandyisdeterminedbytherelationshipy=3.5+2.5x2+u(4.35)Infactthesearenottheonlypossibilities.Anyrelationshipwhichisaweightedaverageof(4.34)and(4.35)wouldalsofitthedata.[(4.32)mayberegardedassuchaweightedaverage,being(4.34)multipliedby0.6plus(4.35)multipliedby0.4.]Insuchasituationitisimpossibleforregressionanalysis,oranyothertechniqueforthatmatter,todistinguishbetweenthesepossibilities.Youwouldnotevenbeabletocalculatetheregressioncoefficientsbecauseboththenumeratorandthedenominatoroftheregressioncoefficientswouldcollapsetozero.Thiswillbedemonstratedwiththegeneraltwo-variablecasey=a+b1x1+b2x2+u(4.36)x2=l+mx1(4.37)Substitutingforx2in(4.11),oneobtainsCov(x1,y)Var(x2)-Cov(x2,y)Cov(x1,x2)b1=2Var(x1)Var(x2)-[]Cov(x1,x2)Cov(x1,y)Var(l+mx1)-Cov([l+mx1],y)Cov(x1,[l+mx1])=(4.38)[]2Var(x1)Var(l+mx1)-Cov(x1,[l+mx1])Cov(x1,y)Var(mx1)-Cov(mx1,y)Cov(x1,mx1)=[]2Var(x1)Var(mx1)-Cov(x1,mx1)ByvirtueofVarianceRule4,theadditivelinthevariancescanbedropped.Asimilarrulecouldbedevelopedforcovariances,sinceanadditiveldoesnotaffectthemeither.Hence2Cov(x1,y)mVar(x1)-mCov(x1,y)mCov(x1,x1)b1=2[]2Var(x1)mVar(x1)-mCov(x1,x1)(4.39)22mCov(x1,y)Var(x1)-mCov(x1,y)Var(x1)0==2[]2mVar(x1)Var(x1)-mVar(x1)0Itisunusualfortheretobeanexactrelationshipamongtheexplanatoryvariablesinaregression.Whenthisoccurs,itistypicallybecausethereisalogicalerrorinthespecification.AnexampleisprovidedbyExercise4.*.However,ifoftenhappensthatthereisanapproximaterelationships.HereisaregressionofEARNINGSonHGC,ASVABC,andASVAB5.ASVAB5isthescoreonaspeedtest\nMULTIPLEREGRESSIONANALYSIS16oftheabilitytoperformverysimplearithmeticalcomputations.LikeASVABC,therawscoresonthistestwerescaledsothattheyhadmean50andstandarddeviation10..regearningshgcasvabcasvab5Source|SSdfMSNumberofobs=570---------+------------------------------F(3,566)=27.66Model|4909.1146831636.37156Prob>F=0.0000Residual|33487.922456659.1659406R-squared=0.1279---------+------------------------------AdjR-squared=0.1232Total|38397.037156967.4816117RootMSE=7.6919------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.7115506.16122354.4130.000.39488111.02822asvabc|.1104595.05042232.1910.029.0114219.2094972asvab5|.0770794.04638681.6620.097-.0140319.1681908_cons|-5.9449772.161409-2.7510.006-10.19034-1.699616------------------------------------------------------------------------------.corasvabcasvab5(obs=570)|asvabcasvab5--------+------------------asvabc|1.0000asvab5|0.63711.0000Theregressionresultindicatesthatanextrayearofschoolingincreaseshourlyearningsby$0.71.AnextrapointonASVABCincreaseshourlyearningsby$0.11.Anindividualwithascoreonestandarddeviationabovethemeanwouldthereforetendtoearnanextra$1.10perhour,comparedwithanindividualatthemean.Anextrapointonthenumericalcomputationspeedtestincreaseshourlyearningsby$0.08.DoesASVAB5belongintheearningsfunction?Attestrevealsthatitscoefficientisjustsignificantlydifferentfromzeroatthe5%level,usingaone-tailedtest.(Aone-tailedtestisjustifiedbythefactthatitisunlikelythatagoodscoreonthistestwouldadverselyaffectearnings.)Inthisregression,thecoefficientofASVABCissignificantonlyatthe5%level.IntheregressionwithoutASVAB5,reportedinSection4.1,itststatisticwas3.60,makingitsignificantlydifferentfromzeroatthe0.1%level.ThereasonforthereductioninitstratioisthatitishighlycorrelatedwithASVAB5.ThismakesitdifficulttopinpointtheindividualeffectsofASVABCandASVAB5.Asaconsequencetheregressionestimatestendtobeerratic.ThehighcorrelationcausesthestandarderrorstobelargerthantheywouldhavebeenifASVABCandASVAB5hadbeenlesshighlycorrelated,warningusthatthepointestimatesareunreliable.Inthisregression,multicollinearityismakingitdifficulttodeterminewhetherASVAB5isadeterminantofearnings.Itispossiblethatitisnot,andthatitsmarginally-significanttstatistichasoccurredasamatterofchance.MulticollinearityinModelswithMoreThanTwoExplanatoryVariablesTheforegoingdiscussionofmulticollinearitywasrestrictedtothecasewheretherearetwoexplanatoryvariables.Inmodelswithagreaternumberofexplanatoryvariables,multicollinearitymaybecausedbyanapproximatelinearrelationshipamongthem.Itmaybedifficulttodiscriminatebetweentheeffectsofonevariableandthoseofalinearcombinationoftheremainder.Inthemodel\nMULTIPLEREGRESSIONANALYSIS17withtwoexplanatoryvariables,anapproximatelinearrelationshipautomaticallymeansahighcorrelation,butwhentherearethreeormore,thisisnotnecessarilythecase.Alinearrelationshipdoesnotinevitablyimplyhighpairwisecorrelationsbetweenanyofthevariables.Theeffectsofmulticollinearityarethesameasinthetwo-variablecase,and,asinthetwo-variablecase,theproblemmaynotbeseriousifthepopulationvarianceofthedisturbancetermissmall,thenumberofobservationslarge,andthevariancesoftheexplanatoryvariableslarge.WhatCanYouDoAboutMulticollinearity?Thevariouswaysoftryingtoalleviatemulticollinearityfallintotwocategories:directattemptstoimprovethefourconditionsresponsibleforthereliabilityoftheregressionestimates,andindirectmethods.2First,youmaytrytoreducesu.Thedisturbancetermisthejointeffectofallthevariablesinfluencingywhichyouhavenotincludedexplicitlyintheregressionequation.Ifyoucanthinkofanimportantvariablethatyouhaveomitted,andisthereforecontributingtou,youwillreducethepopulationvarianceofthedisturbancetermifyouaddittotheregressionequation.Bywayofillustration,wewilltakeearningsfunctiondiscussedintheprevioussection,whereahighcorrelationbetweenASVABC,thecompositecognitiveabilityscore,andASVAB5,thescoreonanumericalcomputationspeedtest,gaverisetoaproblemofmulticollinearity..regearningshgcasvabcasvab5Source|SSdfMSNumberofobs=570---------+------------------------------F(3,566)=27.66Model|4909.1146831636.37156Prob>F=0.0000Residual|33487.922456659.1659406R-squared=0.1279---------+------------------------------AdjR-squared=0.1232Total|38397.037156967.4816117RootMSE=7.6919------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.7115506.16122354.4130.000.39488111.02822asvabc|.1104595.05042232.1910.029.0114219.2094972asvab5|.0770794.04638681.6620.097-.0140319.1681908_cons|-5.9449772.161409-2.7510.006-10.19034-1.699616------------------------------------------------------------------------------Wenowaddthreenewvariableswhichareoftenfoundtobedeterminantsofearnings:lengthoftenurewiththecurrentemployer,heremeasuredinweeks,sexofrespondent,andwhethertherespondentwaslivinginanurbanoraruralarea.ThelasttwovariablesarequalitativevariablesandtheirtreatmentwillbeexplainedinChapter6.Allofthesenewvariableshavehightstatisticsandas2aconsequencetheestimateofsufalls,from59.17to54.50(seethecalculationoftheresidualsumofsquaresdividedbythenumberofdegreesoffreedominthetoprightquarteroftheregressionoutput).Howeverthejointcontributionofthenewvariablestotheexplanatorypowerofthemodelissmall,despitebeinghighlysignificant,andthereductioninthestandarderrorsofthecoefficientsofHGC,ASVABC,andASVAB5isnegligible.Theycouldevenhaveincreased.ThenewvariableshappentohaveverylowcorrelationswithHGC,ASVABC,andASVAB5.Iftheyhadbeenlinearlyrelatedtooneormoreofthevariablesalreadyintheequation,theirinclusioncouldhavemadetheproblemofmulticollinearityworse.Notehowunstablethecoefficientsare,anothersignofmulticollinearity.\nMULTIPLEREGRESSIONANALYSIS18.regearningshgcasvabcasvab5tenuremaleurbanSource|SSdfMSNumberofobs=570---------+------------------------------F(6,563)=23.60Model|7715.8732261285.97887Prob>F=0.0000Residual|30681.163856354.4958505R-squared=0.2009---------+------------------------------AdjR-squared=0.1924Total|38397.037156967.4816117RootMSE=7.3821------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.8137184.15639755.2030.000.50652451.120912asvabc|.0442801.0497160.8910.373-.0533714.1419317asvab5|.1113769.04587572.4280.016.0212685.2014853tenure|.287038.06764714.2430.000.1541665.4199095male|3.123929.646854.8290.0001.8533954.394463urban|2.061867.72742862.8340.005.63306183.490672_cons|-10.600232.195757-4.8280.000-14.91311-6.287358------------------------------------------------------------------------------Thenextfactortoconsiderisn,thenumberofobservations.Ifyouareworkingwithcross-sectiondata(individuals,households,enterprises,etc)andyouareundertakingasurvey,youcouldincreasethesizeofthesamplebynegotiatingabiggerbudget.Alternatively,youcouldmakeafixedbudgetgofurtherbyusingatechniqueknownasclustering.Youdividethecountrygeographicallyintolocalities.Forexample,theU.S.NationalLongitudinalSurveyofYouth,fromwhichtheEAEFdataaredrawn,dividesthecountryintocounties,independentcitiesandstandardmetropolitanstatisticalareas.Youselectanumberoflocalitiesrandomly,perhapsusingstratifiedrandomsamplingtomakesurethatmetropolitan,otherurbanandruralareasareproperlyrepresented.Youthenconfinethesurveytothelocalitiesselected.Thisreducesthetraveltimeandcostofthefieldworkers,allowingthemtointerviewagreaternumberofrespondents.Ifyouareworkingwithtimeseriesdata,youmaybeabletoincreasethesamplebyworkingwithshortertimeintervalsforthedata,forexamplequarterlyorevenmonthlydatainsteadofannualdata.Thisissuchanobviousandeasythingtodothatmostresearchersworkingwithtimeseriesalmostautomaticallyusequarterlydata,iftheyareavailable,insteadofannualdata,eveniftheredoesnotappeartobeaproblemofmulticollinearity,simplytominimizethepopulationvariancesoftheregressioncoefficients.Thereare,however,potentialproblems.Youmayintroduce,oraggravate,autocorrelation(seeChapter**),butthiscanbeneutralized.Alsoyoumayintroduce,oraggravate,measurementerrorbias(seeChapter**)ifthequarterlydataarelessaccuratelymeasuredthanthecorrespondingannualdata.Thisproblemisnotsoeasilyovercome,butitmaybeaminorone..regearningshgcasvabcasvab5Source|SSdfMSNumberofobs=2868---------+------------------------------F(3,2864)=183.45Model|36689.8765312229.9588Prob>F=0.0000Residual|190928.139286466.664853R-squared=0.1612---------+------------------------------AdjR-squared=0.1603Total|227618.016286779.3924017RootMSE=8.1649------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|1.002693.078744712.7330.000.84829051.157095asvabc|.1448345.02411356.0060.000.097553.1921161asvab5|.0483846.02183522.2160.027.0055703.091199_cons|-9.6545931.033311-9.3430.000-11.6807-7.628485------------------------------------------------------------------------------\nMULTIPLEREGRESSIONANALYSIS19Hereistheresultofrunningtheregressionwithall2,867observationsintheEAEFdataset.ComparingthisresultwiththatusingDataSet21,weseethatthestandarderrorsaremuchsmaller,asexpected.Asaconsequence,thetstatisticsarehigher.InthecaseofASVABC,thisispartlyduetothefactthatthepointestimateofthecoefficientishigher.However,inthecaseofASVAB5,thetstatisticishigherdespitethefactthatthecoefficientissmaller.Athirdpossiblewayofreducingtheproblemofmulticollinearitymightbetoincreasethevarianceoftheexplanatoryvariables.Thisispossibleonlyatthedesignstageofasurvey.Forexample,ifyouwereplanningahouseholdsurveywiththeaimofinvestigatinghowexpenditurepatternsvarywithincome,youshouldmakesurethatthesampleincludedrelativelyrichandrelativelypoorhouseholdsaswellasmiddle-incomehouseholdsbystratifyingthesample.(Foradiscussionofsamplingtheoryandtechniques,see,forexampleKish,19**,orMoserandKalton,19**.)Thefourthdirectmethodisthemostdirectofall.Ifyouarestillatthedesignstageofasurvey,youshoulddoyourbesttoobtainasamplewheretheexplanatoryvariablesarelessrelated(moreeasilysaidthandone,ofcourse).Ifthecorrelatedvariablesaresimilarconceptually,itmaybereasonabletocombinethemintosomeoverallindex.ThatispreciselywhathasbeendonewiththethreecognitiveASVABvariables.ASVABChasbeencalculatedasaweightedaverageofASVAB2(arithmeticreasoning),ASVAB3(wordknowledge),andASVAB4(paragraphcomprehension).HereisaregressionofEARNINGSonHGCandthethreecomponentsofASVABC.ASVAB2hasahighlysignificantcoefficient,butASVAB3doesnotandthecoefficientofASVAB4hasthewrongsign.Thisisnotsurprising,giventhehighcorrelationsbetweentheASVABvariables..regearningshgcasvab2asvab3asvab4Source|SSdfMSNumberofobs=570---------+------------------------------F(4,565)=25.68Model|5906.4772641476.61931Prob>F=0.0000Residual|32490.559856557.5054156R-squared=0.1538---------+------------------------------AdjR-squared=0.1478Total|38397.037156967.4816117RootMSE=7.5832------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.7362439.15868124.6400.000.42456681.047921asvab2|.2472668.04722495.2360.000.154509.3400246asvab3|.0137422.0587160.2340.815-.1015861.1290705asvab4|-.1051868.0544682-1.9310.054-.2121716.001798_cons|-4.7343032.06706-2.2900.022-8.794363-.6742428------------------------------------------------------------------------------.corasvab2asvab3asvab4(obs=570)|asvab2asvab3asvab4--------+---------------------------asvab2|1.0000asvab3|0.69161.0000asvab4|0.65360.76281.0000ComparingthisregressionwiththeregressionwithASVABC,itcanbeseenthatthestandarderrorofASVABCissmallerthanthoseofitscomponents,asyouwouldexpect.ThetstatisticofASVABCissmallerthanthatofASVAB2,butthatisbecauseitscoefficientissmaller.\nMULTIPLEREGRESSIONANALYSIS20OfcourseitcouldbearguedthatASVAB3andASVAB4donotbelonginthemodel.Thissuggestsanotherpossiblesolutiontotheproblemofmulticollinearity.Dropsomeofthecorrelatedvariables,iftheyhaveinsignificantcoefficients..regearningshgcasvab2Source|SSdfMSNumberofobs=570---------+------------------------------F(2,567)=48.81Model|5639.3711122819.68556Prob>F=0.0000Residual|32757.66656757.7736613R-squared=0.1469---------+------------------------------AdjR-squared=0.1439Total|38397.037156967.4816117RootMSE=7.6009------------------------------------------------------------------------------earnings|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.6449415.15197554.2440.000.3464378.9434452asvab2|.2019724.03765675.3640.000.1280086.2759361_cons|-5.7963981.957987-2.9600.003-9.642191-1.950605------------------------------------------------------------------------------HereistheearningsfunctionwithASVAB3andASVAB4dropped.Asexpected,thestandarderrorofthecoefficientofASVAB2issmallerthanintheregressionincludingASVAB3andASVAB4.However,thisapproachtomitigatingtheproblemofmulticollinearityisdangerousbecausesomeofthevariablesdroppedmaytrulybelonginthemodelandtheiromissionmaycauseomittedvariablebias(seeChapter7).Afurtherwayofdealingwiththeproblemofmulticollinearityistouseextraneousinformation,ifavailable,concerningthecoefficientofoneofthevariables.y=a+b1x+b2p+u(4.40)Forexample,supposethatyinequation(4.40)isthedemandforacategoryofconsumerexpenditure,xisaggregatedisposablepersonalincome,andpisapriceindexforthecategory.Tofitamodelofthistypeyouwouldusetimeseriesdata.Ifxandppossessstrongtimetrendsandarethereforehighlycorrelated,whichisoftenthecasewithtimeseriesvariables,multicollinearityislikelytobeaproblem.Suppose,however,thatyoualsohavecross-sectiondataonyandxderivedfromaseparatehouseholdsurvey.Thesevariableswillbedenotedy'andx'toindicatethatthedataarehouseholddata,notaggregatedata.Assumingthatallthehouseholdsinthesurveywerepayingroughlythesamepriceforthecommodity,fitthesimpleregression''''yˆ=a+bx(4.41)1'Nowsubstituteb1forb1inthetimeseriesmodel,'y=a+b1x+b2p+u,(4.42)'subtractb1xfrombothsides,'y-b1x=a+b2p+u(4.43)\nMULTIPLEREGRESSIONANALYSIS21'andregressz=y-b1xonprice.Thisisasimpleregression,somulticollinearityhasbeeneliminated.Therearetwopossibleproblemswiththistechnique,whichyoushouldbearinmind.First,your'estimateofb2isconditionalontheaccuracyofyourestimateofb1,andthisofcourseissubjecttosamplingerror.Second,youareassumingthattheincomecoefficienthasthesamemeaningintimeseriesandcross-sectioncontexts,andthismaynotbethecase.Formanycommoditiestheshort-runandlong-runeffectsofchangesinincomemaydifferquitemarkedly.Onereasonisthatexpenditurepatternsaresubjecttoinertia,whichmaydominateincomeeffectsintheshortrun.Anotheristhatachangeinincomecanaffectexpenditurebothdirectly,byalteringthebudgetconstraint,andindirectly,throughcausingachangeinlifestyle,andtheindirecteffectismuchslowerthanthedirectone.Asafirstapproximation,itiscommonlyarguedthattimeseriesregressions,particularlythoseusingshortsampleperiods,estimateshort-runeffectswhilecross-sectionregressionsestimatelong-runones.Foradiscussionofthisandrelatedissues,seeKuhandMeyer,1957.Last,butbynomeansleast,istheuseofatheoreticalrestriction,whichisdefinedasahypotheticalrelationshipamongtheparametersofaregressionmodel.Itwillbeexplainedusinganeducationalattainmentmodelasanexample.Supposethatwehypothesizethathighestgradecompleted,HGC,dependsonASVABC,andthehighestgradecompletedbytherespondent'smotherandfather,HGCMandHGCF,respectively:HGC=a+b1ASVSABC+b2HGCM+b3HGCF+u(4.44)FittingthemodelusingDataSet21,weobtainthefollowingoutput:.reghgcasvabchgcmhgcfSource|SSdfMSNumberofobs=570---------+------------------------------F(3,566)=110.83Model|1278.241533426.080508Prob>F=0.0000Residual|2176.005845663.84453329R-squared=0.3700---------+------------------------------AdjR-squared=0.3667Total|3454.247375696.07073351RootMSE=1.9607------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1295006.009954413.0090.000.1099486.1490527hgcm|.069403.04229741.6410.101-.013676.152482hgcf|.1102684.03119483.5350.000.0489967.1715401_cons|4.914654.50635279.7060.0003.9200945.909214------------------------------------------------------------------------------TheregressioncoefficientsimplythatHGCincreasesby0.13yearsforeveryone-pointincreaseinASVABC,by0.07yearsforeveryextrayearofschoolingofthemotherandby0.11yearsforeveryextrayearofschoolingofthefather.Mother'seducationisgenerallyheldtobeatleast,ifnotmore,importantthanfather'seducationforeducationalattainment,sothisoutcomeisunexpected.ItisalsosurprisingthatthecoefficientofHGCMisnotsignificant,evenatthe5%level,usingaone-tailedtest.HoweverassortativematingleadstoahighcorrelationbetweenHGCMandHGCFandtheregressionappearstobesufferingfrommulticollinearity.Supposethatwehypothesizethatmother'sandfather'seducationareequallyimportant.Wecanthenimposetherestrictionb2=b3.Thisallowsustowritetheequationas\nMULTIPLEREGRESSIONANALYSIS22HGC=a+b1ASVSABC+b2(HGCM+HGCF)+u(4.45)DefiningHGCPtobethesumofHGCMandHGCF,theequationmayberewrittenwithASVABCandHGCPastheexplanatoryvariables:HGC=a+b1ASVSABC+b2HGCP+u(4.46).ghgcp=hgcm+hgcf.reghgcasvabchgcpSource|SSdfMSNumberofobs=570---------+------------------------------F(2,567)=166.22Model|1276.737642638.368819Prob>F=0.0000Residual|2177.509735673.84040517R-squared=0.3696---------+------------------------------AdjR-squared=0.3674Total|3454.247375696.07073351RootMSE=1.9597------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1295653.009948513.0240.000.1100249.1491057hgcp|.093741.01656885.6580.000.0611973.1262847_cons|4.823123.48448299.9550.0003.8715235.774724------------------------------------------------------------------------------Theestimateofb2isnow0.094.Notsurprisingly,thisisacompromisebetweenthecoefficientsofHGCMandHGCFinthepreviousspecification.ThestandarderrorofHGCPismuchsmallerthanthoseofHGCMandHGCF,indicatingthattheuseoftherestrictionhasledtoagaininefficiency,andasaconsequencethetstatisticisveryhigh.Thustheproblemofmulticollinearityhasbeeneliminated.However,therestrictionmightnotbevalid.Weshouldtestit.WeshallseehowtodothisinChapter7.Exercises4.11UsingyourEAEFdataset,regressHGConHGCM,HGCFandASVAB2,ASVAB3,andASVAB4,thethreecomponentsoftheASVABCcompositescore.ComparethecoefficientsandtheirstandarderrorswiththoseofASVABCinaregressionofHGConHGCM,HGCFandASVABC.CalculatecorrelationcoefficientsforthethreeASVABcomponents.4.12InvestigatethedeterminantsoffamilysizebyregressingSIBLINGSonHGCMandHGCFusingyourEAEFdataset.HGCMandHGCFarelikelytobehighlycorrelated(findthecorrelationinyourdataset)andtheregressionmaybesubjecttomulticollinearity.IntroducetherestrictionthatthetheoreticalcoefficientsofHGCMandHGCFareequalandruntheregressionasecondtimereplacingHGCMandHGCFbytheirsum,HGCP.Evaluatetheregressionresults.4.13*Aresearcherinvestigatingthedeterminantsofthedemandforpublictransportinacertaincityhasthefollowingdatafor100residentsforthepreviouscalendaryear:expenditureonpublictransport,E,measuredindollars;numberofdaysworked,W;andnumberofdaysnotworked,NW.BydefinitionNWisequalto365-W.Heattemptstofitthefollowingmodel\nMULTIPLEREGRESSIONANALYSIS23E=a+b1W+b2NW+uExplainwhyheisunabletofitthisequation.(Givebothintuitiveandtechnicalexplanations.)Howmightheresolvetheproblem?4.14Yearsofworkexperienceinthelaborforceisgenerallyfoundtobeanimportantdeterminantofearnings.ThereisnodirectmeasureofworkexperienceintheEAEFdataset,butpotentailworkexperience,PWEdefinedbyPWE=AGE–HGC–6mayapproximateit.Thisisthemaximumnumberofyearssincethecompletionoffull-timeeducation,assumingthatanindividualentersfirstgradeattheageof6.UsingyourEAEFdataset,firstregressEARNINGSonHGCandPWE,andthenruntheregressionasecondtimeaddingAGEaswell.Commentontheregressionresults.24.5GoodnessofFit:R2Asinsimpleregressionanalysis,thecoefficientofdetermination,R,measurestheproportionofthevarianceofyexplainedbytheregressionandisdefinedequivalentlybyVar(yˆ)/Var(y),by{1-Var(e)/Var(y)},orbythesquareofthecorrelationcoefficientforyandyˆ.Itcanneverdecrease,andgenerallywillincrease,ifyouaddanothervariabletoaregressionequation,providedthatyouretainalltheoriginalexplanatoryvariables.Toseethis,supposethatyouregressyonx1andx2andfittheequationyˆ=a+bx+bx.(4.47)1122Nextsupposethatyouregressyagainstx1onlyandtheresultis**yˆ=a+bx(4.48)11Thiscanberewritten**yˆ=a+bx+0x(4.49)112Comparing(4.47)and(4.49),thecoefficientsintheformerhavebeendeterminedfreelybytheOLStechniqueusingthedatafory,x1andx2togivethebestpossiblefit.In(4.49),however,thecoefficientofxhasarbitrarilybeensetatzero,andthefitwillbesuboptimalunless,bycoincidence,b2happensto*bezero,whenthefitwillbethesame.(Inthiscasea*willthenbeequaltoa,andb1willbeequalto2b1).Hence,ingeneral,thelevelofRwillbehigherin(4.49)thanin(4.47),anditcanneverbelower.\nMULTIPLEREGRESSIONANALYSIS242Ofcourse,ifthenewvariabledoesnotgenuinelybelongintheequation,theincreaseinRislikelytobenegligible.2Youmightthinkthat,becauseRmeasurestheproportionofthevariancejointlyexplainedbytheexplanatoryvariables,itshouldbepossibletodeducetheindividualcontributionofeachexplanatoryvariableandthusobtainameasureofitsrelativeimportance.Atleastitwouldbeveryconvenientifonecould.Unfortunately,suchadecompositionisimpossibleiftheexplanatoryvariablesarecorrelatedbecausetheirexplanatorypowerwilloverlap.TheproblemwillbediscussedfurtherinSection7.*.FtestsWesawinSection3.10thatanFtestisusedtoassessananalysisofvariance.Afteryouhaveusedregressionanalysistodividethevarianceofthedependentvariableinto"explained"and"unexplained"components,youconstructtheFstatistic:ESS/kF(k,n-k-1)=(4.50)RSS/(n-k-1)whereESSistheexplainedsumofsquares,RSSistheresidual(unexplained)sumofsquares,andkisthenumberofdegreesoffreedomusedupintheexplanation,andperformanFtesttoseewhethertheamountapparentlyexplainedisgreaterthanislikelytohaveoccurredbychance.Todothis,youlookupthecriticallevelofFinthecolumncorrespondingtokdegreesoffreedomandtherowcorrespondington-k-1degreesoffreedomintheappropriatepartofTableA.3.ThemostcommonuseoftheFtestistoassesswhethertheexplanationprovidedbytheequationasawholeissignificant.Inadditiononemayalsoperformsomesupplementarytests,andweshalllookattheseaswell.WholeEquationWhenperforminganFtestfortheequationasawhole,thenullhypothesis,expressedverbally,isthatthemodelhasnoexplanatorypoweratall.Ofcoursewehopetorejectitandconcludethatthemodeldoeshavesomeexplanatorypower.Themodelwillhavenoexplanatorypowerifitturnsoutthatyisunrelatedtoanyoftheexplanatoryvariables.Mathematically,therefore,ifthemodelisy=a+b1x1+…+bkxk+u,(4.51)thenullhypothesisisthatallthebcoefficientsarezero:H0:b1=b2=…=bk=0(4.52)ThealternativehypothesisH1isthatatleastonebcoefficientisdifferentfromzero.\nMULTIPLEREGRESSIONANALYSIS25InthesimpleregressionmodeltheFtestwasequivalenttothe(two-tailed)ttestontheslopecoefficientbecausetherewasjustonevariable.InthemultipleregressionmodelthereisadifferencebetweentherolesoftheFandttests.TheFtestteststhejointexplanatorypowerofthevariables,whilethetteststestwhethertheindividualcontributionsoftheexplanatoryvariablesaresignificant,testingseparatelythehypothesesb1=0,...,bk=0.WhencalculatingtheFstatisticfortheequationasawhole,itmaybeconvenienttodivideboth2thenumeratoranddenominatorof(4.50)byTSS,thetotalsumofsquares.NotingthatESS/TSSisR2andRSS/TSSis(1-R),theFstatisticmaythenbewritten2RkF(k,n-k-1)=(4.53)2(1-R)(n-k-1)ExampleTheeducationalattainmentmodelwillbeusedasanillustration.WewillsupposethatHGCdependsonASVABC,HGCM,andHGCF:HGC=a+b1ASVABC,+b2HGCM+b3HGCF+u.(4.54)ThenullhypothesisfortheFtestofgoodnessoffitisthatallthreeslopecoefficientsareequaltozero:H0:b1=b2=b3=0(4.55)Thealternativehypothesisisthatatleastoneofthemisnon-zero.HereistheregressionoutputusingDataSet21..reghgcasvabchgcmhgcfSource|SSdfMSNumberofobs=570---------+------------------------------F(3,566)=110.83Model|1278.241533426.080508Prob>F=0.0000Residual|2176.005845663.84453329R-squared=0.3700---------+------------------------------AdjR-squared=0.3667Total|3454.247375696.07073351RootMSE=1.9607------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1295006.009954413.0090.000.1099486.1490527hgcm|.069403.04229741.6410.101-.013676.152482hgcf|.1102684.03119483.5350.000.0489967.1715401_cons|4.914654.50635279.7060.0003.9200945.909214------------------------------------------------------------------------------Inthisexample,k,thenumberofexplanatoryvariables,isequalto3andn-k-1,thenumberofdegreesoffreedom,isequalto566.ThenumeratoroftheFstatisticistheexplainedsumofsquaresdividedbyk.IntheStataoutputthesenumbers,1278.2and3,respectively,aregivenintheModelrow.Thedenominatoristheresidualsumofsquaresdividedbythenumberofdegreesoffreedom\nMULTIPLEREGRESSIONANALYSIS26remaining,2176.0and566,respectively.HencetheFstatisticis110.8.Allseriousregressionpackagescomputeitforyouaspartofthediagnosticsintheregressionoutput.1278.2/3F(3,566)==110.8(4.56)2176.0/566ThecriticalvalueforF(3,566)isnotgivenintheFtables,butweknowitmustbelowerthanF(3,120),whichisgiven.Atthe0.1%level,thisis5.78.HenceweeasilyrejectH0atthe0.1%level.ThisresultcouldhavebeenanticipatedbecausebothASVABCandHGCFhavehighlysignificanttstatistics.Soweknewinadvancethatbothb1andb3werenon-zero.InprincipletheFstatisticmightnotbesignificantevenifsomeofthetstatisticsaresignificant.Supposethatyourananonsenseregressionwith40explanatoryvariables,nonebeingatruedeterminantofthedependentvariable.ThentheFstatisticshouldbelowenoughforH0nottoberejected.However,ifyouareperformingttestsontheslopecoefficientsatthe5%level,witha5%chanceofaTypeIerror,onaverage2ofthe40variablescouldbeexpectedtohave"significant"coefficients.Inpractice,however,theFstatisticwillbesignificantifanytstatisticis.Theoppositecaneasilyhappen,though.Supposeyouhaveamultipleregressionmodelwhichis2correctlyspecifiedandtheRishigh.YouwouldbelikelytohaveahighlysignificantFstatistic.However,iftheexplanatoryvariablesarehighlycorrelatedandthemodelissubjecttoseveremulticollinearity,thestandarderrorsoftheslopecoefficientscouldallbesolargethatnoneofthetstatisticsissignificant.Inthissituationyouwouldknowthatyourmodelisagoodone,butyouarenotinapositiontopinpointthecontributionsmadebytheexplanatoryvariablesindividually.FurtherAnalysisofVarianceBesidestestingtheequationasawhole,youcanuseanFtesttoseewhetherornotthejointmarginalcontributionofagroupofvariablesissignificant.SupposethatyoufirstrunaregressionwithkexplanatoryvariablesandtheexplainedsumofsquaresisESSk.Thenyouaddsomemorevariables,bringingthetotaltom,andtheexplainedsumofsquaresrisestoESSm.YouhavethereforeexplainedanadditionalsumofsquaresequaltoESSm–ESSkusingupanadditionalm-kdegreesoffreedom,andyouwanttoseewhethertheincreaseisgreaterthanislikelytohavearisenbychance.AgainanFtestisusedandtheappropriateFstatisticmaybeexpressedinverbaltermsasImprovementinfitExtradegreesoffreedomusedupF=(4.57)ResidualsumofsquaresremainingDegreesoffreedomremainingSinceRSSm,theunexplainedsumofsquareswithallmvariables,isequaltoTSS–ESSm,andRSSk,theresidualsumofsquareswithonlykvariables,isequaltoTSS–ESSk,theimprovementinthefitwhentheextravariablesareadded,ESSm–ESSk,isequaltoRSSk–RSSm.HencetheappropriateFstatisticis(RSSk-RSSm)(m-k)F(m-k,n-m-1)=(4.58)RSSm(n-m-1)\nMULTIPLEREGRESSIONANALYSIS27TABLE4.3DegreesofSumofSquaresDividedSumofSquaresFStatisticfreedombyDegreesofFreedomExplainedbyoriginalESSkkESSk/kvariablesESSkkRSSk(n-k-1)ResidualRSSk=TSS–ESSkn–k–1RSSk/(n–k–1)ExplainedbyESSm–ESSkm–k(RSSk–RSSm)/(m–k)newvariables=RSSk–RSSm(RSSk-RSSm)(m-k)RSSm(n-m-1)ResidualRSSm=TSS–ESSmn–m–1RSSm/(n–m–1)Underthenullhypothesisthattheadditionalvariablescontributenothingtotheequation,itisdistributedwithm-kandn-m-1degreesoffreedom.TheupperhalfofTable4.3givestheanalysisofvariancefortheexplanatorypoweroftheoriginalkvariables.Thelowerhalfgivesitforthejointmarginalcontributionofthenewvariables.ExampleWewillillustratethetestwiththeeducationalattainmentexample.HereisHGCregressedonASVABCusingDataSet21.Wemakeanoteoftheresidualsumofsquares,2300.4..reghgcasvabcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=284.89Model|1153.8086411153.80864Prob>F=0.0000Residual|2300.438735684.05006818R-squared=0.3340---------+------------------------------AdjR-squared=0.3329Total|3454.247375696.07073351RootMSE=2.0125------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1545378.009155916.8790.000.1365543.1725213_cons|5.770845.466847312.3610.0004.8538886.687803------------------------------------------------------------------------------Nowweaddagroupoftwovariables,thehighestgradecompletedbyeachparent.Dothesevariablesjointlymakeasignificantcontributiontotheexplanatorypowerofthemodel?Well,wecanseethata\nMULTIPLEREGRESSIONANALYSIS28ttestwouldshowthatHGCFhasahighlysignificantcoefficient,butwewillperformtheFtestanyway.WemakeanoteofRSS,2176.0..reghgcasvabchgcmhgcfSource|SSdfMSNumberofobs=570---------+------------------------------F(3,566)=110.83Model|1278.241533426.080508Prob>F=0.0000Residual|2176.005845663.84453329R-squared=0.3700---------+------------------------------AdjR-squared=0.3667Total|3454.247375696.07073351RootMSE=1.9607------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1295006.009954413.0090.000.1099486.1490527hgcm|.069403.04229741.6410.101-.013676.152482hgcf|.1102684.03119483.5350.000.0489967.1715401_cons|4.914654.50635279.7060.0003.9200945.909214------------------------------------------------------------------------------Theimprovementinthefitonaddingtheparentalvariablesisthereductionintheresidualsumofsquares,2300.4–2176.0.Thecostistwodegreesoffreedombecausetwoadditionalparametershavebeenestimated.TheresidualsumofsquaresremainingunexplainedafteraddingHGCMandHGCFis2176.0.Thenumberofdegreesoffreedomremaining,n-k-1,is570-3-1=566.(RSS1-RSS2)2(2300.4-2176.0)/2F(2,570-3-1)===16.18(4.59)RSS2(570-3-1)2176.0/566ThustheFstatisticis16.18.ThecriticalvalueofF(2,120)atthe0.1%levelis7.32.ThecriticalvalueofF(2,566)mustbelower,sowerejectH0andconcludethattheparentaleducationvariablesdohavesignificantjointexplanatorypower.RelationshipbetweenFstatisticandtstatisticSupposethatyouareconsideringthefollowingalternativemodelspecifications:y=a+b1x1+…+bk-1xk-1+u(4.60)y=a+b1x1+…+bk-1xk-1+bkxk+u(4.61)theonlydifferencebeingtheadditionofxkasanexplanatoryvariablein(4.61).Younowhavetwowaystotestwhetherxkbelongsinthemodel.Youcouldperformattestonitscoefficientwhen(4.61)isfitted.Alternatively,youcouldperformanFtestofthetypejustdiscussed,treatingxkasa“group”ofjustonevariable,totestitsmarginalexplanatorypower.FortheFtestthenullhypothesiswillbeH0:bk=0,sinceonlyxkhasbeenaddedandthisisthesamenullhypothesisasthatforthettest.Thusitmightappearthatthereisariskthattheoutcomesofthetwotestsmightconflictwitheachother.Fortunately,thisisimpossible,sinceitcanbeshownthattheFstatisticmustbeequaltothesquareofthetstatisticandthatthecriticalvalueofFisequaltothesquareofthecriticalvalueoft\nMULTIPLEREGRESSIONANALYSIS29(one-tailedtest).Thisresultmeansthatthettestofthecoefficientofavariableisineffectatestofitsmarginalexplanatorypower,afteralltheothervariableshavebeenincludedintheequation.Ifthevariableiscorrelatedwithoneormoreoftheothervariables,itsmarginalexplanatorypowermaybequitelow,evenifitgenuinelybelongsinthemodel.Ifallthevariablesarecorrelated,itispossibleforallofthemtohavelowmarginalexplanatorypowerandfornoneofthetteststobesignificant,eventhoughtheFtestfortheirjointexplanatorypowerishighlysignificant.Ifthisisthecase,themodelissaidtobesufferingfromtheproblemofmulticollinearitydiscussedearlierinthischapter.Noproofoftheequivalencewillbeofferedhere,butitwillbeillustratedwiththeeducationalattainmentmodel.InthefirstregressionithasbeenhypothesizedthatHGCdependsonASVABCandHGCM.Inthesecond,ithasbeenhypothesizedthatitdependsonHGCFaswell..reghgcasvabchgcm|.reghgcasvabchgcmhgcf|Source|SSdfMS|Source|SSdfMS---------+------------------------------|---------+------------------------------Model|1230.20392615.101949|Model|1278.241533426.080508Residual|2224.043475673.92247526|Residual|2176.005845663.84453329---------+------------------------------|---------+------------------------------Total|3454.247375696.07073351|Total|3454.247375696.07073351|----------------------------------------|----------------------------------------hgc|Coef.Std.Err.t|hgc|Coef.Std.Err.t---------+------------------------------|----------------------------------------asvabc|.1381062.009749414.166|asvabc|.1295006.009954413.009hgcm|.154783.03507284.413|hgcm|.069403.04229741.641_cons|4.791277.51024319.390|hgcf|.1102684.03119483.535----------------------------------------|_cons|4.914654.50635279.706|----------------------------------------TheimprovementonaddingHGCFisthereductionintheresidualsumofsquares,2224.0–2176.0.ThecostisjustthesingledegreeoffreedomlostwhenestimatingthecoefficientofHGCF.TheresidualsumofsquaresremainingafteraddingHGCFis2176.0.ThenumberofdegreesoffreedomremainingafteraddingHGCFis570-3-1=566.HencetheFstatisticis12.49.(RSS1-RSS2)1(2224.0-2176.01F(1,570-3-1)===12.49(4.62)RSS2(570-3-1)2176.0566ThecriticalvalueofFatthe0.1%significancelevelwith120degreesoffreedomis11.38.Thecriticalvaluewith566degreesoffreedommustbelower,sowerejectH0atthe0.1%level.ThetstatisticforthecoefficientofHGCFinthesecondregressionis3.535.Thecriticalvalueoftatthe0.1%levelwith120degreesoffreedomis3.373.Thecriticalvaluewith566degreesoffreedommustbelower,sowealsorejectH0withthettest.Thesquareof3.535is12.50,equaltotheFstatistic,exceptforroundingerror,andthesquareof3.373is11.38,equaltothecriticalvalueofF(1,120).ItcanalsobeshownthatthecriticalvalueofFmustbeequaltothesquareofthecriticalvalueoft.(Thecriticalvaluesshownarefor120degreesoffreedom,butthismustalsobetruefor566degreesoffreedom.)Hencetheconclusionsofthetwotestsmustcoincide.2"Adjusted"R\nMULTIPLEREGRESSIONANALYSIS302Ifyoulookatregressionoutput,youwillalmostcertainlyfindneartheRstatisticsomethingcalled22the"adjusted"R.Sometimesitiscalledthe"corrected"R.However,"corrected"makesitsoundasifitisbetterthantheordinaryone,andmanywoulddisagree.2AswasnotedinSection4.2,Rcanneverfall,andgenerallyincreases,ifyouaddanother22variabletoaregressionequation.TheadjustedR,usuallydenotedR,attemptstocompensateforthisautomaticupwardshiftbyimposingapenaltyforincreasingthenumberofexplanatoryvariables.Itisdefinedas22n-1n-12kR=1-(1-R)=R-n-k-1n-k-1n-k-1(4.63)2k2=R-(1-R).n-k-1wherekisthenumberofexplanatoryvariables.Askincreases,k/(n–k–1)increases,andsothe2negativeadjustmenttoRincreases.2ItcanbeshownthattheadditionofanewvariabletoaregressionwillcauseRtoriseifand2onlyiftheabsolutevalueofitststatisticisgreaterthanone.HenceariseinRwhenanewvariableisaddeddoesnotnecessarilymeanthatitscoefficientissignificantlydifferentfromzero.Ittherefore2doesnotfollow,asissometimessuggested,thatariseinRimpliesthatthespecificationofanequationhasimproved.2ThisisonereasonwhyRhaslostfavorasadiagnosticstatistic.Anotheristhedecreasein22attentionpaidtoRitself.AtonetimetherewasatendencyforappliedeconometricianstoregardRasakeyindicatorofthesuccessofmodelspecification.Inpractice,however,aswillbeseeninthe2followingchapters,evenaverybadlyspecifiedregressionmodelmayyieldahighR,andrecognition2ofthisfacthasledtothedemotionofRinimportance.Itisnowregardedasjustoneofawholesetofdiagnosticstatisticsthatshouldbeexaminedwhenevaluatingaregressionmodel,andprobablyoneofthelessimportantonesatthat.Consequently,thereislittletobegainedbyfinetuningitwitha"correction"ofdubiousvalue.Exercises4.15Fitaneducationalattainmentfunction,regressingHGConASVABC,HGCM,andHGCM.2CalculatetheFstatisticusingRandperformatestoftheexplanatorypoweroftheequationasawhole.4.16FitaneducationalattainmentfunctionusingthespecificationinExercise4.15,addingtheASVABspeedtestscoresASVAB5andASVAB6.PerformanFtestofthejointexplanatorypowerofASVAB5andASVAB6,usingtheresultsofthisregressionandthatinExercise4.15.4.17Fitaneducationalattainmentfunction,regressingHGConASVABC,HGCM,HGCM.,andASVAB5.PerformanFtestoftheexplanatorypowerofASVAB6,usingtheresultsofthisregressionandthatinExercise4.16.Verifythatitleadstothesameconclusionasatwo-tailedttest.4.18*TheresearcherinExercise4.13decidestodividethenumberofdaysnotworkedintothenumberofdaysnotworkedbecauseofillness,I,andthenumberofdaysnotworkedforother\nMULTIPLEREGRESSIONANALYSIS31reasons,O.ThemeanvalueofIinthesampleis2.1andthemeanvalueofOis120.2.Hefitstheregression(standarderrorsinparentheses):2Eˆ=-9.6+2.10W+0.45OR=0.72(8.3)(1.98)(1.77)PerformttestsontheregressioncoefficientsandanFtestonthegoodnessoffitoftheequation.ExplainwhythettestsandFtesthavedifferentoutcomes.\n5TRANSFORMATIONSOFVARIABLESNonlinearrelationshipsaremoreplausiblethanlinearonesformanyeconomicprocesses.Inthischapterwewillfirstdefinewhatismeantbylinearregressionanalysisandthenshowhowsomeapparentlynonlinearrelationshipscanbefittedbyit.Wewillnextseewhatcanbedonewhenlinearmethodscannotbeused.Thechapterendswithanexpositionofatechniquefordiscriminatingstatisticallybetweenlinearandnonlinearrelationships.5.1BasicProcedureOneofthelimitationsoflinearregressionanalysisisimplicitinitsveryname,inthatitcanbeusedtofitonlylinearequationswhereeveryexplanatoryterm,excepttheconstant,iswrittenintheformofacoefficientmultipliedbyvariable:y=a+b1x1+b2x2+b3x3(5.1)Equationssuchasby=a+(5.2)xandby=ax(5.3)arenonlinear.Ifyouchoosevaluesforaandbanddrawgraphs,youwillfindthatbothofthemrepresentcurves.However,both(5.2)and(5.3)havebeensuggestedassuitableformsforEngelcurves,therelationshipbetweenthedemandforaparticularcommodity,y,andtotalincome,x.Givendataonyandx,howcouldoneestimatetheparametersaandbineachequation?Actually,inbothcases,withalittlepreparationonecanuselinearregressionanalysisafterall.First,notethat(5.1)islinearintwosenses.Therightsideislinearinvariablesbecausethevariablesareincludedexactlyasdefined,ratherthanasfunctions.Itthereforeconsistsofaweightedsumofthevariables,theparametersbeingtheweights.Therightsideisalsolinearintheparameterssinceitconsistsofaweightedsumoftheseaswell,thexvariablesbeingtheweightsthistime.ÓC.Dougherty1999.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof19.11.99.\nTRANSFORMATIONSOFVARIABLES2Forthepurposeoflinearregressionanalysis,onlythesecondtypeoflinearityisimportant.Nonlinearityinthevariablescanalwaysbesidesteppedbyusingappropriatedefinitions.Forexample,supposethattherelationshipwereoftheform2y=a+b1x1+b2x2+b3logx3+...(5.4)2Bydefiningz1=x1,z2=x2,z3=logx3etc,therelationshipcanberewritteny=a+b1z1+b2z2+b3z3+…(5.5)anditisnowlinearinvariablesaswellasinparameters.Thistypeoftransformationisonlycosmetic,andyouwillusuallyseetheregressionequationpresentedwiththevariablesleftintheirnonlinearform.Thisavoidstheneedforexplanationandextranotation.Ontheotherhandanequationsuchas(5.3)isnonlinearinbothparametersandvariablesandcannotbehandledbyamereredefinition.(Donotbetemptedtothinkthatyoucanmakeitlinearbybbdefiningz=xandreplacingxwithz;sinceyoudonotknowb,youhavenowayofcalculatingsampledataforz.)Wewilldiscusstheproblemoffittingrelationshipsthatarenonlinearinparametersinthenextsection.Inthecaseof(5.2),however,allwehavetodoistodefinez=1/x.Equation(5.2)nowbecomesy=a+bz(5.6)andthisislinear,soyouregressyonzandtherearenoproblems.Theconstanttermintheregressionwillbeanestimateofaandthecoefficientofzwillbeanestimateofb.ExampleSupposethatyouareinvestigatingtherelationshipbetweenannualconsumptionofbananas(boring,safeexample)andannualincome,andyouhavetheobservationsshowninTable5.1for10households(ignorezforthetimebeing):TABLE5.1BananasIncome(lbs)($10,000)Householdyxz11.7111.00026.8820.50038.2530.33349.5240.25059.8150.200611.4360.167711.0970.143810.8780.125912.1590.1111010.94100.100\nTRANSFORMATIONSOFVARIABLES314121086Bananas(lbs)420024681012Income($thousand)Figure5.1.RegressionofexpenditureonbananasonincomeTheseobservationsareplottedinFigure5.1,togetherwiththelineobtainedbyregressingyonx(standarderrorsinparentheses):2yˆ=4.62+0.84xR=0.69(5.7)(1.26)(0.20)Now,ifyoulookatFigure5.1,youwillseethattheregressionlinedoesnotfittheobservationsverywell,despitethefactthatthecoefficientofincomeissignificantlydifferentfromzeroatthe1percentlevel.Quiteobviously,theobservationslieonacurve,whiletheregressionequationisofcourseastraightline.Inthiscase,itiseasytoseethatthefunctionalrelationshipbetweenyandxhasbeenmisspecified.Inthecaseofmultipleregressionanalysis,nonlinearitymightbedetectedusingthegraphicaltechniquedescribedinSection4.2.Alternatively,anexaminationoftheresidualsmaybesufficienttoindicatethatsomethingiswrong.InthiscasetheresidualsareasshowninTable5.2.Theresidualsoughttoberandomlypositiveornegative,largeorsmall.Instead,theystartoutbeingnegative,crosstobeingpositive,reachamaximum,fallagain,andcrossbacktobeingnegative:verysuspiciousindeed.TABLE5.2HouseholdyyˆeHouseholdyyˆe11.715.46-3.75611.439.691.7426.886.310.57711.0910.530.5538.257.151.10810.8711.38-0.5149.528.001.52912.1512.22-0.0759.818.840.971010.9413.07-2.13\nTRANSFORMATIONSOFVARIABLES414y12108642000.20.40.60.81z1.2Figure5.2.RegressionofexpenditureonbananasonthereciprocalofincomeThevaluesofyandxinthisexampleweregeneratedusingtheMonteCarlotechnique,thetruerelationshipbeing10y=12-+disturbanceterm,(5.8)xxtakingthenumbersfrom1to10andthevaluesofthedisturbancetermbeingobtainedusingnormallydistributedrandomnumberswithzeromeanandstandarddeviationequalto0.5.Ifwerealizethisanddefinez=1/x,thisequationbecomesofthelinearform(5.6).zforeachhouseholdhasalreadybeencalculatedinTable5.1.Regressingyonz,weobtain(standarderrorsinparentheses):2yˆ=12.48-10.99zR=0.97(5.9)(0.26)(0.65)Substitutingz=1/x,thisbecomes10.99yˆ=12.48-(5.10)xInviewoftheexcellentfitobtainedwith(5.9),itisnotsurprisingthat(5.10)isclosetothetrueequation(5.8).Theregressionrelationship,togetherwiththeobservationsony,x,andz,isshownin2Figures5.2and5.3.Theimprovementinthefit,asmeasuredbyR,isclearfromacomparisonofFigures5.1and5.3.\nTRANSFORMATIONSOFVARIABLES514y12108642002468101x2Figure5.3.Nonlinearregressionofexpenditureonbananasonincome5.2LogarithmicTransformationsNextwewilltacklefunctionssuchas(5.3),whicharenonlinearinparametersaswellasvariables:by=ax(5.11)Whenyouseesuchafunction,youcanimmediatelysaythattheelasticityofywithrespecttoxisconstantandequaltob.Thisiseasilydemonstrated.Regardlessofthemathematicalrelationshipconnectingyandx,orthedefinitionsofyandx,theelasticityofywithrespecttoxisdefinedtobetheproportionalchangeinyforagivenproportionalchangeinx:dyyelasticity=(5.12)dxxThus,forexample,ifyisdemandandxisincome,theaboveexpressiondefinestheincomeelasticityofdemandforthecommodityinquestion.Theexpressionmayberewritten(dy/dx)/(y/x).Inthecaseofthedemandexample,thismaybeinterpretedasthemarginalpropensitytoconsumethecommoditydividedbytheaveragepropensitytoconsumeit.Iftherelationshipbetweenyandxtakestheform(5.11),dyb-1y=abx=b.(5.13)dxx\nTRANSFORMATIONSOFVARIABLES6UseofLogarithmsFirst,somebasicrules:1.Ify=xz,logy=logx+logz2.Ify=x/z,logy=logx-logzn3.Ify=x,logy=nlogxTheserulescanbecombinedtotransformmorecomplicatedexpressions.Forexample,takeequation(5.11):bbIfy=ax,logy=loga+logxusingRule1=loga+blogxusingRule3Thusfarwehavenotspecifiedwhetherwearetakinglogarithmstobaseeortobase10.Throughoutthistextweshallbeusingeasthebase,andsoweshallbeusingwhatareknownas"natural"logarithms.Thisisnowstandardineconometrics.Puristssometimeswritelninsteadoflogtoemphasizethattheyareworkingwithnaturallogarithms,butthisisnowunnecessary.Nobodyuseslogarithmstobase10anymore.Theyweretabulatedinthedreadedlogtablesthatwereuniversallyemployedformultiplyingordividinglargenumbersuntiltheearly1970s.Whenthepocketcalculatorwasinvented,theybecameredundant.Theyarenotmissed.Witheasabase,wecanstateanotherrule:x4.Ify=e,logy=xxxe,alsosometimeswrittenexp(x),isfamiliarlyknownastheantilogofx.Onecansaythatlogexisthelogoftheantilogofx,andsincelogandantilogcancelout,itisnotsurprisingthatlogeturnsoutjusttobex.Hencedydxbyxelasticity===b.(5.14)yxyxThus,forexample,ifyouseeanEngelcurveoftheform0.3y=0.01x(5.15)\nTRANSFORMATIONSOFVARIABLES7thismeansthattheincomeelasticityofdemandisequalto0.3.Ifyouaretryingtoexplainthistosomeonewhoisnotfamiliarwitheconomicjargon,theeasiestwaytoexplainitistosaythata1%changeinx(income)willcausea0.3%changeiny(demand).Afunctionofthistypecanbeconvertedintoalinearequationbyusinglogarithms.Youwillcertainlyhaveencounteredlogarithmsinabasicmathematicscourse.Youprobablythoughtthatwhenthatcoursewasfinished,youcouldforgetaboutthem,writingthemoffasoneofthoseacademictopicsthatneverturnouttobeofpracticaluse.Nosuchluck.Ineconometricworktheyareindispensable,soifyouareunsureabouttheiruse,youshouldreviewyournotesfromthatbasicmathcourse.Themainpropertiesoflogarithmsaregiveninaboxbelow,topromptyou.Intheboxitisshownthat(5.11)maybelinearizedaslogy=loga+blogx(5.16)Ifwewritey'=logy,z=logx,anda'=loga,theequationmayberewritteny'=a'+bz(5.17)Theregressionprocedureisnowasfollows.Firstcalculatey'andzforeachobservation,takingthelogarithmsoftheoriginaldata.Yourregressionapplicationwillalmostcertainlydothisforyou,giventheappropriateinstructions.Second,regressy'onz.Thecoefficientofzwillbeadirectestimateofb.Theconstanttermwillbeanestimateofa',thatis,ofloga.Toobtainanestimateofa,youhavetotaketheantilog,thatis,calculateexp(a').Example:EngelcurveFigure5.4plotsannualhouseholdexpenditureonfoodeatenathome,FDHO,andtotalannualhouseholdexpenditure,bothmeasuredindollars,for869representativehouseholdsintheUnitedStatesin1995,thedatabeingtakenfromtheConsumerExpenditureSurvey.FDHO1600014000120001000080006000400020000020000400006000080000100000120000140000EXP160000Figure5.4.Regressionofexpenditureonfoodeatenathomeontotalhouseholdexpenditure\nTRANSFORMATIONSOFVARIABLES8.regfdhoexpSource|SSdfMSNumberofobs=869---------+------------------------------F(1,867)=381.47Model|9158435741915843574Prob>F=0.0000Residual|2.0815e+098672400831.16R-squared=0.3055---------+------------------------------AdjR-squared=0.3047Total|2.9974e+098683453184.55RootMSE=1549.5------------------------------------------------------------------------------fdho|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------exp|.0528427.002705519.5310.000.0475325.0581529_cons|1916.14396.5459119.8470.0001726.6522105.634------------------------------------------------------------------------------.reglgfdholgexpSource|SSdfMSNumberofobs=868---------+------------------------------F(1,866)=396.06Model|84.4161692184.4161692Prob>F=0.0000Residual|184.579612866.213140429R-squared=0.3138---------+------------------------------AdjR-squared=0.3130Total|268.995781867.310260416RootMSE=.46167------------------------------------------------------------------------------lgfdho|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------lgexp|.4800417.024121219.9010.000.4326988.5273846_cons|3.166271.24429712.9610.0002.6867873.645754------------------------------------------------------------------------------Whenanalyzinghouseholdexpendituredata,itisusualtorelatetypesofexpendituretototalhouseholdexpenditureratherthanincome,thereasonbeingthatincomedatatendtoberelativelyerratic.Theoutputsfromlinearandlogarithmicregressionsareshown.10LGFDHO987657891011121LGEXP3Figure5.5.Logarithmicregressionofexpenditureonfoodeatenathomeontotalhouseholdexpenditure\nTRANSFORMATIONSOFVARIABLES9FDHO1600014000120001000080006000400020000020000400006000080000100000120000140000EXP160000Figure5.6.LinearandlogarithmicregressionsofexpenditureonfoodeatenathomeontotalhouseholdexpenditureThelinearregressionindicatesthat5.3centsoutofthemarginaldollararespentonfoodeatenathome.Interpretationoftheinterceptisproblematicbecauseliterallyitimpliesthat$1,916wouldbespentonfoodeatenathomeeveniftotalexpenditurewerezero.Thelogarithmicregression,showninFigure5.5,indicatesthattheelasticityofexpenditureonfoodeatenathomewithrespecttototalhouseholdexpenditureis0.48.Isthisfigureplausible?Yes,becausefoodeatenathomeisanecessityratherthanaluxury,soonewouldexpecttheelasticitytobelessthan1.Theintercepthasnoeconomicmeaning.Figure5.6plotsthelogarithmicregressionlineintheoriginaldiagram.Whilethereisnotmuchdifferencebetweentheregressionlinesoverthemiddlepartoftherangeofobservation,itisclearthatthelogarithmicregressiongivesabetterfitforverylowandveryhighlevelsofhouseholdexpenditure.Semi-logarithmicmodelsAnothercommonfunctionalform.isgivenbyequation(5.18):bxy=ae,(5.18)Herebshouldbeinterpretedastheproportionalchangeinyperunitchangeinx.Again,thisiseasilydemonstrated.Differentiating,dybx=abe=by,(5.19)dxHence\nTRANSFORMATIONSOFVARIABLES10dydx=b.(5.20)yInpracticeitisoftenmorenaturaltospeakofthepercentagechangeinyratherthantheproportionalchange,perunitchangeinx,inwhichcaseonemultipliestheestimateofbby100.Thefunctioncanbeconvertedintoamodelwhichislinearinparametersbytakingthelogarithmsofbothsides:bxbxlogy=logae=loga+loge=loga+bxloge(5.21)=loga+bxNotethatonlytheleftsideislogarithmicinvariables,andforthisreasonthemodelisdescribedassemi-logarithmic.Theinterpretationofbastheproportionalchangeinyperunitchangeinxisvalidonlyforsmallchangesinx.Whenchangesarenotsmall,theinterpretationmaybealittlemorecomplex.Supposethatyisrelatedtoxby(5.18)andthatxincreasesbyoneunittox'.Theny',thenewvalueofyisgivenbybx'b(x+1)y'=ae=aebxbc=aee=ye(5.22)2b=y(1+b++...)2!2Ifbissmall,bandfurthertermswillbeverysmallandcanbeneglected.Inthatcase,therightsideoftheequationsimplifiestoy(1+b)andtheoriginalmarginalinterpretationofbstillapplies.Example:semilogarithmicearningsfunctionForfittingearningsfunctions,thesemi-logarithmicmodelisgenerallyconsideredtobesuperiortothelinearmodel.Wewillstartwiththesimplestpossibleversion:bHGCEARNINGS=ae,(5.23)whereEARNINGSishourlyearnings,measuredindollars,andHGCishighestgradecompleted.AftertakinglogarithmsitbecomesLGEARN=a'+bHGC(5.24)whereLGEARNisthenaturallogarithmofEARNINGSanda'isthelogarithmofa.\nTRANSFORMATIONSOFVARIABLES11.reglgearnhgcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=93.21Model|21.681253121.681253Prob>F=0.0000Residual|132.12064568.23260676R-squared=0.1410---------+------------------------------AdjR-squared=0.1395Total|153.801893569.270302096RootMSE=.48229------------------------------------------------------------------------------lgearn|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.0792256.00820619.6550.000.0631077.0953435_cons|1.358919.112778512.0490.0001.1374061.580433------------------------------------------------------------------------------Theregressionoutput,whichusesEAEFDataSet21,indicatesthateveryextrayearofschoolingincreasesearningsbyaproportion0.079,thatis,7.9percent,asafirstapproximation.Strictlyspeaking,awholeextrayearofschoolingisnotmarginal,soitwouldbemoreaccuratetocalculate0.079e,whichis1.082.Thusamoreaccurateinterpretationisthatanextrayearofschoolingraisesearningsby8.2percent.Thescatterdiagramforthesemi-logarithmicregressionisshowninFigure5.7.Forthepurposeofcomparison,itisplottedtogetherwiththelinearregressioninaplotwiththeuntransformedvariablesinFigure5.8.Thetworegressionlinesdonotdiffergreatlyintheiroverallfit,butthesemi-logarithmicspecificationhastheadvantageofnotpredictingnegativeearningsforindividualswithlowlevelsofschoolingandofallowingtheincreaseinearningsperyearofschoolingtoincreasewithschooling.Aswillbeseeninlaterchapter,themostcompellingreasonsforpreferringthesemi-logarithmicspecificationaretechnical.4321Logarithmofhourlyearnings001234567891011121314151617181920HighestgradecompletedFigure5.7.Semi-logarithmicregressionofearningsonschooling\nTRANSFORMATIONSOFVARIABLES12807060504030Hourlyearnings($)2010001234567891011121314151617181920-10HighestgradecompletedFigure5.8.Linearandsemi-logarithmicregressionsofearningsonschoolingExercises5.1DownloadtheCESdatasetfromthewebsiteandfitlinearand(double)logarithmicregressionsforyourcommodity,excludingobservationswithzeroexpenditureonyourcommodity.Interprettheregressionsandperformappropriatetests.5.2RepeatthelogarithmicregressioninExercise5.1,addingthelogarithmofthesizeofthehouseholdasanadditionalexplanatoryvariable.Interprettheresultsandperformappropriatetests.5.3UsingyourEAEFdataset,regressthe(natural)logarithmofWEIGHT85onthelogarithmofHEIGHT.Interprettheregressionresultsandperformappropriatetests5.4UsingyourEAEFdataset,regressthelogarithmofearningsonHGCandASVABC.Interprettheregressionresultsandperformappropriatetests5.5*UsingtheOECDdatasetonemploymentgrowthratesandGDPgrowthratesinExercise2.6,plotascatterdiagramandinvestigatewhetheranonlinearspecificationmightbesuperiortoalinearone.5.3TheDisturbanceTermThusfar,nothinghasbeensaidabouthowthedisturbancetermisaffectedbythesetransformations.Indeed,inthediscussionaboveithasbeenleftoutaltogether.\nTRANSFORMATIONSOFVARIABLES13Thefundamentalrequirementisthatthedisturbancetermshouldappearinthetransformedequationasanadditiveterm(+u)thatsatisfiestheGauss-Markovconditions.Ifitdoesnot,theleastsquaresregressioncoefficientswillnothavetheusualproperties,andthetestswillbeinvalid.Forexample,itishighlydesirablethat(5.6)shouldbeoftheformy=a+bz+u(5.25)whenwetaketherandomeffectintoaccount.Workingbackwards,thisimpliesthattheoriginal(untransformed)equationshouldbeoftheformby=a++u(5.26)xInthisparticularcase,ifitistruethatintheoriginalequationthedisturbancetermisadditiveandtheGauss-Markovconditionsaresatisfied,itwillalsobetrueinthetransformedequation.Noproblemhere.Whathappenswhenwestartoffwithamodelsuchasby=ax(5.27)Aswehaveseen,theregressionmodel,afterlinearizationbytakinglogarithms,islogy=loga+blogx+u(5.28)whenthedisturbancetermisincluded.Workingbacktotheoriginalequation,thisimpliesthat(5.27)shouldberewrittenby=axv(5.29)wherevanduarerelatedbylogv=u.Hencetoobtainanadditivedisturbancetermintheregressionequation,wemuststartwithamultiplicativedisturbancetermintheoriginalequation.bThedisturbancetermvmodifiesaxbyincreasingitorreducingitbyarandomproportion,ratherthanbyarandomamount.Notethatuisequaltozerowhenlogvisequaltozero,whichoccurswhenvisequalto1.Therandomfactorwillbezerointheestimatingequation(5.28)ifvhappenstobbeequaltoone.Thismakessense,sinceifvisequalto1itisnotmodifyingaxatall.ForthettestsandtheFteststobevalid,umustbenormallydistributed.Thismeansthatlogvmustbenormallydistributed,whichwilloccuronlyifvislognormallydistributed.Whatwouldhappenifweassumedthatthedisturbancetermintheoriginalequationwasadditive,insteadofmultiplicative?by=ax+u(5.30)\nTRANSFORMATIONSOFVARIABLES14180160140120100806040200-3to-2.5-2.5to2-2to-1.5-1.5to-1-1to-0.5-0.5to00to0.50.5to11to1.51.5to22to2.52.5to3Residuals(linear)Residuals(semi-logarithmic)Figure5.9.StandardizedresidualsfromearningsfunctionregressionsbTheansweristhatwhenyoutakelogarithms,thereisnomathematicalwayofsimplifyinglog(ax+u).Thetransformationdoesnotleadtoalinearization.Youwouldhavetouseanonlinearregressiontechnique,forexample,ofthetypediscussedinSection5.5.ExampleThecentrallimittheoremsuggeststhatthedisturbancetermshouldhaveanormaldistribution.Itcanbedemonstratedthatifthedisturbancetermhasanormaldistribution,sowilltheresidualsfromaproperly-specifiedregression.Anexaminationofthedistributionoftheresidualsthusprovidesindirectevidenceoftheadequacyofthespecificationofaregressionmodel.Figure5.9showstheresidualsfromlinearandsemi-logarthmicregressionsofEARNINGSonHGCusingEAEFDataSet21,standardizedsothattheyhavestandarddeviationequalto1,forcomparison.Thedistributionforthesemi-logarithmicresidualsismuchclosertoanormaldistributionthanthatforthelinearregression.Itisleftskewed,butnotnearlyassharplyasthedistributionforthelinearregression.5.4NonlinearRegressionSupposeyoubelievethatavariableydependsonavariablexaccordingtotherelationshipgy=a+bx+u,(5.31)andyouwishtoobtainestimatesofa,bandggivendataonyandx.Thereisnowayoftransforming(5.23)toobtainalinearrelationship,andsoitisnotpossibletoapplytheusualregressionprocedure.\nTRANSFORMATIONSOFVARIABLES15Neverthelessonecanstillusetheprincipleofminimizingthesumofthesquaresoftheresidualstoobtainestimatesoftheparameters.Theprocedureisbestdescribedasaseriesofsteps:1.Youstartbyguessingplausiblevaluesfortheparameters.2.Youcalculatethepredictedvaluesofyfromthedataonx,usingthesevaluesoftheparameters.3.Youcalculatetheresidualforeachobservationinthesample,andhenceRSS,thesumofthesquaresoftheresiduals.4.Youthenmakesmallchangesinoneormoreofyourestimatesoftheparameters.5.Youcalculatethenewpredictedvaluesofy,residualsandRSS.6.IfRSSissmallerthanbefore,yournewestimatesoftheparametersarebetterthantheoldonesandyoutakethemasyournewstartingpoint.7.Yourepeatsteps4,5and6againandagainuntilyouareunabletomakeanychangesintheestimatesoftheparametersthatwouldreduceRSS.8.YouconcludethatyouhaveminimizedRSS,andyoucandescribethefinalestimatesoftheparametersastheleastsquaresestimates.ExampleWewillreturntothebananasexampleinSection5.1,whereyandxarerelatedbyby=a++u(5.32)xTokeepthingsassimpleaspossible,wewillassumethatweknowthataisequalto12,sowehaveonlyoneunknownparametertoestimate.Wewillsupposethatwehaveguessedthattherelationshipisoftheform(5.32),butwearetoowitlesstothinkofthetransformationdiscussedinSection5.1.Weinsteadusenonlinearregression.Figure5.9showsthevalueofRSSthatwouldresultfromanychoiceofb,giventhevaluesofyandxinTable5.1.Supposewestartedoffwithaguessof-6.0forb.Ourprovisionalequationwouldbe6y=12-(5.33)xWewouldcalculatethepredictedvaluesofyandtheresiduals,andfromthelattercalculateavalueof29.17forRSSNextwetryb=-7.RSSisnow18.08,whichislower.Wearegoingintherightdirection.Sowenexttryb=-8.RSSis10.08.Wekeepgoing.Puttingb=-9,RSSis5.19.Puttingb=-10,RSSis3.39.Puttingb=-11,RSSis4.70.Clearlywithb=-11wehaveovershot,becauseRSShasstartedrisingagain.Westartmovingbackwards,butwithsmallersteps,say0.1,trying-10.9,-10.8etc.Wekeepmovingbackwardsuntilweovershootagain,andthenstartmovingforwards,withevensmallersteps,say0.01.Eachtimeweovershoot,wereversedirection,cuttingthesizeofthestep.Wecontinuedoingthisuntilwehave\nTRANSFORMATIONSOFVARIABLES1635302520RSS151050-15-14-13-12-11-10-9-8-7-6-5bFigure5.10.Nonlinearregression,RSSasafunctionofb.achievedthedesiredaccuracyinthecalculationoftheestimateofb.Table5.3showsthestepsinthisexample.TheprocessshowninTable5.3wasterminatedafter25iterations,bywhichtimeitisclearthattheestimate,totwodecimalplaces,is-10.08..Obviously,greaterprecisionwouldhavebeenobtainedbycontinuingtheiterativeprocessfurther.Notethat,althoughtheestimateisclosetothetruevalueof-10,itisnotexactlythesameastheestimateobtainedinequation(5.9),whichwas–10.99.Inprinciplethetwosetsofresultsshouldbeidentical,becausebothareminimizingthesumofthesquaresoftheresiduals.Thediscrepancyiscausedbythefactthatwehavecheatedslightlyinthenonlinearcase.Wehaveassumedthataisequaltoitstruevalue,12,insteadofestimatingit.Ifwehadreallyfailedtospotthetransformationwhichallowsustouselinearregressionanalysis,wewouldhavehadtouseanonlineartechniquehuntingforthebestvaluesofaandbsimultaneously,andthefinalvaluesofaandbwouldhavebeen12.48and-10.99,respectively,asinequation(5.9).TABLE5.3bRSSbRSSbRSSbRSS-629.17-10.84.19-10.13.38-10.063.384-718.08-10.73.98-10.03.393-10.073.384-810.08-10.63.80-10.013.391-10.083.383-95.19-10.53.66-10.023.389-10.093.384-103.39-10.43.54-10.033.387-114.70-10.33.46-10.043.386.-10.94.43-10.23.41-10.053.385\nTRANSFORMATIONSOFVARIABLES17Inpractice,thealgorithmsusedforminimizingtheresidualsumofsquaresinanonlinearmodelaremathematicallyfarmoresophisticatedthanthesimpletrial-and-errormethoddescribedabove.Nevertheless,untilfairlyrecentlyamajorproblemwiththefittingofnonlinearregressionswasthatitwasveryslowcomparedwithlinearregression,especiallywhentherewereseveralparameterstobeestimated,andthehighcomputingcostdiscouragedtheuseofnonlinearregression.Thishaschangedwiththeenormousincreaseinthespeedandpowerofcomputersthathasoccurredinthepastfewyears.Asaconsequencemoreinterestisbeingtakeninthetechniqueandsomeregressionapplicationsincorporateuser-friendlynonlinearregressionfeatures.5.5ChoiceofFunction:Box-CoxTestsThepossibilityoffittingnonlinearmodels,eitherbymeansofalinearizingtransformationorbytheuseofanonlinearregressionalgorithm,greatlyincreasestheflexibilityofregressionanalysis,butitalsomakesyourtaskasaresearchermorecomplex.Youhavetoaskyourselfwhetheryoushouldstartoffwithalinearrelationshiporanonlinearone,andifthelatter,whatkind.Agraphicalinspection,usingthetechniquedescribedinSection4.2inthecaseofmultipleregressionanalysis,mighthelpyoudecide.IntheillustrationinSection5.2,itwasobviousthattherelationshipwasnonlinear,anditshouldnothavetakenmuchefforttodiscoverthananequationoftheform(5.3)wouldgiveagoodfit.Usually,however,theissueisnotsoclear-cut.Itoftenhappensthatseveraldifferentnonlinearformsmightapproximatelyfittheobservationsiftheylieonacurve.Whenconsideringalternativemodelswiththesamespecificationofthedependentvariable,theselectionprocedureisstraightforward.Themostsensiblethingtodoistorunregressionsbasedonalltheplausiblefunctionsyoucanimagineandchoosethefunctionthatexplainsthegreatestproportionofthevarianceofthedependentvariable.Iftwoormorefunctionsaremoreorlessequallygood,youshouldpresenttheresultsofeach.LookingagainattheillustrationinSection5.1,youcanseethatthelinearfunctionexplained69percentofthevarianceofy,whereasthehyperbolicfunction(5.3)explained97percent.Inthisinstancewehavenohesitationinchoosingthelatter.However,whenalternativemodelsemploydifferentfunctionalformsforthedependentvariable,theproblemofmodelselectionbecomesmorecomplicatedbecauseyoucannotmakedirect2comparisonsofRorthesumofthesquaresoftheresiduals.Inparticular-andthisisthemostcommonexampleoftheproblem-youcannotcomparethesestatisticsforlinearandlogarithmicdependentvariablespecifications.Forexample,inSection2.6,thelinearregressionofexpenditureonearningsonhighestgrade2completedhasanRof0.104,andRSSwas34,420.Forthesemi-logarithmicversioninSection5.2,thecorrespondingfiguresare0.141and132.RSSismuchsmallerforthelogarithmicversion,butthismeansnothingatall.ThevaluesofLGEARNaremuchsmallerthanthoseofEARNINGS,soitis2hardlysurprisingthattheresidualsarealsomuchsmaller.AdmittedlyRisunit-free,butitisreferringtodifferentconceptsinthetwoequations.Inoneequationitismeasuringtheproportionofthevarianceofearningsexplainedbytheregression,andintheotheritismeasuringtheproportionofthe2varianceofthelogarithmofearningsexplained.IfRismuchgreaterforonemodelthanfortheother,2youwouldprobablybejustifiedinselectingitwithoutfurtherfuss.ButiftheRsforthetwomodelsareclose,simpleeyeballingwillnotdo.\nTRANSFORMATIONSOFVARIABLES18ThestandardprocedureunderthesecircumstancesistoperformwhatisknownasaBox-Coxtest(BoxandCox,1964).Ifyouareinterestedonlyincomparingmodelsusingyandlogyasthedependentvariable,youcanuseaversiondevelopedbyZarembka(1968).Itinvolvesscalingtheobservationsonysothattheresidualsumsofsquaresinthelinearandlogarithmicmodelsarerendereddirectlycomparable.Theprocedurehasthefollowingsteps:1.Youcalculatethegeometricmeanofthevaluesofyinthesample.(Thisisequaltotheexponentialofthemeanoflogy,soitiseasytocalculate.)2.Youscaletheobservationsonybydividingbythisfigure.So*yi=yi/geometricmeanofy,(5.34)*whereyiisthescaledvalueinobservationi.3.Youthenregressthelinearmodelusingy*insteadofyasthedependentvariable,andthelogarithmicmodelusinglogy*insteadoflogy,butotherwiseleavingthemodelsunchanged.Theresidualsumsofsquaresofthetworegressionsarenowcomparable,andthemodelwiththelowersumisprovidingthebetterfit.4.Toseeifonemodelisprovidingasignificantlybetterfit,youcalculate(n/2)logZwhereZistheratiooftheresidualsumsofsquaresinthescaledregressionsandnisthenumberofobservations,andtaketheabsolutevalue(thatis,ignoreaminussignifpresent).Underthe2nullhypothesisthatthereisnodifference,thisstatisticisdistributedasac(chi-squared)2statisticwith1degreeoffreedom.Ifitexceedsthecriticallevelofcatthechosensignificancelevel,youconcludethatthereisasignificantdifferenceinthefit.ExampleThetestwillbeperformedforthealternativespecificationsoftheearningsfunction.ThemeanvalueofLGEARNis2.430133.Thescalingfactoristhereforeexp(2.430133)=11.3604.TheresidualsumofsquaresinaregressionoftheZarembka-scaledearningsonHGCwas266.7;theresidualsumofsquaresinaregressionofthelogarithmofZaremka-scaledearningswas132.1.Hencetheteststatisticwas570266.7loge=200.2(5.35)2132.12Thecriticalvalueofcwith1degreeoffreedomatthe0.1percentlevelis10.8.Hencethereisnodoubt,accordingtothistest,thatthesemi-logarithmicspecificationprovidesabetterfit.Note:theZarembka-scaledregressionsaresolelyfordecidingwhichmodelyouprefer.Youshouldnotpayanyattentiontotheircoefficients,onlytotheirresidualsumsofsquares.Youobtainthecoefficientsdirectlyfromtheunscaledversionofthepreferredmodel.\nTRANSFORMATIONSOFVARIABLES19Exercises5.6PerformaBox-CoxtestparalleltothatdescribedinthissectionusingyourEAEFdataset.5.7LinearandlogarithmicZarembka-scaledregressionsofexpenditureonfoodathomeontotalhouseholdexpenditureusingtheCESdatainSection5.2.Theresidualsumsofsquareswere225.1and184.6,respectively.Thenumberofobservationswas868,thehouseholdreportingnoexpenditureonfoodathomebeingdropped.PerformaBox-Coxtestandstateyourconclusion.5.8PerformaBox-CoxtestforyourcommodityintheCESdataset,droppinghouseholdsreportingnoexpenditureonyourcommodity.Appendix5.1AMoreGeneralBox-CoxTest(Note:Thissectioncontainsrelativelyadvancedmaterialwhichcansafelybeomittedatafirstreading).TheoriginalBox-CoxprocedureismoregeneralthantheversiondescribedinSection5.5.BoxlandCoxnotedthaty-1andlogyarespecialcasesofthefunction(y-1)/l,y-1beingthefunctionwhenlisequaltoone,logybeingthe(limitingformofthe)functionasltendstozero.Thereisnoreasontosupposethateitherofthesevaluesoflisoptimal,andhenceitmakessensetotryarangeofvaluesandseewhichyieldstheminimumvalueofRSS(afterperformingtheZarembkascaling).Thisexerciseisknownasagridsearch.Thereisnopurpose-designedfacilityforitinthetypicalregressionpackage,butneverthelessitisnothardtoexecute.Ifyouaregoingtotry10valuesofl,yougeneratewithintheregressionpackage10newdependentvariablesusingthefunctionalformandthedifferentvaluesofl,afterfirstperformingtheZarembkascaling.Youthenregresseachoftheseseparatelyontheexplanatoryvariables.Table5.4givestheresultsforfoodexpenditureathome,usingtheCESdata,forvariousvaluesofl.Theregressionswererunwithdisposablepersonalincomebeingtransformedinthesamewayasy,exceptfortheZarembkascaling.Thisisnotnecessary;youcankeeptheright-sidevariableorvariablesinlinearformifyouwish,ifyouthinkthisappropriate,oryoucouldexecuteasimultaneous,separategridsearchforadifferentvalueoflforthem.TABLE5.4lRSSlRSS1.0225.10.4176.40.9211.20.3175.50.8199.80.2176.40.7190.90.1179.40.6184.10.0184.60.5179.3\nTRANSFORMATIONSOFVARIABLES20Theresultsindicatethattheoptimalvalueoflisabout0.3.Inadditiontoobtainingapointestimateforl,onemayalsoobtainaconfidenceinterval,buttheprocedureisbeyondthelevelofthistext.(IfyouareinterestedyoucouldconsultSpitzer,1982.)\n6DUMMYVARIABLES(draft)Note:Thisdraftwillnotbeupdatedinthecurrentacademicyear.Norwilltheslideshows.Theslideshowscoverthelectures,butareinanoldformat.TheBasicIdeaCOSTa+daNOccupationalschoolsRegularschoolsSupposethatyouhypothesizethattheschoolshavethefollowingcostfunctions:Regularschools:COST=a+bN+uOccupationalschools:COST=a'+bN+uTheseequationsincorporatetheimplicitassumptionthattheschoolshavethesamemarginalcost.Onlytheoverheadcostdiffers.Theassumptionmayormaynotbetrueanditwillberelaxedlater.Definingdtobeequalto(a'-a),theequationfortheoccupationalschoolscanbere-written:Occupationalschools:COST=a+d+bN+uNowthetwoequationscanbecombinedasÓC.Dougherty1999.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof20.11.99.\nhttp://econ.lse.ac.uk/ie2COST=a+dD+bN+uwhereDisanartificialvariable,knownasadummyvariable,withtwopossiblevalues:0and1.Itissetequalto0forobservationsrelatingtotheregularschoolsandto1forobservationsrelatingtotheoccupationalschools.Byincludingitweallowtheintercepttoswitchbetweena,forregularschools,anda+d,foroccupationalschools.Hencewecanrunjustoneregressionusingthecompletesample,insteadoftwoseparateregressionsforthedifferenttypesofschool.Thishastwoadvantages.Themainoneisthatwehavealargersample,whichwillreducethepopulationvariancesofthecoefficients,andthisshouldbereflectedbysmallerstandarderrors.Theotheristhatweobtainasingleestimateofb,insteadoftwoseparateonesthatarelikelytoconflict.Thepricewehavetopayisthatwehavetoassumethatbisthesameforbothsubsamples.Wewillrelaxthisassumptioninduecourse.Example:RegularandOccupationalSchoolsinShanghaiThescatterdiagramshowstheannualrecurrentexpenditure(COST),measuredinyuan,thenworthabout40centsU.S.,plottedagainstenrolment(N),forasampleof74secondaryschoolsinShanghaiinthemid-1980s.Ascanbeseenfromthediagram,theoccupationalschoolstendtocostmoretorunthantheregularschools.700000600000500000400000COST30000020000010000000200400600800100012001400NOccupationalschoolsRegularschoolsWewillfitacostfunctionincludingadummyvariableOCCwhichisequalto1foroccupationalschoolsand0forregularschools:COST=a+dOCC+bN+uDataTable9.1showsthedataforthefirst10ofthe74observations.NotethatthedummyvariableOCCtakesvalue0iftheobservationrelatestoaregularschooland1ifitrelatestoanoccupationalschool.2\nhttp://econ.lse.ac.uk/ie3Table9.1:RecurrentExpenditure(COST)andEnrolment(N)byTypeofSchoolSchoolTypeCOSTNOCC1Occupational345,00062312Occupational537,00065313Regular170,00040004Occupational526,00066315Regular100,00056306Regular28,00023607Regular160,00030708Occupational45,00017319Occupational120,000146110Occupational61,000991Onceithasbeendefined,OCCistreatedlikeanyotherexplanatoryvariableintheregressionmodel.TheStataoutputwhichfollowsshowstheresultofregressingCOSTonNandOCC..regcostnoccSource|SSdfMSNumberofobs=74---------+------------------------------F(2,71)=56.86Model|9.0582e+1124.5291e+11Prob>F=0.0000Residual|5.6553e+11717.9652e+09R-squared=0.6156---------+------------------------------AdjR-squared=0.6048Total|1.4713e+12732.0155e+10RootMSE=89248------------------------------------------------------------------------------cost|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------n|331.449339.758448.3370.000252.1732410.7254occ|133259.120827.596.3980.00091730.06174788.1_cons|-33612.5523573.47-1.4260.158-80616.7113391.61------------------------------------------------------------------------------Theregressionresultistherefore(standarderrorsinparentheses)^COST=-34,000+133,000OCC+331N(24,000)(21,000)(40)PuttingOCCequalto0and1,respectively,wecanobtaintheimplicitcostfunctionsshowninthefollowingdiagramforthetwotypesofschool.^Regularschools:COST=-34,000+331N^Occupationalschools:COST=-34,000+133,000+331N=99,000+331N3\nhttp://econ.lse.ac.uk/ie4700000600000500000400000300000COST20000010000000200400600800100012001400-100000NOccupationalschoolsRegularschoolsExtensiontoMorethanTwoCategoriesInactualfacttherearefourtypesofschoolinthesample:secondarytechnicalschoolsandskilledworkers'schools,theoccupationalschoolsinthepreviousexample,andgeneralacademicschoolsandvocationalschools.Thevocationalschoolswereclassifiedwiththegeneralschoolsasregularschoolsinthepreviousexamplebecausetheytendedtobeconvertedgeneralschoolswithoutseriousprovisionofoccupationaltraining.700000600000500000400000COST30000020000010000000200400600800100012001400NTechnicalschoolsVocationalschoolsGeneralschoolsWorkers'schoolsWeneedtochooseareferencecategorytowhichthebasicequationapplies.Itisusuallybesttochoosethedominantormostnormalcategory,ifthereisone.Wewillchoosegeneralacademicschools.Thedefinedummiesfortheothercategories:TECHfortechnicalschools,4\nhttp://econ.lse.ac.uk/ie5WORKERforskilledworkers'schools,andVOCforvocationalschools.Theyaredefinedtobe1iftheobservationrelatestothattypeofschooland0otherwise.ThemodelisnowCOST=a+dTTECH+dWWORKER+dVVOC+bN+uDataTable9.2:RecurrentExpenditure,EnrolmentsandTypeofSchoolSchoolTypeCOSTNTECHWORKERVOC1Technical345,0006231002Technical537,0006531003General170,0004000004Skilledworkers'526,0006630105General100,0005630006Vocational28,0002360017Vocational160,0003070018Technical45,0001731009Technical120,00014610010Skilledworkers'61,00099010.regcostntechworkervocSource|SSdfMSNumberofobs=74---------+------------------------------F(4,69)=29.63Model|9.2996e+1142.3249e+11Prob>F=0.0000Residual|5.4138e+11697.8461e+09R-squared=0.6320---------+------------------------------AdjR-squared=0.6107Total|1.4713e+12732.0155e+10RootMSE=88578------------------------------------------------------------------------------cost|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------n|342.633540.21958.5190.000262.3978422.8692tech|154110.926760.415.7590.000100725.3207496.4worker|143362.427852.85.1470.00087797.57198927.2voc|53228.6431061.651.7140.091-8737.646115194.9_cons|-54893.0926673.08-2.0580.043-108104.4-1681.748------------------------------------------------------------------------------Theregressionresultistherefore(standarderrorsinparentheses)^COST=-55,000+154,000TECH+143,000WORKER+53,000VOC+342N(27,000)(27,000)(28,000)(31,000)(40)Fromthisequationwecanobtaintheimplicitcostfunctionsshowninthefollowingdiagramforthefourtypesofschool.^Generalschools:COST=-55,000+342N^Technicalschools:COST=-55,000+154,000+342N=99,000+342N5\nhttp://econ.lse.ac.uk/ie6^Skilledworkers':COST=-55,000+143,000+342N=88,000+342N^Vocationalschools:COST=-55,000+53,000+342N=-2,000+342N700000600000500000400000300000COST20000010000000200400600800100012001400-100000NTechnicalschoolsVocationalschoolsGeneralschoolsWorkers'schoolsJointExplanatoryPowerofaGroupofDummyVariablesWecantestthejointexplanatorypowerofthedummyvariablesasagroupbycomparingtheresidualsumofsquaresinaregressioncontainingthemwiththatinanequationomittingthem.IfourtheoreticalmodelisCOST=a+dTTECH+dWWORKER+dVVOC+bN+uthenthenullhypothesisforthetestisH0:dT=dW=dV=0.H1isthatatleastonedisnon-zero.ToperformthetestweneedtheresultofregressingCOSTonlyonN:.regcostnSource|SSdfMSNumberofobs=74---------+------------------------------F(1,72)=46.82Model|5.7974e+1115.7974e+11Prob>F=0.0000Residual|8.9160e+11721.2383e+10R-squared=0.3940---------+------------------------------AdjR-squared=0.3856Total|1.4713e+12732.0155e+10RootMSE=1.1e+05------------------------------------------------------------------------------cost|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------n|339.043249.551446.8420.000240.2642437.8222_cons|23953.327167.960.8820.381-30205.0478111.65------------------------------------------------------------------------------6\nhttp://econ.lse.ac.uk/ie7ThetestisastandardFtestforagroupofexplanatoryvariables.ThenumeratoristhereductioninRSSwhentheyareadded,dividedbythecost.Thecostisthenumberofdegreesoffreedomgivenup,whichinthiscaseis3,oneforeachextraparameterestimated.ThedenominatorisRSSafterthedummyvariableshavebeenincluded,dividedbythenumberofdegreesoffreedomremaining,inthiscase69.(.89160541383-.)11674.F(,)369===149.5413869.007846.11NotethattheRSSgivenintheregressionresultswereallmultipliedby10,butthisfactorcanbeignoredbecauseitappearsinboththenumeratorandthedenominatoroftheFstatistic.Notealsothattheratioswerecalculatedtofoursignificantfigures.ThiswillensurethattheFstatisticwillbecorrecttothreesignificantfigures.ThecriticalvalueofF(3,69)willbealittlebelow4.13,thecriticalvalueforF(3,60),atthe1%significancelevel,sowecanrejectH0atthislevel.Indeedifweperformeda0.1%significancetest,wewouldstillrejectH0.ThisisonlytobeexpectedbecausettestsshowedthatdTanddVwerebothsignificantlydifferentfromzero,anditisrare(butnotimpossible)fortheFtestnottorejectH0whenoneormorecoefficientsissignificant.ChangeofOmittedCategoryTheskilledworkers'schoolswereconsiderablylessacademicthantheothers,eventhetechnicalschools.Supposethatwewishtoinvestigatewhethertheircostsweresignificantlydifferentfromtheothers.Theeasiestwaytodothisistomakethemtheomittedcategory(referencecategory).Thenthecoefficientsofthedummyvariablesbecomeestimatesofthedifferencesbetweentheoverheadcostsoftheothertypesofschoolandthoseoftheskilledworkers'schools.Sinceskilledworkers'schoolsarenowthereferencecategory,weneedadummyvariable,whichwillbecalledGEN,forthegeneralacademicschools.COST=a+dTTECH+dGGEN+dVVOC+bN+uThedatatablenowisasshowninTable9.3.Table9.3:RecurrentExpenditure,EnrolmentsandTypeofSchoolSchoolTypeCOSTNTECHGENVOC1Technical345,0006231002Technical537,0006531003General170,0004000104Skilledworkers'526,0006630005General100,0005630106Vocational28,0002360017Vocational160,0003070018Technical45,0001731009Technical120,00014610010Skilledworkers'61,000990007\nhttp://econ.lse.ac.uk/ie8TheStataregressionresultisgivenbelow:.regcostntechvocgenSource|SSdfMSNumberofobs=74---------+------------------------------F(4,69)=29.63Model|9.2996e+1142.3249e+11Prob>F=0.0000Residual|5.4138e+11697.8461e+09R-squared=0.6320---------+------------------------------AdjR-squared=0.6107Total|1.4713e+12732.0155e+10RootMSE=88578------------------------------------------------------------------------------cost|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------n|342.633540.21958.5190.000262.3978422.8692tech|10748.5130524.870.3520.726-50146.9371643.95voc|-90133.7433984.22-2.6520.010-157930.4-22337.07gen|-143362.427852.8-5.1470.000-198927.2-87797.57_cons|88469.2928849.563.0670.00330916.01146022.6------------------------------------------------------------------------------Theregressionresultistherefore(standarderrorsinparentheses)^COST=88,000+11,000TECH-143,000GEN-90,000VOC+342N(29,000)(30,000)(28,000)(34,000)(40)Fromthisequationwecanagainobtaintheimplicitcostfunctionsshowninthefollowingdiagramforthefourtypesofschool.^Generalschools:COST=88,000–143,000+342N=-55,000+342N^Technicalschools:COST=88,000+11,000+342N=99,000+342N^Skilledworkers':COST=88,000+342N^Vocationalschools:COST=88,000-90,000+342N=-2,000+342NNotethattheseequationsareidenticaltothoseobtainedwhengeneralschoolsweretheomittedcategory.Thechoiceofomittedcategorydoesnotaffectthesubstanceoftheregressionresults.Theonlycomponentswhichchangearethestandarderrorsandthe2meaningofthettests.R,thecoefficientsoftheothervariables,thetstatisticsfortheothervariables.andtheFstatisticfortheequationasawholedonotalter.Andofcoursethediagramrepresentingthefourcostfunctionsisthesameasbefore.8\nhttp://econ.lse.ac.uk/ie9MoreThanOneSetofDummyVariablesItiscommontohavemorethanonesetofdummyvariablesinaregressionequation.Anexamplewillbegivenhere.Someoftheschoolswereboardingschools(residentialschools),othersdayschools.Youwouldexpecttheoverheadcostsofboardingschoolstoberelativelyhigh,soweintroduceadummyvariables,BOARD,whichisequalto1forboardingschoolsand0fortheothers.700000600000500000400000COST30000020000010000000200400600800100012001400NNon-residentialoccupationalschoolsResidentialoccupationalschoolsNon-residentialregularschoolsResidentialregularschoolsForthesakeofsimplicitywewillreverttotheoccupational/regularclassificationofschooltype.ThemodelnowbecomesCOST=a+dOCC+eBOARD+bN+uDataOfthefirst10schoolsinthesample,thesecond,fourthandseventhwereboardingschools.HencethevaluesofBOARDareasshowninTable9.4.Table9.4:RecurrentExpenditurebySchoolTypeandWhetherDayorBoardingSchoolTypeCOSTNOCCBOARD1Occupational,day345,000623102Occupational,boarding537,000653113Regular,day170,000400004Occupational,boarding526,000663115Regular,day100,000563006Regular,day28,000236007Regular,boarding160,000307018Occupational,day45,000173109Occupational,day120,0001461010Occupational,day61,00099109\nhttp://econ.lse.ac.uk/ie10TheStataregressionresultisgivenbelow:.regcostnoccboardSource|SSdfMSNumberofobs=74---------+------------------------------F(3,70)=40.43Model|9.3297e+1133.1099e+11Prob>F=0.0000Residual|5.3838e+11707.6911e+09R-squared=0.6341---------+------------------------------AdjR-squared=0.6184Total|1.4713e+12732.0155e+10RootMSE=87699------------------------------------------------------------------------------cost|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------n|321.83339.402258.1680.000243.2477400.4183occ|109564.624039.584.5580.00061619.15157510board|57909.0130821.311.8790.064-3562.137119380.2_cons|-29045.2723291.54-1.2470.217-75498.7817408.25------------------------------------------------------------------------------Theregressionresultistherefore(standarderrorsinparentheses)^COST=-29,000+110,000OCC+58,000BOARD+342N(23,000)(24,000)(31,000)(39)700000600000500000400000300000COST20000010000000200400600800100012001400-100000NNon-residentialoccupationalschoolsResidentialoccupationalschoolsNon-residentialregularschoolsResidentialregularschoolsSlopeDummyVariablesSupposethatwewishtorelaxtheassumptionthatthemarginalcostperstudentisthesameforalltypesofschool.Wecandothisbyintroducingaslopedummyvariable,NOCC,definedtobetheproductofNandOCC:COST=a+dOCC+bN+lNOCC+u10\nhttp://econ.lse.ac.uk/ie11IfOCCiszero,soisNOCCandtheequationbecomesCOST=a+bN+uIfOCCisone,NOCCisequaltoNandtheequationbecomesCOST=a+d+(b+l)N+ulisthereforetheincrementalmarginalcostassociatedwithoccupationalschools,inthesamewaythatdistheincrementaloverheadcostassociatedwiththem.DataSchoolTypeCOSTNOCC1Occupational345,00062312Occupational537,00065313Regular170,00040004Occupational526,00066315Regular100,00056306Regular28,00023607Regular160,00030708Occupational45,00017319Occupational120,000146110Occupational61,000991.gnocc=n*occ.regcostnoccnoccSource|SSdfMSNumberofobs=74---------+------------------------------F(3,70)=49.64Model|1.0009e+1233.3363e+11Prob>F=0.0000Residual|4.7045e+11706.7207e+09R-squared=0.6803---------+------------------------------AdjR-squared=0.6666Total|1.4713e+12732.0155e+10RootMSE=81980------------------------------------------------------------------------------cost|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------n|152.298260.019322.5370.01332.59349272.003occ|-3501.17741085.46-0.0850.932-85443.5578441.19nocc|284.478675.632113.7610.000133.6351435.3221_cons|51475.2531314.841.6440.105-10980.24113930.7------------------------------------------------------------------------------11\nhttp://econ.lse.ac.uk/ie12700000600000500000400000COST30000020000010000000200400600800100012001400NOccupationalschoolsRegularschoolsJointExplanatoryPoweroftheInterceptandSlopeDummyVariablesThejointexplanatorypoweroftheinterceptanslopedummiescanbetestedusingtheusualFtestforagroupofvariables.RSSintheregressionwithoutthedummyvariableswas11118.9160x10,andintheregressionwiththedummyvariablesitwas4.7045x10.TheFstatisticistherefore(.89160470452-.)F(,)270==313.4704570.ThecriticalvalueofF(2,70)atthe1%significancelevelisalittlebelow4.98,thecriticalvalueforF(2,60),sowecometotheconclusionthatthenullhypothesisH0:d=l=0shouldberejected.Weknowfromthetteststhatlissignificantlydifferentfromzerobutdisnot.ChowTestTheChowtestisdesignedtotestwhetherthesameregressionmodelcanbeappliedtotwoormoredistinctsubsamplesofobservationsinthesample.Wewillillustrateitwithreferencetotheschoolcostfunctiondata,makingasimpledistinctionbetweenregularandoccupationalschools.Weneedtorunthreeregressions.InthefirstweregressCOSTonNusingthewholesample.Wehavealreadydonethis(seeabove).Thisiscalledthepooledregression.We11makeanoteofRSSforit,8.9160x10.InthesecondandthirdwerunthesameregressionforthetwosubsamplesofregularandoccupationalschoolsseparatelyandagainmakeanoteofRSS.12\nhttp://econ.lse.ac.uk/ie13.regcostnifocc==0Source|SSdfMSNumberofobs=40---------+------------------------------F(1,38)=13.53Model|4.3273e+1014.3273e+10Prob>F=0.0007Residual|1.2150e+11383.1973e+09R-squared=0.2626---------+------------------------------AdjR-squared=0.2432Total|1.6477e+11394.2249e+09RootMSE=56545------------------------------------------------------------------------------cost|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------n|152.298241.397823.6790.00168.49275236.1037_cons|51475.2521599.142.3830.0227750.06495200.43------------------------------------------------------------------------------.regcostnifocc==1Source|SSdfMSNumberofobs=34---------+------------------------------F(1,32)=55.52Model|6.0538e+1116.0538e+11Prob>F=0.0000Residual|3.4895e+11321.0905e+10R-squared=0.6344---------+------------------------------AdjR-squared=0.6229Total|9.5433e+11332.8919e+10RootMSE=1.0e+05------------------------------------------------------------------------------cost|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------n|436.776958.620857.4510.000317.3701556.1836_cons|47974.0733879.031.4160.166-21035.26116983.4------------------------------------------------------------------------------1111RSSis1.2150x10fortheregularschoolsand3.4895x10fortheoccupationalschools.The11totalRSSfromthesubsampleregressionsistherefore4.7045x10.ItislowerthanRSSforthepooledregressionbecausethesubsampleregressionsfittheirsubsamplesbetterthanthepooledregression.Thequestioniswhetherthedifferenceinthefitissignificant,andonceagainwetestthiswithanFtest.Thenumeratoristheimprovementinfitonsplittingthesample,dividedbythecost(havingtoestimatetwosetsofparametersinsteadofonlyone).11Inthiscaseitis(8.9160-4.7045)x10dividedby2(wehavehadtoestimatetwointerceptsandtwoslopecoefficients,insteadofonlyoneofeach).ThedenominatoristhejointRSSremainingaftersplittingthesample,dividedbythejointnumberofdegreesoffreedom11remaining.Inthiscaseitis4.7045x10dividedby70(74observations,lessfourdegreesoffreedombecausetwoparameterswereestimatedineachequation).WhenwecalculatetheF11statisticthe10factorscanceloutandwehave421152.F(,)270==311.4704570.ThecriticalvalueofF(2,70)atthe1%significancelevelisalittlebelow4.98,thecriticalvalueforF(2,60),sowecometotheconclusionthatthereisasignificantimprovementinthefitonsplittingthesampleandthatweshouldnotusethepooledregression.NotethatthistestisexactlyequivalenttotheFtestonthejointexplanatorypoweroftheinterceptandslopedummyvariables.13\nhttp://econ.lse.ac.uk/ie14700000600000500000400000COST30000020000010000000200400600800100012001400NOccupationalschoolsRegularschools14\n7SPECIFICATIONOFREGRESSIONVARIABLES:APRELIMINARYSKIRMISHWhataretheconsequencesofincludingintheregressionavariablethatshouldnotbethere?Whataretheconsequencesofleavingoutavariablethatshouldbeincluded?Whathappensifyouhavedifficultyfindingdataonavariableanduseaproxyinstead?Thischapterisapreliminaryskirmishwiththeseissuesinthesensethatitfocusesontheconsequencesofvariablemisspecification,ratherthanonproceduresformodelselection,amuchmorecomplexsubjectthatislefttolaterinthetext.Thechapterconcludesbyshowinghowsimplerestrictionsontheparameterscanbetested.7.1ModelSpecificationTheconstructionofaneconomicmodelinvolvesthespecificationoftherelationshipsthatconstituteit,thespecificationofthevariablesthatparticipateineachrelationship,andthemathematicalfunctionrepresentingeachrelationship.ThelastelementwasdiscussedinChapter5.Inthischapter,wewillconsiderthesecondelement,andwewillcontinuetoassumethatthemodelconsistsofjustoneequation.WewilldiscusstheapplicationofregressionanalysistomodelsconsistingofsystemsofsimultaneousrelationshipsinChapter10.Ifweknowexactlywhichexplanatoryvariablesoughttobeincludedintheequationwhenweundertakeregressionanalysis,ourtaskislimitedtocalculatingestimatesoftheircoefficients,confidenceintervalsfortheseestimates,andsoon.Inpractice,however,wecanneverbesurethatwehavespecifiedtheequationcorrectly.Economictheoryoughttoprovideaguide,buttheoryisneverperfect.Withoutbeingawareofit,wemightbeincludingsomevariablesthatoughtnottobeintheequation,andwemightbeleavingoutothersthatoughttobeincluded.Thepropertiesoftheregressionestimatesofthecoefficientsdependcruciallyonthevalidityofthespecificationofthemodel.TheconsequencesofmisspecificationofthevariablesinarelationshiparesummarizedinTable7.1:ÓC.Dougherty2000.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof6January2000.\nSPECIFICATIONOFREGRESSIONVARIABLES2TABLE7.1ConsequencesofVariableSpecificationTrueModely=a+b1x1+uy=a+b1x1+b2x2+uCoefficientsarebiasedCorrectspecification,yˆ=a+b1x1(ingeneral).StandardnoproblemserrorsareinvalidCoefficientsareunbiasedFittedModel(ingeneral)butinefficient.Correctspecification,yˆ=a+bx+bx1122Standarderrorsarenoproblemsvalid(ingeneral).1.Ifyouleaveoutavariablethatoughttobeincluded,theregressionestimatesareingeneral(butnotalways)biased.Thestandarderrorsofthecoefficientsandthecorrespondingttestsareingeneralinvalid.2.Ifyouincludeavariablethatoughtnottobeintheequation,theregressioncoefficientsareingeneral(butnotalways)inefficientbutnotbiased.Thestandarderrorsareingeneralvalidbut,becausetheregressionestimationisinefficient,theywillbeneedlesslylarge.Wewillbeginbydiscussingthesetwocasesandthencometosomebroaderissuesofmodelspecification.7.2TheEffectofOmittingaVariablethatOughttobeIncludedTheProblemofBiasSupposethatthedependentvariableydependsontwovariablesx1andx2accordingtoarelationshipy=a+b1x1+b2x2+u,(7.1)butyouareunawareoftheimportanceofx2.Thinkingthatthemodelshouldbey=a+b1x1+u,(7.2)youuseregressionanalysistofit\nSPECIFICATIONOFREGRESSIONVARIABLES3yeffectofx2directeffectofx1,holdingx2constantapparenteffectofx1,actingasamimicforx2x1x2Figure7.1.yˆ=a+b1x1,(7.3)andcalculateb1usingtheexpressionCov(x1,y)/Var(x1),insteadofthecorrectexpressiongivenby4.11.Bydefinition,b1isanunbiasedestimatorofb1ifandonlyifE(b1)isequaltob1.Infact,if(7.1)istrue,ìCov(x1,y)üCov(x1,x2)Eíý=b1+b2(7.4)îVar(x1)þVar(x1)Weshallgivefirstanintuitiveexplanationofthisandthenaformalproof.Ifx2isomittedfromtheregressionmodel,x1willappeartohaveadoubleeffect,asillustratedinFigure7.1.Itwillhaveadirecteffectandalsoaproxyeffectwhenitmimicstheeffectofx2.Theapparentindirecteffectofx1onydependsontwofactors:theapparentabilityofx1tomimicx2,andtheeffectofx2ony.Theapparentabilityofx1toexplainx2isdeterminedbytheslopecoefficienthinthepseudo-regressionxˆ2=g+hx1.(7.5)hofcourseisgivenbytheusualsimpleregressionformula,inthiscaseCov(x1,x2)/Var(x1).Theeffectofx2onyisb2,sothemimiceffectviax2maybewrittenb2Cov(x1,x2)/Var(x1).Thedirecteffectofx1onyisb1,andhencewhenyisregressedonx1,omittingx2,thecoefficientofx1isgivenbyb1+b2Cov(x1,x2)/Var(x1)+samplingerror(7.6)Providedthatx1isnonstochastic,theexpectedvalueofthecoefficientwillbethesumofthefirsttwoterms.Thepresenceofthesecondtermimpliesthatingeneraltheexpectedvalueofthecoefficientwillbedifferentfromthetruevalueb1andthereforebiased.Theformalproofof(7.4)isquitestraightforward.Webeginbymakingatheoreticalexpansionoftheestimatorb1:\nSPECIFICATIONOFREGRESSIONVARIABLES4Cov(x1,y)Cov(x1,[a+b1x1+b2x2+u])b1==Var(x1)Var(x1)1={}Cov(x1,a)+Cov(x1,b1x1)+Cov(x1,b2x2)+Cov(x1,u)Var(x1)(7.7)1={}0+b1Var(x1)+b2Cov(x1,x2)+Cov(x1,u)Var(x1)Cov(x1,x2)Cov(x1,u)=b1+b2+Var(x1)Var(x1)Providedthatx1andx2arenonstochastic,thefirsttwotermsareunaffectedwhenwetaketheexpectedvalueofb1andthethirdis0.Henceweobtain(7.4).Thisconfirmsourearlierintuitiveconclusionthatb1isbiasedbyanamountb2Cov(x1,x2)/Var(x1).Thedirectionofthebiaswilldependonthesignofb2andCov(x1,x2).Forexample,ifb2ispositiveandthecovarianceispositive,thebiaswillbepositiveandb1willtendtooverestimateb1.Youcanworkouttheothercasesforyourself.Thereis,however,oneexceptionalcasewhereb1isunbiasedafterall.Thatiswhenthesamplecovariancebetweenx1andx2happenstobeexactly0.IfCov(x1,x2)is0,thebiastermdisappears.Indeed,theregressioncoefficientobtainedusingsimpleregressionwillbeexactlythesameasifyouhadusedaproperlyspecifiedmultipleregression.Ofcoursethebiastermwouldalsobe0ifb2were0,butthenthemisspecificationdoesnotarise.InvalidationoftheStatisticalTestsAnotherseriousconsequenceofomittingavariablethatoughttobeincludedintheregressionisthatthestandarderrorsofthecoefficientsandtheteststatisticsareingeneralinvalidated.Thismeansofcoursethatyouarenotinprincipleabouttotestanyhypotheseswithyourregressionresults.ExampleTheproblemofomittedvariablebiaswillfirstbeillustratedwiththeeducationalattainmentfunctionusingEAEFDataSet21.Forthepresentpurposes,itwillbeassumedthatthetruemodelisHGC=a+b1ASVABC+b2HGCM+u,(7.8)althoughobviouslythisisagreatoversimplification.Thefirstpartoftheregressionoutputshowstheresultofthisregression.ThesecondandthirdpartsoftheoutputthenshowtheeffectsofomittingHGCMandASVABC,respectively.WhenHGCMisomitted,Cov(ASVABC,HGCM)E(b1)=b1+b2(7.9)Var(ASVABC)\nSPECIFICATIONOFREGRESSIONVARIABLES5.reghgcasvabchgcmSource|SSdfMSNumberofobs=570---------+------------------------------F(2,567)=156.81Model|1230.20392615.101949Prob>F=0.0000Residual|2224.043475673.92247526R-squared=0.3561---------+------------------------------AdjR-squared=0.3539Total|3454.247375696.07073351RootMSE=1.9805------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1381062.009749414.1660.000.1189567.1572556hgcm|.154783.03507284.4130.000.0858946.2236715_cons|4.791277.51024319.3900.0003.789085.793475------------------------------------------------------------------------------.reghgcasvabcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=284.89Model|1153.8086411153.80864Prob>F=0.0000Residual|2300.438735684.05006818R-squared=0.3340---------+------------------------------AdjR-squared=0.3329Total|3454.247375696.07073351RootMSE=2.0125------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1545378.009155916.8790.000.1365543.1725213_cons|5.770845.466847312.3610.0004.8538886.687803------------------------------------------------------------------------------.reghgchgcmSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=83.59Model|443.1104361443.110436Prob>F=0.0000Residual|3011.136935685.30129742R-squared=0.1283---------+------------------------------AdjR-squared=0.1267Total|3454.247375696.07073351RootMSE=2.3025------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgcm|.3445198.03768339.1420.000.2705041.4185354_cons|9.506491.449575421.1450.0008.62345810.38952------------------------------------------------------------------------------ThecorrelationbetweenASVABCandHGCMispositive(0.38).Thereforethecovariancetermispositive.Sincevariancesarealwayspositive(unlessequalto0),theonlyotherrelevantfactorfordeterminingthesignofthebiasisb2.Itisreasonabletoassumethatthisispositive,andthefactthatitsestimateinthefirstregressionisindeedpositiveandhighlysignificantprovidesoverwhelmingcorroborativeevidence.OnewouldthereforeanticipatethatthecoefficientofASVABCwillbeupwardsbiasedwhenHGCMisomitted,andyoucanseethatitisindeedlarger.Notallofthedifferenceshouldbeattributedtobias.Partofitmaybeattributabletotheeffectsofthedisturbanceterm,whichcouldgoeitherway.Similarly,whenASVABCisomitted,\nSPECIFICATIONOFREGRESSIONVARIABLES6Cov(ASVABC,HGCM)E(b2)=b2+b1(7.10)Var(HGCM)Sinceb2isalsolikelytobepositive,thecoefficientofHGCMinthethirdregressionshouldbeupwardsbiased.Theestimateinthethirdregressionisindeedhigherthanthatinthefirst.Inthisexample,theomissionofoneexplanatoryvariablecausesthecoefficientoftheothertobeoverestimated.However,thebiascouldjustaseasilybenegative.Thesignofthebiasdependsonthesignsofthetruecoefficientoftheomittedvariableandofthesamplecovariancebetweentheincludedandomittedvariables,andthesewilldependonthenatureofthemodelbeinginvestigated.Itshouldbeemphasizedthattheanalysisaboveappliesonlytothecasewherethetruemodelisamultipleregressionmodelwithtwoexplanatoryvariables.Whentherearemoreexplanatoryvariables,itmaybedifficulttopredicttheimpactofomittedvariablebiasmathematically.Neverthelessitmaybepossibletoconcludethattheestimatesofthecoefficientsofsomeofthevariablesmayhavebeeninflatedordeflatedbythebias.2RinthePresenceofOmittedVariableBias2InSection4.*itwasassertedthatingeneralitisimpossibletodeterminethecontributiontoRofeachexplanatoryvariableinmultipleregressionanalysis,andwearenowinapositiontoseewhy.Wewilldiscusstheissuefirstwithreferencetotheeducationalattainmentmodelabove.Inthe2regressionofHGConASVABCalone,Rwas0.33.IntheregressiononHGCMalone,itwas0.13.DoesthismeanthatASVABCexplains33%ofthevarianceinHGCandHGCM13%?No,becausethiswouldimplythattogethertheywouldexplain46%ofthevariance,andthisconflictswiththefindinginthemultipleregressionthattheirjointexplanatorypoweris0.36.TheexplanationisthatinthesimpleregressionofHGConASVABC,ASVABCisactingpartlyas2avariableinitsownrightandpartlyasaproxyforthemissingHGCM,asinFigure7.1.RforthatregressionthereforereflectsthecombinedexplanatorypowerofASVABCinbothoftheseroles,andnotjustitsdirectexplanatorypower.Hence0.33overestimatesthelatter.Similarly,inthesimpleregressionofHGConHGCM,HGCMisactingpartlyasaproxyforthe2missingASVABC,andthelevelofRinthatregressionreflectsthecombinedexplanatorypowerofHGCMinboththoseroles,andnotjustitsdirectexplanatorypower.Inthisexample,theexplanatorypowerofthetwovariablesoverlapped,withtheconsequence22thatRinthemultipleregressionwaslessthanthesumofRintheindividualsimpleregressions.22HoweveritisalsopossibleforRinthemultipleregressiontobegreaterthanthesumofRintheindividualsimpleregressions,asisshownintheaccompanyingregressionoutputforanearningsfunctionmodel.ItisassumedthatthetruemodelisLGEARN=a+b1HGC+b2MALE+u,(7.11)whereMALEisadummyvariableequalto1formalesand0forfemales.Thefirstpartoftheregressionoutputshowstheresultoffitting(7.11),andthesecondandthirdpartsshowtheresultsof2omitting,firstMALE,andthenHGC.Rinthemultipleregressionis0.188,whileitis0.141and0.038inthesimpleregressions,thesumbeing0.179.Asinthepreviousexample,itcanbeassumedthatbothb1andb2arepositive.HoweverHGCandMALEarenegativelycorrelated,sointhiscase\nSPECIFICATIONOFREGRESSIONVARIABLES7thecoefficientsofHGCandMALEinthesecondandthirdregressionsmaybeexpectedtobebiaseddownwards.Asaconsequence,theapparentexplanatorypowerofHGCandMALEinthesimpleregressionsisunderestimated..reglgearnhgcmaleSource|SSdfMSNumberofobs=570---------+------------------------------F(2,567)=65.74Model|28.951332214.475666Prob>F=0.0000Residual|124.850561567.220194992R-squared=0.1882---------+------------------------------AdjR-squared=0.1854Total|153.801893569.270302096RootMSE=.46925------------------------------------------------------------------------------lgearn|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.0818944.007997610.2400.000.0661858.097603male|.2285156.03976955.7460.000.1504021.3066291_cons|1.19254.113484510.5080.000.96963861.415441------------------------------------------------------------------------------.reglgearnhgcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=93.21Model|21.681253121.681253Prob>F=0.0000Residual|132.12064568.23260676R-squared=0.1410---------+------------------------------AdjR-squared=0.1395Total|153.801893569.270302096RootMSE=.48229------------------------------------------------------------------------------lgearn|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------hgc|.0792256.00820619.6550.000.0631077.0953435_cons|1.358919.112778512.0490.0001.1374061.580433------------------------------------------------------------------------------.reglgearnmaleSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=22.51Model|5.8628816515.86288165Prob>F=0.0000Residual|147.939011568.260456005R-squared=0.0381---------+------------------------------AdjR-squared=0.0364Total|153.801893569.270302096RootMSE=.51035------------------------------------------------------------------------------lgearn|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------male|.2048652.04317974.7440.000.1200538.2896767_cons|2.313324.03260570.9500.0002.2492822.377365------------------------------------------------------------------------------Exercises7.1RegressLGEARN(1)onHGCandASVABC,(2)onHGConly,and(3)onASVABConly.CalculatethecorrelationbetweenHGCandASVABC.ComparethecoefficientsofHGCinregressions(1)and(2).Givebothmathematicalandintuitiveexplanationsofthedirectionofthechange.AlsocomparethecoefficientsofASVABCinregressions(2)and(3)andexplainthedirectionofthechange.\nSPECIFICATIONOFREGRESSIONVARIABLES87.2*ThetablegivestheresultsofmultipleandsimpleregressionsofLGFDHO,thelogarithmofannualhouseholdexpenditureonfoodeatenathome,onLGEXP,thelogarithmoftotalannualhouseholdexpenditure,andLGSIZE,thelogarithmofthenumberofpersonsinthehousehold,usingasampleof869householdsinthe1995ConsumerExpenditureSurvey.ThecorrelationcoefficientforLGEXPandLGSIZEwas0.45.Explainthevariationsintheregressioncoefficients.(1)(2)(3)0.290.48LGEXP-(0.02)(0.02)0.490.63LGSIZE-(0.03)(0.02)4.723.177.50constant(0.22)(0.24)(0.02)2R0.520.310.427.3Supposethatyisdeterminedbyx1andx2accordingtoy=a+b1x1+b2x2+u,andthatCov(x1,x2)is0.Usethistosimplifythemultipleregressioncoefficientb1givenbyCov(x1,y)Var(x2)-Cov(x2,y)Cov(x1,x2)b1=2Var(x1)Var(x2)-{Cov(x1,x2)}andshowthatitreducestothesimpleregressionexpression.Whataretheimplicationsforthespecificationoftheregressionequation?7.3TheEffectofIncludingaVariablethatOughtnottobeIncludedSupposethatthetruemodelisy=a+b1x1+u(7.12)andyouthinkitisy=a+b1x1+b2x2+u,(7.13)andyouestimateb1using(4.11)insteadofCov(x1,y)/Var(x1).Ingeneralthereisnoproblemofbias,eventhoughb1hasbeencalculatedincorrectly.E(b1)willstillbeequaltob1butingeneralitwillbeaninefficientestimator.Itwillbemoreerratic,inthesense\nSPECIFICATIONOFREGRESSIONVARIABLES9probabilitydensityfunctionb1estimatedusingCov(x1,y)/Var(x1)(exploitsinformationb2=0)b1estimatedusing(4.11)(doesnotexploitinformationb2=0)b1b1Figure7.2.ofhavingalargervarianceaboutb1,thanifithadbeencalculatedcorrectly.ThisisillustratedinFigure7.2.Thisiseasytoexplainintuitively.Thetruemodelmayberewritteny=a+b1x1+0x2+u(7.14)Soifyouregressyonx1andx2,b1willbeanunbiasedestimatorofb1andb2willbeanunbiasedestimatorof0,providedthattheGauss-Markovconditionsaresatisfied.Effectively,youarediscoveringforyourselfthatb2is0.Ifyourealizedbeforehandthatb2is0,youwouldbeabletoexploitthisinformationtoexcludex2andusesimpleregression,whichinthiscontextismoreefficient.Thelossofefficiencycausedbyincludingx2whenitoughtnottobeincludeddependsonthecorrelationbetweenx1andx2.Comparethevariancesofb1usingsimpleandmultipleregression(Table7.2).Thevariancewillingeneralbelargerinthecaseofmultipleregression,andthedifferencewillbethegreaterthecloserthecorrelationcoefficientistoplusorminus1.Theoneexceptiontothelossofefficiencyoccurswhenthecorrelationcoefficienthappenstobeexactlyequalto0.Inthiscasetheestimatorb1formultipleregressionwillbeidenticaltothatforsimpleregression.Theproofofthiswillbeleftasaneasyexercise.Thereisoneexceptiontotheunbiasednessconclusion,whichoughttobekeptinmind.Ifx2iscorrelatedwithu,theregressioncoefficientswillbebiasedafterall.Writingthemodelas(7.14),thisamountstothefourthGauss-Markovconditionnotbeingsatisfiedwithrespecttox2.TABLE7.2SimpleRegressionMultipleRegression222su2su1sb1=sb1=2nVar(x1)nVar(x1)1-rx1x2\nSPECIFICATIONOFREGRESSIONVARIABLES10ExampleTheregressionoutputshowstheresultsofregressionsofLGFDHO,thelogarithmofannualhouseholdexpenditureonfoodeatenathome,onLGEXP,thelogarithmoftotalannualhouseholdexpenditure,andLGSIZE,thelogarithmofthenumberofpersonsinthehousehold,usingasampleof869householdsinthe1995ConsumerExpenditureSurvey.Inthesecondregression,LGHOUS,thelogarithmofannualexpenditureonhousingservices,hasbeenadded.ItissafetoassumethatLGHOUSisanirrelevantvariableand,notsurprisingly,itscoefficientisnotsignificantlydifferentfrom0.ItishoweverhighlycorrelatedwithLGEXP(correlationcoefficient0.81),andalso,toalesserextent,withLGSIZE(correlationcoefficients0.33).Itsinclusiondoesnotcausethecoefficientsofthosevariablestobebiasedbutitdoesincreasetheirstandarderrors,particularlythatofLGEXP,asyouwouldexpect,giventhelossofefficiency..reglgfdholgexplgsizeSource|SSdfMSNumberofobs=868---------+------------------------------F(2,865)=460.92Model|138.776549269.3882747Prob>F=0.0000Residual|130.219231865.150542464R-squared=0.5159---------+------------------------------AdjR-squared=0.5148Total|268.995781867.310260416RootMSE=.388------------------------------------------------------------------------------lgfdho|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------lgexp|.2866813.022682412.6390.000.2421622.3312003lgsize|.4854698.025547619.0030.000.4353272.5356124_cons|4.720269.220999621.3590.0004.2865115.154027------------------------------------------------------------------------------.reglgfdholgexplgsizelghousSource|SSdfMSNumberofobs=868---------+------------------------------F(3,864)=307.22Model|138.841976346.2806586Prob>F=0.0000Residual|130.153805864.150640978R-squared=0.5161---------+------------------------------AdjR-squared=0.5145Total|268.995781867.310260416RootMSE=.38812------------------------------------------------------------------------------lgfdho|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------lgexp|.2673552.03707827.2110.000.1945813.340129lgsize|.4868228.025638318.9880.000.4365021.5371434lghous|.0229611.03484080.6590.510-.0454214.0913436_cons|4.708772.221759221.2340.0004.2735225.144022------------------------------------------------------------------------------Exercises7.4*Asocialscientistthinksthatthelevelofactivityintheshadoweconomy,Y,dependseitherpositivelyonthelevelofthetaxburden,X,ornegativelyonthelevelofgovernmentexpendituretodiscourageshadoweconomyactivity,Z.YmightalsodependonbothXandZ.Internationalcross-sectiondataonY,X,andZ,allmeasuredinU.S.$millions,areobtainedforasampleof30industrializedcountriesandasecondsampleof30developingcountries.The\nSPECIFICATIONOFREGRESSIONVARIABLES11socialscientistandregresses(1)logYonlogXalone,(2)logYonlogZalone,and(3)logYonbothlogXandlogZ,foreachsample,withthefollowingresults(standarderrorsinparentheses):IndustrializedCountriesDevelopingCountries(1)(2)(3)(1)(2)(3)0.6990.2010.8060.727logX--(0.154)(0.112)(0.137)(0.090)-0.646-0.053-0.0910.427logZ--(0.162)(0.124)(0.117)(0.116)-1.137-1.0651.230-1.122-1.0242.824constant(0.863)(1.069)(0.896)(0.873)(0.858)(0.835)2R0.440.100.010.710.700.33XwaspositivelycorrelatedwithZinbothsamples.Havingcarriedouttheappropriatestatisticaltests,writeashortreportadvisingthesocialscientisthowtointerprettheseresults.7.4ProxyVariablesItfrequentlyhappensthatyouareunabletoobtaindataonavariablethatyouwouldliketoincludeinaregressionequation.Somevariables,likesocioeconomicstatusandqualityofeducation,aresovaguelydefinedthatitmaybeimpossibleeveninprincipletomeasurethem.Othersmightbemeasurable,butrequiresomuchtimeandenergythatinpracticetheyhavetobeabandoned.Sometimesyouarefrustratedbecauseyouareusingsurveydatacollectedbysomeoneelse,andanimportantvariable(fromyourpointofview)hasbeenomitted.Whateverthereason,itisusuallyagoodideatouseaproxyforthemissingvariable,ratherthanleaveitoutentirely.Forsocioeconomicstatus,youmightuseincomeasasubstitute,ifdataonitareavailable.Forqualityofeducation,youmightusethestaff-studentratioorexpenditureperstudent.Foravariableomittedinasurvey,youwillsimplyhavetolookatthedataactuallycollectedtoseeifthereisasuitableproxy.Therearetwogoodreasonsfortryingtofindaproxy.First,ifyousimplyleavethevariableout,yourregressionislikelytosufferfromomittedvariablebiasofthetypedescribedinSection7.2,andthestatisticaltestswillbeinvalidated.Second,theresultsfromyourproxyregressionmayindirectlyshedlightonthevariableforwhichitisasubstitute.Supposethatthetruemodelisy=a+b1x1+b2x2+...+bkxk+u(7.15)Supposethatwehavenodataforx1,butanothervariablezisanidealproxyforitinthesensethatthereexistsanexactlinearrelationshipbetweenxandz:x1=l+mz(7.16)\nSPECIFICATIONOFREGRESSIONVARIABLES12landmbeingfixed,butunknown,constants.(Notethatiflandmwereknown,wecouldcalculatex1fromz,andsotherewouldbenoneedtousezasaproxyforit.Notefurtherthatwecannotestimatelandmbyregressionanalysis,becausetodothatweneeddataonx1.)Substitutingforx1from(7.16)into(7.15),themodelmayberewritteny=a+b1(l+mz)+b2x2+...+bkxk+u=a+b1l+b1mz+b2x2+...+bkxk+u(7.17)Themodelisnowformallycorrectlyspecifiedintermsofobservablevariables,andifwefitit,thefollowingresultswillobtain:1.Thecoefficientsofx2,…,xkwillbethesameasifx1hadbeenusedinsteadofz.2.Thestandarderrorsandtstatisticsofthecoefficientsofx2,…,xkwillbethesameasifx1hadbeenusedinsteadofz.23.Rwillbethesameasifx1hadbeenusedinsteadofz.4.Thecoefficientofzwillbeanestimateofb1mandsoitwillnotbepossibletoobtainanestimateofb1m,unlessyouareabletoguessthevalueofm.5.However,thetstatisticforzwillbethesameasthatwhichwouldhavebeenobtainedforx1,andsoyouareabletoassessthesignificanceofx1,evenifyouarenotabletoestimateitscoefficient.6.Itwillnotbepossibletoobtainanestimateofa,sincetheinterceptisnowa+b1l,butusuallytheinterceptisofsecondaryinterest,anyway.Withregardtothefourthpoint,supposethatyouareinvestigatingthebraindrainfromcountryAtocountryBandyouareusingthe(verynaive)modely=a+bx+u(7.18)whereyistherateofmigrationofacertaintypeofworkerfromAtoB,andxistheratioofthewagerateinBtothewagerateinA.Thehighertherelativewagerate,youthinkthehigherismigration.ButsupposethatyouonlyhavedataonGDPpercapita,notwages.YoumightdefineaproxyvariablepwhichistheratioofGDPinBtoGDPinA.Inthiscaseitmightbereasonabletoassume,asfirstapproximation,thatrelativewagesareproportionaltorelativeGDP.Iftherelationshipwereexactlytrue,onecouldwrite(7.16)withlequalto0andmequalto1.InthiscasethecoefficientofrelativeGDPwouldyieldadirectestimateofthecoefficientofrelativewages.Sincevariablesinregressionanalysisarefrequentlydefinedinrelativeterms,thisspecialcaseactuallyhasquiteawideapplication.Inthisdiscussionwehaveassumedthatzisanidealproxyforx,andthevalidityofalltheforegoingresultsdependsonthiscondition.Inpracticeitisunusualtofindaproxywhichisexactlylinearlyrelatedtothemissingvariable,butiftherelationshipisclosetheresultswillholdapproximately.Amajorproblemisposedbythefactthatthereisneveranymeansoftestingwhethertheconditionisorisnotapproximatedsatisfactorily.Onehastojustifytheuseoftheproxysubjectively.\nSPECIFICATIONOFREGRESSIONVARIABLES13ExampleThemaindeterminantsofeducationalattainmentappeartobecognitiveabilityofanindividualandthesupportandmotivationprovidedbythefamilybackground.TheNLSYdatasetisexceptionalinthatcognitiveabilitymeasuresareavailableforvirtuallyalltherespondents,thedatabeingobtainedwhentheDepartmentofDefense,needingtore-normtheArmedServicesVocationalAptitudeBatteryscores,sponsoredtheadministrationofthetests.However,therearenodatawhichbeardirectlyonsupportandmotivationprovidedbythefamilybackground.Thisfactorisdifficulttodefineandprobablyhasseveraldimensions.Accordingly,itisunlikelythatasingleproxycoulddojusticetoit.Parentaleducationalattainmentiscommonlyusedasaproxy,butitisobviouslyimperfect.Anotherpossibleproxyisthenumberofsiblingsoftherespondent,therationalebeingthatparentswhoareambitiousfortheirchildrentendtolimitthefamilysizeinordertoconcentrateresources.TheNLSYdatasetalsocontainsthreedummyvariablesspecificallyintendedtocapturefamilybackgroundeffects:whetheranyoneinthefamilypossessedalibrarycard,whetheranyoneinthefamilyboughtmagaines,andwhetheranyoneinthefamilyboughtnewspapers,whentherespondentwasaged14.Howevertheexplanatorypowerofthesevariablesappearstobeverylimited.TheregressionoutputshowstheresultsofregressingHGConASVABConlyandonASVABC,parentaleducation,numberofsiblings,andthelibrarycarddummyvariable.ASVABCispositivelycorrelatedwithHGCM,HGCF,andLIBRARY(correlationcoefficients0.38,0.42and0.22,respectively),andnegativelycorrelatedwithSIBLINGS(correlationcoefficient–0.19).Itscoefficientisthereforeunambiguouslybiasedupwardsinthefirstregression.However,theremaystillbeanelementofbiasinthesecond,giventheweaknessoftheproxyvariables..reghgcasvabcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=284.89Model|1153.8086411153.80864Prob>F=0.0000Residual|2300.438735684.05006818R-squared=0.3340---------+------------------------------AdjR-squared=0.3329Total|3454.247375696.07073351RootMSE=2.0125------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1545378.009155916.8790.000.1365543.1725213_cons|5.770845.466847312.3610.0004.8538886.687803------------------------------------------------------------------------------.reghgcasvabchgcmhgcflibrarysiblingsSource|SSdfMSNumberofobs=570---------+------------------------------F(5,564)=66.87Model|1285.582085257.116416Prob>F=0.0000Residual|2168.665295643.84515122R-squared=0.3722---------+------------------------------AdjR-squared=0.3666Total|3454.247375696.07073351RootMSE=1.9609------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1277852.01005412.7100.000.1080373.147533hgcm|.0619975.04275581.4500.148-.0219826.1459775hgcf|.1045035.03149283.3180.001.042646.166361library|.1151269.19698440.5840.559-.2717856.5020394siblings|-.0509486.039956-1.2750.203-.1294293.027532_cons|5.236995.56655399.2440.0004.1241816.349808------------------------------------------------------------------------------\nSPECIFICATIONOFREGRESSIONVARIABLES14UnintentionalProxiesItsometimeshappensthatyouuseaproxywithoutrealizingit.Youthinkthatydependsuponz,butinrealityitdependsuponx.Ifthecorrelationbetweenzandxislow,theresultswillbepoor,soyoumayrealizethat2somethingiswrong,but,ifthecorrelationisgood,theresultsmayappeartobesatisfactory(Ruptotheanticipatedlevel,etc.)andyoumayremainblissfullyunawarethattherelationshipisbogus.Doesthismatter?Well,itdependsonwhyyouarerunningtheregressioninthefirstplace.Ifthepurposeoffittingtheregressionlineistopredictfuturevaluesofy,theuseofaproxywillnotmattermuch,providedofcoursethatthecorrelationremainshighandwasnotastatisticalflukeinthesampleperiod.Howeverifyourintentionistousetheexplanatoryvariableasapolicyinstrumentforinfluencingthedependentvariable,theconsequencescouldbedisastrous.Unlesstherehappenstobeafunctionalconnectionbetweentheproxyandthetrueexplanatoryvariable,manipulatingtheproxywillhavenoeffectatallonthedependentvariable.Ifthemotiveforyourregressionisscientificcuriosity,theoutcomeisequallyunsatisfactory.Unintentionalproxiesareespeciallycommonintimeseriesanalysis,particularlyinmacroeconomicmodels.Ifthetrueexplanatoryvariableissubjecttoatimetrend,youwillprobablygetagoodfitifyousubstitute(intentionallyorotherwise)anyothervariablewithatimetrend.Evenifyourelatechangesinyourdependentvariabletochangesinyourexplanatoryvariable,youarelikelytogetsimilarresultswhetheryouareusingthecorrectexplanatoryvariableoraproxy,sincemacroeconomicvariablestendtochangeinconcertoverthetradecycle.Exercise7.5Lengthofworkexperienceisgenerallyfoundtobeanimportantdeterminantofearnings.Thedatasetdoesnotcontainthisvariable,butTENURE,tenurewiththecurrentemployer,couldbetakenasaproxy.AnalternativeistocalculatePWE,yearsofpotentialworkexperience,asaproxy.Thisisdefinedtobecurrentage,AGE,lessageofcompletionoffull-timeeducation.Thelattercanbeestimatedasyearsofschoolingplus5,assumingthatschoolingbeginsattheageof6.HencePWE=AGE-HGC-5.RegressLGEARNonHGC,ASVABC,MALE,ETHBLACK,ETHHISPandPWE.ComparetheresultswiththecorrespondingregressionwithoutPWE.YouarelikelytofindthatthecoefficientofHGCisgreaterthanbefore.Canyouexplainwhy?ThedatasetincludesTENURE,tenurewithcurrentemployer.ThisallowsonetodividePWEintotwocomponents:PWEBEFandTENURE.DefinePWEBEFasPWEBEF=PWE-TENUREandregressLGEARNonthevariablesasbefore,replacingPWEbyPWEBEFandTENURE.Comparetheresultwiththatofthepreviousregression.Variation:PWEisnotlikelytobeasatisfactoryproxyforworkexperienceforfemalesbecauseitdoesnottakeintoaccounttimespentoutofthelaborforcerearing\nSPECIFICATIONOFREGRESSIONVARIABLES15children.InvestigatethisbyrunningtheregressionswithPWEforthemaleandfemalesubsamplesseparately(observations1-325and326-570,respectively).YoumustdroptheMALEdummyfromthespecification(explainwhy).DothesamefortheregressionswithPWEBEFandTENURE.7.5TestingaLinearRestrictionInSection4.4itwasdemonstratedthatyoucanreducethenumberofexplanatoryvariablesinaregressionequationbyoneifyoubelievethatthereexistsalinearrelationshipbetweentheparametersinit.Byexploitingtheinformationabouttherelationship,youwillmaketheregressionestimatesmoreefficient.Iftherewaspreviouslyaproblemofmulticollinearity,itmaybealleviated.Eveniftheoriginalmodelwasnotsubjecttothisproblem,thegaininefficiencymayyieldawelcomeimprovementintheprecisionoftheestimates,asreflectedbytheirstandarderrors.TheexamplediscussedinSection4.4wasaneducationalattainmentmodelwithHGCrelatedtoASVABC,HGCM,andHGCF:.reghgcasvabchgcmhgcfSource|SSdfMSNumberofobs=570---------+------------------------------F(3,566)=110.83Model|1278.241533426.080508Prob>F=0.0000Residual|2176.005845663.84453329R-squared=0.3700---------+------------------------------AdjR-squared=0.3667Total|3454.247375696.07073351RootMSE=1.9607------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1295006.009954413.0090.000.1099486.1490527hgcm|.069403.04229741.6410.101-.013676.152482hgcf|.1102684.03119483.5350.000.0489967.1715401_cons|4.914654.50635279.7060.0003.9200945.909214------------------------------------------------------------------------------Somewhatsurprisingly,thecoefficientofHGCMisnotsignificant,evenatthe5%level,usingaone-tailedtest.HoweverassortativematingleadstoahighcorrelationbetweenHGCMandHGCFandtheregressionappearedtobesufferingfrommulticollinearity.Wethenhypothesizedthatmother'sandfather'seducationareequallyimportantallowingustoimposetherestrictionb2=b3andrewritetheequationasHGC=a+b1ASVSABC+b2(HGCM+HGCF)+u(7.19)=a+b1ASVSABC+b2HGCP+uwhereHGCPwasthesumofHGCMandHGCF.\nSPECIFICATIONOFREGRESSIONVARIABLES16.ghgcp=hgcm+hgcf.reghgcasvabchgcpSource|SSdfMSNumberofobs=570---------+------------------------------F(2,567)=166.22Model|1276.737642638.368819Prob>F=0.0000Residual|2177.509735673.84040517R-squared=0.3696---------+------------------------------AdjR-squared=0.3674Total|3454.247375696.07073351RootMSE=1.9597------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1295653.009948513.0240.000.1100249.1491057hgcp|.093741.01656885.6580.000.0611973.1262847_cons|4.823123.48448299.9550.0003.8715235.774724------------------------------------------------------------------------------ThestandarderrorofHGCPismuchsmallerthanthoseofHGCMandHGCF,indicatingthattheuseoftherestrictionhasledtoagaininefficiency,andasaconsequencethetstatisticisveryhigh.Thustheproblemofmulticollinearityhasbeeneliminated.However,weareobligedtotestthevalidityoftherestriction,andtherearetwoequivalentprocedures.FtestofarestrictionRuntheregressioninboththerestrictedandtheunrestrictedformsanddenotethesumofthesquaresoftheresidualsasRSSRintherestrictedcaseandRSSUintheunrestrictedcase.Sincetheimpositionoftherestrictionmakesitmoredifficulttofittheregressionequationtothedata,RSSRcannotbelessthanRSSUandwillingeneralbegreater.Wewouldliketotestwhethertheimprovementofthefitongoingfromtherestrictedtotheunrestrictedversionissignificant.Ifitis,therestrictionshouldberejected.ForthispurposewecanuseanFtestwhosestructureisthesameasthatdescribedinSection4.5:ImprovementinfitExtradegreesoffreedomusedupF=(7.20)ResidualsumofsquaresremainingDegreesoffreedomremainingInthiscasetheimprovementongoingfromtherestrictedtotheunrestrictedversionis(RSSR-RSSU),1extradegreeoffreedomisusedupintheunrestrictedversion(becausethereisonemoreparametertoestimate),andtheresidualsumofsquaresremainingaftertheshiftfromtherestrictedtotheunrestrictedversionisRSSU.HencetheFstatisticisinthiscaseRSSR-RSSUF(1,n-k-1)=(7.21)RSSU/(n-k-1)wherekisthenumberofexplanatoryvariablesintheunrestrictedversion.Itisdistributedwith1andn-k-1degreesoffreedomunderthenullhypothesisthattherestrictionisvalid.Inthecaseoftheeducationalattainmentfunction,thenullhypothesiswasH0:b2=b3,whereb2isthecoefficientofHGCMandb3isthecoefficientofHGCF.Theresidualsumofsquareswas2177.51intherestrictedversionand2176.01intheunrestrictedversion.HencetheFstatisticwas\nSPECIFICATIONOFREGRESSIONVARIABLES172177.51-2176.01F(1,n-k-1)==0.39(7.22)2176.01/566SincetheFstatisticislessthan1,itisnotsignificantatanysignificancelevelandwedonotrejectthenullhypothesisthatthecoefficientsofHGCMandHGCFareequal.ttestofarestrictionLinearrestrictionscanalsobetestedusingattest.Thisinvolveswritingdownthemodelfortherestrictedversionandaddingthetermthatwouldconvertitbacktotheunrestrictedversion.Thetestevaluateswhetherthisadditionaltermisneeded.Tofindtheconversionterm,wewritetherestrictedversionofthemodelundertheunrestrictedversionandsubtract.HGC=a+b1ASVABC+b2HGCM+b3HGCF+u(7.23)HGC=a+b1ASVABC+b2HGCP+u(7.24)0=b2HGCM+b3HGCF-b2HGCP=b2HGCM+b3HGCF-b2(HGCM+HGCF)(7.25)=(b3-b2)HGCFWeaddthistermtotherestrictedmodelandinvestigatewhetheritisneeded.HGC=a+b1ASVABC+b2HGCP+(b3-b2)HGCF+u(7.26)Thenullhypothesisisthatthecoefficientoftheconversiontermis0,andthealternativehypothesisisthatitisdifferentfrom0.Ofcoursethenullhypothesisisthattherestrictionisvalid.Ifitisvalid,theconversiontermisnotneeded,andtherestrictedversionisanadequaterepresentationofthedata.Hereisthecorrespondingregressionfortheeducationalattainmentexample.WeseethatthecoefficientofHGCFisnotsignificantlydifferentfrom0,indicatingthatthetermisnotneededandthattherestrictedversionisanadequaterepresentationofthedata..reghgcasvabchgcphgcfSource|SSdfMSNumberofobs=570---------+------------------------------F(3,566)=110.83Model|1278.241533426.080508Prob>F=0.0000Residual|2176.005845663.84453329R-squared=0.3700---------+------------------------------AdjR-squared=0.3667Total|3454.247375696.07073351RootMSE=1.9607------------------------------------------------------------------------------hgc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1295006.009954413.0090.000.1099486.1490527hgcp|.069403.04229741.6410.101-.013676.152482hgcf|.0408654.06533860.6250.532-.0874704.1692012_cons|4.914654.50635279.7060.0003.9200945.909214------------------------------------------------------------------------------\nSPECIFICATIONOFREGRESSIONVARIABLES18WhyisthettestapproachequivalenttothatoftheFtest?WelltheFtestteststheimprovementinfitwhenyougofromtherestrictedversiontotheunrestrictedversion.Thisisaccomplishedbyaddingtheextraterm,but,asweknow,anFtestontheimprovementinfitwhenyouaddanextratermisequivalenttothettestonthecoefficientofthatterm(seeSection4.5).Exercises7.6YouwillhavefoundinExercise7.5thattheestimatesofthecoefficientsofPWEBEFandTENUREaredifferent.Thisraisestheissueofwhetherthedifferenceisduetorandomfactorsorwhethertheyaresignificantlydifferent.SetupthenullhypothesisH0:d1=d2,whered1isthecoefficientofPWEBEFandd2isthecoefficientofTENURE.ExplainwhytheregressionwithPWEisthecorrectspecificationifH0istrue,whiletheregressionwithPWEBEFandTENUREshouldbeusedifH0isfalse.PerformanFtestoftherestrictionusingRSSforthetworegressions.Dothisforthecombinedsampleandalsoformalesandfemalesseparately.7.7*ThefirstregressionshowstheresultofregressingLGFDHO,thelogarithmofannualhouseholdexpenditureonfoodeatenathome,onLGEXP,thelogarithmoftotalannualhouseholdexpenditure,andLGSIZE,thelogarithmofthenumberofpersonsinthehousehold,usingasampleof869householdsinthe1995ConsumerExpenditureSurvey.Inthesecondregression,LGFDHOPC,thelogarithmoffoodexpenditurepercapita(FDHO/SIZE),isregressedonLGEXPPC,thelogarithmoftotalexpenditurepercapita(EXP/SIZE).InthethirdregressionLGFDHOPCisregressedonLGEXPPCandLGSIZE..reglgfdholgexplgsizeSource|SSdfMSNumberofobs=868---------+------------------------------F(2,865)=460.92Model|138.776549269.3882747Prob>F=0.0000Residual|130.219231865.150542464R-squared=0.5159---------+------------------------------AdjR-squared=0.5148Total|268.995781867.310260416RootMSE=.388------------------------------------------------------------------------------lgfdho|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------lgexp|.2866813.022682412.6390.000.2421622.3312003lgsize|.4854698.025547619.0030.000.4353272.5356124_cons|4.720269.220999621.3590.0004.2865115.154027------------------------------------------------------------------------------.reglgfdhopclgexppcSource|SSdfMSNumberofobs=868---------+------------------------------F(1,866)=313.04Model|51.4364364151.4364364Prob>F=0.0000Residual|142.293973866.164311747R-squared=0.2655---------+------------------------------AdjR-squared=0.2647Total|193.73041867.223449146RootMSE=.40535------------------------------------------------------------------------------lgfdhopc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------lgexppc|.376283.021267417.6930.000.3345414.4180246_cons|3.700667.197892518.7000.0003.3122624.089072------------------------------------------------------------------------------\nSPECIFICATIONOFREGRESSIONVARIABLES19.reglgfdhopclgexppclgsizeSource|SSdfMSNumberofobs=868---------+------------------------------F(2,865)=210.94Model|63.5111811231.7555905Prob>F=0.0000Residual|130.219229865.150542461R-squared=0.3278---------+------------------------------AdjR-squared=0.3263Total|193.73041867.223449146RootMSE=.388------------------------------------------------------------------------------lgfdhopc|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------lgexppc|.2866813.022682412.6390.000.2421622.3312004lgsize|-.2278489.0254412-8.9560.000-.2777826-.1779152_cons|4.720269.220999621.3590.0004.2865115.154027------------------------------------------------------------------------------1.Explainwhythesecondmodelisarestrictedversionofthefirst,statingtherestriction.2.PerformanFtestoftherestriction.3.Performattestoftherestriction.4.Summarizeyourconclusionsfromtheanalysisoftheregressionresults./7.8Inhisclassicarticle(Nerlove,1963),Nerlovederivesthefollowingcostfunctionforelectricitygeneration:C=aYbPg1Pg2Pg3v,123whereCistotalproductioncost,Yisoutput(measuredinkilowatthours),P1isthepriceoflaborinput,P2isthepriceofcapitalinput,P3isthepriceoffuel(allmeasuredinappropriateunits),andvisadisturbanceterm.Theoretically,thesumofthepriceelasticitiesshouldbe1:g1+g2+g3=1andhencethecostfunctionmayberewrittenC/P=aYb(P/P)g1(P/P)g2v31323Thetwoversionsofthecostfunctionarefittedtothe29medium-sizedfirmsinNerlove'ssample,withthefollowingresults(standarderrorsinparentheses):logˆC=-4.93+0.94logY+0.31logP1–0.26logP2+0.44logP3RSS=0.336(1.62)(0.11)(0.23)(0.29)(0.07)logCˆ/P3=-6.55+0.91logY+0.51logP1/P3+0.09logP2/P3RSS=0.364(0.16)(0.11)(0.23)(0.19)Comparetheregressionresultsforthetwoequationsandperformatestofthevalidityoftherestriction.\nSPECIFICATIONOFREGRESSIONVARIABLES207.6GettingtheMostOutofYourResidualsTherearetwowaysoflookingattheresidualsobtainedafterfittingaregressionequationtoasetofdata.Ifyouarepessimisticandpassive,youwillsimplyseethemasevidenceoffailure.Thebigger2theresiduals,theworseisyourfit,andthesmallerisR.Thewholeobjectoftheexerciseistofittheregressionequationinsuchawayastominimizethesumofthesquaresoftheresiduals.However,ifyouareenterprising,youwillalsoseetheresidualsasapotentiallyfertilesourceofnewideas,perhapsevennewhypotheses.Theyofferbothachallengeandconstructivecriticism.Thechallengeisthatprovidingthestimulusformostscientificresearch:evidenceoftheneedtofindabetterexplanationofthefacts.Theconstructivecriticismcomesinbecausetheresiduals,takenindividually,indicatewhenandwhereandbyhowmuchtheexistingmodelisfailingtofitthefacts.Takingadvantageofthisconstructivecriticismrequirespatienceonthepartoftheresearcher.Ifthesampleissmallenough,youshouldlookcarefullyateveryobservationwithalargepositiveornegativeresidual,andtrytohypothesizeexplanationsforthem.Someoftheseexplanationsmayinvolvespecialfactorsspecifictotheobservationsinquestion.Thesearenotofmuchusetothetheorist.Otherfactors,however,mayappeartobeassociatedwiththeresidualsinseveralobservations.Assoonasyoudetectaregularityofthiskind,youhavethemakingsofprogress.Thenextstepistofindasensiblewayofquantifyingthefactorandofincludingitinthemodel.\n8HETEROSCEDASTICITYMedicineistraditionallydividedintothethreebranchesofanatomy,physiology,andpathology–whatabodyismadeof,howitworks,andwhatcangowrongwithit.Itistimetostartdiscussingthepathologyofleastsquaresregressionanalysis.Thepropertiesoftheestimatorsoftheregressioncoefficientsdependonthepropertiesofthedisturbancetermintheregressionmodel.InthisandthenexttwochaptersweshallbelookingatsomeoftheproblemsthatarisewhenviolationsoftheGauss-Markovconditions,theassumptionsrelatingtothedisturbanceterm,arenotsatisfied.8.1HeteroscedasticityandItsImplicationsInthischapterwewillbeconcernedwiththeassumptionthat"thevarianceofthedisturbancetermineachobservationshouldbeconstant".Thissoundspeculiarandneedsabitofexplanation.Thedisturbancetermineachobservationhasonlyonevalue,sowhatcanbemeantbyits"variance"?Whatwearetalkingaboutisitspotentialbehaviorbeforethesampleisgenerated.Whenwewritethemodely=a+bx+u,(8.1)yy=a+bxax1x2x3x4x5xFigure1.Homoscedasticity©C.Dougherty2000.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof12January2000.\nHETEROSCEDASTICITY2ThefirsttwoGauss-Markovconditionsstatethatthedisturbancetermsu1,u2,...,uninthenobservationspotentiallycomefromprobabilitydistributionsthathave0meanandthesamevariance.Theiractualvaluesinthesamplewillsometimesbepositive,sometimesnegative,sometimesrelativelyfarfrom0,sometimesrelativelyclose,buttherewillbenoapriorireasontoanticipateaparticularlyerraticvalueinanygivenobservation.Toputitanotherway,theprobabilityofureachingagivenpositive(ornegative)valuewillbethesameinallobservations.Thisconditionisknownashomoscedasticity,whichmeans"samedispersion".Figure8.1providesanillustrationofhomoscedasticity.Tokeepthediagramsimple,thesamplecontainsonlyfiveobservations.Letusstartwiththefirstobservation,wherexhasthevaluex1.Iftherewerenodisturbanceterminthemodel,theobservationwouldberepresentedbythecircleverticallyabovex1ontheliney=a+bx.Theeffectofthedisturbancetermistoshifttheobservationupwardsordownwardsvertically.Thepotentialdistributionofthedisturbanceterm,beforetheobservationhasbeengenerated,isshownbythenormaldistributioncenteredonthecircle.Theactualvalueofthedisturbancetermforthisobservationturnedouttobenegative,theobservationbeingrepresentedbythesolidmarker.Thepotentialdistributionofthedisturbanceterm,andtheactualoutcome,areshowninasimilarwayfortheotherfourobservations.Althoughhomoscedasticityisoftentakenforgrantedinregressionanalysis,insomecontextsitmaybemorereasonabletosupposethatthepotentialdistributionofthedisturbancetermisdifferentfordifferentobservationsinthesample.ThisisillustratedinFigure8.2,wherethevarianceofthepotentialdistributionofthedisturbancetermisincreasingasxincreases.Thisdoesnotmeanthatthedisturbancetermwillnecessarilyhaveaparticularlylarge(positiveornegative)valueinanobservationwherexislarge,butitdoesmeanthattheaprioriprobabilityofhavinganerraticvaluewillberelativelyhigh.Thisisanexampleofheteroscedasticity,whichmeans"differingdispersion".Mathematicallyhomoscedasticityandheteroscedasticitymaybedefined:yy=a+bxax1x2x3x4x5xFigure2.Heteroscedasticity\nHETEROSCEDASTICITY3yxFigure8.3.Modelwithaheteroscedasticdisturbanceterm22Homoscedasticity:su=s,constantforallobservationsi22Heteroscedasticity:su=si,notthesameforallobservationsiFigure8.3illustrateshowatypicalscatterdiagramwouldlookifywereanincreasingfunctionofxandtheheteroscedasticitywereofthetypeshowninFigure8.2.Youcanseethat,althoughtheobservationsarenotnecessarilyfurtherawayfromthenonstochasticcomponentoftherelationship,representedbytheliney=a+bx,thereisatendencyfortheirdispersiontoincreaseasxincreases.(YoushouldbewarnedthatheteroscedasticityisnotnecessarilyofthetypeshowninFigures8.2and8.3.Thetermreferstoanycaseinwhichthevarianceoftheprobabilitydistributionofthedisturbancetermisnotthesameinallobservations.)Whydoesheteroscedasticitymatter?ThisparticularGauss-Markovconditiondoesnotappeartohavebeenusedanywhereintheanalysissofar,soitmightseemalmostirrelevant.Inparticular,theproofsoftheunbiasednessoftheOLSregressioncoefficientsdidnotusethiscondition.Therearetworeasons.Thefirstconcernsthevariancesoftheregressioncoefficients.Youwantthesetobeassmallaspossiblesothat,inaprobabilisticsense,youhavemaximumprecision.Ifthereisnoheteroscedasticity,andiftheotherGauss-Markovconditionsaresatisfied,theOLSregressioncoefficientshavethelowestvariancesofalltheunbiasedestimatorsthatarelinearfunctionsoftheobservationsofy.Ifheteroscedasticityispresent,theOLSestimatorsareinefficientbecauseyoucould,atleastinprinciple,findotherestimatorsthathavesmallervariancesandarestillunbiased.Thesecond,equallyimportant,reasonisthattheestimatorsofthestandarderrorsoftheregressioncoefficientswillbewrong.Theyarecomputedontheassumptionthatthedistributionofthedisturancetermishomoscedastic.Ifthisisnotthecase,theyarebiased,andasaconsequencethettests,andalsotheusualFtests,areinvalid.Itisquitelikelythatthestandarderrorswillbeunderestimated,sothetstatisticswillbeunderestimatedandyouwillhaveamisleadingimpressionof\nHETEROSCEDASTICITY4theprecisionofyourregressioncoefficients.Youmaybeledtobelievethatacoefficientissignificantlydifferentfrom0,atagivensignificancelevel,wheninfactitisnot.Theinefficiencypropertycanbeexplainedintuitivelyquiteeasily.SupposethatheteroscedasticityofthetypedisplayedinFigures8.2and8.3ispresent.Anobservationwherethepotentialdistributionofthedisturbancetermhasasmallstandarddeviation,likethefirstobservationinFigure8.2,willtendtolieclosetotheliney=a+bxandhencewillbeagoodguidetothelocationofthatline.Bycontrast,anobservationwherethepotentialdistributionhasalargestandarddeviation,likethatforthefifthobservationinFigure8.2,willbeanunreliableguidetothelocationoftheline.OLSdoesnotdiscriminatebetweenthequalityoftheobservations,givingequalweighttoeach,irrespectiveofwhethertheyaregoodorpoorguidestothelocationoftheline.Itfollowsthatifwecanfindawayofgivingmoreweighttothehigh-qualityobservationsandlesstotheunreliableones,wearelikelytoobtainabetterfit.Inotherworkds,ourestimatorsofaandbwillbemoreefficient.Weshallseehowtodothisbelow.PossibleCausesofHeteroscedasticityHeteroscedasticityislikelytobeaproblemwhenthevaluesofthevariablesintheregressionequationvarysubstantiallyindifferentobservations.Ifthetruerelationshipisgivenbyy=a+bx+u,itmaywellbethecasethatthevariationsintheomittedvariablesandthemeasurementerrorsthatarejointlyresponsibleforthedisturbancetermwillberelativelysmallwhenyandxaresmallandlargewhentheyarelarge,economicvariablestendingtomoveinsizetogether.Forexample,supposethatyouareusingthesimpleregressionmodeltoinvestigatetherelationshipbetweenvalueaddedinmanufacturing,MANU,andgrossdomesticproduct,GDP,indifferentcountries,andthatyouhavecollectedthesampleofobservationsgiveninTable8.1,whichincludessmalleconomiessuchasSloveniaandSlovakiaaswellaslargeoneslikeFrance,theUK,andItaly.Manufacturingoutputtendstoaccountfor15to25percentofGDP,variationsbeingcausedbycomparativeadvantageandhistoricaleconomicdevelopment.Clearly,whenGDPislarge,a1percentvariationwillmakeagreatdealmoredifference,inabsoluteterms,thanwhenitissmall.300000250000200000SouthKorea150000100000Singapore50000MexicoManufacturingoutput(U.S.$millions)Greece00200000400000600000800000100000012000001400000GDP(U.S.$millions)Figure8.4.ManufacturingoutputandGDP\nHETEROSCEDASTICITY5TABLE8.1ManufacturingValueAdded(MANU),GDP,andPopulation(POP)foraSampleofCountries,1994CountryMANUGDPPOPMANU/POPGDP/POPBelgium4451723200610093441122987Canada11261754720329109386918798Chile1309650919139949363639Denmark259271512665207497929050Finland21581976245085424419199France256316133099857856443023005Greece939298861104139029494HongKong117581308236044194521645Hungary722741506101627114084Ireland17572526623536497014893Israel11349741215362211713823Italy145013101628657177253617774Korea,S.1613183808204450136258558Kuwait2797248481754159514167Malaysia1887472505196959583681Mexico55073420788895646154698Netherlands4859533428615382315921732Norway134841229264314312628495Portugal1702587352982417338892Singapore20648710393268631821738Slovakia27201374653255112581Slovenia452014386192523487473Spain8010448365239577202412221Sweden348061984328751397722675Switzerland575032613887104809436794Syria331744753138402403234Turkey31115135961599035192270UK244397102460958005421317664Source:UNIDOYearbook1997Note:MANUandGDParemeasuredinU.S.$millions.POPismeasuredinthousands.MANU/POPandGDP/POParemeasuredinU.S.$.SouthKoreaandMexicoarebothcountrieswithrelativelylargeGDP.ThemanufacturingsectorisrelativelyimportantinSouthKorea,soitsobservationisfarabovethetrendline.TheoppositewasthecaseforMexico,atleastin1997.SingaporeandGreeceareanotherpairofcountrieswithrelativelylargeandsmallmanufacturingsectors.However,becausetheGDPofbothcountriesissmall,theirvariationsfromthetrendrelationshiparealsosmall.\nHETEROSCEDASTICITY68.2.DetectionofHeteroscedasticityInprinciplethereisnolimittothepossibledifferenttypesofheteroscedasticityandaccordinglyalargenumberofdifferenttestsappropriatefordifferentcircumstnaceshavebeenproposed.Wewillconfineourattentiontothreetestswhichhypothesizearelationshipbetweenthevarianceofthedisturbancetermandthesizeoftheexolanatoryvariable(s):theSpearmanrankcorrelationtest,theGoldfeldQuandttest,andtheGlejsertest.TheSpearmanRankCorrelationTestTheSpearmanrankcorrelationtestassumesthatthevarianceofthedisturbancetermiseitherincreasingordecreasingasxincreasesandthatthereforetherewillbeacorrelationbetweentheabsolutesizeoftheresidualsandthesizeofxinanOLSregression.Thedataonxandtheresidualsarebothranked,andtherankcorrelationcoefficientisdefinedas26åDir=1-(8.2)x,e2n(n-1)whereDiisthedifferencebetweentherankofxandtherankofeinobservationi.Undertheassumptionthatthepopulationcorrelationcoefficientis0,therankcorrelationcoefficienthasanormaldistributionwith0meanandvariance1/(n-1)inlargesamples.Theappropriateteststatisticisthereforerx,en-1andthenullhypothesisofhomoscedasticitywillberejectedatthe5%levelifitsabsolutevalueisgreaterthan1.96andatthe1%levelifitsabsolutevalueisgreaterthan2.58,usingtwo-tailedtests.Ifthereismorethanoneexplanatoryvariableinthemodel,thetestmaybeperformedwithanyoneofthem.ExampleUsingthedatainTable8.1,anOLSregressionofmanufacturingoutputonGDPyieldsthefollowingresult(standarderrorsinparentheses):2MAˆNU=604+0.194GDPR=0.89(8.3)(5700)(0.013)Thisimpliesthatmanufacturingaccountsfor$194,000outofevery$1millionincreaseinGDPinthecrosssection.TheresidualsfromtheregressionandGDParebothrankedinTable8.2andDiand2Diarecomputed.Thesumofthelattercameto1608.Therankcorrelationcoefficientisthus6´16081-=0.56(8.4)28´783andtheteststatisticis0.5627=2.91.Thisisabove2.58andhencethenullhypothesisofhomoscedasticityisrejectedatthe1percentlevel.\nHETEROSCEDASTICITY7TABLE8.222GDPRank|e|RankDDGDPRank|e|RankDD1374615472-11130823151418523-86414386211304-241359611641761241624848326208-5251512661739761163641506414175-1119843218423314416447535595515-1010023200619102531625650919626299-392613882062701739526627676819-12144334286211675824-39710398628418-10100380820228695228-636725059422713-416420788232703425-24741211036111000483652241418022248735211499110100547203256024169819762412206766361016286265243927-1198861131036020-749102460927453332611122926141092921-7491330998282093721441TheGoldfeld-QuandtTestPerhapsthemostcommonformaltestforheteroscedasticityisthatofGoldfeldandQuandt(1965).Itassumesthatsi,thestandarddeviationoftheprobabilitydistributionofthedisturbanceterm,isproportionaltothevalueofxinthatobservation.ItalsoassumesthatthedisturbancetermisnormallydistributedandsatisfiestheotherGauss-Markovconditions.Thenobservationsinthesampleareorderedbythemagnitudeofxandseparateregressionsarerunforthefirstn'andforthelastn'observations,themiddle(n–2n')observationsbeingdroppedentirely.Ifheteroscedasticityispresent,andiftheassumptionconcerningitsnatureistrue,thevarianceofuinthelastn'observationswillbegreaterthanthatinthefirstn',andthiswillbereflectedinthesumsofthesquaresoftheresidualsinthetwosubregressions.DenotingthesebyRSS1andRSS2forthesubregressionswiththefirstn'andthelastn'observations,respectively,theratioRSS2/RSS1willbedistributedasanFstatisticwith(n'-k-1)and(n'-k-1)degreesoffreedom,wherekisthenumberofexplanatoryvariablesintheequation,underthenullhypothesisofhomoscedasticity.Thepowerofthetestdependsonthechoiceofn'inrelationton.Asaresultofsomeexperimentsundertakenbythem,GoldfeldandQuandtsuggestthatn'shouldbeabout11whennis30andabout22whennis60,suggestingthatn'shouldbeaboutthree-eighthsofn.Ifthereismorethanoneexplanatoryvariableinthemodel,theobservationsshouldbeorderedbythatwhichishypothesizedtobeassociatedwithsi.ThenullhypothesisforthetestisthatRSS2isnotsignificantlygreaterthanRSS1,andthealternativehypothesisisthatitissignficantlygreater.IfRSS2turnsouttobesmallerthanRSS1,youarenotgoingtorejectthenullhypothesisandthereisnopointincomputingtheteststatisticRSS2/RSS1.However,theGoldfeld-Quandttestcanalsobeusedforthecasewherethestandarddeviationofthedisturbancetermishypothesizedtobeinverselyproportionaltoxi.Theprocedureisthesameasbefore,buttheteststatisticisnowRSS1/RSS2,anditwillagainbedistributedasanFstatisticwith(n'-k-1)and(n'-k-1)degreesoffreedomunderthenullhypothesisofhomoscedasticity.\nHETEROSCEDASTICITY8ExampleUsingthedatainTable8.1,OLSregressionswererunusingtheobservationsforthe11countrieswithsmallestGDPandthenforthe11countrieswithlargestGDP.Theresidualsumofsquaresinthefirst66regressionwas157x10,andintheseconditwas13,518x10..TheratioRSS2/RSS1wastherefore86.1.ThecriticalvalueofF(9,9)atthe0.1percentlevelis10.1,andthenullhypothesisofhomoscedasticitywasthereforerejected.TheGlejserTestTheGlejsertestallowsyoutoexplorethenatureoftheheteroscedasticityalittlemoreclosely.Werelaxtheassumptionthatsiisproportionaltoxiandinvestigatewhethersomeotherfunctionalformmaybemoreappropriate,forexamplegsi=a+bxi(8.5)Tousetheprocedure,youregressyonxusingOLSandthenfittheabsolutevaluesoftheresiduals,|e|,tothefunctionforagivenvalueofg.Youmayfitseveralsuchfunctions,varyingthechoiceofg.Ineachcasethenullhypothesisofhomoscedasticitywillberejectediftheestimateofbissignificantlydifferentfrom0.Ifmorethanonefunctiongivesrisetoasignificantestimateofb,thatwiththebestfitmaybeaguidetothenatureoftheheteroscedasticity.ExampleUsingthedataonGDPand|e|inTable8.2,(8.5)wasfittedusingvaluesofgfrom–1.0to1.5.TheresultsaresummarizedinTable8.3.Notethatthedifferentestimatesofbarenotcomparablebecausethedefinitionoftheexplanatorygvariable,x,isdifferentineachcase.Significantfits,atthe1percentlevel,areobtainedforthemiddle2threevaluesofg.ThelevelsofRarecomparablebecausethedependentvariableisthesameineachcase.Thebestfitsareobtainedwithgequalto0.25and0.5,sothestandarddeviationofthedistributionofudoesappeartobeincreasingwithGDPbutnotinthesameproportion.TABLE8.32gbs.e.(b)R88-1.0-3.51x101.94x100.1166-0.5-4.21x101.71x100.1966-0.250.56x100.20x100.230.2516405200.280.536.111.80.271.00.0260.0100.21-6-61.519.8x109.3x100.15\nHETEROSCEDASTICITY9Exercise8.1Thefileeduc.dtaintheheteroscedasticdatasetsfolderonthewebsitecontainsinternationalcross-sectiondataonEDUC,aggregateexpenditureoneducation,GDP,grossdomesticproduct,andPOP,population,forasampleof38countriesin1997.EDUCandGDParemeasuredinU.S.$millionsandPOPismeasuredinthousands.Downloadthedataset,plotascatterdiagramofEDUConGDP,andcommentonwhetherthedatasetappearstobesubjecttoheteroscedasticity.SortthedatasetbyGDPandperformaGoldfeld-Quandttestforheteroscedasticity,runningregressionsusingthesubsamplesof14countrieswiththesmallestandgreatestGDP..8.3WhatCanYouDoaboutHeteroscedasticity?Supposethatthetruerelationshipisy=a+bx+u(8.6)Letthestandarddeviationofthedisturbanceterminobservationibesi.Ifyouhappenedtoknowsiforeachobservation,youcouldeliminatetheheteroscedasticitybydividingeachobservationbyitsvalueofs.Themodelbecomesyi1xiui=a+b+(8.7)sisisisiThedisturbancetermui/siishomoscedasticbecauseì2üìuiüïæuiöï1212populationvarianceofíý=Eíçç÷÷ý=2E(ui)=2sui=1(8.8)îsiþïîèsiøïþsisiTherefore,everyobservationwillhaveadisturbancetermdrawnfromadistributionwithpopulationvariance1,andthemodelwillbehomoscedastic.Therevisedmodelmayberewritteny'=ah+bx'+u',(8.9)whereyi'=yi/si,xi'=xi/si,hisanewvariablewhosevalueinobservationiis1/si,andui'=ui/si.Notethatthereshouldnotbeaconstanttermintheequation.Byregressingy'onhandx',youwillobtainefficientestimatesofaandbwithunbiasedstandarderrors.Amathematicaldemonstrationthattherevisedmodelwillyieldmoreefficientestimatesthantheoriginaloneisbeyondthescopeofthistext,butitiseasytogiveanintuitiveexplanation.Thoseobservationswiththesmallestvaluesofsiwillbemostusefulforlocatingthetruerelationshipbetweenyandxbecausetheywilltendtohavethesmallestdisturbanceterms.Wearetaking\nHETEROSCEDASTICITY10advantageofthisfactbyperformingwhatissometimescalledaweightedregression.Thefactthatobservationiisgivenweight1/siautomaticallymeansthatthebetteritsquality,thegreatertheweightthatitreceives.Thesnagwiththisprocedureisthatitismostunlikelythatyouwillknowtheactualvaluesofthesi.However,ifyoucanthinkofsomethingthatisproportionaltoitineachobservation,anddividetheequationbythat,thiswillworkjustaswell.Supposethatyoucanthinkofsuchavariable,whichweshallcallz,anditisreasonabletosupposethatsiisproportionaltozi:si=lzi.(8.10)forsomeconstant,l.Ifwedividetheoriginalequationthroughbyz,wehaveyi1xiui=a+b+ziziziziThemodelisnowhomoscedasticbecauseì2ü22ìuiüïæçuiö÷ï1212lzi2populationvarianceofízý=Eíç÷ý=2E(ui)=2sui=2=l(8.11)îiþïîèziøïþziziziWedonotneedtoknowthevalueofl,andindeedingeneralwillnotknowit.Itisenoughthatitshouldbeconstantforallobservations.Inparticular,itmaybereasonabletosupposethatsiisroughlyproportionaltoxi,asintheGoldfeld-Quandttest.Ifyouthendivideeachobservationbyitsvalueofx,themodelbecomesy1uii=a+b+,(8.12)xxxiiiand,withalittlebitofluck,thenewdisturbancetermu/xwillhaveconstantvariance.Younowregressy/xon1/x,includingaconstanttermintheregression.Thecoefficientof1/xwillbeanefficientestimateofaandtheconstantwillbeanefficientestimateofb.Inthecaseofthemanufacturingoutputexampleintheprevioussection,thedependentvariablewouldbemanufacturingoutputasaproportionofGDP,andtheexplanatoryvariablewouldbethereciprocalofGDP.Sometimestheremaybemorethanonevariablethatmightbeusedforscalingtheequation.Inthecaseofthemanufacturingoutputexample,analternativecandidatewouldbethesizeofthepopulationofthecountry,POP.DividingtheoriginalmodelthroughbyPOP,oneobtainsyi1xiui=a+b+,(8.13)POPiPOPiPOPiPOPi\nHETEROSCEDASTICITY11andagainonehopesthatthedisturbanceterm,ui/POPi,willhaveconstantvarianceacrossobservations.ThusnowoneisregressingmanufacturingoutputpercapitaonGDPpercapitaandthereciprocalofthesizeofthepopulation,thistimewithoutaconstantterm.Inpracticeitmaybeagoodideatotryseveralvariablesforscalingtheobservationsandtocomparetheresults.Iftheresultsareroughlysimilareachtime,andtestsfailtorejectthenullhypothesisofhomoscedasticity,yourproblemshouldbeatanend.ExamplesIntheprevioussectionitwasfoundthatalinearregressionofMANUonGDPusingthedatainTable8.1andthemodelMANU=a+bGDP+u(8.14)wassubjecttosevereheteroscedasticity.Onepossibleremedymightbetoscaletheobservationsbypopulation,themodelbecomingMANU1GDPu=a+b+,(8.15)POPPOPPOPPOPFigure8.5providesaplotofMANU/POPonGDP/POP.Despitescaling,theplotstilllooksheteroscedastic.When(8.15)isfittedusingthe11countrieswithsmallestGDPpercapitaandthe11countrieswiththegreatest,theresidualsumsofsquaresare5,378,000and17,362,000.Theratio,andhencetheFstatistic,is3.23.However,withsuchsmallsubsamples,itispossibletoobtainhighratiosunderthenullhypothesisofhomoscedasticity.Inthiscase,thenullhypothesisisjustrejectedatthe5%level,thecriticalvalueofF(9,9)being3.18.90008000700060005000400030002000Manufacturingoutputpercapita100000500010000150002000025000300003500040000GDPpercapitaFigure8.5.\nHETEROSCEDASTICITY120.400.300.20Manufacturing/GDP0.100.00010203040506070801/GDPx1,000,000Figure8.6.Figure8.6showstheresultofscalingthroughbyGDPitself,manufacturingasashareofGDPbeingplottedagainstthereciprocalofGDP.Inthiscasetheresidualsumsofsquaresforthesubsamplesare0.065and0.070,andsofinallywehaveamodelwherethenullhypothesisofhomoscedasticityisnotrejected.Wewillcomparetheregressionresultsfortheunscaledmodelandthetwoscaledmodels,summarizedinequations(8.16)–(8.18)(standarderrorsinparentheses):2MAˆNU=604+0.194GDPR=0.89(8.16)(5,700)(0.013)MANU1GDP2=612+0.182R=0.70(8.17)POPPOPPOP(1,371)(0.016)MANU12=0.189+533R=0.02(8.18)GDPGDP(0.019)(841)First,notethattheestimateofthecoefficientofGDPismuchthesameinthethreeregressions:0.194,0.182,and0.189(rememberthatitbecomestheinterceptwhenscalingthroughbythexvariable).Onewouldnotexpectdramaticshiftssinceheteroscedasticitydoesnotgiverisetobias.Theestimatorinthethirdestimateshouldhavethesmallestvarianceandthereforeoughttohaveatendencytobethemostaccurate.Perhapssurprisingly,itsstandarderroristhelargest,butthenthestandarderrorsinthefirsttworegressionsshouldbedisregardedbecausetheyareinvalidatedbytheheteroscedasticity.\nHETEROSCEDASTICITY13Inthismodeltheinterceptdoesnothaveanysensibleeconomicinterpretation.Inanycaseitsestimateinthethirdequation,whereithasbecomethecoefficientof1/GDP,isnotsignificantly2differentfrom0.TheonlyapparentproblemwiththethirdmodelisthatRisverylow.Wewillreturntothisinthenextsubsection.NonlinearModelsHeteroscedasticity,orperhapsapparentheteroscedascity,maybeaconsequenceofmisspecifyingthemodelmathematically.Supposethatthetruemodelisnonlinear,forexampleby=axv(8.19)with(forsakeofargument)bpositivesothatyisanincreasingfunctionofx.Themultiplicativedisturbancetermvhastheeffectofincreasingorreducingybyarandomproportion.Supposethattheprobabilitydistributionofvisthesameforallobservations.Thisimplies,forexample,thattheprobabilityofa5percentincreaseordecreaseinyduetoitseffectsisjustthesamewhenxissmallaswhenxislarge.However,inabsolutetermsa5percentincreasehasalargereffectonywhenxislargethanwhenxissmall.Ifyisplottedagainstx,thescatterofobservationswillthereforetendtobemorewidelydispersedaboutthetruerelationshipasxincreases,andalinearregressionofyonxmaythereforeexhibitheteroscedasticity.Thesolution,ofcourse,istorunalogarithmicregressioninstead.Notonlywouldthisbeamoreappropriatemathematicalspecification,butitmakestheregressionmodelhomoscedasticlogy=loga+blogx+logv(8.20)logvnowaffectsthedependentvariable,logy,additively,sotheabsolutesizeofitseffectisindependentofthemagnitudeoflogx.ExampleFigure8.7showsthelogarithmofmanufacturingoutputplottedagainstthelogarithmofGDPusingthedatainTable8.1.Atfirstsightatleast,theplotdoesnotappeartoexhibitheteroscedasticity.Logarithmicregressionsusingthesubsamplesof11countrieswithsmallestandgreatestGDPyieldresidualsumsofsquares2.14and1.04,respectively.InthiscasetheconventionalGoldfeld-Quandttestissuperfluous.SincethesecondRSSissmallerthanthefirst,itcannotbesignificantlygreater.HowevertheGoldfeld-Quandttestcanalsobeusedtotestforheteroscedasticitywherethestandarddeviationofthedistributionofthedisturbancetermisinverselyproportionaltothesizeofthexvariable.TheFstatisticisthesame,withRSS1andRSS2interchanged.InthepresentcasetheFstatisticif2.06,whichislowerthanthecriticalvalueofFatthe5%level,andwedonotrejectthenullhypothesisofhomoscedasticity.Runningtheregressionwiththecompletesample,weobtain(standarderrorsinparentheses):\nHETEROSCEDASTICITY14131211109logManufacturingoutput879101112131415logGDPFigure8.7.2logMˆANU=-1.694+0.999logGDPR=0.90(8.21)(0.785)(0.066)implyingthattheelasticityofMANUwithrespecttoGDPisequalto1.Wenowhavetwomodelsfreefromheteroscedasticity,(8.18)and(8.21).Thelattermightseem2moresatisfactory,giventhatithasaveryhighRand(8.18)averylowone,butinfact,inthisparticularcase,theyhappentobeequivalent.(8.21)istellingusthatmanufacturingoutputincreasesproportionallywithGDPinthecross-sectionofcountriesinthesample.Inotherwords,manufacturingoutputaccountsforaconstantproportionofGDP.Toworkoutthisproportion,werewritetheequationas-1.6940.990.999MAˆNU=eGDP=0.184GDP(8.22)(8.18)istellingusthattherationMANU/GDPiseffectivelyaconstant,sincethe1/GDPtermappearstoberedundant,andthattheconstantis0.189.Henceinsubstancetheinterpretationscoincide.DoesHeteroscedasticityReallyMatter?Thiswilldependontheextenttowhichyourobservationsvaryinsize(assumingthatthesizeoftheobservationisaguidetothesizeofthestandarderrorofthedisturbanceterm).SomecalculationsundertakenbyGeary(1966)indicatethat,ifthestandarddeviationofthedisturbancetermisproportionaltothemagnitudeoftheexplanatoryvariableinasimpleregression,thevarianceoftheestimatoroftheslopecoefficientcouldbethreetimesasgreatusingOLSasitwouldbeusingatechniquethatcorrectedfortheheteroscedasticity.\nHETEROSCEDASTICITY15Exercises8.2UsingyourEAEFdataset,chooseaspecificationforanearningsfunctionwithEARNINGSasthedependentvariable..SorttheobservationsbysizeofHGCandperformaGoldfeld-QuandttesttotestforheteroscedasticityintheHGCdimension.RepeatusingLGEARNasthedependentvariable.8.3WiththedatasetonaggregateexpenditureoneducationandaggregateGDPusedinExercise8.1,usetheGoldfeld-QuandttesttoinvestigatewhetherscalingbypopulationorbyGDP,orwhetherrunningtheregressioninlogarithmicform,wouldeliminatetheheteroscedasticity.Comparetheresultsofregressionsusingthealternativespecificationsandtheentiresample.\n10SIMULTANEOUSEQUATIONSESTIMATION(draft)10.1SimultaneousEquationsModels:SomeTerminologyResearchintothebehaviorofthefirmisanimportantbranchofappliedeconometrics.Manyempiricalstudieshaveusedenterprise-leveldatatoinvestigatethedeterminantsofprofitability,growthandotherimportantoutcomesforfirms,investigatingtherolesofexpenditureonresearchanddevelopment,marketing,training,andotherfactors.Wewillconsideraverysimplifiedmodelofthistype.Supposethatyouhaveasampleofnenterprisesbelongingtovariousindustrieswithinthemanufacturingsectorofacountryandyouwishtoinvestigatehowmuch,ifatall,expenditureonresearchanddevelopmentcontributestogrowth.Toavoidtheproblemofheteroscedasticity,wewillchoosemeasuresofbothwhicharenotautomaticallyassociatedwiththesizeoftheenterprise.Wechooseasthedependentvariabletheaveragegrowthrateoverthepastfiveyears,denotedG,andasanexplanatoryvariableaverageexpenditureonresearchanddevelopmentasaproportionoftotalcostsoverthepastfiveyears,denotedR.Someoftheindustrieswithinthemanufacturingsectorwillhavegrownfasterthanothersandwewouldexpecttheaveragegrowthrateofitsindustry,I,tobeadeterminantofthegrowthrateofthefirm.WewouldexpectGtobepositivelyinfluencedbybothRandI:G=a+b1R+b2I+u(10.1)Nowthosefirmsthataregrowingfastestarealsolikelytotendtobethosespendingmostonresearchanddevelopmentbecausedynamic,well-managedfirmstendtospendmoreonresearchanddevelopment,training,etc,thanthelaggards.SowealsohypothesizetherelationshipR=g+d1G+v(10.2)Thuswehaveasimpleexampleofasimultaneousequationsmodel.Byitsveryspecificationitinvolvesacertainamountofcircularity–RhelpstodetermineGinthefirstequation,andinturnGdeterminesRinthesecond.Tocutthroughthecircularityweneedtomakeadistinctionbetweenendogenousandexogenousvariables.Endo-andexo-areGreekprefixesthatmeanwithinandoutside,respectively.Endogenousvariablesarevariableswhosevaluesaredeterminedbytheinteractionoftherelationshipsinthemodel.ExogenousonesarethosewhosevaluesaredeterminedÓC.Dougherty2000.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof22January2000.\nSIMULTANEOUSEQUATIONSESTIMATION2externally.ThusinthepresentcaseGandRarebothendogenousandIisexogenous.Theexogenousvariablesandthedisturbancetermsultimatelydeterminethevaluesoftheendogenousvariables,onceonehascutthroughthecircularity.Themathematicalrelationshipsexpressingtheendogenousvariablesintermsoftheexogenousvariablesanddisturbancetermsareknownasthereducedformequations.Theoriginalequationsthatwewrotedownwhenspecifyingthemodelaredescribedasthestructuralequations.WewillderivethereducedformequationsforGandR.ToobtainthatforG,wetakethestructuralequationforGandsubstituteforRfromthesecondequation:G=a+b1R+b2I+u=a+b1(g+d1G+v)+b2I+u(10.3)Hence(1-b1d1)G=(a+b1g)+b2I+u+b1v(10.4)andsoa+b1g+b2I+u+b1vG=(10.5)1-b1d1SimilarlyweobtainthereducedformequationforR:R=g+d1G+v=g+d1(a+b1R+b2I+u)+v(10.6)Hence(1-b1d1)R=(g+ad1)+d1b2I+d1u+v(10.7)andsog+ad1+d1b2I+d1u+vR=(10.8)1-b1d110.2SimultaneousEquationsBiasInmany(butbecareful:bynomeansall)simultaneousequationsmodels,thereducedformequationsexpresstheendogenousvariablesintermsofalloftheexogenousvariablesandallofthedisturbanceterms.Youcanseethatthisisthecasehere.Inthismodel,thereisonlyoneexogenousvariable.IisanexplanatoryvariableinthestructuralequationforG,soitisalsoacomponentofthereducedformequationforG.RdoesnotdependdirectlyonIbutitdoessoindirectlybecauseGisadeterminant.Similarly,bothGandRdependonu,GdirectlyandRindirectly.Andbothdependonv,RdirectlyandGindirectly.\nSIMULTANEOUSEQUATIONSESTIMATION3ThedependenceofRonumeansthatOLSwouldyieldinconsistentestimatesifusedtofitequation(10.1),thestructuralequationforG.Risastochasticregressoranditsrandomcomponentisnotdistributedindependentlyofthedisturbancetermu.SimilarlythedependenceofGonvmeansthatOLSwouldyieldinconsistentestimatesifusedtofit(10.2).Sincethisisasimpleregressionequation,itiseasytoanalyzethelarge-samplebiasintheOLSestimatorofd1andwewilldoso.OLSAfterwritingdowntheexpressionford1,thefirststep,asusual,istosubstituteforG.Herewehavetomakeadecision.WenowhavetwoequationsforG,thestructuralequation(10.1)andthereducedformequation(10.5).Ultimatelyitdoesnotmatterwhichweuse,butthealgebraisalittlemorestraightforwardifweusethestructuralequationbecausetheexpressiondecomposesimmediatelyintothetruevalueandtheerrorterm.Wecanthenconcentrateontheerrorterm.IfweuseOLStofitthestructuralequationforR,OLSCov(R,G)Cov([g+d1G+v],G)d1==Var(G)Var(G)(10.9)Cov(g,G)+Cov(d1G,G)+Cov(v,G)Cov(v,G)==d1+Var(G)Var(G)Theerrortermisanonlinearfunctionofbothuandv(rememberthatGdependsonboth)anditisnotpossibletoobtainananalyticalexpressionforitsexpectedvalue.Insteadwewillinvestigateitsprobabilitylimit,usingtherulethattheprobabilitylimitofaratioisequaltotheprobabilitylimitofthenumeratordividedbytheprobabilitylimitofthedenominator,providedthatbothexist.WewillfirstfocusonplimCov(v,G).æ1öplimCov(v,G)=plimCovççv,(a+b2I+u+b1v)÷÷è1-b1d1ø1=[plimCov(v,a)+b2plimCov(v,I)+plimCov(v,u)+b1plimCov(v,v)]1-b1d1(10.10)Cov(v,a)is0sinceaisaconstant.plimCov(v,I)willbe0ifIistruelyexogenous,aswehaveassumed.plimCov(u,v)willbe0unlessthedisturbancetermsinthestructuralequationsarenotindependent.ButplimCov(v,v)isnonzerobecauseitisplimVar(v)andthelimitingvalueofthe2samplevarianceofvisitspopulationvariance,sv.Hence2b1svplimCov(v,G)=(10.11)1-b1d1NowforplimVar(G).æa+b1gb2I+u+b1vöplimVar(G)=plimVarç+÷ç÷è1-b1d11-b1d1ø(10.12)æb2I+u+b1vö=plimVarç÷ç÷è1-b1d1ø\nSIMULTANEOUSEQUATIONSESTIMATION4a+b1gsinceisanadditiveconstant.So1-b1d1æ1öplimVar(G)=çç2÷÷[plimVar(b2I)+plimVar(u)+plimVar(b1v)è(1-b1d1)ø(10.13)+2plimCov(b2I,u)+2plimCov(b2I,b1v)+2plimCov(u,b1v)]NowifI,uandvareindependentlydistributed,thelimitingvaluesofthethreecovariancetermsare0.Thelimitingvaluesofthevariancetermsarethecorrespondingpopulationvariances.Henceæç1ö÷22222plimVar(G)=ç2÷[b2sI+su+b1sv](10.14)è(1-b1d1)øThus2OLSb1svplimd1=d1+(1-b1d1)22222(10.15)b2sI+su+b1svOLSandsod1isaninconsistentestimatorofd1.Thedirectionofsimultaneousequationsbiasdependsonthestructureofthemodelbeingfitted.Canonesayanythingaboutitinthiscase?Variancesarealwayspositive,ofnot0,soitdependsonthesignof(1-b1d1).Lookingatthereducedformequations(10.5)and(10.8),itisreasonabletosupposethatGandRwillbepositivelyinfluencedbyI.Sinceitisalsoreasonabletosupposethatb2andd1arebothpositive,onemayinferthat(1-b1d1)isalsopositive.Actually,thisisaconditionforequilibriuminthismodel.ConsidertheeffectofanincreaseDIinI.Inviewof(10.1),itsimmediateeffectistoincreaseGbyanamountb2DI.Lookingat(10.2),thisinturnincreasesRbyanamountd1b2DI.Returningto(10.1),thiscausesasecondaryincreaseinGofb1d1b2DI,andhence,returningto2(10.2),asecondaryincreaseinRequaltob1db12DI.Returningagainto(10.1),thiscausesafurther22increaseinGequaltobb1d12DI.ThetotalincreaseinGwillthereforebe2233DG=(1+b1d1+b1d1+b1d1+...)b2DI(10.16)andthiswillbefiniteonlyifb1d1<1.AMonteCarloExperimentThissectionreportsonaMonteCarloexperimentthatinvestigatestheperformanceofOLSand,later,IVwhenfittingtheresearchanddevelopmentequationintheresearchanddevelopmentandgrowthmodel.Numericalvalueswereassignedtotheparametersoftheequationsasfollows:G=-5.0+2.0R+0.9I+u(10.17)R=2.0+0.2G+v(10.18)\nSIMULTANEOUSEQUATIONSESTIMATION5TABLE10.1Samplegs.e.(g)d1s.e.(d1)Samplegs.e.(g)d1s.e.(d1)11.600.160.300.04111.540.110.320.0221.430.120.350.02121.700.130.290.0331.610.100.300.02131.520.080.310.0241.480.130.330.03141.640.120.290.0251.540.140.300.03151.690.130.280.0361.460.110.310.02161.360.090.330.0271.620.100.280.02171.590.140.300.0381.650.100.290.02181.590.100.300.0291.670.100.260.02191.590.130.310.03101.450.090.330.02201.560.100.300.02uandvweregeneratedasnormalrandomvariableswith0meanandunitvariance,scaledby0.25and0.4,respectively.Eachreplicationoftheexperimentusedasampleof40observations.Usingtheexpressionderivedabove,plimd1=0.31whentheresearchanddevelopmentequationisfittedwithOLS.Theexperimentwasreplicated20timeswiththeresultsshownintheTable10.1.Itisevidentthattheestimatesareheavilybiased.Everyestimateoftheslopecoefficientisabovethetruevalueof0.2,andeveryestimateoftheinterceptisbelowthetruevalueof2.0.Themeanoftheslopecoefficientsis3.0,veryclosetothetheoreticalplimfortheOLSestimate.Figure10.1showshowthebiasarises.ThehollowcirclesshowwhattherelationshipbetweenRandGwouldlooklikeintheabsenceofthedisturbanceterms,for20observations.ThedisturbancetermuisnotresponsibleforbiaswhenOLSisusedtofittheresearchanddevelopmentequation,sotokeepthingssimple,itwillbesuppressed.ThedisturbancetermvaltersthevaluesofbothRandGineachobservationwhenR5432100123456789GFigure10.1\nSIMULTANEOUSEQUATIONSESTIMATION6itisintroduced.IthasadirecteffectonRequaltov,but,becausevhasincreased,thegrowthequationisaffectedandGincreasesbyanamountb1v.Returningtotheresearchanddevelopmentequation,theincreaseinGcausesasecondaryincreaseinRequaltob1d1v.Returningtothegrowthequation,2thissecondaryincreaseinRcausesafurtherincreaseinGequaltob1d1v.Takingaccountofthefullmultipliereffect,Rincreasesbyv/(1-b1d1)andGbyb1v/(1-b1d1).Itfollowsthattheshiftisalongalinewithslope1/b1.Thesolidcirclesaretheactualobservations,aftervhasbeenintroduced.Theshiftlinehasbeendrawnforeachobservation.Ascanbeseen,theoveralleffectistoskewthepatternofobservations,withtheresultthattheOLSslopecoefficientisacompromisebetweentheslopeofthetruerelationship,d1,andtheslopeoftheshiftlines,1/b1.Exercise10.1Aresearcherisinvestigatingtheimpactofadvertisingonsalesusingcross-sectiondatafromfirmsproducingrecreationalgoods.Foreachfirmtherearedataonsales,S,andexpenditureonadvertising,A,bothmeasuredinsuitableunits,forarecentyear.Theresearcherproposesthefollowingmodel:S=a+bA+uA=g+dS+vwhereuandvaredisturbanceterms.Thefirstrelationshipreflectthepositiveeffectofadvertisingonsales,andthesecondthefactthatlargestfirms,asmeasuredbysales,tendtospendmostonadvertising.GiveamathematicalanalysisofwhatwouldhappeniftheresearchertriedtofitthemodelusingOLS.10.3InstrumentalVariablesAswesawinthediscussionofmeasurementerror,theinstrumentalvariablesapproachmayofferasolutiontotheproblemscausedbyaviolationofthefourthGauss-Markovcondition.Inthepresentcase,whenwefitthestructuralequationforR,thefourthGauss-MarkovconditionisviolatedbecauseGisnotdistributedindependentlyofv.WeneedavariablewhichiscorrelatedwithGbutnotwithv,andwhichdoesnotalreadyappearintheequationinitsownright.ThereducedformequationforGgaveussomebadnews–itrevealedthatGwasdependentonv.Butitalsogivesussomegoodnews–itshowsthatGiscorrelatedwithI,whichisexogenousandthusindependentofv.SowecanfittheequationusingIasaninstrumentforG.Recallingthat,forsimpleregressionanalysis,theinstrumentalvariablesestimatoroftheslopecoefficientisgivenbythecovarianceofthedependentvariablewiththeinstrumentdividedbythecovarianceoftheexplanatoryvariablewiththeinstrument,theIVestimatorisgivenby\nSIMULTANEOUSEQUATIONSESTIMATION7IVCov(I,R)d1=(10.19)Cov(I,G)Wewilldemonstratethatitisconsistent.SubstitutingfromthestructuralequationforR,IVCov(I,[g+d1G+v])Cov(I,g)+Cov(I,d1G)+Cov(I,v)d1==Cov(I,G)Cov(I,G)(10.20)Cov(I,v)=d1+Cov(I,G)sincethefirstcovarianceinthenumeratoris0andthesecondisequaltod1Cov(I,G).NowplimCov(I,v)is0ifIisexogenousandsodistributedindependentlyofv.plimCov(I,g)isnonzerobecauseIisadeterminantofG.Hencetheinstrumentalvariableestimatorisaconsistentestimateofd1.Table10.2showstheresultswhenIVisusedtofitthemodeldescribedinSection10.2.IncontrasttotheOLSestimates,theIVestimatesaredistributedaroundthetruevalues,themeanoftheestimatesoftheslopecoefficientbeing0.19andofthoseoftheinterceptbeing2.1.Thereisnopointincomparingthestandarderrorsusingthetwoapproaches.ThoseforOLSmayappeartobeslightlysmaller,butthesimultaneousequationsbiasrenderstheminvalid.Inthisexample,IVdefinitelygavebetterresultsthanOLS,butthatoutcomewasnotinevitable.IfyoucompareTables10.1and10.2,youwillseethatthedistributionoftheOLSestimatesoftheslopecoefficientismuchmoreconcentratedthanthatoftheIVestimates.Thestandarddeviationoftheestimates(calculateddirectlyfromtheestimates,ignoringthestandarderrors)is0.020.FortheIVestimatesitis0.045.Soifthebiashadbeensmaller,itispossiblethatOLSmighthaveyieldedsuperiorestimatesaccordingtoacriterionlikethemeansquareerrorthatallowsatrade-offbetweenbiasandvariance.TABLE10.2Samplegs.e.(g)d1s.e.(d1)Samplegs.e.(g)d1s.e.(d1)12.510.320.090.07112.000.170.220.0422.030.210.210.05122.290.210.160.0431.990.140.220.03131.760.110.250.0242.230.260.150.06142.090.170.190.0452.350.280.100.07152.250.210.150.0461.880.160.200.04161.650.120.250.0372.020.150.180.04172.290.250.130.0681.930.120.220.03181.970.150.200.0392.010.140.180.03192.290.250.150.05101.840.150.240.03201.890.140.220.03\nSIMULTANEOUSEQUATIONSESTIMATION8Exercises10.2TheresearcherinExercise10.1discoversthatlastyear'sadvertisingbudget,A(-1),isalsoanimportantdeterminantofA.Explainhowthisinformationcouldbeusedtoobtainaconsistentestimatorofb,andprovethatitisconsistent.10.3SupposethatA(-1)inExercise10.2alsohasaninfluenceonS.Howwouldthisaffectthefittingofthemodel?10.4OveridentificationandTwo-StageLeastSquaresConsiderthemodelG=a+b1R+b2I+u(10.21)R=g+d1G+d2K+v(10.22)whereKisameasureofcapitalinvestmentperworker.Itsupposesthat,thegreateristhecapitalinvestmentperworker,thegreateristheneedforresearchanddevelopment.WewillassumethatKisanexogenousvariable.Thereducedformisthena+b1g+b2I+b1d2K+u+b1vG=(10.23)1-b1d1g+ad1+d1b2I+d2K+d1u+vR=(10.24)1-b1d1ThegrowthequationisnowidentifiedbecauseKisavailableasaninstrumentforR.Theresearchanddevelopmentequationremainsidentified,withIactingasaninstrumentforG.NextconsiderthemodelG=a+b1R+b2I+b3M+u(10.25)R=g+d1G+v(10.26)whereMisameasureofthemarketingeffortofthefirm,thegrowthratebeingpositivelyinfluencedbyit.Inthiscasethereducedformisa+b1g+b2I+b3M+u+b1vG=(10.27)1-b1d1\nSIMULTANEOUSEQUATIONSESTIMATION9g+ad1+d1b2I+d1b3M+d1u+vR=(10.28)1-b1d1Thegrowthequationisunderidentified(notidentified)becausebothexogenousvariablesappearinitintheirownright.HowevertheresearchanddevelopmentequationisnowoveridentifiedbecausewehavetwoinstrumentsforG.WecoulduseIasaninstrumentforG,asbefore:IVCov(I,R)d1=(10.29)Cov(I,G)orwecoulduseMasaninstrument:IVCov(M,R)d1=(10.30)Cov(M,G)Bothareconsistentestimators,sotheywouldconvergetothetruevalue,andthereforetoeachother,asthesamplesizebecamelarge,butforfinitesamplestheywouldgivedifferentestimates.Supposethatyouhadtochoosebetweenthem(youdonot,aswewillsee).Whichwouldyouchoose?Thepopulationvarianceofthefirstisgivenby22sv1sIV=´(10.31)d1nVar(G)r2G,IThepopulationvarianceofthesecondestimatorisgivenbyasimilarexpressionwiththecorrelationcoefficientreplacedbythatbetweenGandM.Wewantthepopulationvariancetobeassmallaspossible,sowewouldchoosetheinstrumentwiththehighercorrelationcoefficient.Two-StageLeastSquaresInpractice,ratherthanchoosebetweentheinstrumentsinthissituation,wewouldconstructalinearcombinationofthemandusethatinstead.Themainreasonforthisisthatingeneralalinearcombination,withsuitablychosenweights,willbemoreefficientthaneitherindividually.Asecondaryreasonisthatusingalinearcombinationeliminatestheproblemofconflictingestimates.LetthelinearcombinationbeZ,whereZ=h1I+h2M(10.32)Howdowechooseh1andh2?Verystraightforward.UsingOLS,regressGonIandM,savethefittedvalues,andcallthesavedvariableZ.ThefittedvaluesareautomaticallylinearcombinationsofIandM:Gˆ=h0+h1I+h2M(10.33)\nSIMULTANEOUSEQUATIONSESTIMATION10Further,thehcoefficientswillhavebeenchoseninsuchawayastomaximizethecorrelationbetweenthefittedvaluesandG.AswesawinChapter2,providedthatthemodeliscorrectlyspecifiedandthattheGauss-Markovconditionsaresatisfied,OLSyieldsestimatesthatareoptimalaccordingtothreemutually-equivalentcriteria:minimizingthesumofthesquaresoftheresiduals,2maximizingR,and(thecriterionwhichisrelevanthere)maximizingthecorrelationbetweentheactualandthefittedvalues.Exercise10.4TheresearcherinExercise10.1findsoutthattheaveragepriceoftheproduct,P,andlastyear'ssales,S(-1),areimportantdeterminantsofS.Howwouldthisaffectthefittingofthemodel?\n11BINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION(draft)Economistsareofteninterestedinthefactorsbehindthedecision-makingofindividualsorenterprises.Examplesare:·Whydosomepeoplegotocollegewhileothersdonot?·Whydosomewomenenterthelaborforcewhileothersdonot?·Whydosomepeoplebuyhouseswhileothersrent?·Whydosomepeoplemigratewhileothersstayput?Themodelsthathavebeendevelopedareknownasbinarychoiceorqualitativeresponsemodelswiththeoutcome,whichwewilldenotey,beingassignedavalueof1iftheeventoccursand0otherwise.Modelswithmorethantwopossibleoutcomeshavebeendeveloped,butwewillrestrictourattentiontobinarychoice.Thelinearprobabilitymodelapart,binarychoicemodelsarefittedusingmaximumlikelihoodestimation.Thechapterendswithanintroductiontothistopic.11.1TheLinearProbabilityModelThesimplestbinarychoicemodelisthelinearprobabilitymodelwhere,asthenameimplies,theprobabilityoftheeventoccurring,p,isassumedtobealinearfunctionofasetofexplanatoryvariable(s):pi=p(yi=1)=a+bxi+ui(11.1)Graphically,therelationshipisasshowninFigure11.1,ifthereisjustoneexplanatoryvariable.Ofcoursepisunobservable.Onehasdataonlyontheoutcome,y.Inthelinearprobabilitymodelthisisusedlikeadummyvariableforthedependentvariable.Asanillustration,wewilltakethefirstquestionlisted.WewilldefineavariableCOLLEGEthatisequalto1iftheindividualwenttocollege,and0forthosewhodonot,andwewillregressitonASVABC,thecompositescoreontestsofcognitiveability.Theregressionoutputshowstheresultoffittingthislinearprobabilitymodel,usingEAEFDataSet21.ÓC.Dougherty2000.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof3February2000.\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION2p1a+bxi0xixFigure11.1.Linearprobabilitymodel.regcollegeasvabcSource|SSdfMSNumberofobs=570---------+------------------------------F(1,568)=195.35Model|36.464907136.464907Prob>F=0.0000Residual|106.028075568.186669147R-squared=0.2559---------+------------------------------AdjR-squared=0.2546Total|142.492982569.250427034RootMSE=.43205------------------------------------------------------------------------------college|Coef.Std.Err.tP>|t|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.027473.001965613.9770.000.0236121.0313338_cons|-.8813018.1002259-8.7930.000-1.07816-.6844431------------------------------------------------------------------------------Theregressionresultsuggeststhattheprobabilityofgoingtocollegeincreasesby0.027,thatis,2.7percent,foreverypointincreaseintheASVABCscore.ASVABCisscaledsothatithasmean50andstandarddeviation10,soaone-standarddeviationincreaseinthescorewouldincreasetheprobabilityofgoingtocollegeby27%.TheinterceptimpliesthatifASVABCwere0,theprobabilitywouldbeminus88%.HowevertheASVABCscoreisscaledinsuchawayastomakeitsminimumabout20,andtheinterpretationisobviousnonsense.Unfortunately,thelinearprobabilitymodelhassomeseriousdefects.First,thereareproblemswiththedisturbanceterm.Asusual,thevalueofthedependentvariableyiinobservationihasanonstochasticcomponentandarandomcomponent.Thenonstochasticcomponentdependsonxiandtheparameters.Therandomcomponentisthedisturbancetermyi=E(yi)+ui(11.2)Thenonstochasticcomponentinobservationiisitsexpectedvalueinthatobservation.Thisissimpletocompute,becauseitcantakeonlytwovalues.Itis1withprobabilitypiand0withprobability(1-pi)E(yi)=1´pi+0´(1–pi)=pi=a+bxi(11.3)\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION3yA1u41-a-bxia+bxia+bxiB0xixFigure11.2.DisturbanceterminthelinearprobabilitymodelTheexpectedvalueinobservationiisthereforea+bxi.Thismeansthatwecanrewritethemodelasyi=a+bxi+ui(11.4)Theprobabilityfunctionisthusalsothenonstochasticcomponentoftherelationshipbetweenyandx.Itfollowsthat,fortheoutcomevariableyitobeequalto1,asrepresentedbythepointAinFigure11.2,thedisturbancetermmustbeequalto(1-a-bxi).Fortheoutcometobe0,asrepresentedbythepointB,thedisturbancetermmustbe(-a-bxi).Thusthedistributionofthedisturbancetermconsistsofjusttwospecificvalues.Itisnotevencontinuous,nevermindnormal.Thismeansthatthestandarderrorsandtheusualteststatisticsareinvalidated.Forgoodmeasure,thetwopossiblevaluesofthedisturbancetermchangewithx,sothedistributionisheteroscedasticaswell,foritcanbeshownthatthepopulationvarianceofuiis(a+bxi)(1-a-bxi),andthisvarieswithxi.Theotherproblemisthatthepredictedprobabilitymaybegreaterthan1orlessthan0forextremevaluesofx.Inthegoing-to-collegeexample,theregressionequationpredictsaprobabilitygreaterthan1probabilityforvaluesofASVABCgreaterthan68andaprobabilitylessthan0forvalueslessthan33.Therewerenorespondentswithscoresabove68inthesamplebuttherewere25withscoreslessthan33.Thefirstproblemisdealtwithbyfittingthemodelwithatechniqueknownasmaximumlikelihoodestimation,describedinSection11.4,insteadofleastsquares.Thesecondprobleminvolveselaboratingthemodelasfollows.Defineavariablezthatisalinearfunctionoftheexplanatoryvariables.Inthepresentcase,sinceweonlyhaveoneexplanatoryvariable,thisfunctioniszi=a+bxi(11.5)\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION41.00F(z)0.750.500.250.00-8-6-4-20246zFigure11.3.LogisticfunctionNext,supposethatpisasigmoidfunctionofz,forexampleasshowninFigure11.3.Belowacertainvalueofz,thereisverylittlechanceoftheindividualgoingtocollege.Aboveacertainvalue,theindividualisalmostcertaintogo.Inbetween,theprobabilityissensitivetothevalueofz.Thisdealswiththeproblemofnonsenseprobabilityestimates,butthenthereisthequestionofwhatshouldbetheprecisemathematicalformofthisfunction.Thereisnodefinitiveanswertothis.Thetwomostpopularformsarethelogisticfunction,whichisusedinlogitestimation,andthecumulativenormaldistribution,whichisusedinprobitestimation.Accordingtooneoftheleadingauthoritiesonthesubject,Amemiya(1981),bothgivesatisfactoryresultsmostofthetimeandneitherhasanyparticularadvantage.Wewillstartwiththeformer.11.2LogitAnalysisInlogitestimationonehypothesizesthattheprobabilityoftheeventisdeterminedbythefunction1p=F(z)=.(11.6)ii1+e-zi-zThisisthefunctionshowninFigure11.3.Asztendstoinfinity,etendsto0andphasalimiting-zupperboundof1.Asztendstominusinfinity,etendstoinfinityandphasalimitingvalueof0.Hencethereisnopossibilityofgettingpredictionsoftheprobabilitybeinggreaterthan1orlessthan0.Themarginaleffectofzontheprobability,whichwillbedenotedf(z),isgivenbythederivativeofthisfunctionwithrespecttoz:\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION5f(z)0.20.10-8-6-4-20246zFigure11.4.Marginaleffectofzontheprobability-zdpefz()==(11.7)-z2dz()1+eThefunctionisshowninFigure11.4.Youcanseethattheeffectofchangesinzontheprobabilityisverysmallforlargepositiveorlargenegativevaluesofz,andthatthesensitivityoftheprobabilitytochangesinzisgreatestatthemidpointvalueof0.Inthecaseofthegoing-to-collegeexample,thefunctionis1p=(11.8)i1+e-a-bASVABCiIfwefitthemodel,wegetthefollowingoutput:.logitcollegeasvabcIteration0:LogLikelihood=-395.07986Iteration1:LogLikelihood=-315.56222Iteration2:LogLikelihood=-311.8771Iteration3:LogLikelihood=-311.81849Iteration4:LogLikelihood=-311.81847LogitEstimatesNumberofobs=570chi2(1)=166.52Prob>chi2=0.0000LogLikelihood=-311.81847PseudoR2=0.2107------------------------------------------------------------------------------college|Coef.Std.Err.zP>|z|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.1448371.013540310.6970.000.1182986.1713756_cons|-7.350942.700885-10.4880.000-8.724652-5.977233------------------------------------------------------------------------------\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION61.000.750.030.500.02MarginaleffectCumulativeeffect0.250.010.0000102030405060708090100ASVABCFigure11.5.CumulativeandmarginaleffectsofASVABCThereasonfortheiterationsisthatthemodelisnonlinearandmaximumlikelihoodestimationinvolvesatrial-and-errorsearchproceduretoestimatetheparameters.ThezstatisticsintheStataoutputareapproximationstotstatisticsandhavingnothingtodowiththezvariablediscussedinthetext.(Someregressionapplicationsdescribethemaststatistics.)ThezstatisticforASVABCishighlysignificant.Howshouldoneinterpretthecoefficients?TocalculatethemarginaleffectofASVABConpweneedtocalculatedp/dASVABC.Youcouldcalculatethedifferentialdirectly,butthebestwaytodothis,especiallyifzisafunctionofmorethanonevariable,istobreakitupintotwostages.pisafunctionofz,andzisafunctionofASVABC,sodpdpdz=.=f(z).b(11.9)dASVABCdzdASVABCwheref(z)isasdefinedabove.Theprobabilityofgoingtocollege,andthemarginaleffect,areplottedasfunctionsofASVABCinFigure11.5.HowcanyousummarizetheeffectoftheASVABCscoreontheprobabilityofgoingtocollege?Theusualmethodistocalculatethemarginaleffectatthemeanvalueoftheexplanatoryvariables.In-zthissamplethemeanvalueofASVABCwas50.15.Forthisvalue,zisequalto–0.079,andeisequalto1.082.Usingthis,f(z)is0.250andthemarginaleffectis0.036.Inotherwords,atthesamplemean,aone-pointincreaseinASVABCatthesamplemeanincreasestheprobabilityofgoingtocollegeby3.6percent.OfcoursewecouldcalculatethemarginaleffectforothervaluesofASVABCifwewished.For-zexample,whenASVABCis65,nearthetopofitsactualrange,zis2.074,eis0.126,f(z)is0.099,andthemarginaleffectis0.014,or1.4percent..Itismuchlowerbecauseindividualswithascoreof65alreadyhaveaveryhighprobabilityofgoingtocollege,89%,andanincreaseinASVABCthencanmakelittledifference.Inthisexample,themarginaleffecthappenedtobegreatestclosetothemeanvalueofASVABC,butthiswasacoincidence.\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION7GeneralizationtoMorethanOneExplanatoryVariable.Logitanalysisiseasilyextendedtothecasewherethereismorethanoneexplanatoryvariable.SupposethatwedecidetorelategoingtocollegetoASVABC,HGCM,thenumberofyearsofschoolingofthemother,HGCF,thenumberofyearsofschoolingofthefather,andthesexoftherespondent,usingadummyvariableMALEthatisequalto1formales,0forfemales.Thezvariablebecomesz=a+b1ASVABC+b2HGCM+b3HGCF+b4MALE(11.10)Thecorrespondingregressionoutput(withiterationmessagesdeleted)isshownbelow:.logitcollegeasvabchgcmhgcfmaleLogitEstimatesNumberofobs=570chi2(4)=193.54Prob>chi2=0.0000LogLikelihood=-298.30875PseudoR2=0.2449------------------------------------------------------------------------------college|Coef.Std.Err.zP>|z|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.132425.01462769.0530.000.1037555.1610945hgcm|.0794132.05607571.4160.157-.0304931.1893195hgcf|.1032057.04019442.5680.010.0244262.1819852male|-.5715936.2063839-2.7700.006-.9760987-.1670885_cons|-8.565297.8425983-10.1650.000-10.21676-6.913835------------------------------------------------------------------------------ThemeanvaluesofASVABC,HGCM,HGCF,andMALEwereasshowninTable11.1,and-zhencethevalueofzatthemeanwas-0.136.Fromthisoneobtains1.146foreand0.249forf(z).Thetableshowsthemarginaleffects,calculatedbymultiplyingf(z)bytheestimatesofthecoefficientsofthelogitregression.TABLE11.1LogitEstimation.DependentVariable:COLLEGEVariableMeanbMean´bf(z)bf(z)ASVABC50.150.1326.6170.2490.033HGCM11.650.0790.9200.2490.020HGCF11.820.1031.2170.2490.026MALE0.57-0.572-0.3260.249-0.142Constant1.00-8.565-8.565Total-0.136\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION8Accordingtothecomputations,aone-pointincreaseintheASVABCscoreincreasestheprobabilityofgoingtocollegeby3.3percent,everyadditionalyearofschoolingofthemotherincreasestheprobabilityby2.0percent;everyadditionalyearofschoolingofthefatherincreasestheprobabilityby2.6percent,andbeingmalereducestheprobabilityby14.2percent.Fromtheregressionoutputitcanbeseenthatalloftheseeffects,exceptthatoftheschoolingofthemother,weresignificantatthe1%levelorhigher.Exercise11.1InvestigatethefactorsaffectinggraduationfromhighschoolusingyourEAEFdataset.DefineabinaryvariableGRADtobeequalto1ifHGC>11and0otherwise.RegressGRADonASVABC,MALE,HGCMandHGCF(1)usingordinaryleastsquares,and(2)usinglogitanalysis.CalculatethemarginaleffectsinthelogitanalysisandcomparethemwiththoseobtainedusingOLS.11.3ProbitAnalysisAnalternativeapproachtothebinarychoicemodelistousethecumulativestandardizednormaldistributiontomodelthesigmoidrelationshipF(z).(Astandardizednormaldistributionisonewith0meanandunitvariance).Aswithlogitanalysis,youstartbydefiningavariablezthatisalinearfunctionofthevariablesthatdeterminetheprobability:z=a+b1x1+b2x2+...+bkxk(11.11)F(z),thestandardizedcumulativenormaldistribution,givestheprobabilityoftheeventoccurringforanyvalueofz.pi=F(zi)(11.12)Maximumlikelihoodanalysisisusedtoobtainestimatesoftheparameters.Themarginaleffectofxiis¶p/¶xithat,asinthecaseoflogitanalysis,isbestcomputedas¶pdp¶z=.=f(z).bi(11.13)¶xidz¶xiNowsinceF(z)isthecumulativestandardizednormaldistribution,f(z),itsderivative,isjustthestandardizednormaldistributionitself:121-zfz()=e2(11.14)2p\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION91.000.40.750.30.500.2MarginaleffectCumulativeeffect0.250.10.000-3-2-1012zFigure11.6.CumulativeandmarginalnormaleffectsofzFigure11.6plotsF(z)andf(z)forprobitanalysis.Aswithlogitanalysis,themarginaleffectofanyvariableisnotconstant.Itdependsonthevalueoff(z),whichinturndependsonthevaluesofeachoftheexplanatoryvariables.Toobtainasummarystatisticforthemarginaleffect,theusualprocedureisparalleltothatusedinlogitanalysis.Youcalculatezforthemeanvaluesoftheexplanatoryvariables.Nextyoucalculatef(z),asin(11.14).Thenyoucalculatef(z)bitoobtainthemarginaleffectofxi.Thiswillbeillustratedusingthegoing-to-collegeexample,usingthesamespecificationasinthelogitregression..Theregressionoutput(withiterationmessagesdeleted)isshownbelow:.probitcollegeasvabchgcmhgcfmaleProbitEstimatesNumberofobs=570chi2(4)=191.89Prob>chi2=0.0000LogLikelihood=-299.13589PseudoR2=0.2428------------------------------------------------------------------------------college|Coef.Std.Err.zP>|z|[95%Conf.Interval]---------+--------------------------------------------------------------------asvabc|.0787402.00817629.6300.000.0627151.0947653hgcm|.0407284.03183581.2790.201-.0216687.1031254hgcf|.0565258.02332072.4240.015.0108181.1022335male|-.3295115.1208056-2.7280.006-.566286-.0927369_cons|-4.945308.4476284-11.0480.000-5.822643-4.067972------------------------------------------------------------------------------Thecomputationofthemarginaleffects,atthesamplemeans,isshowninTable11.2.zis-0.021whenevaluatedatthemeanvaluesofthevariablesandf(z)is0.399.IfyoucompareTables11.1and11.2,youwillseethattheestimatesofthemarginaleffectsareverysimilarforthelogitandprobitregressions.Thisistypicallythecaseforsamplesofobservationswithbroadlysimilarnumbershavingoutcomes1and0.Howevertheshapesofthetailsofthelogitandprobitdistributionsaredissimilarandtheymayyielddifferentresultsifthesampleisdominatedby1or0outcomes.\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION10TABLE11.2ProbitEstimation.DependentVariable:COLLEGEVariableMeanbMean´bf(z)bf(z)ASVABC50.150.0793.9610.3990.032HGCM11.650.0410.4780.3990.016HGCF11.820.0570.6740.3990.023MALE0.57-0.330-0.1880.399-0.132Constant1.00-4.945-4.945Total-0.021Exercise11.2RegressthevariableGRADdefinedinExercise11.1onASVABC,MALE,HGCMandHGCFusingprobitanalysis.CalculatethemarginaleffectsandcomparethemwiththoseobtainedusingOLSandlogitanalysis.11.4AnIntroductiontoMaximumLikelihoodEstimationSupposethatarandomvariablexhasanormaldistributionwithunknownmeanmandstandarddeviations.Forthetimebeingwewillassumethatweknowthatsisequalto1.Wewillrelaxthisassumptionlater.Youhaveasampleoftwoobservations,values4and6,andyouwishtoobtainanestimateofm.Thecommon-senseansweris5,andwehaveseenthatthisisscientificallyrespectableaswellsincethesamplemeanistheleastsquaresestimatorandassuchanunbiasedandefficientestimatorofthepopulationmean,providedcertainassumptionsarevalid.However,wehaveseenthatinpracticeineconometricsthenecessaryassumptions,inparticulartheGauss-Markovconditions,areoftennotsatisfiedandasaconsequenceleastsquaresestimatorsloseoneormoreoftheirdesirableproperties.Wehaveseenthatinsomecircumstancestheymaybeinconsistentandwehavebeenconcernedtodevelopalternativeestimatorsthatareconsistent.Typicallywearenotabletoanalyzethefinite-samplepropertiesoftheseestimatorsandwejusthopethattheestimatorsarewell-behaved.Oncewearedealingwithconsistentestimators,thereisnoguaranteethatthosebasedontheleastsquarescriterionofgoodnessoffitareoptimal.Indeeditcanbeshownthat,undercertainassumptions,adifferentapproach,maximumlikelihoodestimation,willyieldestimatorsthat,besidesbeingconsistent,areasymptoticallyefficient(efficientinlargesamples).Toreturntothenumericalexample,supposeforamomentthatthetruevalueofmis3.5.Theprobabilitydensityfunctionofthenormaldistributionisgivenby21æx-mö1-ç÷f(x)=e2èsø.(11.15)s2p\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION11p0.40.35210.30.20.10.01750.0012345678mFigure11.7.Probabilitydensitiesatx1=4andx2=6conditionalonm=3.5.Figure11.7showsthedistributionofxconditionalonm=3.5ands=1.Inparticular,theprobabilitydensityis0.3521whenx=4and0.0175whenx=6.Thejointprobabilitydensityforthetwoobservationsistheproduct,0.0062.Nowsupposethatthetruevalueofmis4.Figure11.8showsthedistributionofxconditionalonthisvalue.Theprobabilitydensityis0.3989whenx=4and0.0540whenx=6.Thejointprobabilitydensityforthetwoobservationsisnow0.0215.Weconcludethattheprobabilityofgettingvalues4and6forthetwoobservationswouldbethreetimesasgreatifmwere4thanitwouldbeifmwere3.5.Inthatsense,m=4ismorelikelythanm=3.5.Ifwehadtochoosebetweentheseestimates,wep0.39890.40.30.20.10.05400.0012345678mFigure11.8.Probabilitydensitiesatx1=4andx2=6conditionalonm=4.0.\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION12TABLE11.3mp(5|m)p(7|m)Llog(L)3.50.35210.01750.0062-5.08794.00.39890.05400.0215-3.83794.50.35210.12950.0456-3.08794.60.33320.14970.0499-2.99794.70.31230.17140.0535-2.92794.80.28970.19420.0563-2.87794.90.26610.21790.0580-2.84795.00.24200.24200.0585-2.83795.10.21790.26610.0580-2.84795.20.19420.28970.0563-2.87795.30.17140.31230.0535-2.92795.40.14970.33320.0499-2.99795.50.12950.35210.0456-3.08796.00.05400.39890.0215-3.83796.50.01750.35210.0062-5.0879shouldthereforechoose4.Ofcoursewedonothavetochoosebetweenthem.Accordingtothemaximumlikelihoodprinciple,weshouldconsiderallpossiblevaluesofmandselecttheonethatgivestheobservationsthegreatestjointprobabilitydensity.Table11.3computestheprobabilitiesofx=4andx=6forvaluesofmfrom3.5to6.5.Thefourthcolumngivesthejointprobabilitydensity,whichisknownasthelikelihoodfunction.ThelikelihoodfunctionisplottedinFigure11.9.Youcanseethatitreachesamaximumform=5,theaveragevalueofthetwoobservations.Wewillnowdemonstratemathematicallythatthismustbethecase.0.070Likelihoodfunction0.06-20.05-40.04Log-likelihood-6functionLikelihood0.03Log-likelihood-80.02-100.010.00-12012345678mFigure11.9.Likelihoodandlog-likelihoodfunctionsform.\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION13First,alittleterminology.Thelikelihoodfunction,writtenL(m|x1=5,x2=7)givesthejointprobabilitydensityasafunctionofm,giventhesampleobservations.Wewillchoosemsoastomaximizethisfunction.Inthiscase,giventhetwoobservationsandtheassumptions=1,thelikelihoodfunctionisgivenbyæ1()2öæ1()2ö1-4-m1-6-mL(m)=çe2÷çe2÷(11.16)çè2p÷øçè2p÷øWewillnowdifferentiatethiswithrespecttomandsettheresultequalto0toobtainthefirst-orderconditionforamaximum.Wewillthendifferentiateasecondtimetocheckthesecond-ordercondition.Well,actuallywewon’t.Evenwithonlytwoobservationsinthesample,thiswouldbelaborious,andwhenwegeneralizetonobservationsitwouldbeverymessy.Wewilluseatricktosimplifytheproceedings.logLisamonotonicallyincreasingfunctionofL.SothevalueofmthatmaximizesLalsomaximizeslogL,andviceversa.logLismucheasiertoworkwith,sinceéæ1-1(4-m)2öæ1-1(6-m)2öùlogL=logêçe2÷çe2÷úêçè2p÷øçè2p÷øúëûæ12öæ12ö1-(4-m)1-(6-m)=logçe2÷+logçe2÷(11.17)çè2p÷øçè2p÷øæ1ö12æ1ö12=logç÷-(4-m)+logç÷-(6-m)ç÷ç÷è2pø2è2pø2Themaximumlikelihoodestimator,whichwewilldenotemˆ,isthevalueofmthatmaximizesthisfunction,giventhedataforx.Itisgivenbythefirst-orderconditiondlogL=(4-mˆ)+(6-mˆ)=0(11.18)dmThusmˆ=5.Thesecondderivativeis–2,sothisgivesamaximumvalueforlogL,andhenceL.121212[Notethat–/2(a-m)=–/2a+am-/2m.Hencethedifferentialwithrespecttomis(a-m).]GeneralizationtoaSampleofnObservationsConsiderasamplethatconsistsofnobservationsx1,…,xn.ThelikelihoodfunctionL(m|x1,…,xn)isnowtheproductofnterms:æ1()2öæ1()2ö1-x1-m1-xn-mL(m)=çe2÷´...´çe2÷(11.19)çè2p÷øçè2p÷ø\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION14Thelog-likelihoodfunctionisnowthesumofnterms:æ12öæ12ö1-(x1-m)1-(xn-m)logL=logçe2÷+...+logçe2÷çè2p÷øçè2p÷ø(11.20)æ1ö12æ1ö12=logçç÷÷-(x1-m)+...+logçç÷÷-(xn-m)è2pø2è2pø2HencethemaximumlikelihoodestimatorofmisgivenbydlogL=(x1-mˆ)+...+(x1-mˆ)=0(11.21)dmThusSxi–nmˆ=0(11.22)andhencethemaximumlikelihoodestimatorofmisthesamplemean.Notethatthesecondderivativeis–n,confirmingthatthelog-likelihoodhasbeenmaximized.GeneralizationtotheCaseWheresisUnknown.Wewillnowrelaxtheassumptionthatsisequalto1andacceptthatinpracticeitwouldbeunknown,likem.Wewillinvestigatethedeterminationofitsmaximumlikelihoodgraphicallyusingthetwo-observationexampleandthengeneralizetoasampleofnobservations.p0.80.60.40.20.17600.17600.00123456789mFigure11.10.Probabilitydensitiesatx1=4andx2=6conditionalons=2.\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION15p0.80.60.40.20.10800.10800.00123456789mFigure11.11.Probabilitydensitiesatx1=4andx2=6conditionalons=0.5.Figure11.10showstheprobabilitydistributionforxconditionalonmbeingequalto5andsbeingequalto2.Theprobabilitydensityatx1=4andx2=6is0.1760andthejointdensity0.0310.Clearlywewouldobtainhigherdensities,andhigherjointdensity,ifthedistributionhadsmallervariance.Ifwetrysequalto0.5,weobtainthedistributionshowninFigure11.11.Heretheindividualdensitiesare0.1080andthejointdensity0.0117.Clearlywehavemadethedistributiontoonarrow,forx1andx2arenowinitstailswithevenlowerdensitythanbefore.Figure11.12plotsthejointdensityasafunctionofs.Wecanseethatitismaximizedwhensisequalto1,andthisisthereforethemaximumlikelihoodestimate,providedthatwehavebeencorrectinassumingthatthemaximumlikelihoodestimateofmis5.L0.060.040.02000.511.522.533.544.5sFigure11.12.Likelihoodfunctionfors.\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION16Wewillnowderivethemaximumlikelihoodestimatorsofbothmandssimultaneously,forthegeneralcaseofasampleofnobservations.Thelikelihoodfunctionis22æ1æx1-mööæ1æxn-mööç1-ç÷÷ç1-ç÷÷L(m)=e2èsø´...´e2èsø(11.23)ç÷ç÷çs2p÷çs2p÷èøèøandsothelog-likelihoodfunctionis22éæ1æx1-mööæ1æxn-mööùç1-ç÷÷ç1-ç÷÷logL=logêe2èsø´...´e2èsøúêç÷ç÷úçs2p÷çs2p÷êëèøèøúû22æ1æx1-mööæ1æxn-mööç1-ç÷÷ç1-ç÷÷=loge2èsø+..+loge2èsøç÷ç÷çs2p÷çs2p÷(11.24)èøèø22æ1ö1æx1-mö1æxn-mö=nlogçç÷÷-ç÷-...-ç÷èo2pø2èsø2èsø111æ1212ö=nlog+nlog+2ç-(x1-m)-...-(xn-m)÷s2psè22øThepartialderivativeofthiswithrespecttomis¶logL1=2[](x1-m)+...+(xn-m)(11.25)¶msSettingthisequalto0,onefindsthatthemaximumlikelihoodestimatorofmisthesamplemean,asbefore.Thepartialderivativewithrespecttosisn12-+å(x-m)(11.26)3issSubstitutingitsmaximumlikelihoodestimatorform,andputtingtheexpressionequalto0,weobtain212sˆ=å(x-x)(11.27)inNotethatthisisactuallybiaseddownwardsinfinitesamples,theunbiasedestimatorbeinggivenbythesameexpressionwithnreplacedby(n-1).Howeveritisasymptoticallymoreefficientusingthemeansquareerrorcriterion,itssmallervariancemorethancompensatingforthebias.Thebiasinanycaseattenuatesasthesamplesizebecomeslarge.\nBINARYCHOICEMODELSANDMAXIMUMLIKELIHOODESTIMATION17ApplicationtotheSimpleRegressionModelSupposethatydependsonxaccordingtothesimplerelationshipy=a+bx+u(11.28)Supposethatxhasvaluexiinobservationi.Potentially,beforetheobservationsaregenerated,yihasadistributionaround(a+bxi),accordingtothevalueofthedisturbanceterm.Wewillassumethatthedisturbancetermisnormallydistributedwith0meanandstandarddeviations,so21æuö1-ç÷f(u)=e2èsø(11.29)s2pTheprobabilitythatywilltakeaspecificvalueyiinobservationiisdeterminedbytheprobabilitythatuiisequalto(yi-a-bxi).Giventheexpressionabove,thisis21æyi-a-bxiö1-ç÷e2èsø(11.30)s2pThejointprobabilitydensityfunctionfortheobservationsinthesampleistheproductofthetermsforeachobservation.Takingtheobservationsasgiven,andtreatingtheunknownparametersasvariables,wesaythatthelikelihoodfunctionfora,bandsisgivenby22æ1æy1-a-bx1ööæ1æyn-a-bxnööç1-ç÷÷ç1-ç÷÷L(a,b,s|x,...,x)=e2èsø´...´e2èsø(11.31)1nç÷ç÷çs2p÷çs2p÷èøèøThelog-likelihoodfunctionisthusgivenbyn1[]()2()2logL=-y1-a-bx1+...+yn-a-bxn(11.32)s2p2sThevaluesofaandbthatmaximizethisfunctionareexactlythesameasthoseobtainedusingtheleastsquaresprinciple.However,theestimateofsisslightlydifferent.\n12MODELSUSINGTIMESERIESDATA12.1StaticModelsWenowswitchfromcross-sectiontotimeseriesdata.Muchoftheanalysiswillbeillustratedwithacoredatasetforfittingdemandfunctions.ItisdrawnfromnationalaccountsdatapublishedbytheU.S.BureauoftheCensusandconsistsofannualaggregatedataon24differentcategoriesofconsumerexpenditurefortheperiod1959-1994,alongwithdataondisposablepersonalincome,DPI,andpriceindexnumbersforthe24categories.Thedataset,togetherwithamanualgivingmoredetailedinformationaboutit,canbedownloadedfromthewebsite.Itwillbeupdatedperiodically,sothetimeperiodcoveredwillbeextendedandpossiblytheremaybechangestothecategoriesofconsumerexpenditure.Twoofthecategories,FOODandHOUS(consumerexpenditureonfoodandhousingservices,respectively)areusedasexamplesinthetextandexercises.Theothercategoriesareintendedforpracticalworkbyasmallgroupofstudents,eachstudentworkingwithadifferentcategory,startingwithasimpleregressionspecificationandgraduallydevelopingamoresophisticatedone.Wewillstartwithaverysimplespecificationforthedemandequationforhousingservices,regressingconsumerexpenditureonthiscategory,HOUS,onDPIandapriceindexforhousing,PRELHOUS:HOUSt=a+b1DPIt+b2PRELHOUSt+ut(12.1)10410210098969492908886195919631967197119751979198319871991Figure12.1.Relativepriceseriesforhousingservices,1959-1994(1992=100)ÓC.Dougherty2000.Allrightsreserved.Copiesmaybemadeforpersonaluse.Versionof16February2000.\nMODELSUSINGTIMESERIESDATA2HOUSandDPIaremeasuredin$billionat1992constantprices.PRELHOUSisanindexconstructedbydividingthenominalpricedeflatorforhousing,PHOUS,bythepricedeflatorfortotalpersonalexpenditure,PTPE,andmultiplyingby100.PRELHOUSthusisarealorrelativepriceindexthatkeepstrackofwhetherhousingisbecomingmoreorlessexpensiverelativetoothertypesofexpenditure.ItisplottedinFigure12.1,whichshowsthattherelativepricedeclinedbyabout10%fromtheearly1960stothelate1970sandthenroseagainbyaboutthesameamount.AstraightforwardlinearregressionusingEViewsgivesthefollowingresult(standarderrorsinparentheses):=============================================================DependentVariable:HOUSMethod:LeastSquaresSample:19591994Includedobservations:36=============================================================VariableCoefficientStd.Errort-StatisticProb.=============================================================C2.65470128.915710.0918080.9274DPI0.1515210.001243121.93430.0000PRELHOUS-0.5569490.290640-1.9162850.0640=============================================================R-squared0.997811Meandependentvar429.3306AdjustedR-squared0.997679S.D.dependentvar149.1037S.E.ofregression7.183749Akaikeinfocriter4.023298Sumsquaredresid1703.006Schwarzcriterion4.155258Loglikelihood-120.5012F-statistic7522.482Durbin-Watsonstat0.809993Prob(F-statistic)0.000000=============================================================Theequationimpliesthatanincreaseof$1billionindisposablepersonalincomeleadstoanincreaseof$0.15billioninexpenditureonhousing.Inotherwords,outofthemarginaldollar,0.15centsisspentonhousing.Isthisaplausiblefigure?Itisabitdifficulttotell,butcertainlyhousingisthelargestcategoryofconsumerexpenditureandonewouldexpectasubstantialcoefficient.Notethatwearetalkingabouthousingservices,andnotinvestmentinhousing.Housingservicesisthevalueoftheservicesprovidedbytheexistinghousingstock.Inthecaseofrentedhousing,rentsaretakenasameasureofthevalueoftheservices.Inthecaseofowner-occupiedhousingandhousingrentedatasubsidizedrate,imputedrents,thatis,themarketrentsthehousingcouldcommand,areusedinstead.ThecoefficientofPRELHOUSimpliesthataone-pointincreaseinthepriceindexleadstoareductionof$0.56billioninexpenditureonhousing.TheconstanttermliterallyindicatestheamountthatwouldbespentonhousingifDPIandPRELHOUSwereboth0,butobviouslyanysuchinterpretationisnonsense.Iftheobservationsreferredtohouseholds,theremightbesomewithnoincomeyetneverthelesspurchasinghousingservicesandotheressentialswithtransferpayments,butherewearetalkingaboutaggregatedataforthewholeoftheUnitedStatesandthatkindofinterpretationisnotsensible.Itiscommontohypothesizethataconstantelasticityfunctionofthetype\nMODELSUSINGTIMESERIESDATA3HOUS=aDPIb1PRELHOUSb2v(12.2)ismathematicallymoreappropriatefordemandfunctions.Linearizingitbytakinglogarithms,oneobtainsLGHOUS=a'+b1LGDPI+b2LGPRHOUS+u,(12.3)whereLGHOUS,LGDPIandLGPRHOUSarethe(natural)logarithmsofHOUS,DPIandPRELHOUS,respectively,uisthenaturallogarithmofthedisturbancetermv,a'isthelogarithmofa,andb1andb2areincomeandpriceelasticities,respectively.Theregressionresultisnowasshown:=============================================================DependentVariable:LGHOUSMethod:LeastSquaresSample:19591994Includedobservations:36=============================================================VariableCoefficientStd.Errort-StatisticProb.=============================================================C-2.0326850.322726-6.2984780.0000LGDPI1.1322480.008705130.06500.0000LGPRHOUS-0.2276340.065841-3.4573230.0015=============================================================R-squared0.998154Meandependentvar5.996930AdjustedR-squared0.998042S.D.dependentvar0.377702S.E.ofregression0.016714Akaikeinfocriter-8.103399Sumsquaredresid0.009218Schwarzcriterion-7.971439Loglikelihood97.77939F-statistic8920.496Durbin-Watsonstat0.846451Prob(F-statistic)0.000000=============================================================ThecoefficientsofLGDPIandLGPRHOUSaredirectestimatesoftheincomeandpriceelasticities,respectively.Is1.13aplausibleincomeelasticity?Probably.Itisconventionaltoclassifyconsumerexpenditureintonormalgoodsandinferiorgoods,typesofexpenditurewhoseincomeelasticitiesarepositiveandnegative,respectively,andtosubdividenormalgoodsintonecessitiesandluxuries,typesofexpenditurewhoseincomeelasticitiesarelessthan1andgreaterthan1,respectively.Housingisobviouslyanecessity,soyoumightexpecttheelasticitytobepositivebutlessthan1.Howeveritalsohasaluxuryelement,sincepeoplespendmoreonbetterqualityhousingastheirincomerises.Overalltheelasticityseemstoworkoutatabout1,sothepresentestimateseemsreasonable.\nMODELSUSINGTIMESERIESDATA4Exercises12.1Theresultsoflinearandlogarithmicregressionsofconsumerexpenditureonfood,FOOD,onDPIandarelativepriceindexseriesforfood,PRELFOOD,usingthedatafortheUnitedStates,1959-1994,intheDemandFunctionsdataset,aresummarizedbelow.Provideaneconomicinterpretationofthecoefficientsandperformappropriatestatisticaltests.2FOˆOD=232.6+0.089DPI-0.534PRELFOODR=0.989(31.9)(0.002)(0.332)2LGFˆOOD=2.66+0.61LGDPI-0.30LGPRFOODR=0.993(0.28)(0.01)(0.07)12.2DownloadtheDemandFunctionsdatasetandassociatedmanualsfromthewebsite.Youshouldchoose,orbeassignedbyyourinstructor,onecategoryofexpenditure,anditmaybehelpfultosimplifythedatasetbydeletingtheexpenditureandpricevariablesrelatingtotheothercategories.ConstructarelativepriceindexseriesforyourcategoryasindicatedinSection1oftheDemandFunctionsRegressionExercisesmanual.Plottheseriesandtrytoexplainwhyithaschangedoverthetimeperiod.12.3RegressexpenditureonyourcategoryonDPIandtherelativepriceindexseriesconstructedinExercise12.1.Giveaneconomicinterpretationoftheregressioncoefficientsandperformappropriatestatisticaltests.12.4RegressthelogarithmofexpenditureonyourcategoryonLGDPIandthelogarithmoftherelativepriceseries.Giveaneconomicinterpretationoftheregressioncoefficientsandperformappropriatestatisticaltests.12.5*PerformaBox-Coxtesttodeterminewhetherthereisasignificantdifferenceinthefitofthelinearandlogarithmicregressionsforyourcategoryofexpenditure.12.6Sometimesatimetrendisincludedinaregressionasanexplanatoryvariable,actingasaproxyforsomegradualchangenotassociatedwithincomeorprice.Changingtastesmightbeanexample.However,inthepresentcasetheadditionofatimetrendmightgiverisetoaproblemofmulticollinearitybecauseitwillbehighlycorrelatedwiththeincomeseriesandperhapsalsothepriceseries.CalculatethecorrelationsbetweentheTIMEvariableinthedataset,LGDPI,andthelogarithmofexpenditureonyourcategory.RegressthelogarithmofexpenditureonyourcategoryonLGDPI,thelogarithmoftherelativepriceseries,andTIME(notthelogarithmofTIME).Provideaninterpretationoftheregressioncoefficients,performappropriatestatisticaltests,andcomparetheregressionresultswiththoseofthesameregressionwithoutTIME.\nMODELSUSINGTIMESERIESDATA512.2DynamicModelsNext,wewillintroducesomesimpledynamics.Onemightsupposethatsometypesofconsumerexpenditurearelargelydeterminedbycurrentincomeandprice,butthisisnotsoforacategorylikehousingthatissubjecttosubstantialinertia.Wewillconsiderspecificationsinwhichexpenditureonhousingdependsonlaggedvaluesofincomeandprice.AvariableXlaggedonetimeperiodhasvaluesthataresimplythepreviousvaluesofX,anditisconventionallyreferredtoasX(-1).Generalizing,avariablelaggedstimeperiodshastheXvaluessperiodspreviously,andisdenotedX(-s).Majorregressionpackagesunderstandthisconventionandforthesethereisnoneedtodefinelaggedvariablesseparately.Table12.1showsthedataforLGDPI,LGDPI(-1)andLGDPI(-2).NotethatobviouslythereisaveryhighcorrelationbetweenLGDPI,LGDPI(-1)andLGDPI(-2),andthisisgoingtocauseproblems.ThefirstcolumnofTable12.2presentstheresultsofaregressionusingcurrentincomeandprice.Thesecondandthirdcolumnsshowtheresultsofregressingexpenditureonhousingonincomeandpricelaggedoneandtwotimeperiods,respectively.Itisreasonabletohypothesizethatexpenditureonacategoryofconsumerexpendituremightdependonbothcurrentandlaggedincomeandprice.Thefourthcolumnshowstheresultsofaregressionusingcurrentincomeandpriceandthesamevariableslaggedonetimeperiod.Thefifthcolumnaddsthesamevariableslaggedtwotimeperiods,aswell.TABLE12.1CurrentandLaggedValuesoftheLogarithmofDisposablePersonalIncomeYearLGDPILGDPI(-1)LGDPI(-2)19595.2750--19605.32595.2750-19615.37205.32595.275019625.42675.37205.325919635.47195.42675.372019645.51755.47195.426719655.57065.51755.4719................................................19876.36896.33776.311919886.39846.36896.337719896.42106.39846.368919906.44136.42106.398419916.45396.44136.421019926.47206.45396.4413`19936.48466.47206.453919946.50466.48466.4720\nMODELSUSINGTIMESERIESDATA6TABLE12.2AlternativeDynamicSpecifications,ExpenditureonHousingServicesVariable(1)(2)(3)(4)(5)LGDPI1.130.380.33--(0.01)(0.15)(0.14)LGDPI(-1)1.100.730.28--(0.01)(0.15)(0.21)LGDPI(-2)1.070.48---(0.01)(0.15)LGPRHOUS-0.23-0.19-0.13--(0.07)(0.08)(0.19)LGPRHOUS(-1)-0.200.140.25--(0.06)(0.08)(0.33)LGPRHOUS(-2)-0.19-0.33---(0.06)(0.19)2R0.9980.9990.9980.9990.999Thefirstthreeregressionsarealmostidentical.ThisisbecauseLGDPI,LGDPI(-1)andLGDPI(-2)areveryhighlycorrelated.Thelasttwodisplaytheclassicsymptomsofmulticollinearity.Thepointestimatesareunstableandthestandarderrorsbecomemuchlargerwhencurrentandlaggedvaluesofincomeandpriceareincludedasregressors.Foratypeofexpenditurelikehousing,whereonemightexpectlonglags,thisisclearlynotaconstructiveapproachtodeterminingthelagstructure.Acommonsolutiontothemulticollinearityproblemistohypothesizethatthedynamicprocesshasaparsimoniouslagstructure,thatis,alagstructurethatcanbecharacterizedwithfewparameters.OneofthemostpopularlagstructuresistheKoyckdistribution,whichassumesthatthecoefficientsoftheexplanatoryvariableshavegeometricallydecliningweights.Wewilllookattwosuchmodels,theadaptiveexpectationsmodelandthepartialadjustmentmodel.Exercises12.7GiveaneconomicinterpretationofthecoefficientsofLGDPI,LGDPI,andLGDPI,incolumn5ofTable12.2.12.8Toallowforthepossibilitythatexpenditureonyourcategoryispartlysubjecttoaone-periodlag,regressthelogarithmofexpenditureonyourcommodityonLGDPI,thelogarithmofyourrelativepriceseries,andthosetwovariableslaggedoneperiod.InmostregressionpackagesX(-1)isrecognizedasXlaggedoneperiodandsothereisnoneedtodefinenewvariables.RepeattheexperimentaddingLGDPI(-2)andthelogarithmofthepriceserieslaggedtwoperiods.RepeatitagainaddingLGDPI(-3)andthelogarithmofthepriceserieslaggedtwoperiods.Comparetheregressionresults,payingattentiontothechangesintheregressioncoefficientsandtheirstandarderrors.\nMODELSUSINGTIMESERIESDATA712.3TheAdaptiveExpectationsModelThemodelingofexpectationsisfrequentlyanimportantanddifficulttaskoftheappliedeconomistusingtimeseriesdata.Thisisespeciallytrueinmacroeconomics,inthatinvestment,saving,andthedemandforassetsareallsensitivetoexpectationsaboutthefuture.Introductorymacroeconomictextbooks,discussingthebasicincomedetermination(IS-LM)model,tendtotreataggregateinvestmentasifitwerefixed,oratmostastraightforwarddecliningfunctionoftherateofinterest.Youarethengivenproblemssuchas,investigatetheeffectofanincreaseingovernmnetexpenditureonaggregateouput,andyouaresupposedtoassumethattheonlyeffectonaggregateinvestimentwillbeviatherateofinterest.Thisisnotthecase.Ifthegovernmentundertakesanexpansionarypolicy,thiswillaffectbusinessmen'sexpectationsaboutthefuturestateoftheeconomyingeneral,andprofitabilityinparticular,andhenceinfluencetheirplans,irrespectiveofwhathappenstotherateofinterest.Forexample,ifthereisconsiderableunemployment,thegovernmentactionmaybeseenasunequivocallybeneficial,andinvestmentwillbestimulated.Ontheotherhand,iftheeconomyisnearfullemployment,thegovernmentpolicymaybeseenasleadingtoanincreaseintherateofinflation,withtheresultthatbusinessconfidenceisunderminedandinvestmentfalls.Itisquiteatrickymatter,asKeynesstressedinhisGeneralTheory.Certainlyhediscussesatlengththemarginalefficiencyofcapitalschedule,whereinvestmentisrelatedtotherateofinterest,buthealsoemphasizesthedependenceofinvestmentonexpectationsandleavesyouinnodoubtthathethoughtthattheIScurve,aswenowdescribeit,washighlyvolatile.Unfortunately,thereisnosatisfactorywayofmeasuringexpectationsdirectlyformacroeconomicpurposes.Consequently,macroeconomicmodelstendnottogiveparticularlyaccurateforecasts,andthismakeeconomicmanagementdifficult.Asamakeshiftsolution,somemodelsuseanindirecttechniqueknownastheadaptiveexpectationsprocess.Thisinvolvesasimplelearningprocessinwhichineachtimeperiodtheactualvalueofthevariableiscomparedwiththevaluethathadbeenexpected.Iftheactualvalueisgreater,theexpectedvalueisadjustedupwardsforthenextperiod.Ifitlower,theexpectedvalueisadjusteddownwards.Thesizeoftheadjustmentishypothesizedtobeproportionaltothediscrepancybetweentheactualandexpectedvalue.eIfxisthevariableinquestion,andxtisthevalueexpectedintimeperiodtgiventheinformationavailableattimeperiodt-1,eeext+1-xt=l(xt-xt)(0£l£1)(12.4)Thiscanberewritteneext+1=lxt+(1-l)xt,(0£l£1)(12.5)whichstatesthattheexpectedvalueofxnexttimeperiodturnsisaweightedaverageoftheactualvalueofxinthecurrenttimeperiodandthevaluethathadbeenanticipated.Thelargerthevalueofl,thequickertheexpectedvalueadjuststopreviousactualoutcomes.\nMODELSUSINGTIMESERIESDATA8Forexample,supposethatyouhypothesizethatadependentvariable,yt,isrelatedtotheexpectedevalueoftheexplanatoryvariable,x,inyeart+1,x:t+1eyt=a+bxt+1+ut.(12.6)e(12.6)expressesyintermsofx,whichisunobservableandmustsomehowbereplacedbyt+1observablevariables,thatis,byactualcurrentandlaggedvaluesofx,andperhapslaggedvaluesofy.eWestartbysubstitutingforxusing(12.5):t+1(e)y=a+blx+(1-l)x+utttt(12.7)e=a+blx+b(1-l)x+uttteOfcoursewestillhaveunobservablevariablexasanexplanatoryvariable,butif(12.5)istruetfortimeperiodt,itisalsotruefortimeperiodt-1:eex=lx+(1-l)x.(12.8)tt-1t-1eSubstitutingforxin(12.7),wenowhavet2ey=a+blx+bl(1-l)x+b(1-l)x+u(12.9)ttt-1t-1tAfterlaggingandsubstitutingstimes,theexpressionbecomes2yt=a+blxt+bl(1-l)xt-1+bl(1-l)xt-2+...(12.10)s-1se+bl(1-l)xt-s+1+b(1-l)xt-s+1+utNowitisreasonabletosupposethatlliesbetween0and1,inwhichcase(1-l)willalsolies+1between0and1.Thus(1-l)becomesprogressivelysmallerassincreases.Eventuallytherewillbeseapointwherethetermb(1-l)xissosmallthatitcanbeneglectedandwehaveamodelint-s+1whichallthevariablesareobservable.Becausethelaggedvariableshavegeometrically-decliningweights,thelagstructureisdescribedashavingaKoyckdistribution.Ascanbeseenfrom(12.10),itishighlyparsimoniousintermsofitsparameterization,requiringonlyoneparametermorethanthestaticversion.Sinceitisnonlinearintheparameters,OLSshouldnotbeusedtofitit,fortworeasons.First,multicollinearitywouldalmostcertainlymaketheestimatesofthecoefficientssoerraticthattheywouldbeworthless–itispreciselythisproblemthatcausedustosearchforanotherwayofspecifyingalagstructure.Second,thepointestimatesofthecoefficientswouldyieldconflictingestimatesoftheparameters.Forexample,supposethatthefittedrelationshipbegan:yˆt=101+0.60xt+0.45xt-1+0.20xt-2+...(12.11)\nMODELSUSINGTIMESERIESDATA9Relatingthetheoreticalcoefficientsofthecurrentandlaggedvaluesofxin(12.10)tothe2estimatesin(12.11),onehasbl=0.60,bl(1-l)=0.45,andbl(1-l)=0.20.Fromthefirsttwoyoucouldinferthatbwasequalto2.40andlwasequalto0.25–butthesevalueswouldconflictwiththethirdequationandindeedwiththeequationsforalltheremainingcoefficientsintheregression.Instead,anonlinearestimationtechniqueshouldbeusedinstead.Mostmajorregressionapplicationshavefacilitiesforperformingnonlinearregressionsbuiltintothem.Ifyourapplicationdoesnot,youcouldfitthemodelusingagridsearch.Itisworthdescribingthistechnique,despitethefactthatitisobsolete,becauseitmakesitclearthattheproblemofmulticollinearityhasbeensolved.Werewrite(12.10)astwoequations:yt=a+bzt+ut(12.12)23zt=lxt+l(1-l)xt-1+l(1-l)xt-2+l(1-l)xt-3...(12.13)Thevaluesofztdependofcourseonthevalueofl.Youconstructtenversionsoftheztvariableusingthefollowingvaluesforl:0.1,0.2,03.,...,1.0andfit(12.12)witheachofthem.Theversionwiththelowestresidualsumofsquaresisbydefinitiontheleastsquaressolution.Notethattheregressionsinvolvearegressionofyonthedifferentversionsofzinasimpleregressionequationandsotheproblemofmulticollinearityhasbeencompletelyeliminated.Table12.3showstheparameterestimatesandresidualsumsofsquaresforagridsearchwherethedependentvariablewasthelogarithmofhousingservicesandtheexplanatoryvariableswerethelogarithmsofDPIandtherelativepriceseriesforhousing.Eightlaggedvalueswereused.Youcanseethattheoptimalvalueoflisbetween0.4and0.5,andthattheincomeelasticityisabout1.13andthepriceelasticityabout-0.32.Ifwehadwantedamoreaccurateestimateofl,wecouldhavecontinuedthegridsearchwithstepsof0.01overtherangefrom0.4to0.5.Notethattheimplicit89incomecoefficientforxt-8,bl(1-l),wasabout1.13×0.5=0.0022.Thecorrespondingpricecoefficientwasevensmaller.Henceinthiscaseeightlagsweremorethansufficient.TABLE12.3LogarithmicRegressionofExpenditureonHousingServicesonDisposablePersonalIncomeandaRelativePriceIndex,AssuminganAdaptiveExpectationsModel,FittedUsingaGridSearchlb1s.e.(b1)b2s.e.(b2)RSS0.11.670.01-0.350.070.0016360.21.220.01-0.280.040.0012450.31.130.01-0.280.030.0009180.41.120.01-0.300.030.0007100.51.130.01-0.320.030.0006660.61.150.01-0.340.030.0008030.71.160.01-0.360.030.0011090.81.170.01-0.380.040.0015610.91.170.01-0.390.040.0021371.01.180.01-0.400.050.002823\nMODELSUSINGTIMESERIESDATA10DynamicsintheAdaptiveExpectationsModelAsyoucanseefrom(12.10),thecurrentvalueofxhascoefficientblintheequationforyt.Thisistheshort-runorimpacteffectofxony.Attimet,thetermsinvolvinglaggedvaluesofxarealreadydeterminedandhenceeffectivelyformpartoftheinterceptintheshort-runrelationship.Howeverwecanalsoderivealong-runrelationshipbetweenyandxbyseeinghowtheequilibriumvalueofywouldberelatedtotheequilibriumvalueofx,ifequilibriumwereeverachieved.Denotingequilibriumyandxbyyandx,respectively,inequilibriumyt=yandxt=xt-1=xt-2=...=x.Substitutinginto(12.10),onehas2y=a+blx+bl(1-l)x+bl(1-l)x+...2=a+bx[l+l(1-l)+l(1-l)+...](12.14)=a+bxToseethelaststep,write2S=l+l(1-l)+l(1-l)...(12.15)Then23(1-l)S=l(1-l)+l(1-l)+l(1-l)...(12.16)Subtracting(12.16)from(12.15),S-(1-l)S=l(12.17)andhenceSisequalto1.Thusthelong-runeffectofxonyisgivenbyb.AnalternativewayofexploringthedynamicsofanadaptiveexpectationsmodelistoperformwhatisknownasaKoycktransformation.Thisallowsustoexpressthedependentvariableintermsofafinitenumberofobservablevariables:thecurrentvaluesoftheexplanatoryvariable(s)andthedependentvariableitself,laggedonetimeperiod.Westartagainwiththeorginalequationsandcombinethemtoobtain(12.20):eyt=a+bxt+1+ut(12.18)eext+1=lxt+(1-l)xt(12.19)(e)yt=a+blxt+(1-l)xt+ut(12.20)e=a+blxt+b(1-l)xt+utNowif(12.18)istruefortimet,itisalsotruefortimet-1:\nMODELSUSINGTIMESERIESDATA11eyt-1=a+bxt+ut-1(12.21)Henceebxt=yt-1-a-ut-1(12.22)Substitutingthisinto(12.20),wenowhaveyt=a+blxt+(1-l)(yt-1-a-ut-1)+ut(12.23)=al+(1-l)yt-1+blxt+ut-(1-l)ut-1Asbeforetheshort-runcoefficientofxisbl,theeffectiveinterceptfortherelationshipbeingal+(1-l)yt-1attimet.Inequilibrium,therelationshipimpliesy=al+(1-l)y+blx(12.24)andhenceagainweobtaintheresultthatbgivesthelong-runeffectofxony.Wewillinvestigatetherelationshipbetweentheshort-runandlong-rundynamicsgraphically.Wewillsuppose,forconvenience,thatbispositiveandthatxincreaseswithtime,andwewillneglecttheeffectofthedisturbanceterm.Attimet,ytisgivenby(12.23).yt-1hasalreadybeendetermined,sotheterm(1-l)yt-1isfixed.Theequationthusgivestheshort-runrelationshipbetweenytandxt.[al+(1-l)yt-1]iseffectivelytheinterceptandblistheslopecoefficient.Whenwecometotimet+1,yt+1isgivenbyyt+1=al+blxt+1+(1-l)yt(12.25)yal+(1-l)yt+3al+(1-l)yt+2al+(1-l)yt+1al+(1-l)ytal+(1-l)yt-1xtxt+1xt+2xt+3xt+4xFigure12.2.Short-runandlong-rundynamicsintheadaptiveexpectationsmodel\nMODELSUSINGTIMESERIESDATA12andtheeffectiveinterceptisnow[al+(1-l)yt].Sincexisincreasing,yisincreasing,sotheinterceptislargerthanthatforytandtheshort-runrelationshiphasshiftedupwards.Theslopeisthesameasbefore,bl.Thustwofactorsareresponsibleforthegrowthofy:thedirecteffectoftheincreaseinx,andthegradualupwardshiftoftheshort-runrelationship.Figure12.2showstheoutcomesfortimetasfarastimet+4.Youcanseethatthelong-runrelationshipissteeperthantheshort-runones.Example:Friedman’sPermanentIncomeHypothesisWithoutdoubtthemostcelebratedapplicationoftheadaptiveexpectationsmodelisFriedman’suseofitwhenfittinganaggregateconsumptionfunctionusingtimeseriesdata.IntheearlyyearsaftertheSecondWorldWar,econometriciansworkingwithmacroeconomicdatawerepuzzledbythefactthatthelong-runaveragepropensitytoconsumeseemedtoberoughlyconstantdespitethemarginalpropensitytoconsumebeingmuchlower.Amodelinwhichcurrentconsumptionwasafunctionofcurrentincomecouldnotexplainthisphenomenonandwasthereforeclearlytoosimplistic.Severalmoresophisticatedmodelsweredevelopedwhichcouldexplainthisapparentcontradiction,notablyFriedman'sPermanentIncomeHypothesis,Brown'sHabitPersistenceModel(discussedinthenextsection),Duesenberry'sRelativeIncomeHypothesisandtheModigliani-Ando-BrumbergLifeCycleModel.PUnderthePermanentIncomeHypothesis,permanentconsumption,C,isproportionaltoPpermanentincome,Y:PPC=bY(12.26)TTActualconsumption,C,andactualincome,Y,alsocontaintransitorycomponents,CandY,respectively:PTC=C+C(12.27)PTY=Y+Y(12.28)Itisassumed,atleastasafirstapproximation,thatthetransitorycomponentsofconsumptionandincomehaveexpectedvalue0andaredistributedindependentlyoftheirpermanentcounterpartsandPofeachother.SubstitutingforCin(12.26)using(12.27)onehasPTC=bY+C(12.29)TWethusobtainarelationshipbetweenactualconsumptionandpermanentincomeinwhichCplaystheroleofadisturbanceterm,previouslylackinginthemodel.Earlier,whenwediscussedthepermanentincomehypothesisinthecontextofcross-sectiondata,theobservationsrelatedtohouseholds.WhenFriedmanfittedthemodel,heactuallyusedaggregatetimeseriesdata.Tosolvetheproblemthatpermanentincomeisunobservable,hehypothesizedthatitwassubjecttoanadaptiveexpectationsprocessinwhichthenotionofpermanentincomewasupdatedbyaproportionofthedifferencebetweenactualincomeandthepreviousperiod’spermanentincome:\nMODELSUSINGTIMESERIESDATA13PPPYt-Yt-1=l(Yt-Yt-1)(12.30)Hencepermanentincomeattimetisaweightedaverageofactualincomeattimetandpermanentincomeattimet-1:PPYt=lYt+(1-l)Yt-1(12.31)Friedmanused(12.31)torelatepermanentincometocurrentandlaggedvaluesofincome.OfcourseitcannotbeuseddirectlytomeasurepermanentincomeinyeartforthetworeasonsthatwedoPnotknowl,andwehavenowayofmeasuringYt-1.Wecansolvetheseconddifficultybynotingthat,if(12.31)holdsfortimet,italsoholdsfortimet+1:PPYt-1=lYt-1+(1-l)Yt-2(12.32)Substitutingthisinto(12.27),weobtainP2PYt=lYt+l(1-l)Yt-1+(1-l)Yt-2(12.33)PThisincludestheunmeasurabletermYt-2,butwecandealwithitbylagging(12.27)twoperiodsPPandsubstituting,thusobtainingYtintermsofYt,Yt-1,Yt-2andYt-3.ContinuingthisprocessPindefinitely,wecanwriteYtasaweightedsumofcurrentandpastmeasuredincome:P23Yt=lYt+l(1-l)Yt-1+l(1-l)Yt-2+l(1-l)Yt-3+...(12.34)sProvidedthatlliesbetween0and1,areasonableassumption,(1-l)isadecreasingfunctionofsandeventuallytheweightattachedtolaggedvaluesbecomessosmallthattheycanbeneglected.Thisstillleavesuswiththeproblemofestimatingl.Friedman’ssolutionwastouseagridsearch.Hetriedalargenumberofdifferentvaluesbetween0and1,calculatedthecorrespondingpermanentincometimeseriesforeach,regressedconsumptiononeachofthesesetsofpermanent2incomedata,andlookedatthefitasmeasuredbyR.HethenchosethatvalueoflthatproducedthePdataonYthatgavehimthebestfit.Effectively,ofcourse,hewasfittingthenonlinearmodel2TCt=blYt+bl(1-l)Yt-1+bl(1-l)Yt-2+...+C(12.35)ThedynamicpropertiesofthemodelareasillustratedinFigure12.1.MathematicallytheyarebestanalyzedbyperformingtheKoycktransformationonthemodel.Thiscouldbedoneonthelinesofequations(12.21)-(12.23)above,orbylagging(12.35)oneperiodandmultiplyingthroughby(1-l):23T-1(1-l)Ct-1=bl(1-l)Yt-1+bl(1-l)Yt-2+bl(1-l)Yt-3+...+(1-l)C(12.36)Subtracting(12.36)from(12.35),onehas\nMODELSUSINGTIMESERIESDATA14TT-1Ct-(1-l)Ct-1=blYt+C-(1-l)C(12.37)andsoTT-1Ct=blYt+(1-l)Ct-1+C-(1-l)C(12.38)Theshort-runmarginalpropensitytoconsumeisblandthelongrunpropensityisb.Sinceislessthan1,themodelisabletoreconcilealowshort-runmarginalpropensitytoconsumewithahigherlong-runaveragepropensity.ThedynamicsofthemodelareasshowninabstractinFigure12.2.Exercise12.9*TheoutputbelowshowstheresultoffittingthemodelLGFOOD=a+b1lLGDPI+b1l(1–l)LGDPI(-1)2+b1l(1–l)LGDPI(-2)+b2lLGPRFOOD+uusingthedataonexpenditureonfoodintheDemandFunctionsdataset.LGFOODandLGPRFOODarethelogarithmsofexpenditureonfoodandtherelativepriceindexseriesforfood.C(1),C(2),C(3),andC(4)areestimatesofa,b1,l,andb2,respectively.Explainhowtheregressionequationcouldbeinterpretedasanadaptiveexpectationsmodelanddiscussthedynamicsimplicitinit,bothshort-runandlong-run.ShouldthespecificationhaveincludedfurtherlaggedvaluesofLGDPI?=============================================================DependentVariable:LGFOODMethod:LeastSquaresSample(adjusted):19611994Includedobservations:34afteradjustingendpointsConvergencenotachievedafter100iterationsLGFOOD=C(1)+C(2)*C(3)*LGDPI+C(2)*C(3)*(1-C(3))*LGDPI(-1)+C(2)*C(3)*(1-C(3))*(1-C(3))*LGDPI(-2)+C(4)*LGPRFOOD=============================================================CoefficientStd.Errort-StatisticProb.=============================================================C(1)2.4351980.3176437.6664670.0000C(2)0.6042660.01263047.844880.0000C(3)0.9395540.3451922.7218290.0107C(4)-0.2514860.073084-3.4410510.0017=============================================================R-squared0.990881Meandependentvar6.132194AdjustedR-squared0.989969S.D.dependentvar0.179564S.E.ofregression0.017984Akaikeinfocriter-5.088545Sumsquaredresid0.009703Schwarzcriterion-4.908973Loglikelihood90.50526F-statistic1086.639Durbin-Watsonstat0.613299Prob(F-statistic)0.000000=============================================================\nMODELSUSINGTIMESERIESDATA1512.4ThePartialAdjustmentModelInthepartialadjustmentmodelitisassumedthatthebehavioralequationdeterminesthedesired*(or"target")value,yt,ofthedependentvariable,ratherthantheactualvalue,yt:*yt=a+bxt+ut(12.39)Itisthenassumedthattheactualincreaseinthedependentvariable,yt-yt-1,isproportionaltothe*discrepancybetweenthedesiredvalueandthepreviousvalue,yt-yt-1:*yt-yt-1=l(yt-yt-1)+vt(0£l£1)(12.40)wherevtisadisturbanceterm.Thismayberewritten*yt=lyt+(1-l)yt-1+vt(12.41)soitcanbeseenthatytisaweightedaverageofthecurrentdesiredvalueandthepreviousactualvalue.Thehigherthevalueofl,themorerapidtheadjustmentprocess.Iflisequalto1,ytisequal*toytandthereisfulladjustmentinoneperiod.Attheotherextreme,iflisequalto0,ytdoesnotadjustatall.*Substitutingforytfromthetargetrelationship,oneobtainsyt=l(a+bxt+ut)+(1-l)yt-1+vt(12.42)=al+blxt+(1-l)yt-1+wtwherewt=lut+vt.Thustheparametersa,bandlofthebehavioralmodelcanbeestimatedbyregressingytonxtandyt-1.Themodelrelatesytothecurrentvalueofxandthelaggedvalueofitself,andsohasthesamestructureastheKoyck-transformedversionoftheadaptiveexpectationsmodel.Itfollowsthatitsdynamicsareexactlythesame.Thecoefficientofyt-1yieldsanestimateof(1-l)andhenceofl,thespeedofadjustment.Thecoefficientofxt,givestheshort-runeffectofxonyandalso,whendividedbyl,thelong-runeffect.Example:Brown’sHabit-PersistenceModelofAggregateConsumptionThefirstattemptsbyeconometricianstofitanaggregateconsumptionfunctionnaturallyusedthesimplestaticmodelCt=a+bYt+ut(12.43)Withestimatesofbwellbelow1,thismodelimpliedthattheaveragepropensitytoconsume\nMODELSUSINGTIMESERIESDATA16shouldfall.Neverthelesslong-rundatashowednotendencyforthistohappen.Consequentlymacroeconomistsstartedlookingformoreelaboratemodelsthatcouldreconciletheseapparentlycontradictoryfacts.Friedman’sPermanentIncomeHypothesiswasonesuchmodel.Anotherwas*Brown’sHabit-PersistenceModel(Brown,1952).Inthismodel,desiredconsumptionCtwasrelatedtowageincome,Wt,andnon-wageincome,NWt:*Ct=a+b1Wt+b2NWt+dA+ut(12.44)BrownusedaggregatedataforCanadafortheyears1926-1949,omittingthewaryears1942-1945,Abeingadummyvariableequalto0forthepre-warperiodand1forthepost-warperiod.Thedivisionofincomeintowageincomeandnon-wageincomefollowstheobservationofMichaelKaleckithatthemarginalpropensitytoconsumeoutofwageincomewaslikelytobemuchhigherthanthatfornon-wageincome,fortworeasons.First,non-wageincometendstobereceivedbyrelativelyrichhouseholdswithhighersavingsratesthanpoorerones.Second,inamarketeconomy,muchnon-wageincomeoriginatesascompanyprofits,andcompaniesnormallyretainpassononlypartoftheirprofitsasdividendstoshareholders,retainingtheremainderforinvestmentinthebusiness.Becausehouseholdsareslowtoadapttheirspendingpatternsinresponsetochangesinincome,Brownhypothesizedapartialadjustmentprocessforactualconsumption:*Ct-Ct-1=l(Ct-Ct-1)(12.45)Fromthisoneobtainscurrentconsumptionastheweightedaverageofdesiredconsumptionandconsumptionintheprevioustimeperiod:*Ct=lCt+(1-l)Ct-1(12.46)*SubstitutingforCtfrom(12.44),onethenhasanequationinobservablevariables:Ct=l(a+b1Wt+b2NWt+A+ut)+(1-l)Ct-1(12.47)=al+b1lWt+b2lNWt+(1-l)Ct-1+lA+lutFittingthemodelwithasimultaneousequationstechnique,Brownobtained(tstatisticsinparentheses):Cˆt=0.90+0.61Wt+0.28NWt+0.22Ct-1+0.69A(12.48)(4.8)(7.4)(4.2)(2.8)(4.8)ThevariableswereallmeasuredinbillionsofCanadiandollarsatconstantpricesofthe1935-1939period.Fromtheregressiononeobtainsshort-runmarginalpropensitiestoconsumeof0.61and0.28forwageincomeandnon-wageincome,respectively.ThecoefficientofCt-1indicatesthat0.78ofthediscrepancybetweendesiredandactualincomeiseliminatedinoneyear.Dividingtheshort-runmarginalpropensitiesbythespeedofadjustment,oneobtainslong-runpropensitiestoconsumeof0.78and0.36forwageincomeandon-wageincome,respectively.\nMODELSUSINGTIMESERIESDATA17ComparisonoftheFriedmanandBrownModelsDespitethefactthattheirtheoreticalframeworksarecompletelydifferent,oneconcernedwiththefutureandexpectations,theotherconcernedwiththepastandinertia,theFriedmanmodel,initsKoyck-transformedform(12.33),andthehabit-persistencemodel(12.41)arevirtuallyidentical.Theybothincorporateshort-runandlong-runpropensitiestoconsumeandaspeedofadjustment.TheonlydifferenceinthevariablespecificationisthattheBrownmodeldividesincomeintowageincomeandnon-wageincome.ThisisausefulrefinementthatshouldbeappliedtoFriedmanmodelaswell.Indeeditisnowastandardfeatureofempiricalmodels.TheFriedmanmodeldoesnothaveanintercept,butthisisaminorempiricaldetail.ThedisturbancetermintheFriedmanmodelisdifferentfromthatintheBrownmodel,anditsstructuremaycauseproblems,butaswillbeseeninthenextchapter,thisislikewisenotanimportantdifference.Thisisanexampleoftheproblemofobservationallyequivalenttheories,wheretwoormoretheoriescanbeusedtofitthesamedatasetinthesamewayandthereisnopossibilityofdiscriminatingbetweenthem.Exercises12.10Expenditureonhousingservices,HOUS,wasregressedonDPI,therelativepriceindexforhousing,PRELHOUS,andthelaggedvalueofHOUS,HOUS(-1),fortheperiod1959-1994fortheUnitedStatesusingtheDemandFunctionsdataset.Theregressionwasrepeatedinlogarithmicform,LGHOUSbeingregressedonLGDPI,LGPRHOUS,andLGHOUS(-1),withtheresultssummarizedbelow.Giveaninterpretationoftheregressioncoefficients,payingattentiontothedynamicsimplicitinthemodel.2LGHˆOUS=-0.39+0.32LGDPI-0.07LGPRHOUS+0.70LGHOUS(-1)R=0.999(0.015)(0.05)(0.02)(0.05)12.11PerformregressionsparalleltothosereportedinExercise12.10foryourcategoryofexpenditureintheDemandFunctionsdataset.Giveaninterpretationoftheregressioncoefficients,payingattentiontothedynamicsimplicitinthemodel.12.12HowwouldyoutestKalecki'sassertion,ifyouhadaccesstoBrown'sdataset?12.13InhisclassicstudyDistributedLagsandInvestmentAnalysis(1954),KoyckinvestigatedtherelationshipbetweeninvestmentinrailcarsandthevolumeoffreightcarriedontheU.S.railroadsusingdatafortheperiod1884-1939.Assumingthatthedesiredstockofrailcarsinyeartdependedonthevolumeoffreightinyeart-1andyeart-2andatimetrend,andassumingthatinvestmentinrailcarswassubjecttoapartialadjustmentprocess,hefittedthefollowingregressionequationusingOLS(standarderrorsandconstanttermnotreported):2Iˆt=0.077Ft-1+0.017Ft-2–0.0033t–0.110Kt-1R=0.85\nMODELSUSINGTIMESERIESDATA18Provideaninterpretationoftheequationanddescribethedynamicprocessimpliedbyit.(Note:ItisbesttosubstituteKt–Kt-1forItintheregressionandtreatitasadynamicrelationshipdeterminingKt.)12.5PredictionSupposethatyouhavefittedamodelyt=a+bxt+ut(12.49)toasampleofTtimeseriesobservations(t=1,…,T):yˆt=a+bxt(12.50)Givenanypost-sampleperiodvalueofx,sayxT+p,youarenowinapositiontopredictthecorrespondingvalueofy:yˆT+p=a+bxT+p(12.51)Therearetworeasonswhysuchpredictionsmaybeimportanttoyou.First,youmaybeoneofthoseeconometricianswhosebusinessitistopeerintotheeconomicfuture.Someeconometriciansareconcernedwithteasingouteconomicrelationshipswiththeaimofimprovingourunderstandingofhowtheeconomyworks,butforothersthisisonlyameanstothemorepracticalobjectiveoftryingtoanticipatewhatwillhappen.Inmostcountriesmacroeconomicforecastinghasaparticularlyhighprofile,teamsofeconometriciansbeingemployedbytheMinistryofFinanceorotherbranchesofgovernment,privatefinancialinstitutions,universitiesandresearchinstitutes,andtheirpredictionsareactivelyusedforframingpublicpolicy,forcommentingonit,orforbusinesspurposes.Whentheyarepublishedinthepress,theytypicallyattractfarmoreattentionthanmostotherformsofeconomicanalysis,bothonaccountoftheirsubjectmatterandbecause,unlikemostotherformsofeconomicanalysis,theyareeasilyunderstoodbytheordinarycitizen.Eventhemostinnumerateandnontechnicallymindedpersoncanhaveagoodunderstandingofwhatismeantbyestimatesofthefuturelevelsofunemployment,inflation,etc.Thereis,however,aseconduseofeconometricprediction,onethathasmadeitofconcerntomosteconometricians,irrespectiveofwhethertheyareinvolvedinforecasting.Itprovidesamethodofevaluatingtherobustnessofaregressionmodelthatismoresearchingthanthediagnosticstatisticsthathavebeenusedsofar.Beforewegoanyfurther,wewillhavetoclarifywhatwemeanbyprediction.Unfortunately,intheeconometricliteraturethistermcanhaveseveralslightlydifferentmeanings,accordingtothestatusofxT+pin(12.51).Wewilldifferentiatebetweenex-postpredictionsandforecasts.Thisclassificationcorrespondstowhatseemstobethemostcommonusage,butyoushouldbewarnedthattheterminologyisnotstandard.\nMODELSUSINGTIMESERIESDATA19Ex-PostPredictionsWewilldescribeyˆT+pasanex-postpredictionifxT+pisknown.Howcanthisbethecase?Ingeneral,econometriciansmakeuseofallavailabledata,tomaximizethesamplesizeandhenceminimizethepopulationvariancesoftheirestimators,soxTwillsimplybethemostrecentrecordedvalueofxavailableatthetimeofrunningtheregression.Nevertheless,therearetwocircumstanceswhenxT+pwillbeknownaswell:whenyouhavewaitedpormoreperiodsafterrunningtheregression,andwhenyouhavedeliberatelyterminatedthesampleperiodearlysothatyouhaveafewofthemostrecentobservationsleftover.Thereasonfordoingthis,asweshallseeinthenextsection,istoenableyoutoevaluatethepredictiveaccuracyofthemodelwithouthavingtowait.Forexample,referringagaintotheprice-inflation/wage-inflationmodelofequation(3.39),supposethatwehadfittedtheequationpˆ=1.0+0.80w(12.52)duringthesampleperiod,wherepandwarethepercentageannualratesofpriceinflationandwageinflation,respectively,andthatweknowthattherateofwageinflationwas6%insomepredictionperiodyear.Thenwecansaythattheex-postpredictionoftherateofwageinflationis5.8%.Weshould,ofcourse,beabletocompareitimmediatelywiththeactualrateofpriceinflationforthatyear,andhencewecanevaluatethepredictionerror,whichisjustthedifferencebetweenthepredictedvalueandactualoutcome.Ingeneral,ifyˆT+pisthepredictedvalue,andyT+ptheactualoutcome,theforecasterrorisdefinedasfT+pwherefT+p=yˆT+p-yT+p(12.53)Whyisthereapredictionerror?Fortworeasons.First,yˆhasbeencalculatedusingtheT+pparameterestimates,aandb,insteadofthetruevalues.Andsecond,yˆT+pcannottakeaccountofthedisturbancetermuT+p,whichisacomponentofyT+p.Inthediscussionthatfollows,weshallassumethatthedataincludeT+pobservationsonthevariables,thefirstT(thesampleperiod)beingusedtofittheregressionandthelastp(thepredictionperiodorpredictioninterval)beingusedtocheckpredictiveaccuracy.ExampleSupposethatwhen,wefittedthedemandfunctionforhousingusingtheDemandFunctionsdataset,wehadonlyusedthefirst32observationsinthesample,thatis,theobservationsfor1959-90,reservingthelastfourobservationsforcheckingpredictions.Thefittedequationfor1959-90is(standarderrorsinparentheses)2LGHˆOUS=-1.96+1.13LGDPI-0.24LGPRHOUSR=0.998(12.54)(0.39)(0.01)(0.08)\nMODELSUSINGTIMESERIESDATA20TABLE12.4PredictedandActualExpenditureonHousingServices,1991-1994LogarithmsAbsoluteequivalentˆˆYearLGHOUSLGHOUSErrorHOUSHOUSError19916.43746.4539-0.0166624.8635.2-10.419926.46976.4720-0.0023645.3646.8-1.519936.48206.4846-0.0026653.3655.0-1.719946.50736.50460.0027670.0668.21.8ThepredictedvaluesofLGHOUSfortheperiod1991-94,usingtheequationandtheactualvaluesofdisposablepersonalincomeandtherelativepriceofhousingservicesforthoseyears,areshowninTable12.4,togetherwiththeactualoutcomesandthepredictionerrors.Thepredictions,likethebasicdata,areinlogarithmicform.Forconvenience,Table12.4alsoshowstheabsolutevalues,derivedfromthelogarithmicvalues,expressedin$billionat1992prices.Wecanseethatinthiscasethepredictedvalueofexpenditureonhousingservicesisroughly2percentbelowtheactualoutcomein1991,andverycloseintheremainingthreeyears.Isthispredictiveperformancesatisfactory?Weshallseeinthenextsection.ForecastsIfyouarewillingtopredictaparticularvalueofyT+p,withoutknowingtheactualvalueofxT+p,youaresaidtobemakingaforecast,atleastintheterminologyofthistext.Themacroeconomicdivinationspublishedinthepressareusuallyforecastsinthissense.Policymakers,andindeedthegeneralpublic,arenotmuchinterestedintwo-handedeconomists(oneswhosay"ontheonehandthis…butif…thenontheotherhandthat…").Theywantthebestpossiblesingle-pointestimates,perhapswithsomeindicationofthelikelymarginoferror,oftennoteventhat.Forecastsarelessaccuratethanpredictionsbecausetheyaresubjecttoanadditionalsourceoferror,theerrorinthepredictionofxT+p.Obviously,thosemakingforecastsnormallyattempttominimizetheadditionalerrorbymodelingthebehaviorofxascarefullyaspossible,insomeinstancesconstructingaseparatemodelforit,inothersbringingtheequationdeterminingyandtheequationdeterminingxtogether,usuallywithotherrelationshipsaswell,inasimultaneousequationsmodelofthetypediscussedinChapter10.PropertiesofLeastSquaresPredictorsInthediscussionthatfollows,wewillbeconcernedwithpredictionsratherthanforecasts,thereasonbeingthatweareinapositiontomakestatementsaboutthepropertiesoftheregressioncoefficientsandthedisturbanceterm,butnotaboutxifitsvaluesarenotknown.First,thereissomegoodnews.IfyT+pisgeneratedbythesameprocessasthesampleperiodvaluesofy[thatis,accordingtoequation(12.49)withuT+pconformingtotheGauss-Markovconditions],andifwemakeourpredictionyˆT+pusingequation(12.51),thepredictionerrorfT+pwillhave0meanandminimumvariance.Thefirstpropertyiseasilydemonstrated:\nMODELSUSINGTIMESERIESDATA21E(fT+p)=E(yˆT+p)-E(yT+p)=E(a+bxT+p)-E(a+bxT+p+uT+p)(12.55)=E(a)+xT+pE(b)-a-bxT+p-E(uT+p)=a+bxT+p-a-bxT+p=0sinceE(a)=a,E(b)=b,andE(uT+p)=0.Wewillnotattempttoprovetheminimumvarianceproperty(foraproof,seeJohnstonandDinardo(1997).Bothofthesepropertiescarryovertothegeneralcaseofmultipleregressionanalysis.Inthesimpleregressioncase,thepopulationvarianceoffT+pisgivenbyì-2ü2ï1(xT+px)ï2sf=í1++ýsu(12.56)T+pnnVar(x)ïîïþwherexandVar(x)arethesampleperiodmeanandvarianceofx.Unsurprisingly,thisimpliesthat,thefartheristhevalueofxfromitssamplemean,thelargerwillbethepopulationvarianceofthepredictionerror.Italsoimplies,againunsurprisingly,that,thelargeristhesample,thesmallerwillbe2thepopulationvarianceofthepredictionerror,withalowerlimitofsu.Asthesamplebecomeslarge,aandbwilltendtotheirtruevalues(providedthattheGauss-Markovconditionshold),sothe2onlysourceoferrorinthepredictionwillbeuT+p,andbydefinitionthishaspopulationvariancesu.ConfidenceIntervalsforPredictions22Wecanobtainthestandarderrorofthepredictionerrorbyreplacingsuin(12.56)bysuandtakingthesquareroot.Then(yˆT+p-yT+p)/standarderrorfollowsatdistributionwiththenumberofdegreesoffreedomwhenfittingtheequationinthesampleperiod.Hencewecanderiveaconfidenceintervalfortheactualoutcome,yT+p:yˆT+p-tcrit´s.e.QRW#VXSSOLHG@,#WKDW#LQ#ODUJH#VDPSOHVG#o#5#±#5Uõìêïçô,I#WKHUH#LV#QR#DXWRFRUUHODWLRQ#SUHVHQW/#U#LV#3/#VR#G#VKRXOG#EH#FORVH#WR#51##,I#WKHUH#LV#SRVLWLYHDXWRFRUUHODWLRQ/#G#VKRXOG#WHQG#WR#EH#OHVV#WKDQ#51##,I#WKHUH#LV#QHJDWLYH#DXWRFRUUHODWLRQ/#LW#VKRXOG#WHQG#WREH#JUHDWHU#WKDQ#51##7KH#WHVW#DVVXPHV#WKDW#U#OLHV#LQ#WKH#LQWHUYDO#±4#!ýUý!#4#DQG#KHQFH#WKDW#G#OLHV#EHWZHHQ7#DQG#317KH#QXOO#K\SRWKHVLV#IRU#WKH#WHVW#LV#WKDW#U#LV#HTXDO#WR#31##2I#FRXUVH/#HYHQ#LI#+3#LV#WUXH/#G#ZLOO#QRW#EHH[DFWO\#HTXDO#WR#5/#H[FHSW#E\#IUHDN#FKDQFH1##+RZHYHU#D#YDOXH#RI#G#PXFK#ORZHU#WKDQ#5#OHDYHV#\RX#ZLWKWZR#FKRLFHV1##2QH#LV#WR#DVVXPH#WKDW#+3#LV#WUXH#DQG#WKDW#WKH#ORZ#YDOXH#RI#G#KDV#DULVHQ#DV#D#PDWWHU#RIFKDQFH1###7KH#RWKHU#LV#WKDW#WKH#GLVWXUEDQFH#WHUP#LV#VXEMHFW#WR#SRVLWLYH#DXWRFRUUHODWLRQ1##$V#XVXDO/#WKHFKRLFH#LV#PDGH#E\#HVWDEOLVKLQJ#D#FULWLFDO#YDOXH#GFULW#EHORZ#ZKLFK#G#ZRXOG#QRW#VLQN/#VD\/#PRUH#WKDQ#8SHUFHQW#RI#WKH#WLPH1##,I#G#ZHUH#EHORZ#GFULW/#\RX#ZRXOG#WKHQ#UHMHFW#+3#DW#WKH#8#SHUFHQW#VLJQLILFDQFH#OHYHO17KH#FULWLFDO#YDOXH#RI#G/#DW#DQ\#VLJQLILFDQFH#OHYHO/#GHSHQGV/#DV#\RX#PLJKW#H[SHFW/#RQ#WKH#QXPEHU#RIH[SODQDWRU\#YDULDEOHV#LQ#WKH#UHJUHVVLRQ#HTXDWLRQ#DQG#WKH#QXPEHU#RI#REVHUYDWLRQV#LQ#WKH#VDPSOH18QIRUWXQDWHO\/#LW#DOVR#GHSHQGV#RQ#WKH#SDUWLFXODU#YDOXHV#WDNHQ#E\#WKH#H[SODQDWRU\#YDULDEOHV1##7KXV#LW#LVQRW#SRVVLEOH#WR#FRQVWUXFW#D#WDEOH#JLYLQJ#WKH#H[DFW#FULWLFDO#YDOXHV#IRU#DOO#SRVVLEOH#VDPSOHV/#DV#RQH#FDQZLWK#WKH#W#WHVW#DQG#WKH#)#WHVW/#EXW#LW#LV#SRVVLEOH#WR#FDOFXODWH#XSSHU#DQG#ORZHU#OLPLWV#IRU#WKH#FULWLFDO#YDOXHRI#G1##7KRVH#IRU#SRVLWLYH#DXWRFRUUHODWLRQ#DUH#XVXDOO\#GHQRWHG#G8#DQG#G/1)LJXUH#4616#UHSUHVHQWV#WKH#VLWXDWLRQ#VFKHPDWLFDOO\/#ZLWK#WKH#DUURZ#LQGLFDWLQJ#WKH#FULWLFDO#OHYHO#RI#G/ZKLFK#ZLOO#EH#GHQRWHG#GFULW1##,I#\RX#NQHZ#WKH#H[DFW#YDOXH#RI#GFULW/#\RX#FRXOG#FRPSDUH#WKH#G#VWDWLVWLF#IRU\RXU#UHJUHVVLRQ#ZLWK#LW1##,I#G#!#GFULW/#\RX#ZRXOG#IDLO#WR#UHMHFW#WKH#QXOO#K\SRWKHVLV#RI#QR#DXWRFRUUHODWLRQ1,I#G#?#GFULW/#\RX#ZRXOG#UHMHFW#WKH#QXOO#K\SRWKHVLV#DQG#FRQFOXGH#WKDW#WKHUH#LV#HYLGHQFH#RI#SRVLWLYHDXWRFRUUHODWLRQ1+RZHYHU/#DOO#\RX#NQRZ#LV#GFULW#WKDW#OLHV#VRPHZKHUH#EHWZHHQ#G/#DQG#G81##7KLV#OHDYHV#\RX#ZLWK#WKUHHSRVVLEOH#RXWFRPHV#IRU#WKH#WHVW1\n$872&255(/$7,2183570G8GFUL70G/7)LJXUH#46171##'XUELQ0:DWVRQ#WHVW#IRU#DXWRFRUUHODWLRQ/#VKRZLQJ#WKH#]RQH#RILQGHWHUPLQDF\#LQ#WKH#FDVH#RI#VXVSHFWHG#QHJDWLYH#DXWRFRUUHODWLRQ41G#LV#OHVV#WKDQ#G/1##,Q#WKLV#FDVH/#LW#PXVW#EH#ORZHU#WKDQ#GFULW/#VR#\RX#ZRXOG#FRQFOXGH#WKDW#SRVLWLYHDXWRFRUUHODWLRQ#LV#SUHVHQW151G#LV#JUHDWHU#WKDQ#G81##,Q#WKLV#FDVH/#G#PXVW#EH#JUHDWHU#WKDQ#GFULW/#VR#\RX#ZRXOG#IDLO#WR#UHMHFW#WKHQXOO#K\SRWKHVLV161G#OLHV#EHWZHHQ#G/#DQG#G81##,Q#WKLV#FDVH/#G#PLJKW#EH#JUHDWHU#RU#OHVV#WKDQ#GFULW1##QRW#VXSSOLHG@#DW#WKH#HQG#RI#WKLV#WH[W#JLYHV#G/#DQG#G8#FURVV0FODVVLILHG#E\#QXPEHU#RIH[SODQDWRU\#YDULDEOHV#DQG#QXPEHU#RI#REVHUYDWLRQV/#IRU#WKH#8#SHUFHQW#DQG#4#SHUFHQW#VLJQLILFDQFH#OHYHOV1,W#VKRZV#WKH#FULWLFDO#YDOXHV#LQ#WKH#FDVH#RI#SRVLWLYH#DXWRFRUUHODWLRQ/#WKH#FDVH#PRVW#IUHTXHQWO\HQFRXQWHUHG#LQ#HFRQRPLF#PRGHOV1###EXW#LWV#YDULDQFH#LV#JLYHQ#E\555VHVHò111òVH+47144,5WVHDQG#VR#LW#LV#QRW#LQGHSHQGHQW#RI#WLPH1##)LJXUH#4715#SURYLGHV#DQ#H[DPSOH#RI#D#UDQGRP#ZDON1,Q#WKH#PRUH#JHQHUDO#YHUVLRQ#RI#WKH#DXWRUHJUHVVLYH#SURFHVV#ZLWK#WKH#FRQVWDQW#P/#WKH#SURFHVVEHFRPHV#ZKDW#LV#NQRZQ#DV#D#UDQGRP#ZDON#ZLWK#GULIW#LI#UýHTXDOV#4=[W##[W04#.#P#.#HW+47145,534843834445464748494:4;4<408043048)LJXUH#47151##5DQGRP#:DON\n,1752'8&7,21#72#12167$7,21$5<#7,0(#6(5,(67534843834445464748494:4;4<408043048)LJXUH#47151##5DQGRP#:DON#ZLWK#'ULIW,I#WKH#VHULHV#VWDUWV#DW#[3#DW#WLPH#3/#[W#LV#JLYHQ#E\[W##[3#.#PW#òýH4#.#«#.#HW+47146,1RZ#WKH#H[SHFWDWLRQ#RI#[W#DW#WLPH#3/#+[3#.#PW,/#LV#DOVR#D#IXQFWLRQ#RI#W1##)LJXUH#4716#SURYLGHV#DQ#H[DPSOHRI##D#UDQGRP#ZDON#ZLWK#GULIW15DQGRP#ZDONV#DUH#QRW#WKH#RQO\#W\SH#RI#QRQVWDWLRQDU\#SURFHVV1##$QRWKHU#FRPPRQ#H[DPSOH#RI#DQRQVWDWLRQDU\#WLPH#VHULHV#LV#RQH#SRVVHVVLQJ#D#WLPH#WUHQG=[W##D#.#EW#.#HW+47147,,WV#H[SHFWHG#YDOXH#DW#WLPH#W/#+D#.#EW,/#LV#QRW#LQGHSHQGHQW#RI#W#DQG#LWV#SRSXODWLRQ#YDULDQFH#LV#QRW#GHILQHG1'LIIHUHQFH06WDWLRQDULW\#DQG#7UHQG06WDWLRQDULW\,Q#WKH#GLVFXVVLRQ#WKDW#IROORZV/#D#GLVWLQFWLRQ#ZLOO#EH#PDGH#EHWZHHQ#GLIIHUHQFH0VWDWLRQDU\#DQG#WUHQG0VWDWLRQDU\#WLPH#VHULHV1##,I#D#QRQVWDWLRQDU\#SURFHVV#FDQ#EH#WUDQVIRUPHG#LQWR#D#VWDWLRQDU\#RQH#E\GLIIHUHQFLQJ/#LW#LV#VDLG#WR#EH#GLIIHUHQFH0VWDWLRQDU\1##$#UDQGRP#ZDON/#ZLWK#RU#ZLWKRXW#GULIW/#LV#DQH[DPSOH1##,I##[W#LV#D#UDQGRP#ZDON#ZLWK#GULIW/#DV#LQ#HTXDWLRQ#+47145,/'[W##+[W#±#[W04,##P#.#HW+47148,57KLV#LV#D#VWDWLRQDU\#SURFHVV#ZLWK#SRSXODWLRQ#PHDQ#P#DQG#YDULDQFHV/#ERWK#LQGHSHQGHQW#RI#WLPH1##,I#DHWLPH#VHULHV#FDQ#EH#WUDQVIRUPHG#LQWR#D#VWDWLRQDU\#SURFHVV#E\#GLIIHUHQFLQJ#RQFH/#DV#LQ#WKLV#FDVH/#LW#LV\n,1752'8&7,21#72#12167$7,21$5<#7,0(#6(5,(68GHVFULEHG#DV#LQWHJUDWHG#RI#RUGHU#4/#,+4,1##,I#D#WLPH#VHULHV#FDQ#EH#PDGH#VWDWLRQDU\#E\#GLIIHUHQFLQJ#WZLFH/LW#LV#NQRZQ#DV#,+5,/#DQG#VR#RQ1##7R#FRPSOHWH#WKH#SLFWXUH/#D#VWDWLRQDU\#SURFHVV/#ZKLFK#E\#GHILQLWLRQ#QHHGVQR#GLIIHUHQFLQJ/#LV#GHVFULEHG#DV#,+3,1$#QRQVWDWLRQDU\#WLPH#VHULHV#LV#GHVFULEHG#DV#EHLQJ#WUHQG0VWDWLRQDU\#LI#LW#FDQ#EH#WUDQVIRUPHG#LQWR#DVWDWLRQDU\#SURFHVV#E\#H[WUDFWLQJ#D#WLPH#WUHQG1##)RU#H[DPSOH/#WKH#YHU\#VLPSOH#PRGHO#JLYHQ#E\#HTXDWLRQ+47147,#FDQ#EH#GHWUHQGHG#E\#ILWWLQJ#WKH#HTXDWLRQ[ÖWDòEW+47149,DQG#GHILQLQJ#D#QHZ#YDULDEOHa[W[Wð[ÖW[WðDðEW+4714:,7KH#QHZ/#GHWUHQGHG/#YDULDEOH#LV#RI#FRXUVH#MXVW#WKH#UHVLGXDOV#IURP#WKH#UHJUHVVLRQ#RI#[#RQ#W16SXULRXV#5HJUHVVLRQV,Q#D#FHOHEUDWHG#DUWLFOH/#*UDQJHU#DQG#1HZEROG#+4<:7,#UHSRUW#WKH#UHVXOWV#RI#D#0RQWH#&DUOR#H[SHULPHQW#LQZKLFK#WZR#YDULDEOHV#ZHUH#JHQHUDWHG#DV#LQGHSHQGHQW#UDQGRP#ZDONV#DQG#WKHQ#RQH#ZDV#UHJUHVVHG#RQ#WKHRWKHU1#2EYLRXVO\/#D#UHJUHVVLRQ#RI#RQH#UDQGRP#ZDON#RQ#DQRWKHU#RXJKW#QRW#WR#\LHOG#VLJQLILFDQW#UHVXOWVH[FHSW#DV#D#PDWWHU#RI#7\SH#,#HUURU1##*UDQJHU#DQG#1HZEROG#UDQ#WKH#H[SHULPHQW#433#WLPHV#DQG#VR/#XVLQJ#D8(#VLJQLILFDQFH#WHVW/#RQH#ZRXOG#DQWLFLSDWH#WKDW#WKH#VORSH#FRHIILFLHQW#ZRXOG#QRW#EH#VLJQLILFDQWO\GLIIHUHQW#IURP#3#DERXW#<8#WLPHV/#WKH#RWKHU#8#WLPHV#ZKHUH#LW#DSSHDUHG#WR#EH#VLJQLILFDQW#EHLQJ#FDVHV#RI7\SH#,#HUURU1##+RZHYHU#WKH\#IRXQG#WKDW#WKH#VORSH#FRHIILFLHQW#KDG#D#VLJQLILFDQW#W#VWDWLVWLF#RQ#::RFFDVLRQV1##8VLQJ#D#PRUH#FDXWLRXV#4(#WHVW#PDGH#YHU\#OLWWOH#GLIIHUHQFH1##7KH#QXOO#K\SRWKHVLV#RI#QRUHODWLRQVKLS#ZDV#UHMHFWHG#RQ#:3#RFFDVLRQV17KH#UHDVRQ#IRU#WKLV#LV#WKDW#UDQGRP#ZDONV#DUH#QRQVWDWLRQDU\/#DQG##2/6/#DQG#YDULDWLRQV#OLNH#$5+4,/\LHOG#LQYDOLG#VWDQGDUG#HUURUV#DQG#WHVW#VWDWLVWLFV#ZKHQ#WKH#UHJUHVVLRQ#PRGHO#LQFOXGHV#QRQVWDWLRQDU\YDULDEOHV1##,W#LV#DOVR#OLNHO\#WKDW#WKH#HVWLPDWHV#RI#WKH#FRHIILFLHQWV#ZLOO#EH#LQFRQVLVWHQW14716###'HWHFWLRQ#RI#1RQVWDWLRQDULW\8QIRUWXQDWHO\#IRU#HFRQRPHWULFLDQV#ZRUNLQJ#ZLWK#WLPH#VHULHV#GDWD/#PDQ\#HFRQRPLF#WLPH#VHULHV#DSSHDUWR#EH#RI#WKH#,+4,#W\SH1##,W#LV#WKHUHIRUH#LPSRUWDQW#WR#WHVW#WLPH#VHULHV#IRU#QRQVWDWLRQDULW\#EHIRUH#DWWHPSWLQJWR#XVH#WKHP#LQ#D#UHJUHVVLRQ#PRGHO1##7HVWLQJ#IRU#QRQVWDWLRQDULW\#LV#RIWHQ#GHVFULEHG#DV#WHVWLQJ#IRU#XQLWURRWV/#IRU#UHDVRQV#WKDW#QHHG#QRW#FRQFHUQ#XV#KHUH1##7KH#VWDQGDUG#WHVW/#SLRQHHUHG#E\#'LFNH\#DQG#)XOOHU/#LVEDVHG#RQ#WKH#PRGHO[W##U[W04#.#EW#.#P#.#HW+4714;,\n,1752'8&7,21#72#12167$7,21$5<#7,0(#6(5,(697KLV#SURFHVV#LV#GLIIHUHQFH0VWDWLRQDU\#LI#Uý#4#DQG#E##3/#WUHQG0VWDWLRQDU\#LI#Uý?#4#DQG#E#z#3/#DQG#LW#DOVRDOORZV#WKH#SRVVLELOLW\#WKDW#WKH#VHULHV#LV#VWDWLRQDU\#DIWHU#DOO#+Uý?#4#DQG#E##3,1##2QH#FDQ#DOORZ#IRU#PRUHFRPSOH[#G\QDPLFV#E\#OHWWLQJ#[#GHSHQG#RQ#IXUWKHU#ODJJHG#YDOXHV#RI#LWVHOI/#IRU#H[DPSOH#DV#LQ[W##U4[W04#.#U5[W05#.#EW#.#P#.#HW+4714<,,Q#WKLV#FDVH#WKH#SURFHVV#ZLOO#EH#VWDWLRQDU\#LI#U4#.#U5#?#4#DQG#E##31###,W#ZLOO#EH#GLIIHUHQFH0VWDWLRQDU\#LI#U4.#U5##4#DQG#E##31##,W#LV#FRQYHQLHQW#WR#UHZULWH#WKH#HTXDWLRQ#DV'[W##+U4#.#U5##0#4,[W04#0#U5'[W04#.#EW#.#P#.#HW+47153,DQG#WHVW#WKH#QXOO#K\SRWKHVLV#WKDW#WKH#FRHIILFLHQWV#RI#[W04#DQG#W#DUH#HTXDO#WR#31##7KH#WHVW/#RU#UDWKHU#IDPLO\#RIWHVWV#EHFDXVH#WKHUH#DUH#YDULRXV#DOWHUQDWLYHV/#DUH#NQRZQ#DV#DXJPHQWHG#'LFNH\0)XOOHU#WHVWV1##7KH#XVXDOSURFHGXUH#LV#WR#SHUIRUP#D#RQH0WDLOHG#W#WHVW#RQ#WKH#FRHIILFLHQW#RI#[W04#XQGHU#WKH#QXOO#K\SRWKHVLV#WKDW#LW#LVHTXDO#WR#3#+WKDW#LV/#WKDW#WKHUH#LV#D#XQLW#URRW#DQG#WKH#VHULHV#LV#QRQVWDWLRQDU\,1##7KH#UHDVRQ#IRU#SHUIRUPLQJ#DRQH0WDLOHG#WHVW#LV#WKDW#LI#WKH#DOWHUQDWLYH#K\SRWKHVLV/#VWDWLRQDULW\/#LV#FRUUHFW/#WKH#FRHIILFLHQW#ZLOO#EHQHJDWLYH1##8QGHU#WKH#QXOO#K\SRWKHVLV/#W#GRHV#QRW#KDYH#LWV#XVXDO#GLVWULEXWLRQ#DQG#WKH#FULWLFDO#YDOXH/#IRUDQ\#JLYHQ#VLJQLILFDQFH#OHYHO/#LV#KLJKHU#WKDQ#WKDW#VKRZQ#LQ#WKH#VWDQGDUG#WDEOHV1##&ULWLFDO#YDOXHV#DUHSURYLGHG#LQ#WKH#DSSHQGL[#WR#WKLV#QRWH14717###&RLQWHJUDWLRQ,I#RQH#RU#PRUH#WLPH#VHULHV#LQ#D#PRGHO#H[KLELW#QRQVWDWLRQDULW\/#FRQYHQWLRQDO#UHJUHVVLRQ#WHFKQLTXHV#DUHLQDSSURSULDWH1##,W#LV#EH\RQG#WKH#VFRSH#RI#WKLV#LQWURGXFWLRQ#WR#GHVFULEH#ZKDW#VKRXOG#EH#GRQH#XQGHU#WKHVHFLUFXPVWDQFHV1##7KH#OLWHUDWXUH#RQ#WKH#VXEMHFW/#DOWKRXJK#JURZLQJ#YHU\#UDSLGO\/#LV#VWLOO#LQ#LWV#LQIDQF\1+RZHYHU/#VXSSRVH#WKDW#WKH#PRGHO\W##DýòýE[W#.#XW+47154,LV#D#FRUUHFW#VSHFLILFDWLRQ1##7KH#HUURU#WHUP#XW#FDQ#EH#WKRXJKW#RI#DV#PHDVXULQJ#WKH#GHYLDWLRQ#EHWZHHQ#WKHFRPSRQHQWV#RI#WKH#PRGHO=XW##\W#0#DýðýE[W#+47155,,Q#WKH#VKRUW#UXQ#WKH#GLYHUJHQFH#EHWZHHQ#WKH#FRPSRQHQWV#ZLOO#IOXFWXDWH/#EXW#LI#WKH#PRGHO#LV#JHQXLQHO\FRUUHFW#WKHUH#VKRXOG#EH#D#OLPLW#WR#WKH#GLYHUJHQFH1##+HQFH#HYHQ#LI#\W#DQG#[W#DUH#QRQVWDWLRQDU\/#XW#VKRXOGEH#VWDWLRQDU\1##,I#WKLV#LV#IRXQG#WR#EH#WKH#FDVH/#\W#DQG#[W#DUH#VDLG#WR#EH#FRLQWHJUDWHG#DQG#WKH#UHODWLRQVKLS#LVLQWHUSUHWHG#DV#D#ORQJ0UXQ#RQH1##7R#WHVW#IRU#FRLQWHJUDWLRQ/#WKH#ILUVW#VWHS#LV#WR#FKHFN#WKDW#WKH#WLPH#VHULHVLQYROYHG#DUH#RI#WKH#VDPH#RUGHU#RI#LQWHJUDWLRQ/#IRU#RWKHUZLVH#LW#LV#QRW#SRVVLEOH#IRU#WKHP#WR#EHFRLQWHJUDWHG1##7KH#VHFRQG#LV#WR#UXQ#DQ#2/6#UHJUHVVLRQ#DQG#WR#WHVW#WKH#UHVLGXDOV#IRU#QRQVWDWLRQDULW\#XVLQJD#XQLW#URRW#WHVW1##,I#WKH#QXOO#K\SRWKHVLV#LV#UHMHFWHG#LQ#IDYRU#RI#VWDWLRQDULW\/#D#FRLQWHJUDWLQJ#UHODWLRQVKLSPD\#KDYH#EHHQ#IRXQG1##1RWH#WKDW/#EHFDXVH#WKH#2/6#UHJUHVVLRQ#KDV#XVHG#QRQVWDWLRQDU\#WLPH#VHULHV/#WKHVWDQGDUG#HUURUV#DQG#RWKHU#GLDJQRVWLF#VWDWLVWLFV#VKRXOG#EH#GLVUHJDUGHG1\n,1752'8&7,21#72#12167$7,21$5<#7,0(#6(5,(6:([HUFLVHV4714+$#UHSHDW#RI#*UDQJHU#DQG#1HZEROG¶V#H[SHULPHQW,1###&RQVWUXFW#WZR#4330REVHUYDWLRQ#UDQGRPZDONV#DQG#UHJUHVV#RQH#RQ#WKH#RWKHU1##'RHV#WKH#W#VWDWLVWLF#RQ#WKH#VORSH#FRHIILFLHQW#DSSHDU#WR#EHVLJQLILFDQW#XVLQJ#D#8(#WHVW"##5HSHDW#WKH#H[SHULPHQW#VHYHUDO#WLPHV#+DW#OHDVW#8#WLPHV>#53#ZRXOG#EHEHWWHU,#DQG#QRWH#WKH#IUHTXHQF\#RI#7\SH#,#HUURUV147157HVW#WKH#ORJDULWKPV#RI#GLVSRVDEOH#SHUVRQDO#LQFRPH/#H[SHQGLWXUH#RQ#\RXU#FRPPRGLW\/#DQG#WKHUHODWLYH#SULFH#VHULHV#IRU#\RXU#FRPPRGLW\#IRU#GLIIHUHQFH0VWDWLRQDULW\1##&DOFXODWH#WKH#ILUVWGLIIHUHQFHV#DQG#WHVW#WKHVH#IRU#GLIIHUHQFH0VWDWLRQDULW\147165XQ#ORJDULWKPLF#UHJUHVVLRQV#RI#H[SHQGLWXUH#RQ#\RXU#FRPPRGLW\#RQ#GLVSRVDEOH#SHUVRQDO#LQFRPHDQG#UHODWLYH#SULFH#DQG#WHVW#IRU#FRLQWHJUDWLRQ15HIHUHQFH*UDQJHU/#&1:1-1/#DQG#31#1HZEROG##+4<:7,##6SXULRXV#UHJUHVVLRQV#LQ#HFRQRPHWULFV/#-RXUQDO#RI(FRQRPHWULFV#5#+5,/#4440453