@xxh
2015-10-14T03:12:05.000000Z
字数 4689
阅读 177
PRML CHAPTER1
probability theory
- 提出probability 是想更加科学一些
- expectation
- def: weighted averages of funtions
- if we are given a finite number N of points drawn from the probability distribution or probability density
- E[f] −1/Nsumn=1,2...Nf(xn)
- consider expectation of functions of several variables eg f(x,y)
- expectation have subscript with repesct to different variable:
- Ex[f(x,y)] and Ey[f(x,y)]
- variance: Var[x]=E[x2]−E[x]2
- consider functions of several variables eg f(x,y)
- covariance: cov[x,y]=Ex,y[x−E[x]yT−E[yT]]
interpretation of probabilities
- popular: classical or frequentist way
- the probability P of an uncertain event A, P(A) is defined by the frequency of that event based on previous observations
- another point of view: Bayesian view - probability provide a quanrification of uncertainty
- for future event, we do not have historical database thus can not count the frequency.
- but can measure the belief in a statement a based on some 'knowledge', denote as P(a|K), different K can generate different P(a|K) and even same K can have different P(a|K) -- the belief is subjective
- Bayes rule
- consider conditional probabilities
- P(A|B)=P(B|A)P(A)/P(B)
- interpretation: updating our belief about a hypothesis A in the light of new evidence B
- in likelihood, it is, output brief of y/A given B/input values+paramters
- P(A|B): posterior belief
- P(A): prior belief
- P(B|A): likelihood, ie the B(our model) will occur if A(the output value of the sample data) is true.
- P(B) is computed by: sumi=1,2,...P(B|Ai)P(Ai) by marginalisation.
- in machine learning, Bayes theorem is used to convert a priot ptobability P(A)=P(β) into a porterior probability P(A|B)=P(β|y) by incorpoating the evidence provided by the observed data
- for β in the polynormial curve fitting model, we can take an approach with Bayes theorem:
- P(β|y)=P(y|β)p(β)P(y)
- given data {y_1,y_2,...}, we want to know the β, cant get directly. P(β|y):= posterior probability
- P(β):= prior probability; our assumption of β
- P(y):= normalization constant since the given data is fixed
- P(y|β):= likelihood function;
- can be view as function of parameter β
- not a probability distrubution, so intergral w.r.t β not nessary = 1
- state Bayes theorem as : posterior 8 likelihood × prior, consider all of these as function of parameters β
- intergrate both side base on β: p(y)=\intergralp(y|β)p(β)dβ
- issue: particularly the need to marginalize (sum or integrate) over the whole of parameter space
different view of likelihood function
- likelihood function: P(y|β)
- from frequentist way of interpretation:
- parameter β is a fixed parameter, the value is determined by 'estimator'
- A widely used frequentist estimator is maximum likelihood, in which wis set to the value that maximizes the likelihood function
- ie. choosing β s.t. probability of the observed data is maximized
- in practice, use negative log of likelihood function = log-likelihood:= error function(monotonically decreasing)
- One approach to determining frequentist error bars is the bootstrap,
- s1: 就是在已有的dataset(size N)里面random弄出L个dataset(size N) by drawing data from 已有的dataset(抽取方式是,可以重复抽,可以有的没有被抽中)
- s2: looking at the variability of predictions between the different bootstrap data sets. then evaluate the accuracy of the estimates of the parameter
- drawback: may lead to extreme conclusion if the dataset is bad, eg, a fair-looking coin is tossed three times and lands heads each time. in this case, we will generate parameter β to make P(lands head) = 1
- from Bayesian viewpoint: