If, like me, you learned Bayesian regression from Gelman and Hill’s book, *Data Analysis Using Regression and Multilevel/Hierarchical Models*, you’re going to love:

- Finkel, Jenny Rose and Christopher D. Manning. 2009. Hierarchical Bayesian domain adaptation. In
*EMNLP*.

### In a Nutshell

Finkel and Manning apply the standard hierarchical model for the parameters of a generalized linear model to structured natural language problems. Although they call it “domain adaptation”, it’s really just a standard hierarchical model. In particular, they focus on CRFs (a kind of structured logistic regression), applying them to named entity recognition and dependency parsing. Their empirical results showed a modest gain from partial pooling over complete pooling or no pooling.

### The Standard Model

In the standard non-hierarchical Bayesian regression model (including generalized linear models like logistic regression), each regression coefficient has a prior, usually in the form of a Gaussian (normal, L2, ridge) or Laplace (L1, lasso) distribution. The maximum likelihood solution corresponds to an improper prior with infinite variance.

In most natural language applications, the prior is assumed to have zero mean and a diagonal covariance matrix with shared variance, so that each coefficient has an independent normal prior with zero mean and the same variance. LingPipe allows arbitrary means and diagonal covariance matrices, as does Genkin, Madigan and Lewis’s package Bayesian Logistic Regression.

With data of the same type (e.g. person, location and organization named-entity data in English) from multiple sources (e.g. CoNLL, MUC6, MUC7), there are two standard approaches. First, train individual models, one for each source, then apply the models in-source. That’s the unpooled result, where the data in each domain isn’t used to help train other domains.

The second approach is to completely pool the data into one big training set and then apply the resulting model to each domain.

### The Hierarchical Model

The hierarchical model generalizes these two approaches, and allows a middle ground where there is partial pooling across domains. A hierarchical model fits coefficients from each domain using a shared prior for each coefficient that is shared across domains. That is, we might have a feature f and coefficients β[1,f], β[2,f] and β[3,f] for three different domains. We assume the β[i,f] are drawn from a shared prior (normal in Finkel and Manning’s case), say Normal(μ[f],σ[f]^{2}).

The prior mean μ[f] and variance σ[f]^{2} for coefficients for feature f and prior can now be fit in the model along with the coefficients for each model.

In BUGS-like sampling notation, for a simple classification problem:

DATA F : number of features D : number of domains I[d] : number of items in domain d, d in 1:D x[d,i] : f-dimensional vector i being classified in domain d, d in 1:D, i in 1:I[d] c[d,i] : discrete category of x[d,i] σ[f]^{2}: variance of feature f, f in 1:F ν : prior mean for feature means τ^{2}: prior variance for feature means PARAMETERS μ[f] : hierarchical mean for feature f, f in 1:F β[d,f] : coefficient for feature f in domain d, f in 1:F, d in 1:D MODEL β[d,f] ~ Norm(μ[f],σ[f]^{2}), d in 1:D, f in 1:F μ[f] ~ Norm(ν,τ^{2}) c[d,i] ~ Bern(logit^{-1}(β[d]^{t}x[d,i]), d in 1:D, i in 1:I[d]

Finkel and Manning fix the prior variances σ[f] to a single shared value σ as a hyperparameter, which makes sense because they don’t really have enough data points to fit them in a meaningful way. This is too bad, because the variance is often the most useful part of setting hierarchical priors (see, e.g, Genkin et al.’s papers). I suspect the prior variance is going to be a very sensitive hyperparameter in this model.

Of course, μ[f] itself requires a prior, which Manning and Finkel fix as a hyperparameter to have mean ν=0, and variance τ^{2}. I suspect τ will also be a pretty sensitive parameter in this model.

The completely pooled result arises as σ^{2} approaches zero, so that the β[d,f] are all equal to μ[f].

The completely unpooled result in this case arises when ν=0 and τ^{2} approaches zero, so that each μ[f]=0 for all f, so that β[d,f] ~ Normal(0,σ[f]^{2}).

### Relation to DaumĂ© (2007)

Turns out Hal did the same thing in his 2007 ACL paper, Frustratingly easy domain adaptation, only with a different presentation and slightly different parameter tying. I read Hal’s paper and didn’t see the connection; Finkel and Manning thank David Vickrey for pointing out the relation.

### More Levels

Of course, as Finkel and Manning mention, it’s easy to add more hierarchical structure. For instance, with several genres, such as newspapers, television, e-mail, blogs, etc., there could be an additional level where the genre priors were drawn from fixed high level priors (thus pooling across genres), and then within-genre coefficients can be drawn from those.

### More Predictors (Random Effects)

There’s no reason these models need to remain hierarchical. They can be extended to general multilevel models by adding more predictors at each level than the instances below it. One example is that we could pool by year and by genre. Check out Gelman and Hill’s book for lots of examples.

### Estimation

The really cool part is how easy the estimation is. Apparently the loss function’s still convex, so all our favorite optimizers work just fine. The hierarchical priors are easy to marginalize, and thus easy to compute the gradient for. This is because normals are in the exponential family, and the domains are assumed to be exchangeable, so the big product of exponentials turns into a simple summation in the log loss space where the gradient happens (see the paper for the formula).

July 21, 2009 at 1:32 pm |

It seems combining the F+M stuff with the recent stuff out of Andrew Ng’s group (also at Stanford :P) on estimating hyperparamters might lead to some pretty cool grab-and-go domain adaptation. If you’re interested in the “more levels”, a shameless self plug: http://hal3.name/docs/daume09hiermtl.pdf might be interesting (it was at UAI this year).

Ooops, should have said great post! Waiting on the next post!

July 22, 2009 at 4:42 pm |

I hadn’t seen this before, but the following is a much earlier reference to hierarchical modeling (“shrinkage” is what the non-Bayesians call estimation based on Gaussian priors):

Mccallum, Rosenfeld, Mitchell and Ng. 1998. Improving text classification by shrinkage in a hierarchy of classes. In

ICML.It actually uses interpolation rather than a more standard hierarchical shrinkage model like Finkel and Manning used. Always interesting to see how the same ideas keep coming back in different forms.

Given their naive Bayes basis, they could’ve easily put this in a hierarchical Bayesian setting by using Dirichlet priors. Then they’d have prior concentrations to estimate instead of interpolation parameters.

July 22, 2009 at 4:44 pm |

And check out Hal’s paper — it estimates the (co)variances, too! I actually saw him present this paper at Columbia, but didn’t make the connection because of all the whiz-bang models flying around.

August 4, 2009 at 7:50 am |

Thanks for the interesting link.

You say that “there’s no reason these models need to remain hierarchical”. I’m wondering how this can be done. In their model, the prior is centered around top-level parameters. How would the prior look like, if there would be many “parent” parameters?

August 4, 2009 at 11:21 am |

Suppose we have two levels, one level for genre (e.g. newspaper, twitter, blog, broadcast television, youtube, etc.), and a second level for topic (e.g. biology, baseball, etc.).

Instead of drawing your coefficient β for a feature from a hierarchical prior with mean α, draw it from a prior with mean α1+α2, where α1 is for level 1 (e.g. twitter) and α2 for level 2 (e.g. baseball). You can even estimate covariance of the priors with enough data (e.g. twitter posts about baseball).

You can also add arbitrary additional hierarchical structure.

Check out section 13.5 of Gelman and Hill’s regression book for more info.

August 4, 2009 at 2:18 pm |

Thanks!