Discussion about this post

User's avatar
Amany Marey's avatar

Thank you for this article. How do we decide the number of attention heads? Does the function of the attention heads differ by language model?

Expand full comment
2 more comments...

No posts