Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy

40 points by panabee 1 year ago | 1 comment
  • free_bip 1 year ago
    Specifically this is Llama2, not Llama3, was a bit disappointed from that. Also wasn't totally clear from the article - will this actually increase GPU inference speed / decrease GPU memory usage?