Even my continued fiddling with the SHA-RNN model shows there's a _lot_ to be studied and explored. I haven't published new incremental progress but you can tie the RNN across the 4 layers to substantially decrease total params yet get nearly equivalent perplexity results.
— Smerity (@Smerity) January 28, 2020