As I mentioned in my previous post, I’ve been really interested in working with language analysis and language generation. Aside from that I’ve been also interested in working with Lithuanian language. As I spoke to few people, back in Lithuania no one has ever really used Lithuanian language for machine learning training, so mine would be one of the first attempts.
Initial idea that I had was to take headlines from various Lithuanian news sites, train machine on them and generate new headlines. No secret: media headlines most often than not misguide readers from what the articles are actually about and tend to be negative, so one of my assumptions was that the new generated text would actually be rather negative.
So here I go… scraping headlines from delfi.lt, 15min.lt, alfa.lt, lrytas.lt among many others… I train the machine using char-rnn multilayer recurrent neural networks algorithm and here is what I get:
Rere sokialtuli il vauliai akstiai kurinTdiru venkias Nerinojys suniaetis: Pa riedi vadainiu toliemojomas
Karketije,be skas kranedu M. A, Cemcinko jadali stramytui, kikike naipotyti prankaju jamava
Well, something like this… As it’s not understandable to English speakers, it doesn’t make any sense to me, Lithuanian speaker either.
What’s wrong?! Data data data… Not enough data. I go to class, I proclaim my complaint to my professor and my classmates and what I hear from them is that in order to see at least a bit accurate results I need at least 1 million of characters to train the machine on. And what do I have..? Not nearly enough… First mini failure.
At this point I realize that gathering data (enough amount of data) is going to take much longer that I can actually expect.
I need to come up with a new plan. Since there is no way I’m going to be able to scrape that many headlines (sites limitations and so on), I figured why don’t I pick Lithuanian poetry and train on those texts. So here I go with the updated idea: I train my machine on texts from few Lithuanian poets and generate new poetry based on that.
The poets that I decided to work with were these:
Maironis: one of the most famous Lithuanian poets who lived at the end of the 19th and at the beginning of the 20th century. His poetry is widely studied at Lithuanian schools because of his huge input to national revival from Russian oppression. He would write a lot about nature and his poems tend to be very melancholic.
Juozas Erlickas: still living Lithuanian writer, poet, comedian. He writes a lot about current issues in the country, mocks Lithuanian politics and public events. He is very sarcastic and complete opposite to Maironis.
After gathering text from both poets I had around 500,000 characters to train on. Still not enough but let’s see what outcome I could get. For this training I used char-rnn algorithm for tensorflow.
I trained it in three different ways:
- Only on Maironis text
- Then added Erlickas’ text to the same file as Maironis’ and kept training on the latest saved training model
- After saving a model from Maironis training, I kept training it only on Erlickas text
Some of the outputs I could get:
Kuomet viska ir Vaitoji nors seka! Pabus? Juk lenkai ji atakavo!..
Jau menesi jos nuveiksiu, Tamsybes keliu gilybes
Iki savo dali Skalviai!
Uzgims jas kunigu palaimingai
Ir paskirto saule, kovos: Temu, ir krutine, ir marias!
When everything and sighed while sequence! Shall awake? For Poles it attacked!
For some months they have done, the path for depth of darkness
By his part Skalvians!
Can be born priest blissfully
And appointed the sun and the fight: Videos and chest, and the bay!
Metu tik siandiena: ragus nuo Zigmino siela, tas kai zemes-matomos baznycios
Daugel ta pries atgaitybiu:
Mokslas ir karta slepia
Mint just today: the horns of Zigmino soul, that when the earth-visible church
Many of the before atgaitybiu:
Science and once hiding
Maironis + Erlickas training
Dar jis mus ir ukai, daugel gi geidus, uzklotis visa!
Dziaugdamasis jam visai. Daugiau kas tare jisai! Graudingi Baustus atidengia sirdis sociai
Sau jums pavasario juosti norejo
Ir Tumas ant galo vainika
Lyg kas norejes sau
Yet it is us, and nebulae, many would desire, onlays all!
Welcoming him at all. Many things soever he replied! Graudingi punished uncovers heart full,
Spring yourself you wanted to turn black
Tumas and on the ends wreaths
As if anyone wanted to yourself
Dar sultis mums draugaus
Priimtini norai Juozas!
Prie baltaveides vyrai skaitys
Ant tu sirdi siela! Nekarta metais?
Ir kad nusvinta gydyt saule nuo mariu prisiekti
Once the juice we made friends
Acceptable wishes Juozas!
A man reads a paleface
On the hearts of those souls! Repeatedly year?
And that shineth treated with the sun from the bay to take the oath
Erlickas training on top of Maironis
Ant nenori, sapnu
Ten atviro ar greiciau:
Tiek klystanciais zaliais;
Smaitytutis jo krutine,
Aplanko ir baltas atbusias taku
The do not want to dream
There, open or faster:
Both the erring Green;
Smaitytutis his chest,
Folder and white Enlightenment path
„Kodel Lietuvas tare pries tikybos sakas:
Nusiragaus prityres!… Bus sapnai! —
Atsiliepe kankles grazaus ir Vaicius pat Juozas dabar gana, Ir amzius iskilme kalinta visai;
Kam kartais apimti zmones Jau zino nemaz jausmu visi!…
“Why Lithuania replied before religion at the cutting edge:
Experienced taste! … It will be a dream! –
Answer zither beautiful and Vaičiaus also Juozas now quite, and age pomp imprisoned whole;
Why sometimes include people already know a fair amount of sense at all! …
First of all, I don’t think there can be the end for the data training. I need more and more. I would like to try not only char-rnn (even though it makes more sense), but also word-rnn algorithm, also try few other data inputs (e.g. TED Talks) and think of the possible ways how the outcome could be displayed: e.g. Twitter bot, poetry website or what user interaction could I introduce to create a two-way human and machine communication?