!= Introduction to Natural Language Processing With Python - Sessions by Pusher

Introduction to Natural Language Processing With Python

Jessica Bowden speaking at Async in May, 2016
Great talks, fired to your inbox 👌
No junk, no spam, just great talks. Unsubscribe any time.

About this talk

This talk will introduce the area of NLP (Natural Language Processing) and a basic introduction of its principles. It will use Python and some of its fundamental NLP packages, such as NLTK, to illustrate examples and topics, demonstrating how to get started with processing and analysing Natural Languages. We’ll look at what NLP can be used for, a broad overview of the sub-topics, and how to get yourself started with a demo project.


So today I'm going to be talking about natural language processing, specifically with Python, just a bit of an introduction, so, yeah. This is what I'll be...an overview of what I'll be talking about today, so I'm going to give a little introduction to who I am in case you don't know me, then a bit of an introduction to what natural language processing is and what's going on, in case you don't know. And then sort of why I think that you should be using Python for NLP. Some people might disagree, that's fine. And then an introduction to a crash course on what the syntax is in Python, just some things that are a bit different that you might not know about. Sorry if you do. And then I'll be looking at sort of preparing your data for building prototypes and how to load data in Python. Then a little look at how to explore and analyze data, just things like tokenizing. And then I'll be looking at a couple little sentiment-based projects that you could hopefully play around with yourself, and then looking at some more advanced things, perhaps. Like I said, I'm Jessica, I work at Brandwatch. To answer that, just like on the data science team. I've been there for about two years now. And that's my Twitter handle. So, yeah, natural language processing is a really, really broad topic. I'll be trying to cover some basic techniques today. So it covers some topics like machine translation, summarizing blocks of text like something that got big like Summly, which is a terrible name. Spam detection, sentiment analysis are a couple more really big fields. So I think Python's great, which is the main language I use for programming now. It's really readable and so it makes...to make really fast prototypes, and it's got really rich support for text analysis, strings, and lists. There's loads of great available NLP libraries like NLTK, spaCy, TextBlob, and there's also some really great parsing libraries, and I've also just added a couple of tools I like using if you want to have a look in your spare time. So now I'm going to do a little bit of a crash course in case you're not familiar with Python. Sorry if you are. So the first thing to note is that Python has no brackets for separating your lines of text. Pretty sure they do in JavaScript. Sorry, my JavaScript is awful, and it's really, yeah, dependent on white space for indenting and separating new lines. There are generally no semicolons, so whilst this block of text would still run, it's not very Pythonic and you should avoid it because semicolons are actually used to separate multiple statements that are used on the same line. So you might use it like that, or if you're importing multiple lines, but even then it's sort of avoided. Strings are written like that and you can format strings like that with curly braces and the format function, pretty similar to other languages, I think. And then just a couple on data structures, so lists, so the equivalent in JavaScript is arrays, you define them with square brackets and you iterate through them as follows. But a nice thing to note is that strings in Python work a bit like lists. You can just iterate through them, slice them like that, which is a thing you'll probably see me using quite a lot. You've also got list comprehensions, which just allows you to perform some if conditionals on some data and return it into a list. So this here is just getting all the ones which are even numbers in the range of 0 to 10, and that's just...I thought I'd show the equivalent of an actual for-loop. It's a lot more verbose. Dictionaries, really similar to JSON blobs. This is just key value storage. So yeah, they look really similar to JSON and you can't just... When you actually read in JSON in Python, it just often comes back as dictionaries anyway. You can access the values for keys like this and iterate through them as follows. It's also worth noting that this is called unpacking variables. I don't think you get it. You can do that now, okay. Well, I don't know about anything else. - [inaudible] JavaScripts. - I don't know about new JavaScripts. And you can also do dictionary comprehensions in Python, which is pretty cool, and you can also do set comprehensions. So just here you're just iterating through the values in the dictionary we defined here and selecting all the ones where the first letter of the key begins with J, which is pretty nice. And then the last but not least, data structure sets, which are just really similar to lists but they're unordered and there were no duplicates. And on the last note, the comparing values and comparing objects in Python is as follows. So you compare values with double equals and objects with the keyword "is". And in case you didn't know, the null keyword is just "none" in Python. And I also included a couple of links to coding style in case any of you are interested in making your code super-Pythonic and really annoying with PEP 8. So, yeah, just a little intro to getting started with NLP in Python. In case you need to, that's how you open and read text files from a local file, and then like this for the online files in case you need to read in online text files to process. And then I'm going to do a little introduction to NLTK, which is a really popular NLP library in Python. It's quite old and it's not often updated now, but it's really great for educational purposes, which is why I'm introducing it here. It's got loads of... It's got a free book included. It's got lots of open data sets that are free, you can just use, so it's great. So the first thing I want to go over is tokenizing. So tokenizing is where you just split your document up into like logical chunks, which is usually group broken up by sentences. So, if I wanted to tokenize the first line from "ALICE IN WONDERLAND", it would end up looking like this if I used the default NLTK tokenizer, and it just breaks it up. It looks like punctuation and spaces. The next thing is stemmers and lemmatizers. They basically just reduce words to their normalized form. So am would become be and cars would become car. That's how you use a stemmer and this is how you use a lemmatizer in NLTK. So they look like they do pretty much the same thing, but stemmers are a lot more naïve and they don't analyze the text like a lemmatizer does, but they're a lot faster. So, if you just want to chunk your text and just have it in a comparable format, then you're better off using lemmatizers if you just want to cluster the text, the similar text in some way. But so, you'll notice things like the E of ALICE has just been chopped off, but that's not a plural just because it's got an E on the end. The same with Lewis or Carroll as well, that's really crap. - [Man] Sorry, what is lemmatizing? - Lemmatizing is just the same as stemming, so it's still reducing it to its normal form. This one doesn't work so well because I haven't added in the part of speech that it is, which we'll come to later, but it just basically considers the context and it doesn't just do it naïvely and go through and just chop off where it sees an S. - Okay, so [inaudible]? - No, just from the end. So, plural, and then lemmatizers consider the context. Okay, so now I'm going to look at exploring and analyzing data. So the first thing that's quite fun that you can do with NLTK is explore the frequency distribution. So we can try to find out which are the most informative tokens in our text. So to do this we can just use the FreqDist package from NLTK, run it against our set of individual tokens from "ALICE IN WONDERLAND" and extract the top 25 most common, most informative, and most common tokens from the text. So that's what it looks like. It's kind of not very informative because it's kept commas and punctuation and stop words, so it's just full of rubbish, really, but they've been included because they are evenly distributed throughout the text, so it makes logical sense, but it's not very useful for us. So, we can instead look at the opposite, which are the ones that aren't frequently occurring at all. But, again, I've never heard of Brandy in <i>Alice in Wonderland</i>, so it doesn't really tell me much about the text. So instead we could look at the still in a frequency distribution, but maybe looking for longer words, which is definitely more useful. So like griffin and creatures and mushroom. What's that one? So they're more informative words but perhaps not quite what we want. And the next part I want to talk about is part-of-speech tagging, which is also known as POS tagging, so it's where you extract, given a sentence or something like that, whether each token is a verb or an adjective or just a bit punctuation. So this is how we'd tokenized using NLTK and this is using... So loads of different libraries have their own versions of how they represent the tags, which is really annoying. So this one uses the Porter stemmer tag set. So I don't know them all off the top of my head, but NMP is like a proper noun, and then VB is obviously a verb, and then we've got prepositions, nouns, conjunctions, using a frequency distribution, and from learning how we can pos_tag sentences, we can consider the frequency distribution of the types of tags throughout "ALICE IN WONDERLAND". So, again, not very interesting. The most common is nouns, injunctions, determiners, prepositions. It's kind of what you'd expect. But then from that, we could try to find more interesting words again. It was difficult earlier. This is looking more informative already. This goes through the most common ones we're looking at before and extracting the proper nouns, so obviously Alice, Queen, but it's included a bunch of punctuation. You can't really know why that's happened without going through and analyzing the individual sentences, but the NLTK POS tagger is not amazing by any means, and this is on properly written text, so it would just fall apart on tweets. So, yeah, now I want to have a little look at sentiment projects, sort of working on the building blocks of what we had to look at before with tokenization. So one of the common approaches to sentiment analysis, while it's not super clever, is rule-based, which is exactly what it sounds like, finding rules in our text, and to find out the polarity of the text. So I've just stolen a bunch of reviews from Rotten Tomatoes, because I'm a monster, that I can analyze. So, building from what we had before is I'm going to take the tokenizer and split one of our reviews. I think it's <i>Captain America</i>, it's tokens, and then pos_tag them and return those, so that's what you can see here. And then I've just built a list of handcrafted rules, and that's definitely not the way you do it in real-life, but it's good enough for this. So the first way we could do it is go through and look at the review. So, the first review, "Very entertaining, and a far tighter production of Marvel's recent output." So go through and if I find one of the words that's been in our list of words, increment the count, the score, by one, or if I find it in the negative reviews, in the negative rules, sorry, I want to decrement it by one. So a really simple approach, and it just output zero. Well, there was "entertain" but it didn't find that because we didn't lemmatize it. So next up, we can have lemmatization, but there's an awful lot to go through, so I've added it to just look at adjectives and do the same again. It's found "entertaining" because we had "entertain" in the ruleset, so we can just build upon it like this. And then you can improve it further maybe by looking at words that increment the meaning of things like if it's really great or very great or too brilliant, might be even better, I don't know. Tweets are terrible these days. And then, from here, we can see that very is...it's very entertaining, so it increases the score even more. It's just like this isn't a great approach, but it's just an example of roughly how a rule-based approach could work. But then you could take it even further and just build upon it. So you could, in a similar way, we add words incrementing it, like "very," we could add modifiers for words like "not," or we could add things that decrement it like, "Oh, it's a little bit good." - [Man] How could you test drive sentiment analysis like that? Would you [inaudible]? - What? For a rule-based approach? - Yeah. You can get people to mark it up themselves, I guess, so get people to mark up a bunch of them and then you can go, "Well, if it agrees with them, I'm probably right." Which brings me sort of onto a Naive Bayes sentiment analysis, which is a bit more sophisticated, I suppose, which is... My words have gone. Yeah, it's a... I'll just move on. Words are not coming. Okay, so this is how you can build a super simple Naive Bayes classifier with NLTK. NLTK Naive Bayes works by using training data. So I've gone and found a bunch of tweets which are already marked up. So it will be loads of tweets that someone has hand-annotated for a very long time and said this one's positive, this one's negative, and it's very exhausting. And so then I can go through and I can split my data into training data and testing data. So training data is what the classifier will use, and testing data is so I can just see later if it's worked as well as I hoped. - Sorry, can you just go back? Could you explain the polarity? - Oh, yeah, that's just the polarity is just like the positive or negative. For some reason, the [inaudible] four instead of one. It's so strange. I don't understand it. - Just in case you get a bit [inaudible]. - Yeah. They just use four and zero, I don't know why. - [inaudible]. - Yeah, it seemed so obvious. - [inaudible] and get four back. - We've [inaudible] our data and I've just processed it so that it's in a format that makes it easier for each tweet. If it's a positive tweet, I've put it in a [inaudible] with the tweet and the sentiment in a list called either positive or negative. So this is how it looks now. This is a sample of the negative and positive. It's just the tweet and the sentiment. Ridiculous, how's it ever going to learn? Now, breaking each down into a bag of words and just making sure it's kept with the sentiment for now but removing those that are really small words because they're not going to be informative to us. This gives us an actual bag of words for all the tweets. So before, we had just a group for...we still knew whether they were positive or negative, so we're keeping track of them from before, but now we've just got an ambiguous bag of words for all of the tweets, which lets us build this. So now building a frequency distribution, like we did before, so we can find out the most informative features from this group of words, and we go through and extract these features so that when we're given a document we can find if any of the features match up with... Sorry. So we [inaudible] document like, "All rockstars are back home: while 'some of us' freshen up, 'others' watch Magic/Lakers game, then we'll celebrate Sweden Rock in Florida." Okay, so we split that off and then we can go through and find out if anything in this tweet is matched with things from our training data so we can ultimately end up finding out whether it's ultimately classified using the classifier. And then from that training set we just built, we can build an actual classifier, and this is an output of what it thinks is the most informative features, so the features which bear the most weight for the classifier we just built. So if something contains (cancer), it's a 12 to 1 possibility that it's negative, whereas if it contains (love), it's a 10 to 1 possibility that it's positive, because that's what it learned from the data, and rightly so. So now we've got there, so we can classify some tweets we put aside at the beginning. So we can extract one of the positive tweets and classify it with our new classifier and it classifies the positive one as positive and the negative one as negative. - [Man] Are those tweets already marked up? - Yes, so that's been classified as positive and it was marked up by a person that's positive. I haven't done a thorough investigation on all 100 tweets I put aside. I probably should, but I think I only put about 100 tweets aside and that's probably not enough. So Naive Bayes is entirely dependent on how much data you throw at it, which also makes it quite hard to see how well it's... Sorry. Yeah, the more data you give to a Naive Bayes classifier, the better it will perform, but obviously then you've got the alternative of having to mark up [inaudible] data. And if something doesn't classify correctly, as we'll see later, it's kind of hard to see where it's gone wrong and you just have to throw more data at it. And if it comes across a feature it hasn't seen before, it's not going to be able to classify it or it's going to classify it incorrectly. So now I thought I'd just do a little demo so we could look at maybe the sentiment that co-occurs along with smileys using the classifier we've already built. Yay, loading in tweets. This is a lot of code. Right. I'm not going to go through and explain all of this because it would be really boring, but basically, these are the Unicode ranges for emojis. And they're in two different ranges, which was why they've had to be compiled separately. So it just finds any of those ranges in a tweet, I am going through all of this, and then go through and classify as a tweet. And if a tweet exists, it finds the emoji in our dictionary and increments one to it, or it increments one to it if it's a negative tweet. And then that's the result of all our emojis below. But it will be a bit more useful in a graph. So I don't think this is going to be super informative, but we can have a look anyway. Well, it's not bad. So the crying emoji appears far more often with negative tweets than positive tweets. The happy emoji appears more or less an equal number of times with both, which makes me think that it's not a very good classifier. And the joyful tweet emoji is appearing with negative far more than positive. So I don't think this classifier had anywhere near enough data or maybe people are writing really weird tweets. - [Man] [inaudible] in a way, where the emojis themselves become worse [inaudible] Bayes classifier? - No, because I don't think there are any emojis in the dataset I have. - Oh, okay. So there's no [inaudible]. - Sorry, this is just separate data that I found, which I just gathered myself to try and maximize the amount of emojis that I could get back, but there weren't many, and then tested it against the... I've not used the training data. I've not been nice to it. This is completely new data. Yes, okay. Yeah, there's still a lot going on beyond NLTK. It's quite a limited library really, and I hope that some of these demos have given you an idea of what you can go and do. If you're actually interested in going far beyond NLTK, there's a lot of interesting and better, faster libraries about at the moment. There's a Python and Oracle spaCy, which is really cool, which has built-in named entity recognition and tokenization which is far superior to NLTK's. And you might have heard of the Google's dependency files that they came out recently, which is open-source, I think. So there's a lot more things you can look at that are a lot more relevant, but I hope this has sort of given you an idea of getting started, and it's not too hard at all, really. - [Man] Can you say the name of [inaudible]? - Parsey McParsface. No, Parsey McParseface. And it's on GitHub, so. This is just a Jupyter Notebook, so you can just run it yourself and it should be easy. Yes, okay.