Cucco is an easy-to-use text normalization library for Python that allows users to make text more consistent, and therefore makes data easier to process, by removing or replacing accent marks, white spaces, stop words, unnecessary characters and punctuation, and lots more.
The easiest way to install cucco is by using pip, as you get the latest release version. Simply run the following command:
$ pip install cucco
Another way to install cucco is by installing it directly from the code using git. To do this, clone the repository from GitHub and install it like this:
$ git clone https://github.com/davidmogar/cucco.git
$ cd cucco
$ python setup.py install
After using either of the two methods, you should now have the latest release version of cucco, ready to use.
Using cucco is very simple, which makes it stand out from other normalization tools. To get started, all you need is to have an up-to-date version of cucco installed. Check out the How to Install section if you have not done this already.
The basic idea is that you only have to input the text to be processed, specify the normalizations to apply and receive the text normalized as the output. Let's see some examples.
You can normalize text within your code using the default settings. In the default mode, punctuation, extra white spaces and symbols are removed. A specified list of English stop words are also removed.
For example, to apply all normalizations to the text "Who let the cucco out?"":
from cucco import Cucco
cucco = Cucco()
print(cucco.normalize('Who let the cucco out?'))
The output would be the following, since the punctuation, extra white spaces and the stop words would be removed:
cucco
To define custom normalizations, you can send a list of normalizations to apply, which will be executed in order.
For example:
from cucco import Cucco
cucco = Cucco()
normalizations = [
'remove_extra_whitespaces',
('replace_punctuation', {'replacement': ' '})
]
print(cucco.normalize('Who let the cucco out?', normalizations))
In this case, the output would be the following, since only extra white spaces were removed and punctuation is replaced by a white space:
Who let the cucco out
Now that you have cucco installed and know the basics of how to process the text, here are a list of the various normalization functions that cucco can do for you.
This function removes accent marks in the text. You can also define accent marks to be excluded from this effect by setting up a parameter.
def remove_accent_marks(text, excluded=None):
"""
Args:
text: The text to be processed.
excluded: Set of unicode characters to exclude.
"""
This function removes white spaces from the beginning and the end of the string, as well as duplicated white spaces between words.
def remove_extra_whitespaces(text):
"""
Args:
text: The text to be processed.
"""
Stop words in over 50 languages are supported. The language of stop words to be loaded is specified during instantiation, according to the language code received. You can specify whether to take the letter case into account or not. English is the default language.
def remove_stop_words(self, text, ignore_case=True, language=None):
"""
Args:
text: The text to be processed.
ignore_case: Whether or not ignore case.
language: Code of the language to use (defaults to 'en').
"""
def replace_characters(self, text, characters, replacement=''):
"""
Args:
text: The text to be processed.
characters: Characters that will be replaced.
replacement: New text that will replace the custom characters.
"""
Specific characters can be removed from the input text or replace them with a specified string.
You can remove email addresses from the input text or replace them with a specified string.
def replace_emails(text, replacement=''):
"""
Args:
text: The text to be processed.
replacement: New text that will replace email addresses.
"""
You can remove emojis, emoticons, or smileys from the input text, or replace them with a specified string.
def replace_emojis(text, replacement=''):
"""
Args:
text: The text to be processed.
replacement: New text that will replace emojis.
"""
It's also possible to remove hyphens from the input text or replace them with a white space or a specified string.
def replace_hyphens(text, replacement=' '):
"""
Args:
text: The text to be processed.
replacement: New text that will replace hyphens.
"""
Punctuation can be easily removed or replaced with a string if specified.
def replace_punctuation(self, text, excluded=None, replacement=''):
"""
Args:
text: The text to be processed.
excluded: Set of characters to exclude.
replacement: New text that will replace punctuation.
"""
You can remove symbols from the input text or replace them with a specified string.
def replace_symbols(text, form='NFKD', excluded=None, replacement=''):
"""
Args:
text: The text to be processed.
form: Unicode form.
excluded: Set of unicode characters to exclude.
replacement: New text that will replace symbols.
"""
URLs can be removed from the text or replaced with a string if specified.
def replace_urls(text, replacement=''):
"""
Args:
text: The text to be processed.
replacement: New text that will replace URLs.
"""
Cucco provides a command line tool that can be used directly from the terminal and is installed along with the library. This section covers the usage and features of this tool.
Although it's not strictly necessary, the command line allows you to set a configuration file with all the normalizations to apply to a given text. This is an example of a configuration file:
normalizations:
- remove_extra_whitespaces
- replace_punctuation:
replacement: ' '
- remove_stop_words
Cucco configuration file uses the YAML format and expects a root element named normalizations. This element has to contain a list of all the normalizations to apply and their respective parameters.
To use this file, the parameter -c path_to_config.yaml has to be used.
This section demonstrates how to use the tool to normalize text from different types of inputs.
The most straightforward way to use cucco is by supplying a text while executing it. An example of this method will be as follows:
$ cucco normalize 'Who let the cucco out?'
cucco
In a similar way to the example above, it is possible to normalize the text coming from a different command using pipes. This example ilustrates how to do it:
$ echo "Who let the cucco out?" | cucco normalize
cucco
Finally, cucco allows you to apply normalizations to all of the files in a given directory. This can be achieved in the following way:
$ cucco batch path_to_directory
2017-06-18 23:28:14,455 I Processing files in "path_to_directory"
Option -r or --recursive can be used to find files recursively.
At the moment is not possible to apply normalizations to a single episode but this functionality will be available soon. Follow the progress here.
One of the most advanced features of the command line tool is to run it in "watch mode." In this mode, cucco will monitor a given path for changes, applying normalizations to every new or modified file.
$ cucco batch --watch path_to_directory
2017-06-18 23:24:40,633 I Initializing watcher for path "path_to_directory"
2017-06-18 23:24:40,634 I Starting watcher
2017-06-18 23:24:40,635 I Waiting for file events
Option -r or --recursive can be used to find new and modified files recursively.
Another way to use Cucco is through its API. This section will give you all the details about its usage. To know more about the routes or how to reach them, you can check this link.
Cucco API has two kinds of endpoints, public and private; the latter of which is being tested at the moment.
The difference between these two endpoints is the way they are accessed. The public endpoints will have routes preceded by /v1 like /v1/accents, and will be rate limited, while the private ones will be preceded by /private as /private/accents, and will requrire authentication.
Private endpoints require the user to be authenticated. This authentication is performed using an API key that can be obtained through this page (coming soon). Once you are registered, you will be able to get an API key from your profile. Once this is done, you should query /private/token endpoint sending in the following header:
"api_key": "[your_api_key]"
This will return a short life token which you should use to query other endpoints. Its usage is similar to the previous query but the header name should be changed to token.
Public endpoints are rate limited, what means that users only can make a limited number of requests per given period of time. Once the limit is surpassed, an error message will be returned for the following requests in the same time window.
HTTP headers can be checked to find out the reason for a rate limit error. The following headers are returned with every request:
The next table shows rate limits for the different routes:
Endpoint | Rate limit |
---|---|
/api/v1/accents | 30 per hour |
/api/v1/characters | 30 per hour |
/api/v1/emails | 30 per hour |
/api/v1/emojis | 30 per hour |
/api/v1/hyphens | 30 per hour |
/api/v1/normalize | 10 per hour |
/api/v1/punctuation | 30 per hour |
/api/v1/stopwords | 10 per hour |
/api/v1/symbols | 30 per hour |
/api/v1/urls | 30 per hour |
/api/v1/whitespaces | 30 per hour |