Abstractive Summarization with HuggingFace pre-trained models
Text summarization is a well explored area in NLP. As shown in Figure 1, the field of text summarization can be split based on input document type, output type and purpose. Regarding output type, text summarization dissects into extractive and abstractive methods.
• Extractive: In the Extractive methods, a summarizer tries to find and combine the most significant sentences of the corpus to form a summary. There are some techniques to identify the principal sentences and measure their importance such as Topic Representation, and Indicator Representation.
• Abstractive: Abstractive Text Summarization (ATS) is the process of finding the most essential meaning of a text and rewriting them in a summary. The resulting summary is an interpretation of the source. Abstractive summarization is closer to what a human usually does. He conceives the text, compares it with his memory and related in-formation, and then re-create its core in a brief text. That is why the abstractive summarization is more challenging than the extractive method, as the model should break the source corpus apart to the very tokens and regenerate the target sentences. Achieving meaningful and grammatically correct sentences in the summaries is a big deal that demands highly precise and sophisticated models.
In this tutorial, we use HuggingFace‘s transformers library in Python to perform abstractive text summarization on any text we want. The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease.
The reason why we chose HuggingFace’s Transformers as it provides us with thousands of pretrained models not just for text summarization, but for a wide variety of NLP tasks, such as text classification, question answering, machine translation, text generation and more.
All the documentation for the transformers library can be found on this website: https://huggingface.co/transformers/
For more information on how transformers are built, we recommend reading the seminar paper ” Attention Is All You Need”.
For usage examples or fine-tuning you can check hugging face community notebook or official notebooks through these links:
- official: https://huggingface.co/transformers/notebooks.html
- community: https://huggingface.co/transformers/v3.0.2/notebooks.html
To install transformers, you can simply run:
!pip install transformers
Then, We need to importing needed dependencies
from transformers import pipeline
Pipeline API
The most straightforward way to use models in transformers is using the pipeline API. The Pipeline are high-level objects which automatically handle tokenization, running your data through a transformers model and outputting the result in a structured object.
In the summarization pipline, the default model is the BART model, which is trained on the CNN/Daily Mail News Dataset.
# Initialize the HuggingFace summarization pipeline summarizer = pipeline("summarization") # Open and read the article TEXT = """ Equitable access to safe and effective vaccines is critical to ending the COVID-19 pandemic, so it is hugely encouraging to see so many vaccines proving and going into development. WHO is working tirelessly with partners to develop, manufacture and deploy safe and effective vaccines. Safe and effective vaccines are a game-changing tool: but for the foreseeable future we must continue wearing masks, cleaning our hands, ensuring good ventilation indoors, physically distancing and avoiding crowds. Being vaccinated does not mean that we can throw caution to the wind and put ourselves and others at risk, particularly because research is still ongoing into how much vaccines protect not only against disease but also against infection and transmission. See WHO’s landscape of COVID-19 vaccine candidates for the latest information on vaccines in clinical and pre-clinical development, generally updated twice a week. WHO’s COVID-19 dashboard, updated daily, also features the number of vaccine doses administered globally. But it’s not vaccines that will stop the pandemic, it’s vaccination. We must ensure fair and equitable access to vaccines, and ensure every country receives them and can roll them out to protect their people, starting with the most vulnerable. """ #run the model summarized = summarizer(TEXT, min_length=25, max_length=50) # Print summarized text print(summarized)
Note that the first time you execute this, it make take a while to download the model architecture and the weights, as well as tokenizer configuration. we declared the min_length and the max_length we want the summarization output to be (this is optional).
The generated summary is:
[{‘summary_text’: ‘ WHO is working tirelessly with partners to develop, manufacture and deploy safe and effective vaccines . We must continue wearing masks, cleaning our hands, ensuring good ventilation indoors, physically distancing and avoiding crowds .’}]
If you want to chnge the model and use the t5 model (e.g. t5-base), which is trained on the c4 Common Crawl web corpus, then change the following statements.
Note that the T5 comes with 3 versions in this library, t5-small, which is a smaller version of t5-base, and t5-large that is larger and more accurate than the others
#setting the pipeline summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf") #run the model summarized = summarizer(TEXT, min_length=25, max_length=50) # Print summarized text print(summarized)
The second summary is as follows:
[{‘summary_text’: ‘WHO is working tirelessly with partners to develop, manufacture and deploy safe and effective vaccines . but for the foreseeable future we must continue wearing masks, cleaning our hands, ensuring good ventilation indoors, physically distancing’}]
For both models, the generated results illustrate that it works really well, which is really impressive!
I hope you enjoyed the tutorial. To download the notebook click here.