This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Ajay Krishnan T. K., School of Digital Sciences;
(2) V. S. Anoop, School of Digital Sciences.
Table of Links
3 Materials and Methods
3.1 Label Studio
Label Studio is an open-source data annotation tool (available at //labelstud.io/) that provides a userfriendly interface for creating labeled datasets by annotating data for machine learning and artificial intelligence tasks. The tool supports various annotation types, including text classification, NER, object detection, image segmentation, and more. Label Studio allows users to import data from various sources, such as CSV files, JSON, or databases, and annotate them using a customizable interface. It provides a collaborative environment where multiple annotators can collaborate on a project, with features like task assignment, annotation review, and inter-annotator agreement measurement. One of the key features of Label Studio is its extensibility. It provides a flexible architecture that allows users to customize the annotation interfaces and incorporate custom labeling functions using JavaScript and Python. This enables the tool to adapt to different annotation requirements and integrate with existing machine-learning workflows. Label Studio also supports active learning, where the tool can suggest samples to be annotated based on a model’s uncertainty, helping to optimize the annotation process and improve model performance.
3.2 snscrape
snscrape is a Python library and command-line tool (available at //github.com/JustAnotherArchivist/ snscrape) for scraping social media content. It lets you retrieve public data from various social media platforms, including Twitter, Instagram, YouTube, Reddit, etc. With snscrape, you can fetch posts, comments, likes, followers, and other relevant information from social media platforms. It provides a flexible and customizable way to search for specific keywords, hashtags, usernames, or URLs and extract the desired content. The library supports scraping recent and historical data from social media platforms, enabling you to gather insights, perform analysis, monitor trends, and conduct research based on social media content. snscrape offers a command-line interface that allows you to search for and scrape social media data interactively. You can specify various parameters, such as the number of results, date range, and output format, to customize your scraping process. In addition to the command-line interface, snscrape provides a Python API that allows you to integrate social media scraping into your own Python scripts and applications. The API offers more advanced functionalities, giving you fine-grained control over the scraping process and allowing you to process the scraped data programmatically. One of the key advantages of snscrape is its ability to work with multiple social media platforms, providing a unified interface for scraping different types of content. It handles the intricacies of each platform’s APIs and HTML structures, making it easier for developers to extract data without needing to learn the specific details of each platform. It’s important to note that snscrape respects the terms of service and usage restrictions of each social media platform. It is primarily intended for scraping publicly available content and should be used responsibly and in compliance with the platform’s policies.
3.3 Newspaper 3k
Newspaper3k is a Python library and web scraping tool (available at //newspaper.readthedocs.io/) that allows you to extract and parse information from online news articles. It provides a simple interface to automate the fetching and processing of news articles from various online sources. With Newspaper3k, you can retrieve article metadata such as the title, author, publish date, and article text from news websites. It also supports extracting additional information like keywords, summaries, and article images. The library uses advanced NLP techniques to extract relevant information from the HTML structure of the news articles. Newspaper3k is designed to handle various complexities of news websites, including different article formats, pagination, and content extraction. It has built-in functionality to handle newspaper-specific features like multi-page articles, article pagination, and RSS feeds. One of the advantages of Newspaper3k is its ease of use. It abstracts away the complexities of web scraping and provides a clean and intuitive API. It also handles various encoding and parsing issues that often arise when dealing with news articles from different sources. Newspaper3k is widely used for various applications, including content analysis, sentiment analysis, and data mining. It offers a convenient way to gather news data for research, data analysis, and machine learning projects.
3.4 ClimateBERT
ClimateBERT is a specialized variant of the BERT model specifically trained and tailored for addressing climate change-related language tasks. Building upon the foundation of BERT, ClimateBERT is pre-trained on a large corpus of climate change-related documents and text sources, enabling it to capture the nuances and domain-specific knowledge relevant to climate science. This fine-tuning process equips ClimateBERT with a deep understanding of climate-related concepts, terminology, and contextual dependencies. [Iqbal et al., 2023] By leveraging ClimateBERT, researchers and practitioners in climate change analysis can effectively tackle various NLP tasks, such as sentiment analysis on climate-related tweets or named entity recognition on climate change articles. Integrating domain-specific knowledge into the pre-training process makes ClimateBERT a powerful tool for extracting insights, identifying patterns, and extracting valuable information from climate-related text data. Its application in climate change analysis can aid in improving decision-making, facilitating research, and enhancing our understanding of the complex challenges climate change poses.