Text Mining: Unlocking the World's Research Papers

text mining

Carl Malamud has made a name for himself as a leader in information liberation. As an advocate in public domain law, he has now set his sights on developing a new form of technology aimed at freeing 73 million academic journals to the masses.

A new 576-terabyte storage facility is being created at Jawaharlal Nehru University (JNU) in New Delhi. Without asking for publication rights, Malamud is working alongside Indian researchers to create a massive storage space of text and images published over the last two centuries.


To avoid copyright laws, Malamud will use text data mining software to extract key excerpts without providing full textual context. Various outlets are hoping to use the technology for mass research and other academic uses.


Malamud's team is leaning on a court precedent set in the 2016 Delhi High Court which ruled that copyrighted materials could be reproduced for educational and research purposes. Although the legality is still somewhat unclear, there are far-reaching implications for the text-based data mining technology.


The Future of Data Mining

The goal of text mining is to process large amounts of textual data to produce high-quality information. University academics are currently limited to using article abstracts from large databases such as PubMed. The JNU Data Depot project could open up a world of other opportunities.


Mining large amounts of data is difficult due to current technological limitations. The JNU team is working to develop new tools to better extract plain text from PDFs and other forms of data files.


In a world of seemingly endless data, the implications of text data mining reach far beyond what we currently know.  Most of the interest generated from the project currently lies in the fields of academics and research. However, commercial prospects are likely to arise in the near future.


Technological Implications

The JNU depot has several groups excited about the potential uses of the technology. News of the project has generated interest from early adopters in various fields. 


Gitanjali Yadav is a computational biologist who created a database of chemicals used by pharmaceutical companies and other researchers. The group hopes the JNU Data Depot project may expand its database technology by providing access to mining full texts without having to visit libraries on-site.


Srinivasan Ramachandran, a bioinformatics researcher at Delhi's Institute of Geonomics, created a database of genes related to type 2 diabetes. He hopes the JNU Data Depot could further his research by using expanded text data-mining abilities. 

Beyond Academics

Although the JNU Data Depot is aimed at assisting academics and researchers, there's no question the technology may also be used in other ways.


One potential use is risk management in the financial sector. As far back as 2002, research has shown that data text mining can be a useful tool for analyzing financial reports. One academic review published by MDPI stated, "the utilization of internal data sources will be a rich source of future research with both theoretical and practical contributions."


If your company is looking to identify emerging corporate risk and opportunity, Bitvore can help. 


Our flagship product, Bitvore Insights Tracker, is a corporate intelligence research platform that uses massive amounts of unstructured data (such as SEC filings, proxy statements, etc.) and comparative/predictive analytics to help drive decision making. Using advanced AI techniques, Bitvore Insights Tracker can help corporations make decisions using data-driven analysis.

If you would like to learn more about Bitvore AI and it's capabilities, check out our whitepaper Using AI-Processed News Datasets to Perform Predictive Analytics.


Download our latest whitepaper: Using AI-Processed  News Datasets to Perform Predictive Analytics

Subscribe to Bitvore News Blog Weekly Email

Recent Posts


See all