Large Data Sets Experience is Needed by Computer Science Graduates
During recent graduate interview discussions I asked about technical items mentioned on resumes. What I heard surprised me in a couple of cases. In particular, the very low volume data sets recent graduates have worked with. I was surprised that they had not used a large data set. I got the impression that some professors do not care about the volume of data that students work with when they create their applications.
In the media there is a constant discussion about a mismatch between the skills that education provides and the capabilities graduates bring to the work place. And, whether they are prepared for work. The lack of large data set use means that skills needed by employers may be missing. I will outline the skills that could be gained by working with large data sets.
Some types of data handling are just high volume. Business intelligence and analytics consume more data than 20 years ago. Handling the increasing volume is important. Research programming and data science are truly part of big data. Even if you are not doing data science, you may be preparing and handling the data sets. Some industries and organisations just have higher volumes of data. Retail is one example. Companies that used to have less volume are obtaining more data as they adapt to the big data world. We should expect the same trend to continue with organisations that have had higher data volumes in the past. They are going to have to handle a much bigger big data experience.
There are practical aspects to handling large data sets. These can lead to experience in storage management and design, data loading, query optimization, parallelization, bandwidth issues and data quality when large data sets are used. And when you take on those issues, architecture skills are needed and can be gained.
Today, the trends known as the Internet of Things, All Things Data, and Data First are forming. As a result there will be demand for graduates who are familiar with handling high volumes of data.
The responsibility for using a large data set falls to the student. Faculty staff need to encourage this though. They often set and guide the students’ goals. A number of large data sets that could be used by students are on the web. An example of one data set would be the Harvard Library Bibliographic Dataset available at http://openmetadata.lib.harvard.edu/bibdata. Another example is the City of Chicago that makes a number of datasets available for download in a wide range of standard formats at https://data.cityofchicago.org/. The advantage of public large data sets is the volume and the opportunity to assess the data quality of the data set. Public data sets can hold many records. They represent many more combinations than we can quickly generate by hand. Using even a small real world data set is a vast improvement over the likely limited number of variations in self-generated data. It may be even better than using a tool to generate data. Such data when downloaded can be manipulated and used as a base for loading.
Loading large data sets is part of being prepared. It requires the use of tools. These tools can be from loaders to full data integration tool suites. A good option for students who need to load data sets is PowerCenter Express. It was announced last year. It is free for use with up to 250,000 rows per day. It is an ideal way to experience a full enterprise data integration tool and work with significantly higher volumes.
Big Data is here and it is a growing trend. And so students need to work with larger data sets than before. It is also feasible. The tools and the data sets the students need to work with large data sets are available. Therefore, in view of the current trends, large data set use should become standard practice in computer science and related courses.