Defense Notices
All students and faculty are welcome to attend the final defense of EECS graduate students completing their M.S. or Ph.D. degrees. Defense notices for M.S./Ph.D. presentations for this year and several previous years are listed below in reverse chronological order.
Students who are nearing the completion of their M.S./Ph.D. research should schedule their final defenses through the EECS graduate office at least THREE WEEKS PRIOR to their presentation date so that there is time to complete the degree requirements check, and post the presentation announcement online.
Upcoming Defense Notices
Andrew Riachi
An Investigation Into The Memory Consumption of Web Browsers and A Memory Profiling Tool Using Linux SmapsWhen & Where:
Nichols Hall, Room 246 (Executive Conference Room)
Committee Members:
Prasad Kulkarni, ChairPerry Alexander
Drew Davidson
Heechul Yun
Abstract
Web browsers are notorious for consuming large amounts of memory. Yet, they have become the dominant framework for writing GUIs because the web languages are ergonomic for programmers and have a cross-platform reach. These benefits are so enticing that even a large portion of mobile apps, which have to run on resource-constrained devices, are running a web browser under the hood. Therefore, it is important to keep the memory consumption of web browsers as low as practicable.
In this thesis, we investigate the memory consumption of web browsers, in particular, compared to applications written in native GUI frameworks. We introduce smaps-profiler, a tool to profile the overall memory consumption of Linux applications that can report memory usage other profilers simply do not measure. Using this tool, we conduct experiments which suggest that most of the extra memory usage compared to native applications could be due the size of the web browser program itself. We discuss our experiments and findings, and conclude that even more rigorous studies are needed to profile GUI applications.
Elizabeth Wyss
A New Frontier for Software Security: Diving Deep into npmWhen & Where:
Eaton Hall, Room 2001B
Committee Members:
Drew Davidson, ChairAlex Bardas
Fengjun Li
Bo Luo
J. Walker
Abstract
Open-source package managers (e.g., npm for Node.js) have become an established component of modern software development. Rather than creating applications from scratch, developers may employ modular software dependencies and frameworks--called packages--to serve as building blocks for writing larger applications. Package managers make this process easy. With a simple command line directive, developers are able to quickly fetch and install packages across vast open-source repositories. npm--the largest of such repositories--alone hosts millions of unique packages and serves billions of package downloads each week.
However, the widespread code sharing resulting from open-source package managers also presents novel security implications. Vulnerable or malicious code hiding deep within package dependency trees can be leveraged downstream to attack both software developers and the end-users of their applications. This downstream flow of software dependencies--dubbed the software supply chain--is critical to secure.
This research provides a deep dive into the npm-centric software supply chain, exploring distinctive phenomena that impact its overall security and usability. Such factors include (i) hidden code clones--which may stealthily propagate known vulnerabilities, (ii) install-time attacks enabled by unmediated installation scripts, (iii) hard-coded URLs residing in package code, (iv) the impacts of open-source development practices, (v) package compromise via malicious updates, (vi) spammers disseminating phishing links within package metadata, and (vii) abuse of cryptocurrency protocols designed to reward the creators of high-impact packages. For each facet, tooling is presented to identify and/or mitigate potential security impacts. Ultimately, it is our hope that this research fosters greater awareness, deeper understanding, and further efforts to forge a new frontier for the security of modern software supply chains.
Alfred Fontes
Optimization and Trade-Space Analysis of Pulsed Radar-Communication Waveforms using Constant Envelope ModulationsWhen & Where:
Nichols Hall, Room 246 (Executive Conference Room)
Committee Members:
Patrick McCormick, ChairShannon Blunt
Jonathan Owen
Abstract
Dual function radar communications (DFRC) is a method of co-designing a single radio frequency system to perform simultaneous radar and communications service. DFRC is ultimately a compromise between radar sensing performance and communications data throughput due to the conflicting requirements between the sensing and information-bearing signals.
A novel waveform-based DFRC approach is phase attached radar communications (PARC), where a communications signal is embedded onto a radar pulse via the phase modulation between the two signals. The PARC framework is used here in a new waveform design technique that designs the radar component of a PARC signal to match the PARC DFRC waveform expected power spectral density (PSD) to a desired spectral template. This provides better control over the PARC signal spectrum, which mitigates the issue of PARC radar performance degradation from spectral growth due to the communications signal.
The characteristics of optimized PARC waveforms are then analyzed to establish a trade-space between radar and communications performance within a PARC DFRC scenario. This is done by sampling the DFRC trade-space continuum with waveforms that contain a varying degree of communications bandwidth, from a pure radar waveform (no embedded communications) to a pure communications waveform (no radar component). Radar performance, which is degraded by range sidelobe modulation (RSM) from the communications signal randomness, is measured from the PARC signal variance across pulses; data throughput is established as the communications performance metric. Comparing the values of these two measures as a function of communications symbol rate explores the trade-offs in performance between radar and communications with optimized PARC waveforms.
Qua Nguyen
Hybrid Array and Privacy-Preserving Signaling Optimization for NextG Wireless CommunicationsWhen & Where:
Zoom Defense, please email jgrisafe@ku.edu for link.
Committee Members:
Erik Perrins, ChairMorteza Hashemi
Zijun Yao
Taejoon Kim
KC Kong
Abstract
This PhD research tackles two critical challenges in NextG wireless networks: hybrid precoder design for wideband sub-Terahertz (sub-THz) massive multiple-input multiple-output (MIMO) communications and privacy-preserving federated learning (FL) over wireless networks.
In the first part, we propose a novel hybrid precoding framework that integrates true-time delay (TTD) devices and phase shifters (PS) to counteract the beam squint effect - a significant challenge in the wideband sub-THz massive MIMO systems that leads to considerable loss in array gain. Unlike previous methods that only designed TTD values while fixed PS values and assuming unbounded time delay values, our approach jointly optimizes TTD and PS values under realistic time delays constraint. We determine the minimum number of TTD devices required to achieve a target array gain using our proposed approach. Then, we extend the framework to multi-user wideband systems and formulate a hybrid array optimization problem aiming to maximize the minimum data rate across users. This problem is decomposed into two sub-problems: fair subarray allocation, solved via continuous domain relaxation, and subarray gain maximization, addressed via a phase-domain transformation.
The second part focuses on preserving privacy in FL over wireless networks. First, we design a differentially-private FL algorithm that applies time-varying noise variance perturbation. Taking advantage of existing wireless channel noise, we jointly design differential privacy (DP) noise variances and users transmit power to resolve the tradeoffs between privacy and learning utility. Next, we tackle two critical challenges within FL networks: (i) privacy risks arising from model updates and (ii) reduced learning utility due to quantization heterogeneity. Prior work typically addresses only one of these challenges because maintaining learning utility under both privacy risks and quantization heterogeneity is a non-trivial task. We approach to improve the learning utility of a privacy-preserving FL that allows clusters of devices with different quantization resolutions to participate in each FL round. Specifically, we introduce a novel stochastic quantizer (SQ) that ensures a DP guarantee and minimal quantization distortion. To address quantization heterogeneity, we introduce a cluster size optimization technique combined with a linear fusion approach to enhance model aggregation accuracy. Lastly, inspired by the information-theoretic rate-distortion framework, a privacy-distortion tradeoff problem is formulated to minimize privacy loss under a given maximum allowable quantization distortion. The optimal solution to this problem is identified, revealing that the privacy loss decreases as the maximum allowable quantization distortion increases, and vice versa.
This research advances hybrid array optimization for wideband sub-THz massive MIMO and introduces novel algorithms for privacy-preserving quantized FL with diverse precision. These contributions enable high-throughput wideband MIMO communication systems and privacy-preserving AI-native designs, aligning with the performance and privacy protection demands of NextG networks.
Arin Dutta
Performance Analysis of Distributed Raman Amplification with Different Pumping ConfigurationsWhen & Where:
Nichols Hall, Room 246 (Executive Conference Room)
Committee Members:
Rongqing Hui, ChairMorteza Hashemi
Rachel Jarvis
Alessandro Salandrino
Hui Zhao
Abstract
As internet services like high-definition videos, cloud computing, and artificial intelligence keep growing, optical networks need to keep up with the demand for more capacity. Optical amplifiers play a crucial role in offsetting fiber loss and enabling long-distance wavelength division multiplexing (WDM) transmission in high-capacity systems. Various methods have been proposed to enhance the capacity and reach of fiber communication systems, including advanced modulation formats, dense wavelength division multiplexing (DWDM) over ultra-wide bands, space-division multiplexing, and high-performance digital signal processing (DSP) technologies. To maintain higher data rates along with maximizing the spectral efficiency of multi-level modulated signals, a higher Optical Signal-to-Noise Ratio (OSNR) is necessary. Despite advancements in coherent optical communication systems, the spectral efficiency of multi-level modulated signals is ultimately constrained by fiber nonlinearity. Raman amplification is an attractive solution for wide-band amplification with low noise figures in multi-band systems.
Distributed Raman Amplification (DRA) have been deployed in recent high-capacity transmission experiments to achieve a relatively flat signal power distribution along the optical path and offers the unique advantage of using conventional low-loss silica fibers as the gain medium, effectively transforming passive optical fibers into active or amplifying waveguides. Also, DRA provides gain at any wavelength by selecting the appropriate pump wavelength, enabling operation in signal bands outside the Erbium doped fiber amplifier (EDFA) bands. Forward (FW) Raman pumping configuration in DRA can be adopted to further improve the DRA performance as it is more efficient in OSNR improvement because the optical noise is generated near the beginning of the fiber span and attenuated along the fiber. Dual-order FW pumping scheme helps to reduce the non-linear effect of the optical signal and improves OSNR by more uniformly distributing the Raman gain along the transmission span.
The major concern with Forward Distributed Raman Amplification (FW DRA) is the fluctuation in pump power, known as relative intensity noise (RIN), which transfers from the pump laser to both the intensity and phase of the transmitted optical signal as they propagate in the same direction. Additionally, another concern of FW DRA is the rise in signal optical power near the start of the fiber span, leading to an increase in the non-linear phase shift of the signal. These factors, including RIN transfer-induced noise and non-linear noise, contribute to the degradation of system performance in FW DRA systems at the receiver.
As the performance of DRA with backward pumping is well understood with relatively low impact of RIN transfer, our research is focused on the FW pumping configuration, and is intended to provide a comprehensive analysis on the system performance impact of dual order FW Raman pumping, including signal intensity and phase noise induced by the RINs of both 1st and the 2nd order pump lasers, as well as the impacts of linear and nonlinear noise. The efficiencies of pump RIN to signal intensity and phase noise transfer are theoretically analyzed and experimentally verified by applying a shallow intensity modulation to the pump laser to mimic the RIN. The results indicate that the efficiency of the 2nd order pump RIN to signal phase noise transfer can be more than 2 orders of magnitude higher than that from the 1st order pump. Then the performance of the dual order FW Raman configurations is compared with that of single order Raman pumping to understand trade-offs of system parameters. The nonlinear interference (NLI) noise is analyzed to study the overall OSNR improvement when employing a 2nd order Raman pump. Finally, a DWDM system with 16-QAM modulation is used as an example to investigate the benefit of DRA with dual order Raman pumping and with different pump RIN levels. We also consider a DRA system using a 1st order incoherent pump together with a 2nd order coherent pump. Although dual order FW pumping corresponds to a slight increase of linear amplified spontaneous emission (ASE) compared to using only a 1st order pump, its major advantage comes from the reduction of nonlinear interference noise in a DWDM system. Because the RIN of the 2nd order pump has much higher impact than that of the 1st order pump, there should be more stringent requirement on the RIN of the 2nd order pump laser when dual order FW pumping scheme is used for DRA for efficient fiber-optic communication. Also, the result of system performance analysis reveals that higher baud rate systems, like those operating at 100Gbaud, are less affected by pump laser RIN due to the low-pass characteristics of the transfer of pump RIN to signal phase noise.
Audrey Mockenhaupt
Using Dual Function Radar Communication Waveforms for Synthetic Aperture Radar Automatic Target RecognitionWhen & Where:
Nichols Hall, Room 246 (Executive Conference Room)
Committee Members:
Patrick McCormick, ChairShannon Blunt
Jon Owen
Abstract
Pending.
Rich Simeon
Delay-Doppler Channel Estimation for High-Speed Aeronautical Mobile Telemetry ApplicationsWhen & Where:
Eaton Hall, Room 2001B
Committee Members:
Erik Perrins, ChairShannon Blunt
Morteza Hashemi
Jim Stiles
Craig McLaughlin
Abstract
The next generation of digital communications systems aims to operate in high-Doppler environments such as high-speed trains and non-terrestrial networks that utilize satellites in low-Earth orbit. Current generation systems use Orthogonal Frequency Division Multiplexing modulation which is known to suffer from inter-carrier interference (ICI) when different channel paths have dissimilar Doppler shifts.
A new Orthogonal Time Frequency Space (OTFS) modulation (also known as Delay-Doppler modulation) is proposed as a candidate modulation for 6G networks that is resilient to ICI. To date, OTFS demodulation designs have focused on the use cases of popular urban terrestrial channel models where path delay spread is a fraction of the OTFS symbol duration. However, wireless wide-area networks that operate in the aeronautical mobile telemetry (AMT) space can have large path delay spreads due to reflections from distant geographic features. This presents problems for existing channel estimation techniques which assume a small maximum expected channel delay, since data transmission is paused to sound the channel by an amount equal to twice the maximum channel delay. The dropout in data contributes to a reduction in spectral efficiency.
Our research addresses OTFS limitations in the AMT use case. We start with an exemplary OTFS framework with parameters optimized for AMT. Following system design, we focus on two distinct areas to improve OTFS performance in the AMT environment. First we propose a new channel estimation technique using a pilot signal superimposed over data that can measure large delay spread channels with no penalty in spectral efficiency. A successive interference cancellation algorithm is used to iteratively improve channel estimates and jointly decode data. A second aspect of our research aims to equalize in delay-Doppler space. In the delay-Doppler paradigm, the rapid channel variations seen in the time-frequency domain is transformed into a sparse quasi-stationary channel in the delay-Doppler domain. We propose to use machine learning using Gaussian Process Regression to take advantage of the sparse and stationary channel and learn the channel parameters to compensate for the effects of fractional Doppler in which simpler channel estimation techniques cannot mitigate. Both areas of research can advance the robustness of OTFS across all communications systems.
Past Defense Notices
YUFEI CHENG
Future Internet Routing Design for Massive Failures and AttacksWhen & Where:
246 Nichols Hall
Committee Members:
James Sterbenz, ChairJiannong Cao
Victor Frost
Fengjun Li
Michael Vitevitch
Abstract
Given the high complexity and increasing traffic load of the current Internet, the geographically-correlated challenge caused by large-scale disasters or malicious attacks pose a significant threat to dependable network communications. To understand its characteristics, we start our research by first proposing a critical-region identification mechanism. Furthermore, the identified regions are incorporated into a new graph resilience metric, compensated Total Geographical Graph Diversity (cTGGD), which is capable of characterizing and differentiating resiliency levels for different topologies. We further propose the path geodiverse problem (PGD) that requires the calculation of a number of geographically disjoint paths, and two heuristics with less complexity compared to the optimal algorithm. We present two flow-diverse multi-commodity flow problems, a linear minimum-cost and a nonlinear delay-skew optimization problem to study the tradeoff among cost, end-to-end delay, and traffic skew on different geodiverse paths. We further prototype and integrate the solution from above models into our cross-layer resilient protocol stack, ResTP--GeoDivRP. Our protocol stack is implemented in the network simulator ns-3 and emulated in the KanREN testbed. By providing multiple geodiverse paths, our protocol stack provides better path protection than Multipath TCP (MPTCP) against geographically-correlated challenges. Finally, we analyze the mechanism attackers could utilize to maximize the attack impact and demonstrate the effectiveness of a network restoration plan.
HARSHITH POTU
Android Application for Interactive TeachingWhen & Where:
250 Nichols Hall
Committee Members:
Prasad Kulkarni, ChairEsam El-Araby
Andy Gill
Abstract
In a world with enormously growing technologies and applications, most people use smart
devices. This provides a means to develop smart applications that will be help students learn effectively.
In this project, we develop a smart android application which will provide digital means of
interaction between the professors and students. Instead of using traditional emails for every
discussion, this application helps to broadcast multiple messages to the class through a single
click. The students will also be able to follow multiple professors and participate in the active
discussions. And also this application allows the users to send personal messages to the other
users in order to participate in an active discussion. It provides unique logins to every student
and professor. It uses mongoDB as the database and "parse" backend as a service.The main
inspiration for this project was an application called Tophat.
ABDULMALIK HUMAYED
Security Protection for Smart Cars — A CPS PerspectiveWhen & Where:
246 Nichols Hall
Committee Members:
Bo Luo, ChairArvin Agah
Prasad Kulkarni
Heechul Yun
Prajna Dhar
Abstract
As the passenger vehicles evolve to be “smart”, electronic components, including communication, intelligent control and entertainment, are continuously introduced to new models and concept vehicles. The new paradigm introduces new features and benefits, but also brings new security issues, which is often overlooked in the industry as well as in the research community.
Smart cars are considered cyber-physical systems (CPS) because of their integration of cyber- and physical- components. In recent years, various threats, vulnerabilities, and attacks have been discovered from different models of smart cars. In the worst- case scenario, external attackers may remotely obtain full control of the vehicle by exploiting an existing vulnerability.
In this research, we investigate smart cars’ security from a CPS’ perspective and derive a taxonomy of threats, vulnerabilities, attacks, and controls. In addition, we investigate three security solutions that would improve the security posture of automotive networks. First, as automotive networks are highly vulnerable to Denial of Service (DoS) attacks, we investigate a solution that effectively mitigates such attacks, namely ID-Hopping. In addition, because several attacks have successfully exploited the poor separation between critical and non-critical components in the automotive networks, we propose to investigate the effectiveness of firewalls and Intrusion Detection Systems (IDS) to prevent and detect such exploitations. To evaluate our proposals, we built a test bench that is composed of five microcontrollers and a communication bus to simulate an automotive network. Simulations and experiments performed with the testbed demonstrates the effectiveness of ID-hopping against DoS attacks.
CAITLIN McCOLLISTER
Predicting Author Traits Through Topic Modeling of Multilingual Social Media TextWhen & Where:
246 Nichols Hall
Committee Members:
Bo Luo, ChairArvin Agah
Luke Huan
Abstract
One source of insight into the motivations of a modern human being is the text they write and post for public consumption online, in forms such as personal status updates, product reviews, or forum discussions. The task of inferring traits about an author based on their writing is often called "author profiling." One challenging aspect of author profiling in today’s world is the increasing diversity of natural languages represented on social media websites. Furthermore, the informal nature of such writing often inspires modifications to standard spelling and grammatical structure which are highly language-specific.
These are some of the dilemmas that inspired a series of so-called "shared task" competitions, in which many participants work to solve a single problem in different ways, in order to compare their methods and results. This thesis describes our submission to one author profiling shared task in which 22 teams implemented software to predict the age, gender, and certain personality traits of Twitter users based on the content of their posts to the website. We will also analyze the performance and implementation of our system compared to those of other teams, all of which were described in open-access reports.
The competition organizers provided a labeled training dataset of tweets in English, Spanish, Dutch, and Italian, and evaluated the submitted software on a similar but hidden dataset. Our approach is based on applying a topic modeling algorithm to an auxiliary, unlabeled but larger collection of tweets we collected in each language, and representing tweets from the competition dataset in terms of a vector of 100 topics. We then trained a random forest classifier based on the labeled training dataset to predict the age, gender and personality traits for authors of tweets in the test set. Our software ranked in the top half of participants in English and Italian, and the top third in Dutch.
ANIRUDH NARASIMMAN
Arcana: Private Tweets on a Public Microblog PlatformWhen & Where:
250 Nichols Hall
Committee Members:
Bo Luo, ChairLuke Huan
Prasad Kulkarni
Abstract
As one of the world’s most famous online social networks (OSN), Twitter now has 320 million monthly active users. Accompanying the large user group and abundant personal information, users increasingly realize the vulnerability of tweets and have reservations of showing certain tweets to different follower groups, such as colleagues, friends and other followers. However, Twitter does not offer enough privacy protection or access control functions. Users can just set an account as protected, which results in only the user’s followers seeing the tweet. The protected tweet does not appear in the public domain, third party sites and search engines cannot access the tweet. However, a protected account cannot distinguish between different follower groups or users who use multiple accounts. To serve the demand of the user so that they can restrict the access of each tweet to certain follower groups, we propose a browser plug-in system, which utilizes CP-ABE (Ciphertext Policy Attribute based encryption), allowing the user to select followers based on predefined attributes. Through simple installation and pre-setting, the user can encrypt and decrypt tweets conveniently and can avoid the fear of information leakage.
PRATHAP KUMAR VALSAN
Towards Achieving Predictable Memory Performance on Multi-core Based Mixed Criticality Embedded SystemsWhen & Where:
250 Nichols Hall
Committee Members:
Heechul Yun, ChairEsam El-Araby
Prasad Kulkarni
Abstract
The shared resources in multi-core systems, mainly the memory subsystem(caches and DRAM), if not managed properly would affect the predictability of real-time tasks in the presence of co-runners. In this work, we first studied the design of COTS DRAM controllers and its impact on predictability and, proposed a DRAM controller design, called MEDUSA, to provide predictable memory performance in multi-core based real-time systems. In our approach, the OS partially partitions DRAM banks into reserved banks and shared banks. The reserved banks are exclusive to each core to provide predictable timing while the shared banks are shared by all cores to efficiently utilize the resources. MEDUSA has two separate queues for read and write requests, and it prioritizes reads over writes. In processing read requests, MEDUSA employs a two-level scheduling algorithm that prioritizes the memory requests to the reserved banks in a Round Robin fashion to provide strong timing predictability. In processing write requests, MEDUSA largely relies on the FR-FCFS for high throughput. We implemented MEDUSA in a cycle-accurate full-system simulator. The results show that MEDUSA achieves up to 91% better worst-case performance for real-time tasks while achieving up to 29% throughput improvement for non-real-time tasks
Second, we studied the contention at shared caches and its impact on predictability. We demonstrate that the prevailing cache partition techniques does not necessarily ensure predictable cache performance in modern COTS multi-core platforms that use non-blocking caches to exploit memory-level-parallelism (MLP). Through carefully designed experiments using three real COTS multi-core platforms (four distinct CPU architectures) and a cycle-accurate full system simulator, we show that special hardware registers in non-blocking caches, known as Miss Status Holding Registers (MSHRs), which track the status of outstanding cache-misses, can be a significant source of contention. We propose a hardware and system software (OS) collaborative approach to efficiently eliminate MSHR contention for multi-core real-time systems.We implement the hardware extension in a cycle-accurate full-system simulator and the scheduler modification in Linux 3.14 kernel. In a case study, we achieve up to 19% WCET reduction (average: 13%) for a set of EEMBC benchmarks compared to a baseline cache partitioning setup.
LEI SHI
Multichannel Sense-and-Avoid Radar for Small UAVsWhen & Where:
2001B Eaton Hall
Committee Members:
Chris Allen, ChairGlenn Prescott
Jim Stiles
Heechul Yun
Lisa Friis
Abstract
This dissertation investigates the feasibility of creating a multichannel sense-and-avoid radar system for small fixed-wing unmanned aerial vehicles (UAVs, also known as sUAS or drones). These aircraft are projected to have a significant impact on the U.S. economy in both the commercial and government sectors, however, their lack of situation awareness has caused the FAA to strictly limit their use. Through this dissertation, a miniature, multichannel, FMCW radar system was created with a small enough size, weight, and power (SWaP) that would allow it to be mounted onboard a sUAS providing inflight target detection. The primary hazard to avoid are general aviation (GA) aircraft such as a Cessna 172 which was estimated to have a radar cross section (RCS) of approximately 1 sqr meter. The radar system is capable of locating potential hazards in range, Doppler, and 3-dimensional space using a patent pending 2-D FFT process and interferometry. The initial prototype system has a detection range of approximately 800 m, with 360-degree azimuth coverage, and +/- 15-degree elevation coverage and draws less than 20 W. From the radar data, target detection, tracking, and the extrapolation of the target behavior in 6-degree of freedom was demonstrated.
RANJITH SOMPALLI
Implementation of Invertebrate Paleontology Knowledge base using Integration of Textual Ontology & Visual FeaturesWhen & Where:
2001B Eaton Hall
Committee Members:
Bo Luo, ChairJerzy Grzymala-Busse
Richard Wang
Abstract
The Treatise on Invertebrate Paleontology is the most authoritative compilation of the invertebrate fossil records. The quality of studies in paleontology, in particular depends on the accessibility of fossil data. Unfortunately, the PDF version of Treatise currently available is just a scanned copy of the paper publications and the content is in no way organized to facilitate search and knowledge discovery. This project builds an Information Retrieval based system, to extract the fossil descriptions, images and other available information from Treatise. This project is divided into two parts. The first part deals with the extraction of the text and images from the Treatise, organize the information in a structured format and store in a relational database, build a search engine to browse fossil data. Extracting text requires identifying common textual patterns and a text parsing algorithm is developed to identify the patterns and organize the information in a structural format. Images are extracted using the image processing techniques like image segmentation, morphological operations etc., and then associated with the corresponding textual descriptions. A Search engine is built to efficiently browse the extracted information and also the web interface provides options to perform many useful tasks with ease. The second part of this research focuses on the implementation of Content Based Information Retrieval System. All images from treatise are grayscale fossil images and identifying the matching images based on the visual image features is a very difficult task. Hence, we employed an approach that integrates textual and visual features to identify matching images. Textual features are extracted from the description of the fossils and using statistical approaches and Parts of Speech tagging approaches, an ontology is generated, that forms attribute – property pairs explaining how a region looks like in each shell. Popular image features like SIFT, GIST, and HOG features are extracted from fossil images. Both the textual and image features are then integrated to extract the information related to the fossil image matching the query image.
NAGABHUSHANA GARGESHWARI MAHADEVASWAMY
How Duplicates Affect the Error Rate of Data Sets During ValidationWhen & Where:
2001B Eaton Hall
Committee Members:
Jerzy Grzymala-Busse, ChairPrasad Kulkarni
Bo Luo
Abstract
In data mining, duplicate data plays a huge role in deciding the set of rules. In this project, an analysis has been made on finding the impact of duplicates in the input data set on the rule set. The effect of duplicates is being analyzed using the error rate factor. Error rate is calculated by comparing the obtained rule set against the testing part of input data. The results of experiments have shown decrement of error rate with the increase of percentage of duplicates in the input data set, which demonstrates that the duplicate data plays a crucial role in validation process of machine learning. LEM2 algorithm and rule checker application have been implemented as a part of project. LEM2 algorithm is used to induce the rule set for the given input data set and rule checker application is used to calculate the error rate.
GOWTHAM GOLLA
Developing Novel Machine Learning Algorithms to Improve Sedentary Assessment for Youth Health EnhancementWhen & Where:
2001B Eaton Hall
Committee Members:
Luke Huan, ChairJerzy Grzymala-Busse
Jordan Carlson
Abstract
Sedentary behavior of youth is an important determinant of health. However, better measures are needed to improve understanding of this relationship and the mechanisms at play, as well as to evaluate health promotion interventions. Even though wearable devices like accelerometers (e.g. activPAL) are considered as the standard for assessing physical activity in research, the machine learning algorithms that we propose will allow us to re-examine existing accelerometer data to better understand the association between sedentary time and health in various populations. In order to achieve this, we collected two datasets, one is laboratory-controlled dataset and second is free-living dataset. We trained machine learning classifiers on both datasets and compared their behaviors on these datasets. The classifiers predict five postures: sit, stand, sit-stand, stand-sit, and stand\walk. We have also compared manually constructed Hidden Markov model(HMM) with automated HMM from existing software on both datasets to better understand the algorithm and existing software. When we tested on the laboratory-controlled dataset and the free-living dataset, the manually constructed HMM gave more F1-Macro score.