Home
The 1st Workshop on AI for Data Editing aims to bring together researchers, practitioners, and policymakers to explore innovative AI-driven solutions for the multifaceted challenges in data editing. As data grows exponentially, there is an urgent demand for advanced strategies in data preprocessing, cleaning, transformation, quality control, as well as better understanding of complex interdependencies in large-scale data workflows. This workshop will delve into the latest AI technologies to facilitate efficient, accurate, and human-centric data editing processes.
The workshop is held in conjunction with the 31th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2025), one of the world's premier conferences in data science, data mining, and big data analytics. Organized by ACM SIGKDD, KDD is a key platform where researchers, practitioners, and industry experts present groundbreaking advancements and explore emerging trends in data mining, machine learning, and AI. This partnership provides the workshop with a vital forum for interdisciplinary knowledge exchange, fostering collaboration among data scientists, domain experts, and policymakers.
Agenda
The workshop will follow a half-day format, focusing on paper presentations, poster sessions, keynote and invited talks. We anticipate attracting a minimum of 75 and potentially up to 100 attendees. People who have been accepted to give paper presentations, posters are invited to attend. In addition, the workshop is open to researchers, industry professionals, and policymakers interested in AI for Data Editing. Due to venue capacity, attendance may be limited.
August 3, 2025 (Room 604, Toronto Convention Centre) | ||
---|---|---|
Time | Event | Speakers |
08:00-08:05 | Opening Remarks | Organizers |
08:05-08:40 | Oral Presentation 1: Towards Autonomous Data Reprogramming | Yanjie Fu (Arizona State University) |
08:40-09:15 | Keynote 1 & QA: Unsupervised Deep Learning-Based Deformable Image Registration for Cardiac Motion Estimation from Cine Cardiac MRI Images | Suzanne M. Shontz (University of Kansas) |
09:15-09:35 | Poster Session 1: 8 Long Papers | |
09:35-10:10 | Keynote 2 & QA: The Myth of Clean Data: Real-World Data Challenges in Materials Informatics | Alix Schmidt (Dow Chemical) |
10:10-10:16 | Oral Presentation 2: Automated Label Placement on Maps via LLMs | Harry Shomer and Jiejun Xu |
10:16-10:22 | Oral Presentation 3: Regression Augmentation With Data-Driven Segmentation | Shayan Alahyari et al. |
10:22-10:28 | Oral Presentation 4: Efficient & Accurate Relevance Label Generation for IR Datasets | Sean D Rosario et al. | 10:28-10:34 | Oral Presentation 5: Image Encryption Using a Hybrid Metaheuristic Algorithm | Harsh Mishra et al. |
10:34-10:40 | Oral Presentation 6: Leveraging LLMs to Create Content Corpora for Niche Domains | Franklin Zhang et al. |
10:40-10:46 | Oral Presentation 7: Image Segmentation to Track Healthy Organs in Medical Scans for Cancer Treatment | Sharan M et al. |
10:46-11:08 | Poster Session 2: 5 Poster Papers | |
11:08-11:40 | Keynote 3 & QA: Preliminary Discussion on LLM-based Feature Selection | Huan Liu (Arizona State University) |
11:40-12:10 | Panel Discussion | Speakers + Attendees |
Panel Discussion Arrangement
Panel Discussion Details | ||
---|---|---|
Segment | Details | Duration |
Opening (Moderator) | Welcome, panelist intros, framing of “AI for Data Editing” | 3 minutes |
Theme 1: data or models for AI | Q1: building augmented data or building augmented model for achieving AI? | 7 minutes |
Theme 2: AI for data engineering | Q2: compared with classic data engineering (e.g., preprocessing, selection, transformation, generation, augmentation), what are the benefits of introducing AI for data engineering, management, retrieval? Q3: In your domain, what are the most critical data quality issues or errors AI should learn to detect or fix? How can domain knowledge be integrated into AI systems to support such tasks? |
7 minutes |
Theme 3: Human-AI Collaboration | Q4: Human labeled generated edited data are deemed as high-quality. Can AI (e.g., classic algorithms, large language models) act as human-like data editors or labelers with minimal supervision? What are the key technical and application challenges? Q5: How do we involve human feedback to improve AI editing systems? |
6 minutes |
Theme 4: Generalization | Q6: Can we build reusable AI data engineering or data management assistants, to generalize well across domains (e.g., vision vs. text vs. tabular)? | 6 minutes |
Audience Q&A | Use pre-collected or live questions: Q7: How can we better promote the convergence between AI and data (e.g., AI for data or data for AI)? Q8: How will AI reshape the future of data-related tasks: data engineering, data management, databases, data retrieval? |
5 minutes |
Closing Round | Ask each panelist to name one research direction with great potential of research breakthrough at the intersection of AI and data? | 1 minute |
Keynote Speaker

Huan Liu
Arizona State University
huanliu@asu.edu
Bio: Dr. Huan Liu is a Regents Professor and Ira A. Fulton Professor of Computer Science and Engineering at Arizona State University. He is the recipient of the ACM SIGKDD 2022 Innovation Award. His research interests are in data mining, machine learning, feature selection, social computing, social media mining, and artificial intelligence. He is a co-author of a text, Social Media Mining: An Introduction,Cambridge University Press. He is Editor in Chief of ACM TIST, and Field Chief Editor of Frontiers in Big Data and its Specialty Chief Editor of Data Mining and Management. He is a Fellow of ACM, AAAI, AAAS, and IEEE.
Speaking Topics
- Preliminary Discussion on LLM-based Feature Selection
Speaker's Notes
Large Language Models (LLMs) are powerful in some data-intensive applications. Can feature selection leverage LLMs to identify informative features within a dataset, complementing its dependence on traditional search strategies or statistical heuristics? This talk offers our preliminary study of LLM-based feature selection approaches and their robustness and efficiency, including direct prompting-based and search-based methods, and introduces an alternative, a tool-use-based approach, followed by empirical results and discussion.

Suzanne M. Shontz
University of Kansas
shontz@ku.edu
Bio:Dr. Suzanne M. Shontz is the Associate Dean for Graduate and Online Education in the School of Engineering (SoE) and a Full Professor in the Department of Electrical Engineering and Computer Science at the University of Kansas (KU). She also holds courtesy appointments in the Department of Mechanical Engineering and the Bioengineering Program. In addition, she is the Director of the Mathematical Methods and Interdisciplinary Computing Center (MMICC) within the Information Sciences Institute at KU. Her research interests are in parallel scientific computing, with an emphasis on the development of unstructured mesh generation, numerical optimization, and machine learning methods, and their applications to image processing, computational medicine, materials, and other areas. Prof. Shontz is a Fellow of the International Meshing Roundtable and is the recipient of an NSF CAREER Award, NSF Presidential Early Career Award for Scientists and Engineers, and the James Corones Award in Leadership, Community Building, and Communication.
Speaking Topics
- Unsupervised Deep Learning-Based Deformable Image Registration for Cardiac Motion Estimation from Cine Cardiac MRI Images
Speaker's Notes
Estimated cardiac motion is used to assess the biomechanics of the cardiac chambers and is important for disease diagnosis and treatment planning. Cardiac motion can be estimated from a 4D cine cardiac MRI dataset by registering pairs of consecutive 3D images of the deforming heart. In this talk, I will present our hybrid convolutional neural network and Vision Transformer architecture for deformable image registration of 3D cine cardiac MRI images for cardiac motion estimation followed by experimental results and discussion.

Alix Schmidt
Dow Chemical
alixschmidt@gmail.com
Bio: Alix Schmidt is a Senior Research Scientist in Dow Chemical's Core R&D Information Research team in Midland, Michigan. During her 16 years in industry, Alix has held technical roles in polymer process research and development, high throughput research and combinatorial design, and chemical manufacturing analytics. She is currently a lead materials informatics researcher at Dow, with three granted patents and several pending patents on the subject of materials representation. Additionally, she leads the R&D Machine Learning Operations (MLOps) Strategy and the LLM Strategy for the successful deployment of traditional and generative AI into Dow's research, product design, and technical customer service organizations. Alix focuses on building a community around industrial materials informatics through guest lectures at several universities, chairing symposia at the AIChE and ACS conferences, and is co-editor of the 2024 book The Digital Transformation of Product Formulation: Concepts, Challenges, and Applications for Accelerated Innovation.
Speaking Topics
- The Myth of Clean Data: Real‑World Data Challenges in Materials Informatics
Speaker's Notes
In contrast to industries with many digital-native companies, the chemicals and materials industry is still dominated by large, established players. Far from digitally native, many older companies suffer from highly variable and siloed data infrastructure resulting from decades of mergers and acquisitions and changes in technology. There is a prevailing assumption that if enough money is spent to integrate systems and achieve a unified data fabric the data foundation for materials AI will be solved. In this talk, I will challenge this assumption and highlight some characteristics inherent to R&D and materials design data that pose difficulties to machine learning even in the case of "perfectly clean" data.
Panel Discussion Speaker

Hui Xiong
Hong Kong University of Science and Technology (Guangzhou), xionghui@hkust-gz.edu.cn

Xiangliang Zhang
University of Notre Dame, xzhang33@nd.edu

Alix Schmidt
Dow Chemical, alixschmidt@gmail.com

Suzanne M. Shontz
University of Kansas, shontz@ku.edu

Huan Liu
Arizona State University, huanliu@asu.edu
Topics of Interest
We encourage submissions on a broad range of topics related to AI for data editing, including but not limited to:
- Methods for automated data science
- Automated data cleaning, denoising, interpolation, refinery and quality improvement
- Automated feature selection, generation, and feature-instance joint selection
- Automated data representation learning or reconstruction
- Automated outlier detection and removal
- New datasets in domain application areas
- in speech, vision, manufacturing, smart cities, transportationmobile computing, sensing, medical, recommendation, personalization, science domain
- Tools and methodologies for accelerating open-source dataset
preparation and iteration
- Tools that quantify and accelerate time to source and prepare high-quality data
- Tools that ensure that the data is labeled consistently, such as label consensus
- Tools that make improving data quality more systematic
- Tools that automate the creation of high-quality supervised learning training data from low-quality resources, such as forced alignment in speech recognition
- Tools that produce consistent and low noise data samples,or remove labeling noise or inconsistencies from existing data
- Tools for controlling what goes into the dataset and for making high-level edits efficiently to very large datasets, e.g. adding new words, languages, or accents to speech datasets with thousands of hours
- Search methods for finding suitably licensed datasets based on public resources
- Tools for creating training datasets for small data problems, or for rare classes in the long tail of big data problems
- Tools for timely incorporation of feedback from production systems into datasets
- Tools for understanding dataset coverage of important classes, and editing them to cover newly identified important cases
- Dataset importers that allow easy combination and composition of existing datasets
- Dataset exporters that make the data consumable for models and interface with model training and inference systems such as webdataset.
- System architectures and interfaces that enable composition of dataset tools such as MLCube, Docker, Airflow
- Algorithms for working with limited labeled data and improving label efficiency:
- Data selection techniques such as active learning and coreset selection for identifying the most valuable examples to label.
- Semi-supervised learning, few-shot learning, and weak supervision methods for maximizing the power of limited labeled data.
- Transfer learning and self-supervised learning approaches for developing powerful representations that can be used for many downstream tasks with limited labeled data
- Algorithms for working with shifted, drifted, out-of-distribution
data
- New datasets for bias evaluation and analysis
- New algorithms for fixing shifted, drifted, OOD data
- Algorithms for working with biased data
- New datasets for bias evaluation and analysis
- New algorithms for automated elimination of bias in data
- New algorithms for model training with biased data
Important Dates
Workshop Call for Papers | April 19, 2025 |
Workshop Paper Submission |
|
Notification of Workshop Papers Acceptance |
|
Workshop Date | August 3, 2025 |
Submission Guidelines
We invite the submission of regular research papers, which cannot exceed 9 pages, including an appendix, plus unlimited references (paper content is limited to 9 pages, that means that if you have an appendix, then it should be included within that page limit. It is also ok if you do not have an appendix and instead 9 pages of content). Submissions must be in PDF format, and formatted according to the new Standard ACM sigconf template. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the easychair system.
We invite authors to submit their papers via the EasyChair submission portal: https://easychair.org/conferences?conf=ai4de
Organizing Committee
Dongjie Wang
University of Kansas
wangdongjie@ku.edu
Yanjie Fu
Arizona State University
yanjie.fu@asu.edu
Kunpeng Liu
Portland State University
kunpeng@pdx.edu
Xiangliang Zhang
University of Notre Dame
xzhang33@nd.edu
Khalid Osman
Stanford University
osmank@stanford.edu
Charu Aggarwal
IBM T.J. Watson Research Center
charu@us.ibm.com
Jian Pei
Duke University
j.pei@duke.edu
Wei Fan
University of Oxford
wei.fan@wrh.ox.ac.uk
Volunteers
Rui Liu
University of Kansas
Ph.D Student
Tao Zhe
University of Kansas
Ph.D Student