2023-08-06

cs.LG

cs.LG - 2023-08-06

AI-GOMS: Large AI-Driven Global Ocean Modeling System

paper_url: http://arxiv.org/abs/2308.03152
repo_url: None
paper_authors: Wei Xiong, Yanfei Xiang, Hao Wu, Shuyi Zhou, Yuze Sun, Muyuan Ma, Xiaomeng Huang
for: 这个论文的目的是为了提出一种基于人工智能的全球海洋模拟系统，以实现精确和高效的全球海洋日常预测。
methods: 该论文使用了一种基于 Fourier-based Masked Autoencoder 结构的基本海洋变量预测模型，以及一些轻量级的细化预测模型，包括地区下降、涟式解码和生物化 Coupling 模块。
results: 该论文实现了在30天预测期间，全球海洋基本变量的最佳性能，并且可以正确 simulate mesoscale eddies 在日本热带区域和海洋层次分布在赤道太平洋区域。该系统还实现了对 Earth system 模型的新脊梁下游方式，使其可以易于转移、扩展和重用。

Abstract
Ocean modeling is a powerful tool for simulating the physical, chemical, and biological processes of the ocean, which is the foundation for marine science research and operational oceanography. Modern numerical ocean modeling mainly consists of governing equations and numerical algorithms. Nonlinear instability, computational expense, low reusability efficiency and high coupling costs have gradually become the main bottlenecks for the further development of numerical ocean modeling. Recently, artificial intelligence-based modeling in scientific computing has shown revolutionary potential for digital twins and scientific simulations, but the bottlenecks of numerical ocean modeling have not been further solved. Here, we present AI-GOMS, a large AI-driven global ocean modeling system, for accurate and efficient global ocean daily prediction. AI-GOMS consists of a backbone model with the Fourier-based Masked Autoencoder structure for basic ocean variable prediction and lightweight fine-tuning models incorporating regional downscaling, wave decoding, and biochemistry coupling modules. AI-GOMS has achieved the best performance in 30 days of prediction for the global ocean basic variables with 15 depth layers at 1/4{\deg} spatial resolution. Beyond the good performance in statistical metrics, AI-GOMS realizes the simulation of mesoscale eddies in the Kuroshio region at 1/12{\deg} spatial resolution and ocean stratification in the tropical Pacific Ocean. AI-GOMS provides a new backbone-downstream paradigm for Earth system modeling, which makes the system transferable, scalable and reusable.

摘要
海洋模拟是一种强大的工具，用于模拟海洋物理、化学和生物过程，这是海洋科学研究和操作海洋学的基础。现代数值海洋模拟主要由管理方程和数值算法组成。不线性不稳定、计算成本高、 reuse效率低和对接成本高逐渐成为现代数值海洋模拟的主要瓶颈。近年来，基于人工智能的科学计算方法在数值海洋模拟中表现出革命性的潜力，但现代数值海洋模拟中的瓶颈问题尚未得到解决。在这种情况下，我们提出了AI-GOMS，一个大规模的人工智能驱动的全球海洋模拟系统，用于准确和高效地预测全球海洋日常变化。AI-GOMS包括一个基础模型，其结构基于 Fourier-based Masked Autoencoder，用于预测基本海洋变量。此外，AI-GOMS还包括一些轻量级的精度增强模型，包括地区下降、浪谱解码和生物化学相互作用模块。AI-GOMS在30天预测全球海洋基本变量的15层深度分辨率下达到了最佳性能。除了在统计指标上的良好表现外，AI-GOMS还能够模拟kuroshio区域的宏观瑞度涟潮和热带太平洋海洋的层次分布。AI-GOMS提供了一个新的基础-下游模式，用于地球系统模拟，这使得系统可转移、可扩展和可重用。

Nest-DGIL: Nesterov-optimized Deep Geometric Incremental Learning for CS Image Reconstruction

paper_url: http://arxiv.org/abs/2308.03807
repo_url: https://github.com/fanxiaohong/Nest-DGIL
paper_authors: Xiaohong Fan, Yin Yang, Ke Chen, Yujie Feng, Jianping Zhang
for: 提高图像恢复精度和速度，并减少artefacts。
methods: 基于第二代Nesterov proximal梯度优化的深度凝师增量学习框架。
results: 提出了一种可 theoretically guarantee geometric texture details的恢复方法，并且可以快速 converges。

Abstract
Proximal gradient-based optimization is one of the most common strategies for solving image inverse problems as well as easy to implement. However, these techniques often generate heavy artifacts in image reconstruction. One of the most popular refinement methods is to fine-tune the regularization parameter to alleviate such artifacts, but it may not always be sufficient or applicable due to increased computational costs. In this work, we propose a deep geometric incremental learning framework based on second Nesterov proximal gradient optimization. The proposed end-to-end network not only has the powerful learning ability for high/low frequency image features,but also can theoretically guarantee that geometric texture details will be reconstructed from preliminary linear reconstruction.Furthermore, it can avoid the risk of intermediate reconstruction results falling outside the geometric decomposition domains and achieve fast convergence. Our reconstruction framework is decomposed into four modules including general linear reconstruction, cascade geometric incremental restoration, Nesterov acceleration and post-processing. In the image restoration step,a cascade geometric incremental learning module is designed to compensate for the missing texture information from different geometric spectral decomposition domains. Inspired by overlap-tile strategy, we also develop a post-processing module to remove the block-effect in patch-wise-based natural image reconstruction. All parameters in the proposed model are learnable,an adaptive initialization technique of physical-parameters is also employed to make model flexibility and ensure converging smoothly. We compare the reconstruction performance of the proposed method with existing state-of-the-art methods to demonstrate its superiority. Our source codes are available at https://github.com/fanxiaohong/Nest-DGIL.

摘要
近似 gradient-based 优化是解析图像问题的一种最常见的策略，易于实现。然而，这些技术通常会产生重要的artefacts在图像重建中。一种最受欢迎的修正方法是微调正则化参数，以避免这些artefacts，但这可能并不总是可行或适用，因为它可能会增加计算成本。在这项工作中，我们提出了一种深度几何增量学习框架，基于第二个讷斯特洛夫 proximal gradient 优化。我们的提案的终端网络不仅有强大的学习能力，可以学习高/低频图像特征，而且可以 theoretically 保证，从初步线性重建中恢复几何纹理细节。此外，它可以避免初步重建结果落在几何分解域之外，并快速 converges。我们的重建框架分为四个模块，包括普通的线性重建、几何增量修复、讷斯特洛夫加速和后处理。在图像修复步骤中，我们设计了几何增量修复模块，以补偿不同几何 spectral decomposition 域中缺失的纹理信息。受到 overlap-tile 策略的启发，我们还开发了后处理模块，以除去 patch-wise 基于自然图像重建中的块效应。所有模型参数都是学习的，并且使用 adaptive initialization 技术，以确保模型的灵活性和平滑的 converging。我们与现有的状态态-of-the-art 方法进行比较，以证明我们的方法的优越性。我们的源代码可以在 https://github.com/fanxiaohong/Nest-DGIL 上获取。

Self-Directed Linear Classification

paper_url: http://arxiv.org/abs/2308.03142
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis
for: 这个论文研究了在线分类中learner预测标签的顺序选择问题，以实现最小化总错误数。
methods: 作者使用了自适应顺序选择方法，并设计了两个主要结果：一是对于uniformly随机从单位球体上取样的$X$数据集，设计了一个高效的自适应学习者，其错误数为$O(d \log \log n)$；二是对于任意$d$-维数据集$X$，设计了一个高效的自适应学习者，可以预测$X$中99%的点标签，错误数与$n$无关。
results: 作者的研究表明，在线分类中，采用自适应顺序选择方法可以实现较低的错误率，比如worst-order和Random-order学习方法的至少$\Omega(d \log n)$错误率。

Abstract
In online classification, a learner is presented with a sequence of examples and aims to predict their labels in an online fashion so as to minimize the total number of mistakes. In the self-directed variant, the learner knows in advance the pool of examples and can adaptively choose the order in which predictions are made. Here we study the power of choosing the prediction order and establish the first strong separation between worst-order and random-order learning for the fundamental task of linear classification. Prior to our work, such a separation was known only for very restricted concept classes, e.g., one-dimensional thresholds or axis-aligned rectangles. We present two main results. If $X$ is a dataset of $n$ points drawn uniformly at random from the $d$-dimensional unit sphere, we design an efficient self-directed learner that makes $O(d \log \log(n))$ mistakes and classifies the entire dataset. If $X$ is an arbitrary $d$-dimensional dataset of size $n$, we design an efficient self-directed learner that predicts the labels of $99\%$ of the points in $X$ with mistake bound independent of $n$. In contrast, under a worst- or random-ordering, the number of mistakes must be at least $\Omega(d \log n)$, even when the points are drawn uniformly from the unit sphere and the learner only needs to predict the labels for $1\%$ of them.

摘要
在在线分类中，学习者会看到一个序列例子并尝试预测其标签，以最小化总错误数。在自适应变体中，学习者在预测之前知道例子池，并可以动态选择预测的顺序。我们研究预测顺序的选择力，并证明了线性分类的基本任务上最差顺序和随机顺序之间的首次强分离。在我们的工作之前，这种分离只知道对非常限制的概念类型，例如一维阈值或垂直排序的直方体。我们有两个主要结果。如果$X$是一个$d$-维均匀随机选取从单位球体上的$n$个点，我们设计了高效的自适应学习者，其错误数为$O(d \log \log(n))$.如果$X$是一个任意$d$-维数据集大小$n$的点，我们设计了高效的自适应学习者，可以预测$X$中$99\%$的点标签，错误 bound不виси于$n$.相比之下，在最差或随机顺序下，错误数至少为$\Omega(d \log n)$,即使点被均匀选取从单位球体，并且学习者只需要预测$1\%$的点标签。

Iterative Magnitude Pruning as a Renormalisation Group: A Study in The Context of The Lottery Ticket Hypothesis

paper_url: http://arxiv.org/abs/2308.03128
repo_url: None
paper_authors: Abu-Al Hassan
for: This paper explores the Lottery Ticket Hypothesis (LTH) in Deep Neural Networks (DNNs), which suggests that smaller, trainable subnetworks within extensive DNNs can achieve performance comparable to the full model.
methods: The paper uses Iterative Magnitude Pruning (IMP) to identify and eliminate minimal weights in DNNs, emulating stepwise learning. The authors also investigate the “universality” of winning tickets and their applicability to other similar problems.
results: The paper bridges the gap between IMP and the Renormalisation Group (RG) theory in physics, providing a more rigorous understanding of IMP and its potential applications in DNNs.

Abstract
This thesis delves into the intricate world of Deep Neural Networks (DNNs), focusing on the exciting concept of the Lottery Ticket Hypothesis (LTH). The LTH posits that within extensive DNNs, smaller, trainable subnetworks termed "winning tickets", can achieve performance comparable to the full model. A key process in LTH, Iterative Magnitude Pruning (IMP), incrementally eliminates minimal weights, emulating stepwise learning in DNNs. Once we identify these winning tickets, we further investigate their "universality". In other words, we check if a winning ticket that works well for one specific problem could also work well for other, similar problems. We also bridge the divide between the IMP and the Renormalisation Group (RG) theory in physics, promoting a more rigorous understanding of IMP.

摘要
这个论文探讨了深度神经网络（DNN）的复杂世界，特别关注了赢家票假设（LTH）。LTH认为，在广泛的DNN中，更小的、可训练的子网络（赢家票）可以达到相同的性能。我们在LTH中使用增量大小减少（IMP）来逐渐减少最小的权重，模拟了DNN中的步骤学习。一旦我们identified这些赢家票，我们进一步调查它们的“通用性”。即我们检查一个赢家票在一个特定问题上能够达到高性能是否也能够在其他相似问题上达到高性能。我们还将IMP与物理学RG理论相连接，以促进IMP的更加准确的理解。

Learning-Rate-Free Learning: Dissecting D-Adaptation and Probabilistic Line Search

paper_url: http://arxiv.org/abs/2308.03102
repo_url: None
paper_authors: Max McGuinness
for: 本研究探讨了两种最近的学习率优化方法：D-Adaptation（arXiv:2301.07733）和概率线搜（arXiv:1502.02846）。这两种方法强调缓解选择初始学习率的负担，通过距离度量和 Gaussian 过程 posterior 估计，分别实现了。
methods: 本研究使用了 D-Adaptation 方法和概率线搜方法。D-Adaptation 方法基于距离度量，可以在不同的批处理大小下选择最佳学习率。概率线搜方法则使用 Gaussian 过程 posterior 估计来估计学习率的变化范围。
results: 本研究通过对两种方法的比较，发现 D-Adaptation 方法在某些情况下可以提供更高的准确率，而概率线搜方法在其他情况下可以提供更快的收敛速率。此外，本研究还发现了这两种方法在不同的批处理大小下的表现。

Abstract
This paper explores two recent methods for learning rate optimisation in stochastic gradient descent: D-Adaptation (arXiv:2301.07733) and probabilistic line search (arXiv:1502.02846). These approaches aim to alleviate the burden of selecting an initial learning rate by incorporating distance metrics and Gaussian process posterior estimates, respectively. In this report, I provide an intuitive overview of both methods, discuss their shared design goals, and devise scope for merging the two algorithms.

摘要
这份报告探讨了两种最近的学习率优化方法：D-Adaptation（arXiv:2301.07733）和概率线搜索（arXiv:1502.02846）。这两种方法目的是减轻选择初始学习率的负担，通过距离度量和 Gaussian 过程 posterior 估计，分别提供了一种简单的概念概述、讨论这两种方法的共同设计目标，并提出将这两种算法合并的范围。

Gradient Coding through Iterative Block Leverage Score Sampling

paper_url: http://arxiv.org/abs/2308.03096
repo_url: None
paper_authors: Neophytos Charalambides, Mert Pilanci, Alfred Hero
for: 这个论文是为了解决分布式计算中的失败问题（即慢进度），通过使用采样技术和编码计算方法来加速线性回归。
methods: 论文使用了一种基于采样的方法，即采样转换后的数据，以获得一个近似的$\ell_2$子空间嵌入。此外，它还使用了一种称为“编码计算”的方法，来加速线性回归。
results: 论文得到了一些有用的结果，包括：1) 采样技术可以在分布式计算中减少计算量，同时保持Solution的质量；2) 编码计算方法可以在分布式计算中加速线性回归，并且可以与采样技术结合使用以获得更好的性能。

Abstract
We generalize the leverage score sampling sketch for $\ell_2$-subspace embeddings, to accommodate sampling subsets of the transformed data, so that the sketching approach is appropriate for distributed settings. This is then used to derive an approximate coded computing approach for first-order methods; known as gradient coding, to accelerate linear regression in the presence of failures in distributed computational networks, \textit{i.e.} stragglers. We replicate the data across the distributed network, to attain the approximation guarantees through the induced sampling distribution. The significance and main contribution of this work, is that it unifies randomized numerical linear algebra with approximate coded computing, while attaining an induced $\ell_2$-subspace embedding through uniform sampling. The transition to uniform sampling is done without applying a random projection, as in the case of the subsampled randomized Hadamard transform. Furthermore, by incorporating this technique to coded computing, our scheme is an iterative sketching approach to approximately solving linear regression. We also propose weighting when sketching takes place through sampling with replacement, for further compression.

摘要
我们扩展了抽象分析过程中的贡献分析方法，以适应分布式设置，使其适合分布式计算环境。这后续用于 derivation of an approximate coded computing approach for first-order methods，known as gradient coding，以加速分布式计算网络中的线性回授。我们将数据复制到分布式网络中，以获得近似保证通过导入的抽象分布。本研究的重要性和主要贡献在于它将随机数学和近似coded computing融合在一起，而且通过导入均匀抽象，以获得$\ell_2$ Sobolev embedding。与传统的随机抽象方法不同的是，我们不需要随机投影，而是通过均匀抽象来实现。此外，我们还提出了在抽象过程中使用权重的思想，以进一步压缩数据。因此，本研究的主要贡献在于提供一种基于随机数学和近似coded computing的迭代快速解方法，用于解决分布式计算环境中的线性回授问题。

Control-aware echo state networks (Ca-ESN) for the suppression of extreme events

paper_url: http://arxiv.org/abs/2308.03095
repo_url: None
paper_authors: Alberto Racca, Luca Magri
for: 这篇论文是为了控制无序非线性系统中的极端事件而写的。
methods: 这篇论文使用了控制准确网络（Ca-ESN），让控制策略（如比例-积分-导数控制和模型预测控制）与ESNs融合在一起，以抑制极端事件的发生。
results: 这篇论文在实验中显示了使用Ca-ESN可以将极端事件的发生减少到传统方法的二个阶层，这开启了新的可控非线性系统的可能性。

Abstract
Extreme event are sudden large-amplitude changes in the state or observables of chaotic nonlinear systems, which characterize many scientific phenomena. Because of their violent nature, extreme events typically have adverse consequences, which call for methods to prevent the events from happening. In this work, we introduce the control-aware echo state network (Ca-ESN) to seamlessly combine ESNs and control strategies, such as proportional-integral-derivative and model predictive control, to suppress extreme events. The methodology is showcased on a chaotic-turbulent flow, in which we reduce the occurrence of extreme events with respect to traditional methods by two orders of magnitude. This works opens up new possibilities for the efficient control of nonlinear systems with neural networks.

摘要
非常事件是普遍存在于非线性对称系统中的突然大幅度变化，这些事件 caracterize 许多科学现象。由于它们的暴力性，非常事件通常会带来不良影响，需要控制方法来预防这些事件发生。在这个工作中，我们介绍了控制意识阶层网络（Ca-ESN），它可以联合 ESN 和控制策略，如比例Integral Derivative 控制和模型预测控制，以抑制非常事件。我们在湍流中使用这种方法，可以与传统方法比较，大大降低非常事件的发生频率，具体来说，降低了两个数量级。这个研究开启了非线性系统控制的新可能性。

Visualization of Extremely Sparse Contingency Table by Taxicab Correspondence Analysis: A Case Study of Textual Data

paper_url: http://arxiv.org/abs/2308.03079
repo_url: None
paper_authors: V. Choulakian, J. Allard
for: 这篇论文是用于描述一种Robust Variant of Correspondence Analysis方法，用于可见化EXTREMELY SPARSE ontingency tables。
methods: 这篇论文使用了12+1维度减少方法（t-SNE、UMAP、PHATE等）来Visualize sacred book fragments的文本数据集，该数据集包含590行x8265列。
results: 这篇论文通过使用Robust Variant of Correspondence Analysis方法，可以准确地VisualizeEXTREMELY SPARSE ontingency tables。

Abstract
We present an overview of taxicab correspondence analysis, a robust variant of correspondence analysis, for visualization of extremely sparse ontingency tables. In particular we visualize an extremely sparse textual data set of size 590 by 8265 concerning fragments of 8 sacred books recently introduced by Sah and Fokou\'e (2019) and studied quite in detail by (12 + 1) dimension reduction methods (t-SNE, UMAP, PHATE,...) by Ma, Sun and Zou (2022).

摘要
我们提供了taxicab对应分析的概述，这是对对应分析的一种鲁棒variant，用于可见化极其稀疏的对应表。特别是我们对590行8265列的一个极其稀疏的文本数据集进行可见化，这个数据集是由 Sah和Fokoué（2019）引入的8种圣书的残篇，并由(12+1)维度减少方法（t-SNE、UMAP、PHATE等）进行了详细研究。这个研究是由Ma、Sun和Zou（2022）进行的。

Study for Performance of MobileNetV1 and MobileNetV2 Based on Breast Cancer

paper_url: http://arxiv.org/abs/2308.03076
repo_url: None
paper_authors: Jiuqi Yan
for: 本实验主要目的是研究人工智能在医学领域中的应用，具体来说是比较MobileNetV1和MobileNetV2模型在分类乳腺癌病理图像方面的表现。
methods: 本实验使用了Kaggle上下载的乳腺癌病理图像集进行训练，并对数据集进行 нор化。然后，使用人工智能模型学习下载的数据集，找出图像中的特征并判断乳腺癌是否存在。
results: 实验结果显示，在处理这个数据集时，MobileNetV1表现得更好， validation accuracy和overfit问题在MobileNetV2训练中出现。这表明，在这种情况下，MobileNetV1比MobileNetV2更适合处理乳腺癌病理图像。

Abstract
Artificial intelligence is constantly evolving and can provide effective help in all aspects of people's lives. The experiment is mainly to study the use of artificial intelligence in the field of medicine. The purpose of this experiment was to compare which of MobileNetV1 and MobileNetV2 models was better at detecting histopathological images of the breast downloaded at Kaggle. When the doctor looks at the pathological image, there may be errors that lead to errors in judgment, and the observation speed is slow. Rational use of artificial intelligence can effectively reduce the error of doctor diagnosis in breast cancer judgment and speed up doctor diagnosis. The dataset was downloaded from Kaggle and then normalized. The basic principle of the experiment is to let the neural network model learn the downloaded data set. Then find the pattern and be able to judge on your own whether breast tissue is cancer. In the dataset, benign tumor pictures and malignant tumor pictures have been classified, of which 198738 are benign tumor pictures and 78, 786 are malignant tumor pictures. After calling MobileNetV1 and MobileNetV2, the dataset is trained separately, the training accuracy and validation accuracy rate are obtained, and the image is drawn. It can be observed that MobileNetV1 has better validation accuracy and overfit during MobileNetV2 training. From the experimental results, it can be seen that in the case of processing this dataset, MobileNetV1 is much better than MobileNetV2.

摘要
人工智能不断发展，可以帮助人们在各个方面更有效率。这个实验主要是研究人工智能在医学领域的应用。实验的目的是比较MobileNetV1和MobileNetV2模型在Kaggle上下载的乳腺病理图像中的表现。当医生查看病理图像时，可能存在错误，导致诊断错误和诊断速度慢。合理使用人工智能可以有效减少医生诊断乳腺癌判断中的错误和提高诊断速度。数据集来自Kaggle，然后Normalized。实验的基本原则是让神经网络模型学习下载的数据集，然后找出图像的模式，并能够独立判断乳腺癌。在数据集中，恶性肿瘤图像和癌瘤图像分别被分类，总共有198738个恶性肿瘤图像和78786个癌瘤图像。后来MobileNetV1和MobileNetV2都被调用，数据集被训练 separately，训练精度和验证精度分别获得，并将图像绘制出来。可以看到，在处理这个数据集时，MobileNetV1表现更好。

Comparative Analysis of Epileptic Seizure Prediction: Exploring Diverse Pre-Processing Techniques and Machine Learning Models

paper_url: http://arxiv.org/abs/2308.05176
repo_url: None
paper_authors: Md. Simul Hasan Talukder, Rejwan Bin Sulaiman
for: 预测癫痫症诊断
methods: 使用五种机器学习模型（Random Forest、Decision Tree、Extra Trees、Logistic Regression和Gradient Boosting）对电энцефалографи记录进行预测
results: 研究发现，Extra Trees模型在预测癫痫症中表现最佳，其准确率为99.29%，高于其他模型和先前研究的状态码。

Abstract
Epilepsy is a prevalent neurological disorder characterized by recurrent and unpredictable seizures, necessitating accurate prediction for effective management and patient care. Application of machine learning (ML) on electroencephalogram (EEG) recordings, along with its ability to provide valuable insights into brain activity during seizures, is able to make accurate and robust seizure prediction an indispensable component in relevant studies. In this research, we present a comprehensive comparative analysis of five machine learning models - Random Forest (RF), Decision Tree (DT), Extra Trees (ET), Logistic Regression (LR), and Gradient Boosting (GB) - for the prediction of epileptic seizures using EEG data. The dataset underwent meticulous preprocessing, including cleaning, normalization, outlier handling, and oversampling, ensuring data quality and facilitating accurate model training. These preprocessing techniques played a crucial role in enhancing the models' performance. The results of our analysis demonstrate the performance of each model in terms of accuracy. The LR classifier achieved an accuracy of 56.95%, while GB and DT both attained 97.17% accuracy. RT achieved a higher accuracy of 98.99%, while the ET model exhibited the best performance with an accuracy of 99.29%. Our findings reveal that the ET model outperformed not only the other models in the comparative analysis but also surpassed the state-of-the-art results from previous research. The superior performance of the ET model makes it a compelling choice for accurate and robust epileptic seizure prediction using EEG data.

摘要
“艾滋症是一种常见的神经疾病，具有不可预测的发作和重复的发作，因此需要精准的预测以便有效地管理和照顾病人。在这些研究中，我们展示了五种机器学习模型（Random Forest、Decision Tree、Extra Trees、Logistic Regression和Gradient Boosting）在艾滋症发作预测中的比较分析。这个数据集经过了精益的清洁、调整、异常处理和扩充，以确保数据质量和模型训练的准确性。这些预处理技术在提高模型表现方面扮演了关键的角色。我们的分析结果显示每个模型的精度，LR分类器获得56.95%的精度，而GB和DT均获得97.17%的精度。RT获得98.99%的精度，而ET模型则表现出99.29%的精度。我们的发现表明ET模型不仅在比较分析中表现出色，而且超越了过去研究中的州前成果。ET模型的超越表现使其成为精确和可靠的艾滋症发作预测中的首选。”

TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties

paper_url: http://arxiv.org/abs/2308.03051
repo_url: None
paper_authors: Karima Kadaoui, Samar M. Magdy, Abdul Waheed, Md Tawkat Islam Khondaker, Ahmed Oumar El-Shangiti, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed
for: The paper assesses the machine translation proficiencies of large language models (LLMs) like Google Bard and OpenAI ChatGPT across ten varieties of Arabic, including Classical Arabic and several dialectal variants.
methods: The paper evaluates the performance of LLMs in machine translation tasks, using diverse Arabic varieties and a human-centric study to scrutinize the models’ ability to follow human instructions.
results: The paper finds that LLMs exhibit satisfactory performance with more prevalent Arabic dialects, but encounter challenges with certain dialects, such as Algerian and Mauritanian, which have limited public data. Additionally, the paper reveals that Bard has limited ability to align with human instructions in translation contexts.

Abstract
Large language models (LLMs) finetuned to follow human instructions have recently emerged as a breakthrough in AI. Models such as Google Bard and OpenAI ChatGPT, for example, are surprisingly powerful tools for question answering, code debugging, and dialogue generation. Despite the purported multilingual proficiency of these models, their linguistic inclusivity remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Arabic. Our evaluation covers diverse Arabic varieties such as Classical Arabic, Modern Standard Arabic, and several nuanced dialectal variants. Furthermore, we undertake a human-centric study to scrutinize the efficacy of the most recent model, Bard, in following human instructions during translation tasks. Our exhaustive analysis indicates that LLMs may encounter challenges with certain Arabic dialects, particularly those for which minimal public data exists, such as Algerian and Mauritanian dialects. However, they exhibit satisfactory performance with more prevalent dialects, albeit occasionally trailing behind established commercial systems like Google Translate. Additionally, our analysis reveals a circumscribed capability of Bard in aligning with human instructions in translation contexts. Collectively, our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities.

摘要
大型语言模型（LLM），如Google Bard和OpenAI ChatGPT，最近在人工智能领域受到了突破。这些模型在问答、代码调试和对话生成等方面表现出了惊人的能力。然而，这些模型在语言多样性方面的探索仍然不充分。为了解决这个问题，我们对Bard和ChatGPT进行了详细的评估，包括GPT-3.5和GPT-4在内的多种阿拉伯语种。我们的评估覆盖了多种阿拉伯语种，包括古典阿拉伯语、现代标准阿拉伯语以及一些细腻的地方语言变体。此外，我们进行了人类中心的研究，以评估Bard在翻译任务中遵循人类指令的能力。我们的详细分析表明，LLMs可能会在某些阿拉伯语言口语中遇到困难，特别是那些具有少量公共数据的语言，如阿尔及利亚和毛里塔尼亚口语。然而，它们在更常见的口语上表现得更好，尽管 occasionally 落后于商业系统如Google Translate。此外，我们的分析还发现了Bard在翻译任务中遵循人类指令的能力有限。总的来说，我们的发现表明，现有的LLMs仍然远离包容，它们只能部分地适应不同社区的语言和文化特点。

Weakly Supervised Multi-Task Representation Learning for Human Activity Analysis Using Wearables

paper_url: http://arxiv.org/abs/2308.03805
repo_url: None
paper_authors: Taoran Sheng, Manfred Huber
for: 这个论文主要针对的是如何使用弱监督多输出siamesenet对着活动和气象数据进行分类和识别。
methods: 该方法使用弱监督多输出siamesenet，该模型可以将数据映射到多个表示空间中，每个表示空间强调一个特定方面的数据。
results: 经过实验 validate，该模型可以同时解决多个任务，并且在许多情况下可以超越单任务监督方法的性能。此外， paper还进一步分析了模型的架构和多任务内部的效果，以及模型的可扩展性。

Abstract
Sensor data streams from wearable devices and smart environments are widely studied in areas like human activity recognition (HAR), person identification, or health monitoring. However, most of the previous works in activity and sensor stream analysis have been focusing on one aspect of the data, e.g. only recognizing the type of the activity or only identifying the person who performed the activity. We instead propose an approach that uses a weakly supervised multi-output siamese network that learns to map the data into multiple representation spaces, where each representation space focuses on one aspect of the data. The representation vectors of the data samples are positioned in the space such that the data with the same semantic meaning in that aspect are closely located to each other. Therefore, as demonstrated with a set of experiments, the trained model can provide metrics for clustering data based on multiple aspects, allowing it to address multiple tasks simultaneously and even to outperform single task supervised methods in many situations. In addition, further experiments are presented that in more detail analyze the effect of the architecture and of using multiple tasks within this framework, that investigate the scalability of the model to include additional tasks, and that demonstrate the ability of the framework to combine data for which only partial relationship information with respect to the target tasks is available.

摘要
感知数据流从智能设备和智能环境中获取，广泛研究在人动作识别（HAR）、人员识别和健康监测等领域。然而，大多数前一些研究在活动和感知流分析中都是专注一个方面的数据，例如只Recognize the type of activity或只是识别执行活动的人。我们提议一种使用弱监督多输出siamesenet，将数据映射到多个表示空间，每个表示空间都关注一个数据方面。数据样本的表示 вектор在空间中位置，使得同Semantic meaning的数据在该方面都很接近。因此，通过实验证明，训练模型可以为多个任务提供分 clustering metrics，使其能同时解决多个任务，甚至在许多情况下超越单任务超级vised方法。此外，进一步的实验还分析了architecture的影响和多个任务在这种框架中的使用情况，证明模型可以扩展到包括更多任务，并且可以将部分相关信息的数据组合成一起进行分析。

Machine learning methods for the search for L&T brown dwarfs in the data of modern sky surveys

paper_url: http://arxiv.org/abs/2308.03045
repo_url: https://github.com/iamaleksandra/ml-brown-dwarfs
paper_authors: Aleksandra Avdeeva
for: 这篇论文的目的是为了开发一种基于机器学习方法的方法，用于分类鲸鱼级和橙级恒星的不同类型。
methods: 这篇论文使用了Random Forest Classifier、XGBoost、SVM Classifier和TabNet等机器学习算法，对PanStarrs DR1、2MASS和WISE数据进行分析，以分类鲸鱼级和橙级恒星。
results: 这篇论文的结果表明，使用机器学习方法可以准确地分类鲸鱼级和橙级恒星，并且比传统的决策规则更有效和相关。

Abstract
According to various estimates, brown dwarfs (BD) should account for up to 25 percent of all objects in the Galaxy. However, few of them are discovered and well-studied, both individually and as a population. Homogeneous and complete samples of brown dwarfs are needed for these kinds of studies. Due to their weakness, spectral studies of brown dwarfs are rather laborious. For this reason, creating a significant reliable sample of brown dwarfs, confirmed by spectroscopic observations, seems unattainable at the moment. Numerous attempts have been made to search for and create a set of brown dwarfs using their colours as a decision rule applied to a vast amount of survey data. In this work, we use machine learning methods such as Random Forest Classifier, XGBoost, SVM Classifier and TabNet on PanStarrs DR1, 2MASS and WISE data to distinguish L and T brown dwarfs from objects of other spectral and luminosity classes. The explanation of the models is discussed. We also compare our models with classical decision rules, proving their efficiency and relevance.

摘要
根据不同的估计，棕矮星（BD）应该占到 галаxy中的25%以上的 объект数。然而，它们的发现和研究很少，个体和人口级别都很少。 homogeneous和完整的褐矮星样本是需要的，以便进行这些类型的研究。由于它们的弱点，褐矮星的spectral studies很困难。因此，创建一个可靠的褐矮星样本，通过spectroscopic observations确认，目前看起来不可能。许多人已经尝试过使用颜色作为决策规则，对大量的survey数据进行搜索和建立褐矮星样本。在这种工作中，我们使用机器学习方法，如Random Forest Classifier、XGBoost、SVM Classifier和TabNet，在PanStarrs DR1、2MASS和WISE数据上分类L和T褐矮星和其他 spectral和 luminosity 类型的 объек数。我们还讲述了模型的解释。我们还比较了我们的模型与经典的决策规则，证明它们的高效和 relevance。

Machine Learning for Infectious Disease Risk Prediction: A Survey

paper_url: http://arxiv.org/abs/2308.03037
repo_url: None
paper_authors: Mutong Liu, Yang Liu, Jiming Liu
for: 这篇论文主要是为了探讨机器学习如何在抑制传染疾病方面发挥作用，以帮助更好地预测传染疾病风险。
methods: 这篇论文使用了不同的机器学习模型来预测传染疾病风险，包括统计预测、数据驱动机器学习和epidemiology-inspired机器学习。
results: 论文结果表明，机器学习可以帮助量化疾病传播模式，并准确预测传染疾病风险。但是，在使用机器学习模型时，需要注意输入数据的问题、设计任务目标和评估模型性能等问题。

Abstract
Infectious diseases, either emerging or long-lasting, place numerous people at risk and bring heavy public health burdens worldwide. In the process against infectious diseases, predicting the epidemic risk by modeling the disease transmission plays an essential role in assisting with preventing and controlling disease transmission in a more effective way. In this paper, we systematically describe how machine learning can play an essential role in quantitatively characterizing disease transmission patterns and accurately predicting infectious disease risks. First, we introduce the background and motivation of using machine learning for infectious disease risk prediction. Next, we describe the development and components of various machine learning models for infectious disease risk prediction. Specifically, existing models fall into three categories: Statistical prediction, data-driven machine learning, and epidemiology-inspired machine learning. Subsequently, we discuss challenges encountered when dealing with model inputs, designing task-oriented objectives, and conducting performance evaluation. Finally, we conclude with a discussion of open questions and future directions.

摘要
免疫疾病，无论是新兴的或长期存在的，对全球公共健康带来沉重负担，数百万人受到威胁。在抗击免疫疾病的过程中，预测疾病传播风险的模型化帮助更有效地预测和控制疾病传播。在这篇文章中，我们系统地描述了机器学习如何在免疫疾病风险预测中发挥重要作用。首先，我们介绍了使用机器学习预测免疫疾病风险的背景和动机。然后，我们描述了不同类型的机器学习模型的开发和组成部分。 especifically，现有模型可以分为三类：统计预测、数据驱动机器学习和epidemiology-inspired机器学习。接着，我们讨论了与模型输入、设计任务指向 objective 和性能评估时遇到的挑战。最后，我们 conclude with 未解决的问题和未来方向。

Serverless Federated AUPRC Optimization for Multi-Party Collaborative Imbalanced Data Mining

paper_url: http://arxiv.org/abs/2308.03035
repo_url: https://github.com/xidongwu/d-auprc
paper_authors: Xidong Wu, Zhengmian Hu, Jian Pei, Heng Huang
for: 这个论文主要针对多方合作训练（分布式学习和联合学习）在巨量数据挑战中提供解决方案。
methods: 本文提出了一种新的服务器less多方合作AUPRC最大化问题，并将其改编为服务器less多方合作学习中的conditional随机优化问题。furthermore, the authors propose a new ServerLess biAsed sTochastic gradiEnt (SLATE) algorithm to directly optimize the AUPRC, and also propose a variance reduction technique called ServerLess biAsed sTochastic gradiEnt with Momentum-based variance reduction (SLATE-M) algorithm to improve the convergence rate.
results: 本文的实验结果表明，SLATE-M算法可以在多方合作学习 Setting中实现更高的AUPRC最大化，并且与单机端Online方法的最优性较高。此外，SLATE-M算法还可以降低了通信成本，提高了计算效率。

Abstract
Multi-party collaborative training, such as distributed learning and federated learning, is used to address the big data challenges. However, traditional multi-party collaborative training algorithms were mainly designed for balanced data mining tasks and are intended to optimize accuracy (\emph{e.g.}, cross-entropy). The data distribution in many real-world applications is skewed and classifiers, which are trained to improve accuracy, perform poorly when applied to imbalanced data tasks since models could be significantly biased toward the primary class. Therefore, the Area Under Precision-Recall Curve (AUPRC) was introduced as an effective metric. Although single-machine AUPRC maximization methods have been designed, multi-party collaborative algorithm has never been studied. The change from the single-machine to the multi-party setting poses critical challenges. To address the above challenge, we study the serverless multi-party collaborative AUPRC maximization problem since serverless multi-party collaborative training can cut down the communications cost by avoiding the server node bottleneck, and reformulate it as a conditional stochastic optimization problem in a serverless multi-party collaborative learning setting and propose a new ServerLess biAsed sTochastic gradiEnt (SLATE) algorithm to directly optimize the AUPRC. After that, we use the variance reduction technique and propose ServerLess biAsed sTochastic gradiEnt with Momentum-based variance reduction (SLATE-M) algorithm to improve the convergence rate, which matches the best theoretical convergence result reached by the single-machine online method. To the best of our knowledge, this is the first work to solve the multi-party collaborative AUPRC maximization problem.

摘要
多方合作训练，如分布式学习和联邦学习，用于解决大数据挑战。然而，传统的多方合作训练算法主要是为了均衡数据挖掘任务而设计，并且是为了提高准确率（例如，交叉熵）。然而，在多数实际应用中，数据分布是偏斜的，因此用于偏斜数据任务的分类器可能会在应用于不均衡数据任务时表现差。因此， Area Under Precision-Recall Curve（AUPRC）被引入为有效的指标。虽然单机AUPRC最大化方法已经设计，但多方合作算法尚未被研究。在这种多机到多机的转换中，存在关键挑战。为解决以上挑战，我们研究了无服务器多方合作AUPRC最大化问题，因为无服务器多方合作训练可以避免服务器瓶颈，从而减少通信成本。然后，我们将问题重新定义为无服务器多方合作学习中的Conditional Stochastic Optimization问题，并提出了一个新的ServerLess biAsed sTochastic gradiEnt（SLATE）算法，以直接优化AUPRC。之后，我们使用了偏移量降低技术，并提出了ServerLess biAsed sTochastic gradiEnt with Momentum-based variance reduction（SLATE-M）算法，以提高收敛率，与单机在线方法的最佳理论收敛率相匹配。到目前为止，这是首次解决多方合作AUPRC最大化问题的研究。

Causal Disentanglement Hidden Markov Model for Fault Diagnosis

paper_url: http://arxiv.org/abs/2308.03027
repo_url: None
paper_authors: Rihao Chang, Yongtao Ma, Weizhi Nie, Jie Nie, An-an Liu
for: 这 paper 的目的是提出一种基于 causal disentanglement hidden Markov model (CDHM) 的 fault diagnosis方法，以实现更加准确的维修预测。
methods: 该方法使用了时间序列数据，逐步分解震动信号为相关 fault 和无关 fault 因素，并使用 ELBO 优化学习 causal disentanglement Markov model。此外，该方法还采用了无监督领域适应，将学习的分解表示转移到其他工作环境中。
results: 实验结果表明，提出的方法能够在 CWRU 数据集和 IMS 数据集上提供更高的预测精度和维修效率，证明了该方法的优势。

Abstract
In modern industries, fault diagnosis has been widely applied with the goal of realizing predictive maintenance. The key issue for the fault diagnosis system is to extract representative characteristics of the fault signal and then accurately predict the fault type. In this paper, we propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism and thus, capture their characteristics to achieve a more robust representation. Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors. The ELBO is reformulated to optimize the learning of the causal disentanglement Markov model. Moreover, to expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments. Experiments were conducted on the CWRU dataset and IMS dataset. Relevant results validate the superiority of the proposed method.

摘要
现代产业中，FAULT诊断已广泛应用，目的是实现预测维护。FAULT诊断系统的关键问题是提取FAULT信号的代表特征，然后准确预测FAULT类型。在这篇论文中，我们提出了 causal Disentanglement Hidden Markov Model（CDHM），用于学习滤波器FAULT机理的 causality，并从而捕捉其特征，以实现更加稳定的表示。specifically，我们利用时间序列数据，逐步分离振荡信号为FAULT相关和FAULT无关的因素。ELBO被重新表述，以便优化学习 causal Disentanglement Markov model。此外，为扩展应用范围，我们采用了无监督领域适应，将学习的分离表示转移到其他工作环境中。在CWRU数据集和IMS数据集上进行了实验，结果证明了我们提出的方法的优越性。

Early Detection and Localization of Pancreatic Cancer by Label-Free Tumor Synthesis

paper_url: http://arxiv.org/abs/2308.03008
repo_url: https://github.com/mrgiovanni/synthetictumors
paper_authors: Bowen Li, Yu-Cheng Chou, Shuwen Sun, Hualin Qiao, Alan Yuille, Zongwei Zhou
for: 早期检测和诊断肝脏癌可以提高病人5年生存率从8.5%提高到20%.
methods: 我们提出了一种使用人工智能（AI）帮助放射科医生早期检测肝脏癌的方法，但是训练AI模型需要大量的标注示例，但现有CT扫描器获得早期癌病变的示例受限。
results: 我们的实验表明，使用我们提出的合成癌病变方法训练AI模型，对肝脏癌的检测率（敏感性和特异性）和每个块分割性能都达到了与实际癌病变示例相当的水平。此外，我们的方法还显示在小癌病变检测方面更高的检测率。

Abstract
Early detection and localization of pancreatic cancer can increase the 5-year survival rate for patients from 8.5% to 20%. Artificial intelligence (AI) can potentially assist radiologists in detecting pancreatic tumors at an early stage. Training AI models require a vast number of annotated examples, but the availability of CT scans obtaining early-stage tumors is constrained. This is because early-stage tumors may not cause any symptoms, which can delay detection, and the tumors are relatively small and may be almost invisible to human eyes on CT scans. To address this issue, we develop a tumor synthesis method that can synthesize enormous examples of small pancreatic tumors in the healthy pancreas without the need for manual annotation. Our experiments demonstrate that the overall detection rate of pancreatic tumors, measured by Sensitivity and Specificity, achieved by AI trained on synthetic tumors is comparable to that of real tumors. More importantly, our method shows a much higher detection rate for small tumors. We further investigate the per-voxel segmentation performance of pancreatic tumors if AI is trained on a combination of CT scans with synthetic tumors and CT scans with annotated large tumors at an advanced stage. Finally, we show that synthetic tumors improve AI generalizability in tumor detection and localization when processing CT scans from different hospitals. Overall, our proposed tumor synthesis method has immense potential to improve the early detection of pancreatic cancer, leading to better patient outcomes.

摘要
早期发现和确定胰腺癌的患者5年生存率可以从8.5%提高到20%.人工智能（AI）可能能够帮助放射学家早期发现胰腺肿瘤。训练AI模型需要巨量的标注示例，但获得早期阶段肿瘤的CT扫描数据受限。这是因为早期阶段的肿瘤可能没有任何症状，这会延迟发现，同时肿瘤也很小，可能对人类眼不可见在CT扫描中。为解决这个问题，我们开发了一种肿瘤合成方法，可以在健康胰腺中合成巨量的小胰腺肿瘤示例，无需人工标注。我们的实验表明，使用合成肿瘤来训练AI的总检测率（敏感度和特异度）和小肿瘤检测率均与实际肿瘤相当。此外，我们还发现，将合成肿瘤与已知大肿瘤的CT扫描数据组合训练AI，可以提高每个voxel的分 segmentation性能。最后，我们证明了合成肿瘤可以提高AI在不同医院CT扫描数据处理中的普适性。总之，我们提出的肿瘤合成方法具有巨大的潜力，可以提高胰腺癌的早期发现，从而提高病人生存率。

Deep Polar Codes

paper_url: http://arxiv.org/abs/2308.03004
repo_url: https://github.com/HzFu/MNet_DeepCDR
paper_authors: Geon Choi, Namyoon Lee
for: 这paper是为了提出一种新的预变扭转码，即深度扭转码。
methods: 这paper使用了一种多层扭转变换的深度编码器，以及一种低复杂度的解码算法Successive Cancellation List with Backpropagation Parity Checks (SCL-BPC)。
results: simulations表明，深度扭转码在不同的编码率下，对块错误率具有较好的性能，并且可以保持低的编码和解码复杂度。此外，这paper还证明了将深度扭转码 concatenate avec cyclic-redundancy-check codes可以实现finite block length capacity的meta-converse bound within 0.4 dB。

Abstract
In this paper, we introduce a novel class of pre-transformed polar codes, termed as deep polar codes. We first present a deep polar encoder that harnesses a series of multi-layered polar transformations with varying sizes. Our approach to encoding enables a low-complexity implementation while significantly enhancing the weight distribution of the code. Moreover, our encoding method offers flexibility in rate-profiling, embracing a wide range of code rates and blocklengths. Next, we put forth a low-complexity decoding algorithm called successive cancellation list with backpropagation parity checks (SCL-BPC). This decoding algorithm leverages the parity check equations in the reverse process of the multi-layered pre-transformed encoding for SCL decoding. Additionally, we present a low-latency decoding algorithm that employs parallel-SCL decoding by treating partially pre-transformed bit patterns as additional frozen bits. Through simulations, we demonstrate that deep polar codes outperform existing pre-transformed polar codes in terms of block error rates across various code rates under short block lengths, while maintaining low encoding and decoding complexity. Furthermore, we show that concatenating deep polar codes with cyclic-redundancy-check codes can achieve the meta-converse bound of the finite block length capacity within 0.4 dB in some instances.

摘要
在这篇论文中，我们介绍了一种新的预变扩展极码，称为深度极码。我们首先描述了一种深度极码编码器，利用多层极化转换来实现低复杂性实现，同时有效地改善极码的重量分布。此外，我们的编码方法支持范围广适的代码速率和块长度。接下来，我们提出了一种低复杂度解码算法，称为顺序取消列表归并后传递检查（SCL-BPC）。这种解码算法利用了反向的多层预变扩展编码中的严格检查方程，并且可以在低复杂度下实现高效的解码。此外，我们还提出了一种低延迟解码算法，通过并行执行SCL解码来处理部分预变扩展的位 Pattern。通过实验，我们证明了深度极码在不同代码速率下的块错误率较低，同时保持低编码和解码复杂度。此外，我们还显示了将深度极码与循环检查码 concatenate 可以实现finite block length capacity的meta-converse bound within 0.4 dB 的情况。

Spanish Pre-trained BERT Model and Evaluation Data

paper_url: http://arxiv.org/abs/2308.02976
repo_url: https://github.com/dccuchile/beto
paper_authors: José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, Jorge Pérez
for: bridging the gap of Spanish language resources for training and evaluating Spanish language models.
methods: using BERT-based language model pre-trained exclusively on Spanish data, and compiling several tasks specifically for the Spanish language in a single repository.
results: fine-tuning the pre-trained Spanish model achieves better results compared to other BERT-based models pre-trained on multilingual corpora for most tasks, and achieves a new state-of-the-art on some tasks.Here’s the format you requested:
for: bridging the gap of Spanish language resources
methods: using BERT-based language model pre-trained exclusively on Spanish data, and compiling several tasks specifically for the Spanish language in a single repository
results: fine-tuning the pre-trained Spanish model achieves better results compared to other BERT-based models pre-trained on multilingual corpora for most tasks, and achieves a new state-of-the-art on some tasks.

Abstract
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.

摘要
西班牙语是全球前5大常用语言之一，但找到用于训练或评估西班牙语模型的资源并不容易。在这篇论文中，我们帮助填补这个差距，提出了基于BERT的西班牙语语言模型，并且在单个存储中集成了许多西班牙语任务。经过练练后的西班牙语模型，在大多数任务上比其他基于多语言Corpus预训练的BERT模型获得更好的结果，甚至达到了一些任务的新状态。我们将公开发布我们的模型、预训练数据和西班牙语benchmark集。

Generalized Oversampling for Learning from Imbalanced datasets and Associated Theory

paper_url: http://arxiv.org/abs/2308.02966
repo_url: None
paper_authors: Samuel Stocksieker, Denys Pommeret, Arthur Charpentier
for: 这个研究旨在解决受扰数据集的问题，尤其是预测任务中的欠资料问题。
methods: 本研究提出了一个基于核密度估计的数据增强方法，称为GOLIATH算法，可以应用于预测和回归任务。这个方法包括两大家族的人工增加：基于扰动，如 Gaussian Noise，和基于插值，如 SMOTE。它还提供了这些机器学习算法的明确形式和条件密度表达，特别是SMOTE。
results: 本研究评估了GOLIATH算法在受扰数据集中的性能，并与现有的State-of-the-art技术进行比较。结果显示，GOLIATH算法在受扰数据集中具有显著的改善。

Abstract
In supervised learning, it is quite frequent to be confronted with real imbalanced datasets. This situation leads to a learning difficulty for standard algorithms. Research and solutions in imbalanced learning have mainly focused on classification tasks. Despite its importance, very few solutions exist for imbalanced regression. In this paper, we propose a data augmentation procedure, the GOLIATH algorithm, based on kernel density estimates which can be used in classification and regression. This general approach encompasses two large families of synthetic oversampling: those based on perturbations, such as Gaussian Noise, and those based on interpolations, such as SMOTE. It also provides an explicit form of these machine learning algorithms and an expression of their conditional densities, in particular for SMOTE. New synthetic data generators are deduced. We apply GOLIATH in imbalanced regression combining such generator procedures with a wild-bootstrap resampling technique for the target values. We evaluate the performance of the GOLIATH algorithm in imbalanced regression situations. We empirically evaluate and compare our approach and demonstrate significant improvement over existing state-of-the-art techniques.

摘要
在超级vised学习中，很常遇到实际上的不均衡数据集。这种情况会导致标准算法学习困难。研究和解决不均衡学习问题的研究主要集中在分类任务上。尽管它的重要性，实际上很少解决不均衡回归问题的解决方案存在。在这篇论文中，我们提出了一种数据扩充过程，named GOLIATH algorithm，基于kernel density estimates，可以在分类和回归中使用。这种通用的方法包括两大家族的人工增加：基于扰动，如 Gaussian Noise，和基于 interpolations，如 SMOTE。它还提供了这些机器学习算法的明确形式，特别是SMOTE的表达。新的人工数据生成器被推导出来。我们在不均衡回归中结合了这些生成器过程和野生bootstrap抽样技术来针对目标值。我们对GOLIATH算法在不均衡回归情况下的性能进行了实验性评估和比较，并证明了我们的方法在现有状态的技术上具有显著的改善。

Data Fusion for Multi-Task Learning of Building Extraction and Height Estimation

paper_url: http://arxiv.org/abs/2308.02960
repo_url: https://github.com/SaadAhmedJamal/IEEE_DFC2023
paper_authors: Saad Ahmed Jamal, Arioluwa Aribisala
for: 这篇论文是为了解决城市重建问题，使用 optic 和 radar 卫星影像进行多任务学习，实现建筑物抽出和高度估计。
methods: 本论文使用多任务学习方法，将 optic 和 radar 卫星影像融合，实现建筑物抽出和高度估计。
results: 根据设计实验结果，本论文的基准结果显著提高了建筑物抽出和高度估计的精度。

Abstract
In accordance with the urban reconstruction problem proposed by the DFC23 Track 2 Contest, this paper attempts a multitask-learning method of building extraction and height estimation using both optical and radar satellite imagery. Contrary to the initial goal of multitask learning which could potentially give a superior solution by reusing features and forming implicit constraints between multiple tasks, this paper reports the individual implementation of the building extraction and height estimation under constraints. The baseline results for the building extraction and the height estimation significantly increased after designed experiments.

摘要
根据DFC23 Track 2 contest提出的城市重建问题，本文提出了一种多任务学习方法，使用光学和雷达卫星影像进行建筑物提取和高度估计。与初始目标的多任务学习不同，本文对每个任务进行独立实现，并在限制下进行建筑物提取和高度估计。经过设计实验，基准结果显著提高。Note: "多任务学习" in Chinese is "多任务学习" (duō zhòng zhì xué xí), and "光学" and "雷达" are both adjectives, so they are not translated.

K-band: Self-supervised MRI Reconstruction via Stochastic Gradient Descent over K-space Subsets

paper_url: http://arxiv.org/abs/2308.02958
repo_url: https://github.com/mikgroup/k-band
paper_authors: Frederic Wang, Han Qi, Alfredo De Goyeneche, Reinhard Heckel, Michael Lustig, Efrat Shimron
For: This paper aims to develop a novel mathematical framework for training deep learning (DL) models using only partial, limited-resolution k-space data in high-dimensional magnetic resonance imaging (MRI).* Methods: The proposed method, called k-band, uses stochastic gradient descent (SGD) over k-space subsets, where only a small k-space portion is used in each training iteration to compute gradients. The method is compatible with different sampling strategies, and the authors demonstrate its effectiveness using k-space “bands” with limited resolution in one dimension.* Results: The authors prove analytically that their method stochastically approximates the gradients computed in a fully-supervised setup, as long as two conditions are met: (i) the limited-resolution axis is chosen randomly-uniformly for every new scan, and (ii) the loss function is weighed with a mask that facilitates accurate reconstruction of high-resolution details. Numerical experiments with raw MRI data show that k-band outperforms two other methods trained on limited-resolution data and performs comparably to state-of-the-art methods trained on high-resolution data.

Abstract
Although deep learning (DL) methods are powerful for solving inverse problems, their reliance on high-quality training data is a major hurdle. This is significant in high-dimensional (dynamic/volumetric) magnetic resonance imaging (MRI), where acquisition of high-resolution fully sampled k-space data is impractical. We introduce a novel mathematical framework, dubbed k-band, that enables training DL models using only partial, limited-resolution k-space data. Specifically, we introduce training with stochastic gradient descent (SGD) over k-space subsets. In each training iteration, rather than using the fully sampled k-space for computing gradients, we use only a small k-space portion. This concept is compatible with different sampling strategies; here we demonstrate the method for k-space "bands", which have limited resolution in one dimension and can hence be acquired rapidly. We prove analytically that our method stochastically approximates the gradients computed in a fully-supervised setup, when two simple conditions are met: (i) the limited-resolution axis is chosen randomly-uniformly for every new scan, hence k-space is fully covered across the entire training set, and (ii) the loss function is weighed with a mask, derived here analytically, which facilitates accurate reconstruction of high-resolution details. Numerical experiments with raw MRI data indicate that k-band outperforms two other methods trained on limited-resolution data and performs comparably to state-of-the-art (SoTA) methods trained on high-resolution data. k-band hence obtains SoTA performance, with the advantage of training using only limited-resolution data. This work hence introduces a practical, easy-to-implement, self-supervised training framework, which involves fast acquisition and self-supervised reconstruction and offers theoretical guarantees.

摘要
although deep learning (DL) methods are powerful for solving inverse problems, their reliance on high-quality training data is a major hurdle. This is significant in high-dimensional (dynamic/volumetric) magnetic resonance imaging (MRI), where acquiring high-resolution fully sampled k-space data is impractical. We introduce a novel mathematical framework, dubbed k-band, that enables training DL models using only partial, limited-resolution k-space data. Specifically, we introduce training with stochastic gradient descent (SGD) over k-space subsets. In each training iteration, rather than using the fully sampled k-space for computing gradients, we only use a small k-space portion. This concept is compatible with different sampling strategies; here we demonstrate the method for k-space "bands", which have limited resolution in one dimension and can hence be acquired rapidly. We prove analytically that our method stochastically approximates the gradients computed in a fully-supervised setup, when two simple conditions are met: (i) the limited-resolution axis is chosen randomly and uniformly for every new scan, hence k-space is fully covered across the entire training set, and (ii) the loss function is weighed with a mask, derived here analytically, which facilitates accurate reconstruction of high-resolution details. Numerical experiments with raw MRI data indicate that k-band outperforms two other methods trained on limited-resolution data and performs comparably to state-of-the-art (SoTA) methods trained on high-resolution data. k-band hence obtains SoTA performance, with the advantage of training using only limited-resolution data. This work hence introduces a practical, easy-to-implement, self-supervised training framework, which involves fast acquisition and self-supervised reconstruction and offers theoretical guarantees.Here's the translation in Traditional Chinese:虽然深度学习（DL）方法有着解决反射问题的力量，但它们对高品质训练数据的依赖是一个主要的阻碍。这特别是在高维度（动态/体积）磁共振成像（MRI）中，其中高分辨率完整探测空间数据的取得是不实际的。我们提出了一个新的数学框架，名为k-band，可以使用仅有部分、有限分辨率的k-空间数据进行深度学习模型的训练。具体来说，我们引入了使用测出梯度的测出梯度运算，并在每个训练迭代中仅使用k-空间中的一小部分。这个概念可以与不同的采样策略相容，在这里我们示例了使用k-空间"带"，它们在一个维度上有限的分辨率，并可以快速地取得。我们 analytically prove了我们的方法可以随机地近似fully-supervised setup中的梯度，只要两个简单的条件是满足的：（i）在每个新的扫描中，Randomly and uniformly chooses the limited-resolution axis，使得k-space是全面覆盖整个训练集，并（ii）使用一个 analytically derived 的mask，来减轻高分辨率细节的重建。numero Experiments with raw MRI data indicate that k-band outperforms two other methods trained on limited-resolution data and performs comparably to state-of-the-art (SoTA) methods trained on high-resolution data. k-band hence obtains SoTA performance, with the advantage of training using only limited-resolution data. This work hence introduces a practical, easy-to-implement, self-supervised training framework, which involves fast acquisition and self-supervised reconstruction and offers theoretical guarantees.

An Empirical Study of AI-based Smart Contract Creation

paper_url: http://arxiv.org/abs/2308.02955
repo_url: None
paper_authors: Rabimba Karanjai, Edward Li, Lei Xu, Weidong Shi
for: 本研究旨在评估大语言模型（LLM）如ChatGPT和Google Palm2在智能合约生成方面的可能性。
methods: 本研究使用了LLMs对智能合约的生成，并评估了生成代码的质量。
results: 研究发现，通过LLMs生成的合约存在安全漏洞，且代码质量和正确性受到输入参数质量的影响。但是，还有一些可能的改进方向。

Abstract
The introduction of large language models (LLMs) like ChatGPT and Google Palm2 for smart contract generation seems to be the first well-established instance of an AI pair programmer. LLMs have access to a large number of open-source smart contracts, enabling them to utilize more extensive code in Solidity than other code generation tools. Although the initial and informal assessments of LLMs for smart contract generation are promising, a systematic evaluation is needed to explore the limits and benefits of these models. The main objective of this study is to assess the quality of generated code provided by LLMs for smart contracts. We also aim to evaluate the impact of the quality and variety of input parameters fed to LLMs. To achieve this aim, we created an experimental setup for evaluating the generated code in terms of validity, correctness, and efficiency. Our study finds crucial evidence of security bugs getting introduced in the generated smart contracts as well as the overall quality and correctness of the code getting impacted. However, we also identified the areas where it can be improved. The paper also proposes several potential research directions to improve the process, quality and safety of generated smart contract codes.

摘要
大型语言模型（LLMs）如ChatGPT和Google Palm2的出现在智能合约生成中似乎是首次成功的AI对应程式设计。LLMs有 Access to a large number of open-source smart contracts， allowing them to use more extensive code in Solidity than other code generation tools. Although the initial and informal assessments of LLMs for smart contract generation are promising, a systematic evaluation is needed to explore the limits and benefits of these models. 本研究的主要目的是评估由LLMs生成的智能合约程式码的质量。我们还想评估对LLMs输入参数的影响，以及生成的程式码的安全性和正确性。为了实现这个目标，我们创建了一个实验setup来评估生成的程式码，包括验证、正确性和效率。我们的研究发现，由LLMs生成的智能合约程式码中有许多安全漏洞，并且程式码的全面性和正确性受到影响。但我们也发现了改进这些模型的可能性。本研究的结论提出了多个可能的研究方向，以改善生成的程式码质量、安全性和可靠性。

dPASP: A Comprehensive Differentiable Probabilistic Answer Set Programming Environment For Neurosymbolic Learning and Reasoning

paper_url: http://arxiv.org/abs/2308.02944
repo_url: None
paper_authors: Renato Lui Geh, Jonas Gonçalves, Igor Cataneo Silveira, Denis Deratani Mauá, Fabio Gagliardi Cozman
for: 本文描述了一种新的宣言式概率逻辑编程框架dPASP，用于不同推理和统计知识的结合。
methods: 本文使用了逻辑约束、间值概率选择和神经预测来表示不确定、矛盾、不完整和统计知识。
results: 本文介绍了一种可以执行推理和学习的实现包，包括一些示例程序，并讨论了在不同语义下的梯度下降学习。

Abstract
We present dPASP, a novel declarative probabilistic logic programming framework for differentiable neuro-symbolic reasoning. The framework allows for the specification of discrete probabilistic models with neural predicates, logic constraints and interval-valued probabilistic choices, thus supporting models that combine low-level perception (images, texts, etc), common-sense reasoning, and (vague) statistical knowledge. To support all such features, we discuss the several semantics for probabilistic logic programs that can express nondeterministic, contradictory, incomplete and/or statistical knowledge. We also discuss how gradient-based learning can be performed with neural predicates and probabilistic choices under selected semantics. We then describe an implemented package that supports inference and learning in the language, along with several example programs. The package requires minimal user knowledge of deep learning system's inner workings, while allowing end-to-end training of rather sophisticated models and loss functions.

摘要
我们介绍了dpasp，一种新的声明型概率逻辑编程框架，用于可微分神经符号逻辑推理。这个框架允许用户指定混合低级感知（图像、文本等）、通用理智、混乱统计知识的概率模型。为支持这些特点，我们讨论了几种概率逻辑程序的 semantics，可以表达非束缚、矛盾、不完整和/或统计知识。我们还讨论了如何在选定 semantics 下使用神经 predicate 和概率选择来进行梯度基于学习。然后，我们描述了一个实现的包，包括推理和学习语言中的许多示例程序。这个包需要最少的用户知识 deep learning 系统的内部工作，同时允许用户实现较复杂的模型和损失函数的整体训练。

Towards the Development of an Uncertainty Quantification Protocol for the Natural Gas Industry

paper_url: http://arxiv.org/abs/2308.02941
repo_url: None
paper_authors: Babajide Kolade
for: 这个论文的目的是为了开发一种用于评估机器学习和机理模型预测结果的不确定性评估协议。
methods: 这个论文使用了机器学习模型和机理模型来进行预测，并使用了不确定性评估协议来评估模型的可靠性。
results: 该论文通过应用不确定性评估协议来评估机器学习和机理模型的预测结果的不确定性，并提供了一些可靠性评估的方法和技术。

Abstract
Simulations using machine learning (ML) models and mechanistic models are often run to inform decision-making processes. Uncertainty estimates of simulation results are critical to the decision-making process because simulation results of specific scenarios may have wide, but unspecified, confidence bounds that may impact subsequent analyses and decisions. The objective of this work is to develop a protocol to assess uncertainties in predictions of machine learning and mechanistic simulation models. The protocol will outline an uncertainty quantification workflow that may be used to establish credible bounds of predictability on computed quantities of interest and to assess model sufficiency. The protocol identifies key sources of uncertainties in machine learning and mechanistic modeling, defines applicable methods of uncertainty propagation for these sources, and includes statistically rational estimators for output uncertainties. The work applies the protocol to test cases relevant to the gas distribution industry and presents learnings from its application. The paper concludes with a brief discussion outlining a pathway to the wider adoption of uncertainty quantification within the industry

摘要
模拟使用机器学习（ML）模型和机理模型经常用于决策过程中，以便更好地了解不同情况下的结果。模拟结果中的uncertainty estimate是决策过程中非常重要的，因为模拟结果的特定情况可能有很宽，但不具体的信任范围，这可能会影响后续分析和决策。本工作的目标是开发一个协议，用于评估机器学习和机理模型预测结果中的不确定性。协议将 outline一个不确定性评估工作流程，可以用来确定计算量表达的可靠范围，并评估模型的充分性。协议将列出机器学习和机理模型中的主要不确定性源泉，并定义这些源泉上适用的不确定性传播方法，并包括输出不确定性的统计合理估计。本工作将应用协议到与天然气分布业有关的测试 случа例，并显示其应用的教训。文章结束于对在业界更广泛采用不确定性评估的道路的简要讨论。

Towards Consistency Filtering-Free Unsupervised Learning for Dense Retrieval

paper_url: http://arxiv.org/abs/2308.02926
repo_url: https://github.com/Haoxiang-WasedaU/Towards-Consistency-Filtering-Free-Unsupervised-Learning-for-Dense-Retrieval
paper_authors: Haoxiang Shi, Sumio Fujita, Tetsuya Sakai
for: overcome domain transfer challenge in modern neural Information Retrieval (IR)
methods: replace consistency filter with direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods
results: TextRank-based pseudo relevance feedback outperforms other methods, and filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance.Here’s the full text in Simplified Chinese:
for: 本研究旨在解决现代神经信息检索（IR）中的域转换问题。
methods: 我们提议取代了统计过程的一致性筛选器，使用直接 Pseudo-labeling、Pseudo- relevance feedback 或无监督关键词生成方法。
results: 我们的广泛的实验证明，TextRank-based Pseudo relevance feedback 方法在其他方法之上表现更出色，并且 filtering-free 无监督学习可以不断提高训练和推理效率，同时保持检索性能。

Abstract
Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces the model efficiency. In addition, consistency filtering often struggles to identify retrieval intentions and recognize query and corpus distributions in a target domain. In this study, we evaluate a more efficient solution: replacing the consistency filter with either direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods for achieving consistent filtering-free unsupervised dense retrieval. Our extensive experimental evaluations demonstrate that, on average, TextRank-based pseudo relevance feedback outperforms other methods. Furthermore, we analyzed the training and inference efficiency of the proposed paradigm. The results indicate that filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance. In some cases, it can even improve performance based on particular datasets.

摘要
域名转移是现代神经信息检索（IR）中的一大挑战。以前的研究使用域名特定的手动标注和生成的域名特定排序器来训练一个通用排序器，以便在目标域名中提高检索性能。然而，训练这些一致性筛选器是计算机Expensive，这会significantly reduces the model efficiency。另外，一致性筛选器经常难以识别检索目的和查询和文献库的分布。在这种研究中，我们评估了一种更有效的解决方案：取代一致性筛选器，使用直接 Pseudo-labeling、pseudo relevance feedback 或无监督关键词生成方法来实现一致性自由无监督检索。我们进行了广泛的实验评估，结果表明，使用 TextRank 基于 Pseudo relevance feedback 的方法在 average 上超过其他方法。此外，我们还分析了提议的训练和执行效率。结果表明，无监督自由学习可以不断提高训练和执行效率，同时保持检索性能。在某些情况下，它甚至可以提高基于特定数据集的性能。

An AI-Enabled Framework to Defend Ingenious MDT-based Attacks on the Emerging Zero Touch Cellular Networks

paper_url: http://arxiv.org/abs/2308.02923
repo_url: None
paper_authors: Aneeqa Ijaz, Waseem Raza, Hasan Farooq, Marvin Manalastas, Ali Imran
for: This paper aims to address the security threats in deeply automated wireless networks and IoT devices, specifically the vulnerability of MDT reports to adversarial attacks.
methods: The paper proposes a novel Malicious MDT Reports Identification framework (MRIF) using Machine Learning to detect and eliminate malicious MDT reports, and verifies its effectiveness through a use-case.
results: The paper highlights the detrimental repercussions of adversarial attacks on MDT reports on the performance of common network automation functions, and proposes a countermeasure to defend against such attacks.

Abstract
Deep automation provided by self-organizing network (SON) features and their emerging variants such as zero touch automation solutions is a key enabler for increasingly dense wireless networks and pervasive Internet of Things (IoT). To realize their objectives, most automation functionalities rely on the Minimization of Drive Test (MDT) reports. The MDT reports are used to generate inferences about network state and performance, thus dynamically change network parameters accordingly. However, the collection of MDT reports from commodity user devices, particularly low cost IoT devices, make them a vulnerable entry point to launch an adversarial attack on emerging deeply automated wireless networks. This adds a new dimension to the security threats in the IoT and cellular networks. Existing literature on IoT, SON, or zero touch automation does not address this important problem. In this paper, we investigate an impactful, first of its kind adversarial attack that can be launched by exploiting the malicious MDT reports from the compromised user equipment (UE). We highlight the detrimental repercussions of this attack on the performance of common network automation functions. We also propose a novel Malicious MDT Reports Identification framework (MRIF) as a countermeasure to detect and eliminate the malicious MDT reports using Machine Learning and verify it through a use-case. Thus, the defense mechanism can provide the resilience and robustness for zero touch automation SON engines against the adversarial MDT attacks

摘要
深层自动化由自组织网络（SON）特点和其出现的变种，如零Touch自动化解决方案，是无线网络和互联网物联网（IoT）的关键加速器。为实现这些目标，大多数自动化功能都依赖于推定测试（MDT）报告。MDT报告可以生成对网络状态和性能的推理，因此动态改变网络参数。然而，从低成本IoT设备收集MDT报告，特别是低成本IoT设备，使得它们成为发动对新型深层自动化无线网络的敌意攻击的易受到攻击的入口点。这添加了新的安全隐患到互联网和无线网络中。现有的文献中关于IoT、SON或零Touch自动化没有讨论这个重要问题。在这篇论文中，我们调查了一种新型的攻击，可以通过利用受到恶意MDT报告的用户设备（UE）进行攻击。我们强调了这种攻击对常见网络自动化功能的负面影响。我们还提出了一个新的恶意MDT报告标识框架（MRIF）作为一种对抗手段，通过机器学习来检测和消除恶意MDT报告，并通过用例验证。因此，防御机制可以为零Touch自动化SON引擎提供防御力和坚固性，抵御攻击。

Structured Low-Rank Tensors for Generalized Linear Models

paper_url: http://arxiv.org/abs/2308.02922
repo_url: None
paper_authors: Batoul Taki, Anand D. Sarwate, Waheed U. Bajwa
for: 这个论文研究了一种新的低级别矩阵模型（LSR），用于通用线性模型（GLM）问题。
methods: 该论文提出了一种块坐标枢车算法来参数估计LSR结构矩阵GLM问题。
results: 论文 derive了一个最小最大下界，用于评估矩阵GLM问题中参数估计的误差阈值。该下界与矩阵GLM问题的自然度相比，表明sample complexity可能会比vectorized GLMs更低。此外，论文还进行了数值分析，并在 synthetic 数据上进行了三种类型的回归（线性、逻辑和波尔兹）的实验，以及一些医学成像数据上的实验。

Abstract
Recent works have shown that imposing tensor structures on the coefficient tensor in regression problems can lead to more reliable parameter estimation and lower sample complexity compared to vector-based methods. This work investigates a new low-rank tensor model, called Low Separation Rank (LSR), in Generalized Linear Model (GLM) problems. The LSR model -- which generalizes the well-known Tucker and CANDECOMP/PARAFAC (CP) models, and is a special case of the Block Tensor Decomposition (BTD) model -- is imposed onto the coefficient tensor in the GLM model. This work proposes a block coordinate descent algorithm for parameter estimation in LSR-structured tensor GLMs. Most importantly, it derives a minimax lower bound on the error threshold on estimating the coefficient tensor in LSR tensor GLM problems. The minimax bound is proportional to the intrinsic degrees of freedom in the LSR tensor GLM problem, suggesting that its sample complexity may be significantly lower than that of vectorized GLMs. This result can also be specialised to lower bound the estimation error in CP and Tucker-structured GLMs. The derived bounds are comparable to tight bounds in the literature for Tucker linear regression, and the tightness of the minimax lower bound is further assessed numerically. Finally, numerical experiments on synthetic datasets demonstrate the efficacy of the proposed LSR tensor model for three regression types (linear, logistic and Poisson). Experiments on a collection of medical imaging datasets demonstrate the usefulness of the LSR model over other tensor models (Tucker and CP) on real, imbalanced data with limited available samples.

摘要
近期研究表明，在回归问题中强制材料结构 onto 参数矩阵可以导致更可靠的参数估计和较低的样本复杂度，相比 vector-based 方法。这个工作研究了一种新的低级别张量模型，called Low Separation Rank (LSR)，在泛化线性模型 (GLM) 问题中。LSR 模型总结了已知的各种 Tucker 和 CANDECOMP/PARAFAC (CP) 模型，并是特殊情况的块张量分解 (BTD) 模型。这个工作提出了基于块坐标推导法的参数估计算法，并 derivates 一个最小最大下界对 GLM 问题中参数矩阵的估计误差。这个下界与 LSR 张量 GLM 问题中参数矩阵的内在度量相对，表明其样本复杂度可能远低于vectorized GLMs。此外，这个结果还可以特殊化到 CP 和 Tucker 结构 GLM 问题中。经过数学分析和实验 validate，这个下界与文献中已知的精细度下界相当。最后，数值实验表明LSR 张量模型在三种回归类型（线性回归、逻辑回归和波恩回归）中具有出色的效果，并在实际医疗影像数据上进行了比较。

Spectral Ranking Inferences based on General Multiway Comparisons

paper_url: http://arxiv.org/abs/2308.02918
repo_url: None
paper_authors: Jianqing Fan, Zhipeng Lou, Weichen Wang, Mengxin Yu
for: 这 paper 研究 spectral method 在对比Entities的偏好分数的估计和不确定性评估中的性能。
methods: paper 使用 spectral method 在一个非常通用和更真实的设定中，其中比较图包含可能不同大小的高阶约束，并且可能只有一个比较。这种设定在实际应用中非常普遍，因此不需要指定图的随机性和PL/BTL模型中的均匀采样假设。
results: paper 发现，在适用 BTL/PL 模型时，spectral estimator 和 Maximum Likelihood Estimator (MLE) 之间存在关系。furthermore, paper 提出了一种two-step spectral method，可以达到 MLE 的同等效率。此外，paper 还提出了一个涵盖一 Sample和两 Sample 排名推论的完整框架，可以应用于固定图和随机图设定。这是首次提出了有效的两 Sample rank testing 方法。最后，paper 通过了详细的数学实验和应用于统计期刊和电影排名等。

Abstract
This paper studies the performance of the spectral method in the estimation and uncertainty quantification of the unobserved preference scores of compared entities in a very general and more realistic setup in which the comparison graph consists of hyper-edges of possible heterogeneous sizes and the number of comparisons can be as low as one for a given hyper-edge. Such a setting is pervasive in real applications, circumventing the need to specify the graph randomness and the restrictive homogeneous sampling assumption imposed in the commonly-used Bradley-Terry-Luce (BTL) or Plackett-Luce (PL) models. Furthermore, in the scenarios when the BTL or PL models are appropriate, we unravel the relationship between the spectral estimator and the Maximum Likelihood Estimator (MLE). We discover that a two-step spectral method, where we apply the optimal weighting estimated from the equal weighting vanilla spectral method, can achieve the same asymptotic efficiency as the MLE. Given the asymptotic distributions of the estimated preference scores, we also introduce a comprehensive framework to carry out both one-sample and two-sample ranking inferences, applicable to both fixed and random graph settings. It is noteworthy that it is the first time effective two-sample rank testing methods are proposed. Finally, we substantiate our findings via comprehensive numerical simulations and subsequently apply our developed methodologies to perform statistical inferences on statistics journals and movie rankings.

摘要

Adversarial Erasing with Pruned Elements: Towards Better Graph Lottery Ticket

paper_url: http://arxiv.org/abs/2308.02916
repo_url: https://github.com/wangyuwen0627/ace-glt
paper_authors: Yuwen Wang, Shunyu Liu, Kaixuan Chen, Tongtian Zhu, Ji Qiao, Mengjie Shi, Yuanyu Wan, Mingli Song
for:这篇论文旨在提高大输入图形深度学习网络（GNN）的计算成本，而不是降低性能。methods:论文提出了一种组合核心子图和稀疏子网络的方法，即Graph Lottery Ticket（GLT），并使用迭代磁力基于剪枝（IMP）来获得奖券。然而，现有的研究仅将奖券获得者视为终极目标，而忽略了剪枝过程中当前关系的动态变化，这限制了奖券的吸引力。results:实验结果显示，我们的ACE-GLT在多种任务中均超越了现有的搜寻GLT方法。

Abstract
Graph Lottery Ticket (GLT), a combination of core subgraph and sparse subnetwork, has been proposed to mitigate the computational cost of deep Graph Neural Networks (GNNs) on large input graphs while preserving original performance. However, the winning GLTs in exisiting studies are obtained by applying iterative magnitude-based pruning (IMP) without re-evaluating and re-considering the pruned information, which disregards the dynamic changes in the significance of edges/weights during graph/model structure pruning, and thus limits the appeal of the winning tickets. In this paper, we formulate a conjecture, i.e., existing overlooked valuable information in the pruned graph connections and model parameters which can be re-grouped into GLT to enhance the final performance. Specifically, we propose an adversarial complementary erasing (ACE) framework to explore the valuable information from the pruned components, thereby developing a more powerful GLT, referred to as the ACE-GLT. The main idea is to mine valuable information from pruned edges/weights after each round of IMP, and employ the ACE technique to refine the GLT processing. Finally, experimental results demonstrate that our ACE-GLT outperforms existing methods for searching GLT in diverse tasks. Our code will be made publicly available.

摘要
Graph Lottery Ticket（GLT），一种将核心子图和稀疏子网络组合的方法，已被提出来降低深度图神经网络（GNNs）在大输入图上的计算成本，保持原始性能。然而，现有的赢家GLT通常通过不重新评估和重新考虑被剪除的信息来获得，这会忽略图/模型结构剪除中边Edge/权重的动态变化，从而限制赢家票的吸引力。在这篇论文中，我们提出一个假设，即现有的被过look的有价信息在剪除后的图连接和模型参数中，可以重新组织成GLT，以提高最终性能。 Specifically，我们提出一种对抗补做（ACE）框架，以挖掘剪除后的有价信息，并使用ACE技术来练级GLT处理。最后，我们的ACE-GLT在多种任务中的实验结果表明，我们的ACE-GLT比现有的GLT搜索方法更高效。我们的代码将在公共网上公布。