All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] add mTHP support for anonymous shmem
@ 2024-05-06  8:46 Baolin Wang
  2024-05-06  8:46 ` [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config Baolin Wang
                   ` (10 more replies)
  0 siblings, 11 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shared pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Many implement anonymous page sharing through
mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
therefore, users expect to apply an unified mTHP strategy for anonymous pages,
also including the anonymous shared pages, in order to enjoy the benefits of
mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the shmem enabled of the top level, meanwhile also allows independent
control of shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

TODO:
 - Support mTHP for tmpfs.
 - Do not split the large folio when share memory swap out.
 - Can swap in a large folio for share memory.

Changes from RFC:
 - Rebase the patch set against the new mm-unstable branch, per Lance.
 - Add a new patch to export highest_order() and next_order().
 - Add a new patch to align mTHP size in shmem_get_unmapped_area().
 - Handle the uffd case and the VMA limits case when building mapping for
   large folio in the finish_fault() function, per Ryan.
 - Remove unnecessary 'order' variable in patch 3, per Kefeng.
 - Keep the anon shmem counters' name consistency.
 - Modify the strategy to support mTHP for anonymous shmem, discussed with
   Ryan and David.
 - Add reviewed tag from Barry.
 - Update the commit message.

Baolin Wang (8):
  mm: move highest_order() and next_order() out of the THP config
  mm: memory: extend finish_fault() to support large folio
  mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
  mm: shmem: add THP validation for PMD-mapped THP related statistics
  mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  mm: shmem: add mTHP support for anonymous shmem
  mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  mm: shmem: add mTHP counters for anonymous shmem

 Documentation/admin-guide/mm/transhuge.rst |  29 ++
 include/linux/huge_mm.h                    |  35 ++-
 mm/huge_memory.c                           |  17 +-
 mm/memory.c                                |  43 ++-
 mm/shmem.c                                 | 335 ++++++++++++++++++---
 5 files changed, 387 insertions(+), 72 deletions(-)

-- 
2.39.3


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
@ 2024-05-06  8:46 ` Baolin Wang
  2024-05-07 10:21   ` Ryan Roberts
  2024-05-06  8:46 ` [PATCH 2/8] mm: memory: extend finish_fault() to support large folio Baolin Wang
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

Move highest_order() and next_order() out of the CONFIG_TRANSPARENT_HUGEPAGE
macro, which can be common functions to be used.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 017cee864080..e49b56c40a11 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -106,6 +106,17 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
 #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
 
+static inline int highest_order(unsigned long orders)
+{
+	return fls_long(orders) - 1;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+	*orders &= ~BIT(prev);
+	return highest_order(*orders);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 extern unsigned long transparent_hugepage_flags;
@@ -138,17 +149,6 @@ static inline bool hugepage_flags_enabled(void)
 	       huge_anon_orders_madvise;
 }
 
-static inline int highest_order(unsigned long orders)
-{
-	return fls_long(orders) - 1;
-}
-
-static inline int next_order(unsigned long *orders, int prev)
-{
-	*orders &= ~BIT(prev);
-	return highest_order(*orders);
-}
-
 /*
  * Do the below checks:
  *   - For file vma, check if the linear page offset of vma is
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
  2024-05-06  8:46 ` [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config Baolin Wang
@ 2024-05-06  8:46 ` Baolin Wang
  2024-05-07 10:37   ` Ryan Roberts
  2024-05-06  8:46 ` [PATCH 3/8] mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio() Baolin Wang
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

Add large folio mapping establishment support for finish_fault() as a preparation,
to support multi-size THP allocation of anonymous shmem pages in the following
patches.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eea6e4984eae..936377220b77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
+	struct folio *folio;
 	vm_fault_t ret;
 	bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
 		      !(vma->vm_flags & VM_SHARED);
+	int type, nr_pages, i;
+	unsigned long addr = vmf->address;
 
 	/* Did we COW the page? */
 	if (is_cow)
@@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return VM_FAULT_OOM;
 	}
 
+	folio = page_folio(page);
+	nr_pages = folio_nr_pages(folio);
+
+	if (unlikely(userfaultfd_armed(vma))) {
+		nr_pages = 1;
+	} else if (nr_pages > 1) {
+		unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+		unsigned long end = start + nr_pages * PAGE_SIZE;
+
+		/* In case the folio size in page cache beyond the VMA limits. */
+		addr = max(start, vma->vm_start);
+		nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
+
+		page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
+	}
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				      vmf->address, &vmf->ptl);
+				       addr, &vmf->ptl);
 	if (!vmf->pte)
 		return VM_FAULT_NOPAGE;
 
 	/* Re-check under ptl */
-	if (likely(!vmf_pte_changed(vmf))) {
-		struct folio *folio = page_folio(page);
-		int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
-
-		set_pte_range(vmf, folio, page, 1, vmf->address);
-		add_mm_counter(vma->vm_mm, type, 1);
-		ret = 0;
-	} else {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
+		update_mmu_tlb(vma, addr, vmf->pte);
+		ret = VM_FAULT_NOPAGE;
+		goto unlock;
+	} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
 		ret = VM_FAULT_NOPAGE;
+		goto unlock;
 	}
 
+	set_pte_range(vmf, folio, page, nr_pages, addr);
+	type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
+	add_mm_counter(vma->vm_mm, type, nr_pages);
+	ret = 0;
+
+unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
 }
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/8] mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
  2024-05-06  8:46 ` [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config Baolin Wang
  2024-05-06  8:46 ` [PATCH 2/8] mm: memory: extend finish_fault() to support large folio Baolin Wang
@ 2024-05-06  8:46 ` Baolin Wang
  2024-05-06  8:46 ` [PATCH 4/8] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

Add a new parameter to specify the huge page order for shmem_alloc_hugefolio(),
as a preparation to supoort mTHP.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index fa2a0ed97507..e4483c4596a8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1604,14 +1604,14 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 }
 
 static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
-		struct shmem_inode_info *info, pgoff_t index)
+		struct shmem_inode_info *info, pgoff_t index, int order)
 {
 	struct mempolicy *mpol;
 	pgoff_t ilx;
 	struct page *page;
 
-	mpol = shmem_get_pgoff_policy(info, index, HPAGE_PMD_ORDER, &ilx);
-	page = alloc_pages_mpol(gfp, HPAGE_PMD_ORDER, mpol, ilx, numa_node_id());
+	mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
+	page = alloc_pages_mpol(gfp, order, mpol, ilx, numa_node_id());
 	mpol_cond_put(mpol);
 
 	return page_rmappable_folio(page);
@@ -1660,7 +1660,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 				index + HPAGE_PMD_NR - 1, XA_PRESENT))
 			return ERR_PTR(-E2BIG);
 
-		folio = shmem_alloc_hugefolio(gfp, info, index);
+		folio = shmem_alloc_hugefolio(gfp, info, index, HPAGE_PMD_ORDER);
 		if (!folio)
 			count_vm_event(THP_FILE_FALLBACK);
 	} else {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/8] mm: shmem: add THP validation for PMD-mapped THP related statistics
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
                   ` (2 preceding siblings ...)
  2024-05-06  8:46 ` [PATCH 3/8] mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio() Baolin Wang
@ 2024-05-06  8:46 ` Baolin Wang
  2024-05-06  8:46 ` [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

In order to extend support for mTHP, add THP validation for PMD-mapped THP
related statistics to avoid statistical confusion.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/shmem.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e4483c4596a8..a383ea9a89a5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1661,7 +1661,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 			return ERR_PTR(-E2BIG);
 
 		folio = shmem_alloc_hugefolio(gfp, info, index, HPAGE_PMD_ORDER);
-		if (!folio)
+		if (!folio && pages == HPAGE_PMD_NR)
 			count_vm_event(THP_FILE_FALLBACK);
 	} else {
 		pages = 1;
@@ -1679,7 +1679,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 		if (xa_find(&mapping->i_pages, &index,
 				index + pages - 1, XA_PRESENT)) {
 			error = -EEXIST;
-		} else if (huge) {
+		} else if (pages == HPAGE_PMD_NR) {
 			count_vm_event(THP_FILE_FALLBACK);
 			count_vm_event(THP_FILE_FALLBACK_CHARGE);
 		}
@@ -2045,7 +2045,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		folio = shmem_alloc_and_add_folio(huge_gfp,
 				inode, index, fault_mm, true);
 		if (!IS_ERR(folio)) {
-			count_vm_event(THP_FILE_ALLOC);
+			if (folio_test_pmd_mappable(folio))
+				count_vm_event(THP_FILE_ALLOC);
 			goto alloced;
 		}
 		if (PTR_ERR(folio) == -EEXIST)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
                   ` (3 preceding siblings ...)
  2024-05-06  8:46 ` [PATCH 4/8] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
@ 2024-05-06  8:46 ` Baolin Wang
  2024-05-07 10:52   ` Ryan Roberts
  2024-05-06  8:46 ` [PATCH 6/8] mm: shmem: add mTHP support " Baolin Wang
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set to:
"always", "inherit (to inherit the top level setting)", "within_size", "advise",
"never", "deny", "force". These values follow the same semantics as the top
level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
to 'always' to keep compatibility.

By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.

In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice versa.
That means now we will avoid using non-PMD sized THP to override the global
huge allocation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 29 +++++++
 include/linux/huge_mm.h                    | 10 +++
 mm/huge_memory.c                           | 11 +--
 mm/shmem.c                                 | 96 ++++++++++++++++++++++
 4 files changed, 138 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 076443cc10a6..a28496e15bdb 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -332,6 +332,35 @@ deny
 force
     Force the huge option on for all - very useful for testing;
 
+Anonymous shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob
+to control mTHP allocation: /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled.
+Its value for each mTHP is essentially consistent with the global setting, except
+for the addition of 'inherit' to ensure compatibility with the global settings.
+always
+    Attempt to allocate <size> huge pages every time we need a new page;
+
+inherit
+    Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
+    have enabled="inherit" and all other hugepage sizes have enabled="never";
+
+never
+    Do not allocate <size> huge pages;
+
+within_size
+    Only allocate <size> huge page if it will be fully within i_size.
+    Also respect fadvise()/madvise() hints;
+
+advise
+    Only allocate <size> huge pages if requested with fadvise()/madvise();
+
+deny
+    Has the same semantics as 'never', now mTHP allocation policy is only
+    used for anonymous shmem and no not override tmpfs.
+
+force
+    Has the same semantics as 'always', now mTHP allocation policy is only
+    used for anonymous shmem and no not override tmpfs.
+
 Need of application restart
 ===========================
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e49b56c40a11..dbd6b3f56210 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,7 @@
 #include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
+#include <linux/kobject.h>
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf,
 				  enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute thpsize_shmem_enabled_attr;
 
 /*
  * Mask of all large folio orders supported for anonymous THP; all orders up to
@@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
 
+struct thpsize {
+	struct kobject kobj;
+	struct list_head node;
+	int order;
+};
+
+#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
+
 enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9efb6fefc391..d3080a8843f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
 static DEFINE_SPINLOCK(huge_anon_orders_lock);
 static LIST_HEAD(thpsize_list);
 
-struct thpsize {
-	struct kobject kobj;
-	struct list_head node;
-	int order;
-};
-
-#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
-
 static ssize_t thpsize_enabled_show(struct kobject *kobj,
 				    struct kobj_attribute *attr, char *buf)
 {
@@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
 
 static struct attribute *thpsize_attrs[] = {
 	&thpsize_enabled_attr.attr,
+#ifdef CONFIG_SHMEM
+	&thpsize_shmem_enabled_attr.attr,
+#endif
 	NULL,
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index a383ea9a89a5..59cc26d44344 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -131,6 +131,14 @@ struct shmem_options {
 #define SHMEM_SEEN_QUOTA 32
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long huge_anon_shmem_orders_always __read_mostly;
+static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
+static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
+static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
+static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
+#endif
+
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -4687,6 +4695,12 @@ void __init shmem_init(void)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
 	else
 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
+
+	/*
+	 * Default to setting PMD-sized THP to inherit the global setting and
+	 * disable all other multi-size THPs, when anonymous shmem uses mTHP.
+	 */
+	huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
 #endif
 	return;
 
@@ -4746,6 +4760,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
 			huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
 		return -EINVAL;
 
+	/* Do not override huge allocation policy with non-PMD sized mTHP */
+	if (huge == SHMEM_HUGE_FORCE &&
+	    huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
+		return -EINVAL;
+
 	shmem_huge = huge;
 	if (shmem_huge > SHMEM_HUGE_DENY)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
@@ -4753,6 +4772,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
 }
 
 struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
+
+static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+	const char *output;
+
+	if (test_bit(order, &huge_anon_shmem_orders_always))
+		output = "[always] inherit within_size advise never deny [force]";
+	else if (test_bit(order, &huge_anon_shmem_orders_inherit))
+		output = "always [inherit] within_size advise never deny force";
+	else if (test_bit(order, &huge_anon_shmem_orders_within_size))
+		output = "always inherit [within_size] advise never deny force";
+	else if (test_bit(order, &huge_anon_shmem_orders_madvise))
+		output = "always inherit within_size [advise] never deny force";
+	else
+		output = "always inherit within_size advise [never] [deny] force";
+
+	return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	int order = to_thpsize(kobj)->order;
+	ssize_t ret = count;
+
+	if (sysfs_streq(buf, "always") || sysfs_streq(buf, "force")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_always);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "inherit")) {
+		/* Do not override huge allocation policy with non-PMD sized mTHP */
+		if (shmem_huge == SHMEM_HUGE_FORCE &&
+		    order != HPAGE_PMD_ORDER)
+			return -EINVAL;
+
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_inherit);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "within_size")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		set_bit(order, &huge_anon_shmem_orders_within_size);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "madvise")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_madvise);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "never") || sysfs_streq(buf, "deny")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else {
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+
+struct kobj_attribute thpsize_shmem_enabled_attr =
+	__ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
 
 #else /* !CONFIG_SHMEM */
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 6/8] mm: shmem: add mTHP support for anonymous shmem
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
                   ` (4 preceding siblings ...)
  2024-05-06  8:46 ` [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
@ 2024-05-06  8:46 ` Baolin Wang
  2024-05-07 10:46   ` kernel test robot
  2024-05-06  8:46 ` [PATCH 7/8] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous share pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous share pages, in order to
enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the shmem enabled of the top level, meanwhile also allows independent
control of shmem enabled for each mTHP.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 177 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 150 insertions(+), 27 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 59cc26d44344..08ccea5170a1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1611,6 +1611,106 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 	return result;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	unsigned long mask = READ_ONCE(huge_anon_shmem_orders_always);
+	unsigned long within_size_orders = READ_ONCE(huge_anon_shmem_orders_within_size);
+	unsigned long vm_flags = vma->vm_flags;
+	/*
+	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
+	 * are enabled for this vma.
+	 */
+	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
+	loff_t i_size;
+	int order;
+
+	if ((vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return 0;
+
+	/* If the hardware/firmware marked hugepage support disabled. */
+	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+		return 0;
+
+	/*
+	 * Following the 'deny' semantics of the top level, force the huge
+	 * option off from all mounts.
+	 */
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return 0;
+	/*
+	 * Only allow inherit orders if the top-level value is 'force', which
+	 * means non-PMD sized THP can not override 'huge' mount option now.
+	 */
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return READ_ONCE(huge_anon_shmem_orders_inherit);
+
+	/* Allow mTHP that will be fully within i_size. */
+	order = highest_order(within_size_orders);
+	while (within_size_orders) {
+		index = round_up(index + 1, order);
+		i_size = round_up(i_size_read(inode), PAGE_SIZE);
+		if (i_size >> PAGE_SHIFT >= index) {
+			mask |= within_size_orders;
+			break;
+		}
+
+		order = next_order(&within_size_orders, order);
+	}
+
+	if (vm_flags & VM_HUGEPAGE)
+		mask |= READ_ONCE(huge_anon_shmem_orders_madvise);
+
+	if (global_huge)
+		mask |= READ_ONCE(huge_anon_shmem_orders_inherit);
+
+	return orders & mask;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					struct address_space *mapping, pgoff_t index,
+					unsigned long orders)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pages;
+	int order;
+
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	if (!orders)
+		return 0;
+
+	/* Find the highest order that can add into the page cache */
+	order = highest_order(orders);
+	while (orders) {
+		pages = 1UL << order;
+		index = round_down(index, pages);
+		if (!xa_find(&mapping->i_pages, &index,
+			     index + pages - 1, XA_PRESENT))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	return 0;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					struct address_space *mapping, pgoff_t index,
+					unsigned long orders)
+{
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
 		struct shmem_inode_info *info, pgoff_t index, int order)
 {
@@ -1639,38 +1739,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp,
 	return (struct folio *)page;
 }
 
-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
-		struct inode *inode, pgoff_t index,
-		struct mm_struct *fault_mm, bool huge)
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+		gfp_t gfp, struct inode *inode, pgoff_t index,
+		struct mm_struct *fault_mm, bool huge, unsigned long orders)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+	unsigned long suitable_orders;
 	struct folio *folio;
 	long pages;
-	int error;
+	int error, order;
 
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		huge = false;
 
-	if (huge) {
-		pages = HPAGE_PMD_NR;
-		index = round_down(index, HPAGE_PMD_NR);
+	if (huge || orders > 0) {
+		if (vma && vma_is_anon_shmem(vma) && orders) {
+			suitable_orders = anon_shmem_suitable_orders(inode, vmf,
+							mapping, index, orders);
+		} else {
+			pages = HPAGE_PMD_NR;
+			suitable_orders = BIT(HPAGE_PMD_ORDER);
+			index = round_down(index, HPAGE_PMD_NR);
 
-		/*
-		 * Check for conflict before waiting on a huge allocation.
-		 * Conflict might be that a huge page has just been allocated
-		 * and added to page cache by a racing thread, or that there
-		 * is already at least one small page in the huge extent.
-		 * Be careful to retry when appropriate, but not forever!
-		 * Elsewhere -EEXIST would be the right code, but not here.
-		 */
-		if (xa_find(&mapping->i_pages, &index,
+			/*
+			 * Check for conflict before waiting on a huge allocation.
+			 * Conflict might be that a huge page has just been allocated
+			 * and added to page cache by a racing thread, or that there
+			 * is already at least one small page in the huge extent.
+			 * Be careful to retry when appropriate, but not forever!
+			 * Elsewhere -EEXIST would be the right code, but not here.
+			 */
+			if (xa_find(&mapping->i_pages, &index,
 				index + HPAGE_PMD_NR - 1, XA_PRESENT))
-			return ERR_PTR(-E2BIG);
+				return ERR_PTR(-E2BIG);
+		}
 
-		folio = shmem_alloc_hugefolio(gfp, info, index, HPAGE_PMD_ORDER);
-		if (!folio && pages == HPAGE_PMD_NR)
-			count_vm_event(THP_FILE_FALLBACK);
+		order = highest_order(suitable_orders);
+		while (suitable_orders) {
+			pages = 1 << order;
+			index = round_down(index, pages);
+			folio = shmem_alloc_hugefolio(gfp, info, index, order);
+			if (folio)
+				goto allocated;
+
+			if (pages == HPAGE_PMD_NR)
+				count_vm_event(THP_FILE_FALLBACK);
+			order = next_order(&suitable_orders, order);
+		}
 	} else {
 		pages = 1;
 		folio = shmem_alloc_folio(gfp, info, index);
@@ -1678,6 +1795,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 	if (!folio)
 		return ERR_PTR(-ENOMEM);
 
+allocated:
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 
@@ -1972,7 +2090,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	struct mm_struct *fault_mm;
 	struct folio *folio;
 	int error;
-	bool alloced;
+	bool alloced, huge;
+	unsigned long orders = 0;
 
 	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
 		return -EINVAL;
@@ -2044,14 +2163,18 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	if (shmem_is_huge(inode, index, false, fault_mm,
-			  vma ? vma->vm_flags : 0)) {
+	huge = shmem_is_huge(inode, index, false, fault_mm,
+			     vma ? vma->vm_flags : 0);
+	/* Find hugepage orders that are allowed for anonymous shmem. */
+	if (vma && vma_is_anon_shmem(vma))
+		orders = anon_shmem_allowable_huge_orders(inode, vma, index, huge);
+	if (huge || orders > 0) {
 		gfp_t huge_gfp;
 
 		huge_gfp = vma_thp_gfp_mask(vma);
 		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
-		folio = shmem_alloc_and_add_folio(huge_gfp,
-				inode, index, fault_mm, true);
+		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
+				inode, index, fault_mm, true, orders);
 		if (!IS_ERR(folio)) {
 			if (folio_test_pmd_mappable(folio))
 				count_vm_event(THP_FILE_ALLOC);
@@ -2061,7 +2184,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 			goto repeat;
 	}
 
-	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, false, 0);
 	if (IS_ERR(folio)) {
 		error = PTR_ERR(folio);
 		if (error == -EEXIST)
@@ -2072,7 +2195,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 
 alloced:
 	alloced = true;
-	if (folio_test_pmd_mappable(folio) &&
+	if (folio_test_large(folio) &&
 	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
 					folio_next_index(folio) - 1) {
 		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 7/8] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
                   ` (5 preceding siblings ...)
  2024-05-06  8:46 ` [PATCH 6/8] mm: shmem: add mTHP support " Baolin Wang
@ 2024-05-06  8:46 ` Baolin Wang
  2024-05-06  8:46 ` [PATCH 8/8] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

Although the top-level hugepage allocation can be turned off, anonymous shmem
can still use mTHP by configuring the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'. Therefore,
add alignment for mTHP size to provide a suitable alignment address in
shmem_get_unmapped_area().

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 08ccea5170a1..27107afdff9e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2404,6 +2404,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	unsigned long inflated_len;
 	unsigned long inflated_addr;
 	unsigned long inflated_offset;
+	unsigned long hpage_size = HPAGE_PMD_SIZE;
 
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -2422,8 +2423,6 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return addr;
-	if (len < HPAGE_PMD_SIZE)
-		return addr;
 	if (flags & MAP_FIXED)
 		return addr;
 	/*
@@ -2437,6 +2436,8 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 
 	if (shmem_huge != SHMEM_HUGE_FORCE) {
 		struct super_block *sb;
+		unsigned long __maybe_unused hpage_orders;
+		int order = 0;
 
 		if (file) {
 			VM_BUG_ON(file->f_op != &shmem_file_operations);
@@ -2449,18 +2450,34 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 			if (IS_ERR(shm_mnt))
 				return addr;
 			sb = shm_mnt->mnt_sb;
+
+			/*
+			 * Find the highest mTHP order used for anonymous shmem to
+			 * provide a suitable alignment address.
+			 */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			hpage_orders = READ_ONCE(huge_anon_shmem_orders_always);
+			hpage_orders |= READ_ONCE(huge_anon_shmem_orders_within_size);
+			hpage_orders |= READ_ONCE(huge_anon_shmem_orders_madvise);
+			hpage_orders |= READ_ONCE(huge_anon_shmem_orders_inherit);
+			order = highest_order(hpage_orders);
+			hpage_size = PAGE_SIZE << order;
+#endif
 		}
-		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
+		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER && !order)
 			return addr;
 	}
 
-	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
-	if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+	if (len < hpage_size)
+		return addr;
+
+	offset = (pgoff << PAGE_SHIFT) & (hpage_size - 1);
+	if (offset && offset + len < 2 * hpage_size)
 		return addr;
-	if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+	if ((addr & (hpage_size - 1)) == offset)
 		return addr;
 
-	inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+	inflated_len = len + hpage_size - PAGE_SIZE;
 	if (inflated_len > TASK_SIZE)
 		return addr;
 	if (inflated_len < len)
@@ -2473,10 +2490,10 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (inflated_addr & ~PAGE_MASK)
 		return addr;
 
-	inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+	inflated_offset = inflated_addr & (hpage_size - 1);
 	inflated_addr += offset - inflated_offset;
 	if (inflated_offset > offset)
-		inflated_addr += HPAGE_PMD_SIZE;
+		inflated_addr += hpage_size;
 
 	if (inflated_addr > TASK_SIZE - len)
 		return addr;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 8/8] mm: shmem: add mTHP counters for anonymous shmem
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
                   ` (6 preceding siblings ...)
  2024-05-06  8:46 ` [PATCH 7/8] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
@ 2024-05-06  8:46 ` Baolin Wang
  2024-05-06 10:54 ` [PATCH 0/8] add mTHP support " Lance Yang
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-06  8:46 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel

Add mTHP counters for anonymous shmem.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h |  3 +++
 mm/huge_memory.c        |  6 ++++++
 mm/shmem.c              | 18 +++++++++++++++---
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index dbd6b3f56210..c15bebb2cf53 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -281,6 +281,9 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
 	MTHP_STAT_ANON_SWPOUT,
 	MTHP_STAT_ANON_SWPOUT_FALLBACK,
+	MTHP_STAT_FILE_ALLOC,
+	MTHP_STAT_FILE_FALLBACK,
+	MTHP_STAT_FILE_FALLBACK_CHARGE,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3080a8843f2..fcda6ae604f6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
 DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
 
 static struct attribute *stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
 	&anon_fault_fallback_charge_attr.attr,
 	&anon_swpout_attr.attr,
 	&anon_swpout_fallback_attr.attr,
+	&file_alloc_attr.attr,
+	&file_fallback_attr.attr,
+	&file_fallback_charge_attr.attr,
 	NULL,
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 27107afdff9e..1af2f0aa384d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1786,6 +1786,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 
 			if (pages == HPAGE_PMD_NR)
 				count_vm_event(THP_FILE_FALLBACK);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
+#endif
 			order = next_order(&suitable_orders, order);
 		}
 	} else {
@@ -1805,9 +1808,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 		if (xa_find(&mapping->i_pages, &index,
 				index + pages - 1, XA_PRESENT)) {
 			error = -EEXIST;
-		} else if (pages == HPAGE_PMD_NR) {
-			count_vm_event(THP_FILE_FALLBACK);
-			count_vm_event(THP_FILE_FALLBACK_CHARGE);
+		} else if (pages > 1) {
+			if (pages == HPAGE_PMD_NR) {
+				count_vm_event(THP_FILE_FALLBACK);
+				count_vm_event(THP_FILE_FALLBACK_CHARGE);
+			}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK);
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE);
+#endif
 		}
 		goto unlock;
 	}
@@ -2178,6 +2187,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		if (!IS_ERR(folio)) {
 			if (folio_test_pmd_mappable(folio))
 				count_vm_event(THP_FILE_ALLOC);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
+#endif
 			goto alloced;
 		}
 		if (PTR_ERR(folio) == -EEXIST)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
                   ` (7 preceding siblings ...)
  2024-05-06  8:46 ` [PATCH 8/8] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
@ 2024-05-06 10:54 ` Lance Yang
  2024-05-07  1:47   ` Baolin Wang
  2024-05-07 10:20 ` Ryan Roberts
       [not found] ` <CGME20240508113934eucas1p13a3972f3f9955365f40155e084a7c7d5@eucas1p1.samsung.com>
  10 siblings, 1 reply; 52+ messages in thread
From: Lance Yang @ 2024-05-06 10:54 UTC (permalink / raw)
  To: baolin.wang
  Cc: 21cnbao, akpm, david, hughd, ioworker0, linux-kernel, linux-mm,
	ryan.roberts, shy828301, wangkefeng.wang, willy, ying.huang, ziy

Hey Baolin,

I found a compilation issue that failed one[1] of my configurations
after applying this series. The error message is as follows:

mm/shmem.c: In function ‘shmem_get_unmapped_area’:
././include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
        _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                                            ^
././include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
                         prefix ## suffix();                             \
                         ^~~~~~
././include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
        _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                     ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
 #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                     ^~~~~~~~~~~~~~~~
./include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
                            ^~~~~~~~~
./include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
 #define HPAGE_PMD_SIZE  ((1UL) << HPAGE_PMD_SHIFT)
                                   ^~~~~~~~~~~~~~~
mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
        unsigned long hpage_size = HPAGE_PMD_SIZE;
                                   ^~~~~~~~~~~~~~~

It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
is undefined.

[1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)

Thanks,
Lance

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-06 10:54 ` [PATCH 0/8] add mTHP support " Lance Yang
@ 2024-05-07  1:47   ` Baolin Wang
  2024-05-07  6:50     ` Lance Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-07  1:47 UTC (permalink / raw)
  To: Lance Yang
  Cc: 21cnbao, akpm, david, hughd, linux-kernel, linux-mm,
	ryan.roberts, shy828301, wangkefeng.wang, willy, ying.huang, ziy

Hi Lance,

On 2024/5/6 18:54, Lance Yang wrote:
> Hey Baolin,
> 
> I found a compilation issue that failed one[1] of my configurations
> after applying this series. The error message is as follows:
> 
> mm/shmem.c: In function ‘shmem_get_unmapped_area’:
> ././include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
>          _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>                                              ^
> ././include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
>                           prefix ## suffix();                             \
>                           ^~~~~~
> ././include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
>          _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          ^~~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
>   #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>                                       ^~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
>   #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>                       ^~~~~~~~~~~~~~~~
> ./include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
>   #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>                              ^~~~~~~~~
> ./include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
>   #define HPAGE_PMD_SIZE  ((1UL) << HPAGE_PMD_SHIFT)
>                                     ^~~~~~~~~~~~~~~
> mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
>          unsigned long hpage_size = HPAGE_PMD_SIZE;
>                                     ^~~~~~~~~~~~~~~
> 
> It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
> is undefined.
> 
> [1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)

Thanks for reporting. I can move the use of HPAGE_PMD_SIZE to after the 
check for CONFIG_TRANSPARENT_HUGEPAGE, which can avoid the building error:

diff --git a/mm/shmem.c b/mm/shmem.c
index 1af2f0aa384d..d603e36e0f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2416,7 +2416,7 @@ unsigned long shmem_get_unmapped_area(struct file 
*file,
         unsigned long inflated_len;
         unsigned long inflated_addr;
         unsigned long inflated_offset;
-       unsigned long hpage_size = HPAGE_PMD_SIZE;
+       unsigned long hpage_size;

         if (len > TASK_SIZE)
                 return -ENOMEM;
@@ -2446,6 +2446,7 @@ unsigned long shmem_get_unmapped_area(struct file 
*file,
         if (uaddr == addr)
                 return addr;

+       hpage_size = HPAGE_PMD_SIZE;
         if (shmem_huge != SHMEM_HUGE_FORCE) {
                 struct super_block *sb;
                 unsigned long __maybe_unused hpage_orders;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-07  1:47   ` Baolin Wang
@ 2024-05-07  6:50     ` Lance Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Lance Yang @ 2024-05-07  6:50 UTC (permalink / raw)
  To: Baolin Wang
  Cc: 21cnbao, akpm, david, hughd, linux-kernel, linux-mm,
	ryan.roberts, shy828301, wangkefeng.wang, willy, ying.huang, ziy

On Tue, May 7, 2024 at 9:47 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Hi Lance,
>
> On 2024/5/6 18:54, Lance Yang wrote:
> > Hey Baolin,
> >
> > I found a compilation issue that failed one[1] of my configurations
> > after applying this series. The error message is as follows:
> >
> > mm/shmem.c: In function ‘shmem_get_unmapped_area’:
> > ././include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
> >          _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> >                                              ^
> > ././include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
> >                           prefix ## suffix();                             \
> >                           ^~~~~~
> > ././include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
> >          _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> >          ^~~~~~~~~~~~~~~~~~~
> > ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
> >   #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
> >                                       ^~~~~~~~~~~~~~~~~~
> > ./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
> >   #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
> >                       ^~~~~~~~~~~~~~~~
> > ./include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
> >   #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> >                              ^~~~~~~~~
> > ./include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
> >   #define HPAGE_PMD_SIZE  ((1UL) << HPAGE_PMD_SHIFT)
> >                                     ^~~~~~~~~~~~~~~
> > mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
> >          unsigned long hpage_size = HPAGE_PMD_SIZE;
> >                                     ^~~~~~~~~~~~~~~
> >
> > It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
> > is undefined.
> >
> > [1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)
>
> Thanks for reporting. I can move the use of HPAGE_PMD_SIZE to after the
> check for CONFIG_TRANSPARENT_HUGEPAGE, which can avoid the building error:

I confirmed that the issue I reported before has disappeared after applying
this change. For the fix,

Tested-by: Lance Yang <ioworker0@gmail.com>

Thanks,
Lance

>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1af2f0aa384d..d603e36e0f4f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2416,7 +2416,7 @@ unsigned long shmem_get_unmapped_area(struct file
> *file,
>          unsigned long inflated_len;
>          unsigned long inflated_addr;
>          unsigned long inflated_offset;
> -       unsigned long hpage_size = HPAGE_PMD_SIZE;
> +       unsigned long hpage_size;
>
>          if (len > TASK_SIZE)
>                  return -ENOMEM;
> @@ -2446,6 +2446,7 @@ unsigned long shmem_get_unmapped_area(struct file
> *file,
>          if (uaddr == addr)
>                  return addr;
>
> +       hpage_size = HPAGE_PMD_SIZE;
>          if (shmem_huge != SHMEM_HUGE_FORCE) {
>                  struct super_block *sb;
>                  unsigned long __maybe_unused hpage_orders;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
                   ` (8 preceding siblings ...)
  2024-05-06 10:54 ` [PATCH 0/8] add mTHP support " Lance Yang
@ 2024-05-07 10:20 ` Ryan Roberts
  2024-05-08  5:45   ` Baolin Wang
       [not found] ` <CGME20240508113934eucas1p13a3972f3f9955365f40155e084a7c7d5@eucas1p1.samsung.com>
  10 siblings, 1 reply; 52+ messages in thread
From: Ryan Roberts @ 2024-05-07 10:20 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 06/05/2024 09:46, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shared pages will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Many implement anonymous page sharing through
> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> also including the anonymous shared pages, in order to enjoy the benefits of
> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have all the same values as the top-level

Didn't we agree that "force" would not be supported for now, and would return an
error when attempting to set for a non-PMD-size hugepage-XXkb/shmem_enabled (or
indirectly through inheritance)?

> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option. By default all sizes will be set to "never"
> except PMD size, which is set to "inherit". This ensures backward compatibility
> with the shmem enabled of the top level, meanwhile also allows independent
> control of shmem enabled for each mTHP.
> 
> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
> 
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.02s        3.14s         82936.359                  2630746.027
> 
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.
> 
> TODO:
>  - Support mTHP for tmpfs.
>  - Do not split the large folio when share memory swap out.
>  - Can swap in a large folio for share memory.
> 
> Changes from RFC:
>  - Rebase the patch set against the new mm-unstable branch, per Lance.
>  - Add a new patch to export highest_order() and next_order().
>  - Add a new patch to align mTHP size in shmem_get_unmapped_area().
>  - Handle the uffd case and the VMA limits case when building mapping for
>    large folio in the finish_fault() function, per Ryan.
>  - Remove unnecessary 'order' variable in patch 3, per Kefeng.
>  - Keep the anon shmem counters' name consistency.
>  - Modify the strategy to support mTHP for anonymous shmem, discussed with
>    Ryan and David.
>  - Add reviewed tag from Barry.
>  - Update the commit message.
> 
> Baolin Wang (8):
>   mm: move highest_order() and next_order() out of the THP config
>   mm: memory: extend finish_fault() to support large folio
>   mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>   mm: shmem: add THP validation for PMD-mapped THP related statistics
>   mm: shmem: add multi-size THP sysfs interface for anonymous shmem
>   mm: shmem: add mTHP support for anonymous shmem
>   mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
>   mm: shmem: add mTHP counters for anonymous shmem
> 
>  Documentation/admin-guide/mm/transhuge.rst |  29 ++
>  include/linux/huge_mm.h                    |  35 ++-
>  mm/huge_memory.c                           |  17 +-
>  mm/memory.c                                |  43 ++-
>  mm/shmem.c                                 | 335 ++++++++++++++++++---
>  5 files changed, 387 insertions(+), 72 deletions(-)
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config
  2024-05-06  8:46 ` [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config Baolin Wang
@ 2024-05-07 10:21   ` Ryan Roberts
  2024-05-08  2:13     ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Ryan Roberts @ 2024-05-07 10:21 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 06/05/2024 09:46, Baolin Wang wrote:
> Move highest_order() and next_order() out of the CONFIG_TRANSPARENT_HUGEPAGE
> macro, which can be common functions to be used.

Sorry if I haven't kept up with the discussion, but why is this needed? I
wouldn't expect a need to iterate over orders if THP is compile-time disabled
because we will never try to allocate THP?

> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  include/linux/huge_mm.h | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 017cee864080..e49b56c40a11 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -106,6 +106,17 @@ extern struct kobj_attribute shmem_enabled_attr;
>  #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
>  #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
>  
> +static inline int highest_order(unsigned long orders)
> +{
> +	return fls_long(orders) - 1;
> +}
> +
> +static inline int next_order(unsigned long *orders, int prev)
> +{
> +	*orders &= ~BIT(prev);
> +	return highest_order(*orders);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  
>  extern unsigned long transparent_hugepage_flags;
> @@ -138,17 +149,6 @@ static inline bool hugepage_flags_enabled(void)
>  	       huge_anon_orders_madvise;
>  }
>  
> -static inline int highest_order(unsigned long orders)
> -{
> -	return fls_long(orders) - 1;
> -}
> -
> -static inline int next_order(unsigned long *orders, int prev)
> -{
> -	*orders &= ~BIT(prev);
> -	return highest_order(*orders);
> -}
> -
>  /*
>   * Do the below checks:
>   *   - For file vma, check if the linear page offset of vma is


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-06  8:46 ` [PATCH 2/8] mm: memory: extend finish_fault() to support large folio Baolin Wang
@ 2024-05-07 10:37   ` Ryan Roberts
  2024-05-08  3:44     ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Ryan Roberts @ 2024-05-07 10:37 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 06/05/2024 09:46, Baolin Wang wrote:
> Add large folio mapping establishment support for finish_fault() as a preparation,
> to support multi-size THP allocation of anonymous shmem pages in the following
> patches.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index eea6e4984eae..936377220b77 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct page *page;
> +	struct folio *folio;
>  	vm_fault_t ret;
>  	bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>  		      !(vma->vm_flags & VM_SHARED);
> +	int type, nr_pages, i;
> +	unsigned long addr = vmf->address;
>  
>  	/* Did we COW the page? */
>  	if (is_cow)
> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  			return VM_FAULT_OOM;
>  	}
>  
> +	folio = page_folio(page);
> +	nr_pages = folio_nr_pages(folio);
> +
> +	if (unlikely(userfaultfd_armed(vma))) {
> +		nr_pages = 1;
> +	} else if (nr_pages > 1) {
> +		unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +		unsigned long end = start + nr_pages * PAGE_SIZE;
> +
> +		/* In case the folio size in page cache beyond the VMA limits. */
> +		addr = max(start, vma->vm_start);
> +		nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
> +
> +		page = folio_page(folio, (addr - start) >> PAGE_SHIFT);

I still don't really follow the logic in this else if block. Isn't it possible
that finish_fault() gets called with a page from a folio that isn't aligned with
vmf->address?

For example, let's say we have a file who's size is 64K and which is cached in a
single large folio in the page cache. But the file is mapped into a process at
VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
start=0 and end=64K I think?

Additionally, I think this path will end up mapping the entire folio (as long as
it fits in the VMA). But this bypasses the fault-around configuration. As I
think I mentioned against the RFC, this will inflate the RSS of the process and
can cause behavioural changes as a result. I believe the current advice is to
disable fault-around to prevent this kind of bloat when needed.

It might be that you need a special variant of finish_fault() for shmem?


> +	}
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -				      vmf->address, &vmf->ptl);
> +				       addr, &vmf->ptl);
>  	if (!vmf->pte)
>  		return VM_FAULT_NOPAGE;
>  
>  	/* Re-check under ptl */
> -	if (likely(!vmf_pte_changed(vmf))) {
> -		struct folio *folio = page_folio(page);
> -		int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> -
> -		set_pte_range(vmf, folio, page, 1, vmf->address);
> -		add_mm_counter(vma->vm_mm, type, 1);
> -		ret = 0;
> -	} else {
> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> +	if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
> +		update_mmu_tlb(vma, addr, vmf->pte);
> +		ret = VM_FAULT_NOPAGE;
> +		goto unlock;
> +	} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> +		for (i = 0; i < nr_pages; i++)
> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>  		ret = VM_FAULT_NOPAGE;
> +		goto unlock;
>  	}
>  
> +	set_pte_range(vmf, folio, page, nr_pages, addr);
> +	type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
> +	add_mm_counter(vma->vm_mm, type, nr_pages);
> +	ret = 0;
> +
> +unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	return ret;
>  }


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/8] mm: shmem: add mTHP support for anonymous shmem
  2024-05-06  8:46 ` [PATCH 6/8] mm: shmem: add mTHP support " Baolin Wang
@ 2024-05-07 10:46   ` kernel test robot
  2024-05-08  6:03     ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: kernel test robot @ 2024-05-07 10:46 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: llvm, oe-kbuild-all, willy, david, ioworker0, wangkefeng.wang,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, baolin.wang,
	linux-mm, linux-kernel

Hi Baolin,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20240506]
[cannot apply to linus/master v6.9-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Baolin-Wang/mm-move-highest_order-and-next_order-out-of-the-THP-config/20240506-164838
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/adc64bf0f150bdc614c6c06fc313adeef7dbbbff.1714978902.git.baolin.wang%40linux.alibaba.com
patch subject: [PATCH 6/8] mm: shmem: add mTHP support for anonymous shmem
config: s390-allnoconfig (https://download.01.org/0day-ci/archive/20240507/202405071820.2KY0UnDu-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 0ab4458df0688955620b72cc2c72a32dffad3615)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240507/202405071820.2KY0UnDu-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405071820.2KY0UnDu-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from mm/shmem.c:28:
   In file included from include/linux/ramfs.h:5:
   In file included from include/linux/fs_parser.h:11:
   In file included from include/linux/fs_context.h:14:
   In file included from include/linux/security.h:33:
   In file included from include/linux/mm.h:2253:
   include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     514 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
>> mm/shmem.c:1780:10: warning: variable 'folio' is used uninitialized whenever 'while' loop exits because its condition is false [-Wsometimes-uninitialized]
    1780 |                 while (suitable_orders) {
         |                        ^~~~~~~~~~~~~~~
   mm/shmem.c:1795:7: note: uninitialized use occurs here
    1795 |         if (!folio)
         |              ^~~~~
   mm/shmem.c:1780:10: note: remove the condition if it is always true
    1780 |                 while (suitable_orders) {
         |                        ^~~~~~~~~~~~~~~
         |                        1
   mm/shmem.c:1750:21: note: initialize the variable 'folio' to silence this warning
    1750 |         struct folio *folio;
         |                            ^
         |                             = NULL
   mm/shmem.c:1564:20: warning: unused function 'shmem_show_mpol' [-Wunused-function]
    1564 | static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
         |                    ^~~~~~~~~~~~~~~
   3 warnings generated.


vim +1780 mm/shmem.c

  1741	
  1742	static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
  1743			gfp_t gfp, struct inode *inode, pgoff_t index,
  1744			struct mm_struct *fault_mm, bool huge, unsigned long orders)
  1745	{
  1746		struct address_space *mapping = inode->i_mapping;
  1747		struct shmem_inode_info *info = SHMEM_I(inode);
  1748		struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
  1749		unsigned long suitable_orders;
  1750		struct folio *folio;
  1751		long pages;
  1752		int error, order;
  1753	
  1754		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
  1755			huge = false;
  1756	
  1757		if (huge || orders > 0) {
  1758			if (vma && vma_is_anon_shmem(vma) && orders) {
  1759				suitable_orders = anon_shmem_suitable_orders(inode, vmf,
  1760								mapping, index, orders);
  1761			} else {
  1762				pages = HPAGE_PMD_NR;
  1763				suitable_orders = BIT(HPAGE_PMD_ORDER);
  1764				index = round_down(index, HPAGE_PMD_NR);
  1765	
  1766				/*
  1767				 * Check for conflict before waiting on a huge allocation.
  1768				 * Conflict might be that a huge page has just been allocated
  1769				 * and added to page cache by a racing thread, or that there
  1770				 * is already at least one small page in the huge extent.
  1771				 * Be careful to retry when appropriate, but not forever!
  1772				 * Elsewhere -EEXIST would be the right code, but not here.
  1773				 */
  1774				if (xa_find(&mapping->i_pages, &index,
  1775					index + HPAGE_PMD_NR - 1, XA_PRESENT))
  1776					return ERR_PTR(-E2BIG);
  1777			}
  1778	
  1779			order = highest_order(suitable_orders);
> 1780			while (suitable_orders) {
  1781				pages = 1 << order;
  1782				index = round_down(index, pages);
  1783				folio = shmem_alloc_hugefolio(gfp, info, index, order);
  1784				if (folio)
  1785					goto allocated;
  1786	
  1787				if (pages == HPAGE_PMD_NR)
  1788					count_vm_event(THP_FILE_FALLBACK);
  1789				order = next_order(&suitable_orders, order);
  1790			}
  1791		} else {
  1792			pages = 1;
  1793			folio = shmem_alloc_folio(gfp, info, index);
  1794		}
  1795		if (!folio)
  1796			return ERR_PTR(-ENOMEM);
  1797	
  1798	allocated:
  1799		__folio_set_locked(folio);
  1800		__folio_set_swapbacked(folio);
  1801	
  1802		gfp &= GFP_RECLAIM_MASK;
  1803		error = mem_cgroup_charge(folio, fault_mm, gfp);
  1804		if (error) {
  1805			if (xa_find(&mapping->i_pages, &index,
  1806					index + pages - 1, XA_PRESENT)) {
  1807				error = -EEXIST;
  1808			} else if (pages == HPAGE_PMD_NR) {
  1809				count_vm_event(THP_FILE_FALLBACK);
  1810				count_vm_event(THP_FILE_FALLBACK_CHARGE);
  1811			}
  1812			goto unlock;
  1813		}
  1814	
  1815		error = shmem_add_to_page_cache(folio, mapping, index, NULL, gfp);
  1816		if (error)
  1817			goto unlock;
  1818	
  1819		error = shmem_inode_acct_blocks(inode, pages);
  1820		if (error) {
  1821			struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
  1822			long freed;
  1823			/*
  1824			 * Try to reclaim some space by splitting a few
  1825			 * large folios beyond i_size on the filesystem.
  1826			 */
  1827			shmem_unused_huge_shrink(sbinfo, NULL, 2);
  1828			/*
  1829			 * And do a shmem_recalc_inode() to account for freed pages:
  1830			 * except our folio is there in cache, so not quite balanced.
  1831			 */
  1832			spin_lock(&info->lock);
  1833			freed = pages + info->alloced - info->swapped -
  1834				READ_ONCE(mapping->nrpages);
  1835			if (freed > 0)
  1836				info->alloced -= freed;
  1837			spin_unlock(&info->lock);
  1838			if (freed > 0)
  1839				shmem_inode_unacct_blocks(inode, freed);
  1840			error = shmem_inode_acct_blocks(inode, pages);
  1841			if (error) {
  1842				filemap_remove_folio(folio);
  1843				goto unlock;
  1844			}
  1845		}
  1846	
  1847		shmem_recalc_inode(inode, pages, 0);
  1848		folio_add_lru(folio);
  1849		return folio;
  1850	
  1851	unlock:
  1852		folio_unlock(folio);
  1853		folio_put(folio);
  1854		return ERR_PTR(error);
  1855	}
  1856	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-06  8:46 ` [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
@ 2024-05-07 10:52   ` Ryan Roberts
  2024-05-08  4:45     ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Ryan Roberts @ 2024-05-07 10:52 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 06/05/2024 09:46, Baolin Wang wrote:
> To support the use of mTHP with anonymous shmem, add a new sysfs interface
> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
> directory for each mTHP to control whether shmem is enabled for that mTHP,
> with a value similar to the top level 'shmem_enabled', which can be set to:
> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
> "never", "deny", "force". These values follow the same semantics as the top
> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
> to 'always' to keep compatibility.

We decided at [1] to not allow 'force' for non-PMD-sizes.

[1]
https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/

However, thinking about this a bit more, I wonder if the decision we made to
allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
for legacy back compat after all, I don't think there is any use case where
changing multiple mTHP size controls atomically is actually useful). Applying
that pattern here, it means the top level can always take "force" without any
weird error checking. And we would allow "force" on the PMD-sized control but
not on the others - again this is easy to error check.

Does this pattern make more sense? If so, is it too late to change
hugepages-xxkB/enabled interface?

> 
> By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
> sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
> 
> In addition, if top level value is 'force', then only PMD-sized hugepages
> have enabled="inherit", otherwise configuration will be failed and vice versa.
> That means now we will avoid using non-PMD sized THP to override the global
> huge allocation.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 29 +++++++
>  include/linux/huge_mm.h                    | 10 +++
>  mm/huge_memory.c                           | 11 +--
>  mm/shmem.c                                 | 96 ++++++++++++++++++++++
>  4 files changed, 138 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 076443cc10a6..a28496e15bdb 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -332,6 +332,35 @@ deny
>  force
>      Force the huge option on for all - very useful for testing;
>  
> +Anonymous shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob
> +to control mTHP allocation: /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled.
> +Its value for each mTHP is essentially consistent with the global setting, except
> +for the addition of 'inherit' to ensure compatibility with the global settings.
> +always
> +    Attempt to allocate <size> huge pages every time we need a new page;
> +
> +inherit
> +    Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
> +    have enabled="inherit" and all other hugepage sizes have enabled="never";
> +
> +never
> +    Do not allocate <size> huge pages;
> +
> +within_size
> +    Only allocate <size> huge page if it will be fully within i_size.
> +    Also respect fadvise()/madvise() hints;
> +
> +advise
> +    Only allocate <size> huge pages if requested with fadvise()/madvise();
> +
> +deny
> +    Has the same semantics as 'never', now mTHP allocation policy is only
> +    used for anonymous shmem and no not override tmpfs.
> +
> +force
> +    Has the same semantics as 'always', now mTHP allocation policy is only
> +    used for anonymous shmem and no not override tmpfs.
> +
>  Need of application restart
>  ===========================
>  
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e49b56c40a11..dbd6b3f56210 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -6,6 +6,7 @@
>  #include <linux/mm_types.h>
>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
> +#include <linux/kobject.h>
>  
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
>  				  struct kobj_attribute *attr, char *buf,
>  				  enum transparent_hugepage_flag flag);
>  extern struct kobj_attribute shmem_enabled_attr;
> +extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  
>  /*
>   * Mask of all large folio orders supported for anonymous THP; all orders up to
> @@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>  }
>  
> +struct thpsize {
> +	struct kobject kobj;
> +	struct list_head node;
> +	int order;
> +};
> +
> +#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> +
>  enum mthp_stat_item {
>  	MTHP_STAT_ANON_FAULT_ALLOC,
>  	MTHP_STAT_ANON_FAULT_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9efb6fefc391..d3080a8843f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
>  static DEFINE_SPINLOCK(huge_anon_orders_lock);
>  static LIST_HEAD(thpsize_list);
>  
> -struct thpsize {
> -	struct kobject kobj;
> -	struct list_head node;
> -	int order;
> -};
> -
> -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> -
>  static ssize_t thpsize_enabled_show(struct kobject *kobj,
>  				    struct kobj_attribute *attr, char *buf)
>  {
> @@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
>  
>  static struct attribute *thpsize_attrs[] = {
>  	&thpsize_enabled_attr.attr,
> +#ifdef CONFIG_SHMEM
> +	&thpsize_shmem_enabled_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a383ea9a89a5..59cc26d44344 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -131,6 +131,14 @@ struct shmem_options {
>  #define SHMEM_SEEN_QUOTA 32
>  };
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static unsigned long huge_anon_shmem_orders_always __read_mostly;
> +static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
> +static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
> +static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
> +static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
> +#endif
> +
>  #ifdef CONFIG_TMPFS
>  static unsigned long shmem_default_max_blocks(void)
>  {
> @@ -4687,6 +4695,12 @@ void __init shmem_init(void)
>  		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>  	else
>  		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> +
> +	/*
> +	 * Default to setting PMD-sized THP to inherit the global setting and
> +	 * disable all other multi-size THPs, when anonymous shmem uses mTHP.
> +	 */
> +	huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>  #endif
>  	return;
>  
> @@ -4746,6 +4760,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
>  			huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
>  		return -EINVAL;
>  
> +	/* Do not override huge allocation policy with non-PMD sized mTHP */
> +	if (huge == SHMEM_HUGE_FORCE &&
> +	    huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
> +		return -EINVAL;
> +
>  	shmem_huge = huge;
>  	if (shmem_huge > SHMEM_HUGE_DENY)
>  		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> @@ -4753,6 +4772,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
>  }
>  
>  struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
> +
> +static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
> +					  struct kobj_attribute *attr, char *buf)
> +{
> +	int order = to_thpsize(kobj)->order;
> +	const char *output;
> +
> +	if (test_bit(order, &huge_anon_shmem_orders_always))
> +		output = "[always] inherit within_size advise never deny [force]";
> +	else if (test_bit(order, &huge_anon_shmem_orders_inherit))
> +		output = "always [inherit] within_size advise never deny force";
> +	else if (test_bit(order, &huge_anon_shmem_orders_within_size))
> +		output = "always inherit [within_size] advise never deny force";
> +	else if (test_bit(order, &huge_anon_shmem_orders_madvise))
> +		output = "always inherit within_size [advise] never deny force";
> +	else
> +		output = "always inherit within_size advise [never] [deny] force";
> +
> +	return sysfs_emit(buf, "%s\n", output);
> +}
> +
> +static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
> +					   struct kobj_attribute *attr,
> +					   const char *buf, size_t count)
> +{
> +	int order = to_thpsize(kobj)->order;
> +	ssize_t ret = count;
> +
> +	if (sysfs_streq(buf, "always") || sysfs_streq(buf, "force")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_always);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "inherit")) {
> +		/* Do not override huge allocation policy with non-PMD sized mTHP */
> +		if (shmem_huge == SHMEM_HUGE_FORCE &&
> +		    order != HPAGE_PMD_ORDER)
> +			return -EINVAL;
> +
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_inherit);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "within_size")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		set_bit(order, &huge_anon_shmem_orders_within_size);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "madvise")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_madvise);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "never") || sysfs_streq(buf, "deny")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +
> +struct kobj_attribute thpsize_shmem_enabled_attr =
> +	__ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
>  
>  #else /* !CONFIG_SHMEM */


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config
  2024-05-07 10:21   ` Ryan Roberts
@ 2024-05-08  2:13     ` Baolin Wang
  2024-05-08  9:06       ` Ryan Roberts
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  2:13 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/7 18:21, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> Move highest_order() and next_order() out of the CONFIG_TRANSPARENT_HUGEPAGE
>> macro, which can be common functions to be used.
> 
> Sorry if I haven't kept up with the discussion, but why is this needed? I
> wouldn't expect a need to iterate over orders if THP is compile-time disabled
> because we will never try to allocate THP?

Cause I don't want to add some dummy functions to avoid building errors 
if CONFIG_TRANSPARENT_HUGEPAGE is not enabled in patch 6. Another 
thought is that the pagecache can also allocate a large folio even when 
THP is not enabled, so these helpers may be used in the future (not sure 
though).

Anyway, I also have no strong perference for this patch, below dummy 
functions can also work for me:
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c15bebb2cf53..7aa802ee2ce5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -586,6 +586,16 @@ static inline bool thp_migration_supported(void)
  {
         return false;
  }
+
+static inline int highest_order(unsigned long orders)
+{
+        return 0;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+        return 0;
+}
  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-07 10:37   ` Ryan Roberts
@ 2024-05-08  3:44     ` Baolin Wang
  2024-05-08  7:15       ` David Hildenbrand
  2024-05-08  8:53       ` Ryan Roberts
  0 siblings, 2 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  3:44 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/7 18:37, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> Add large folio mapping establishment support for finish_fault() as a preparation,
>> to support multi-size THP allocation of anonymous shmem pages in the following
>> patches.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>   1 file changed, 33 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eea6e4984eae..936377220b77 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>   {
>>   	struct vm_area_struct *vma = vmf->vma;
>>   	struct page *page;
>> +	struct folio *folio;
>>   	vm_fault_t ret;
>>   	bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>   		      !(vma->vm_flags & VM_SHARED);
>> +	int type, nr_pages, i;
>> +	unsigned long addr = vmf->address;
>>   
>>   	/* Did we COW the page? */
>>   	if (is_cow)
>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>   			return VM_FAULT_OOM;
>>   	}
>>   
>> +	folio = page_folio(page);
>> +	nr_pages = folio_nr_pages(folio);
>> +
>> +	if (unlikely(userfaultfd_armed(vma))) {
>> +		nr_pages = 1;
>> +	} else if (nr_pages > 1) {
>> +		unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>> +		unsigned long end = start + nr_pages * PAGE_SIZE;
>> +
>> +		/* In case the folio size in page cache beyond the VMA limits. */
>> +		addr = max(start, vma->vm_start);
>> +		nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>> +
>> +		page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
> 
> I still don't really follow the logic in this else if block. Isn't it possible
> that finish_fault() gets called with a page from a folio that isn't aligned with
> vmf->address?
> 
> For example, let's say we have a file who's size is 64K and which is cached in a
> single large folio in the page cache. But the file is mapped into a process at
> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate

For shmem, this doesn't happen because the VA is aligned with the 
hugepage size in the shmem_get_unmapped_area() function. See patch 7.

> start=0 and end=64K I think?

Yes. Unfortunately, some file systems that support large mappings do not 
perform alignment for multi-size THP (non-PMD sized, for example: 64K). 
I think this requires modification to 
__get_unmapped_area--->thp_get_unmapped_area_vmflags() or 
file->f_op->get_unmapped_area() to align VA for multi-size THP in future.

So before adding that VA alignment changes, only allow building the 
large folio mapping for anonymous shmem:

diff --git a/mm/memory.c b/mm/memory.c
index 936377220b77..9e4d51826d23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4786,7 +4786,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
         folio = page_folio(page);
         nr_pages = folio_nr_pages(folio);

-       if (unlikely(userfaultfd_armed(vma))) {
+       if (unlikely(userfaultfd_armed(vma)) || !vma_is_anon_shmem(vma)) {
                 nr_pages = 1;
         } else if (nr_pages > 1) {
                 unsigned long start = ALIGN_DOWN(vmf->address, nr_pages 
* PAGE_SIZE);

> Additionally, I think this path will end up mapping the entire folio (as long as
> it fits in the VMA). But this bypasses the fault-around configuration. As I
> think I mentioned against the RFC, this will inflate the RSS of the process and
> can cause behavioural changes as a result. I believe the current advice is to
> disable fault-around to prevent this kind of bloat when needed.

With above change, I do not think this is a problem? since users already 
want to use mTHP for anonymous shmem.

> It might be that you need a special variant of finish_fault() for shmem?
> 
> 
>> +	}
>>   	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>> -				      vmf->address, &vmf->ptl);
>> +				       addr, &vmf->ptl);
>>   	if (!vmf->pte)
>>   		return VM_FAULT_NOPAGE;
>>   
>>   	/* Re-check under ptl */
>> -	if (likely(!vmf_pte_changed(vmf))) {
>> -		struct folio *folio = page_folio(page);
>> -		int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>> -
>> -		set_pte_range(vmf, folio, page, 1, vmf->address);
>> -		add_mm_counter(vma->vm_mm, type, 1);
>> -		ret = 0;
>> -	} else {
>> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +	if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>> +		update_mmu_tlb(vma, addr, vmf->pte);
>> +		ret = VM_FAULT_NOPAGE;
>> +		goto unlock;
>> +	} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>> +		for (i = 0; i < nr_pages; i++)
>> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>   		ret = VM_FAULT_NOPAGE;
>> +		goto unlock;
>>   	}
>>   
>> +	set_pte_range(vmf, folio, page, nr_pages, addr);
>> +	type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>> +	add_mm_counter(vma->vm_mm, type, nr_pages);
>> +	ret = 0;
>> +
>> +unlock:
>>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>   	return ret;
>>   }

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-07 10:52   ` Ryan Roberts
@ 2024-05-08  4:45     ` Baolin Wang
  2024-05-08  7:08       ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  4:45 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/7 18:52, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>> with a value similar to the top level 'shmem_enabled', which can be set to:
>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>> "never", "deny", "force". These values follow the same semantics as the top
>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>> to 'always' to keep compatibility.
> 
> We decided at [1] to not allow 'force' for non-PMD-sizes.
> 
> [1]
> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
> 
> However, thinking about this a bit more, I wonder if the decision we made to
> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
> for legacy back compat after all, I don't think there is any use case where
> changing multiple mTHP size controls atomically is actually useful). Applying

Agree. This is also our usage of 'inherit'.

> that pattern here, it means the top level can always take "force" without any
> weird error checking. And we would allow "force" on the PMD-sized control but
> not on the others - again this is easy to error check.
> 
> Does this pattern make more sense? If so, is it too late to change
> hugepages-xxkB/enabled interface?

IMO, this sounds reasonable to me. Let's see what others think, David?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-07 10:20 ` Ryan Roberts
@ 2024-05-08  5:45   ` Baolin Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  5:45 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/7 18:20, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have all the same values as the top-level
> 
> Didn't we agree that "force" would not be supported for now, and would return an
> error when attempting to set for a non-PMD-size hugepage-XXkb/shmem_enabled (or
> indirectly through inheritance)?

Yes. Sorry, I did not explain it in detail in the cover letter. Please 
see patch 5 you already commented.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/8] mm: shmem: add mTHP support for anonymous shmem
  2024-05-07 10:46   ` kernel test robot
@ 2024-05-08  6:03     ` Baolin Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  6:03 UTC (permalink / raw)
  To: kernel test robot, akpm, hughd
  Cc: llvm, oe-kbuild-all, willy, david, ioworker0, wangkefeng.wang,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, linux-mm,
	linux-kernel

Hi,

On 2024/5/7 18:46, kernel test robot wrote:
>>> mm/shmem.c:1780:10: warning: variable 'folio' is used uninitialized whenever 'while' loop exits because its condition is false [-Wsometimes-uninitialized]
>      1780 |                 while (suitable_orders) {
>           |                        ^~~~~~~~~~~~~~~
>     mm/shmem.c:1795:7: note: uninitialized use occurs here
>      1795 |         if (!folio)
>           |              ^~~~~
>     mm/shmem.c:1780:10: note: remove the condition if it is always true
>      1780 |                 while (suitable_orders) {
>           |                        ^~~~~~~~~~~~~~~
>           |                        1
>     mm/shmem.c:1750:21: note: initialize the variable 'folio' to silence this warning
>      1750 |         struct folio *folio;
>           |                            ^
>           |                             = NULL
>     mm/shmem.c:1564:20: warning: unused function 'shmem_show_mpol' [-Wunused-function]
>      1564 | static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)

Thanks for reporting. Will add below change to avoid the warning:
diff --git a/mm/shmem.c b/mm/shmem.c
index d603e36e0f4f..fd2cb2e73a21 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1747,7 +1747,7 @@ static struct folio 
*shmem_alloc_and_add_folio(struct vm_fault *vmf,
         struct shmem_inode_info *info = SHMEM_I(inode);
         struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
         unsigned long suitable_orders;
-       struct folio *folio;
+       struct folio *folio = NULL;
         long pages;
         int error, order;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08  4:45     ` Baolin Wang
@ 2024-05-08  7:08       ` David Hildenbrand
  2024-05-08  7:12         ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08  7:08 UTC (permalink / raw)
  To: Baolin Wang, Ryan Roberts, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08.05.24 06:45, Baolin Wang wrote:
> 
> 
> On 2024/5/7 18:52, Ryan Roberts wrote:
>> On 06/05/2024 09:46, Baolin Wang wrote:
>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>>> "never", "deny", "force". These values follow the same semantics as the top
>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>> to 'always' to keep compatibility.
>>
>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>
>> However, thinking about this a bit more, I wonder if the decision we made to
>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>> for legacy back compat after all, I don't think there is any use case where
>> changing multiple mTHP size controls atomically is actually useful). Applying
> 
> Agree. This is also our usage of 'inherit'.
> 
>> that pattern here, it means the top level can always take "force" without any
>> weird error checking. And we would allow "force" on the PMD-sized control but
>> not on the others - again this is easy to error check.
>>
>> Does this pattern make more sense? If so, is it too late to change
>> hugepages-xxkB/enabled interface?
> 
> IMO, this sounds reasonable to me. Let's see what others think, David?

Likely too late and we should try not to diverge too much from "enabled" 
for "shmem_enabled".

Having that said, to me it's much cleaner to just have "inherit" for all 
sizes and disallow invalid configurations as discussed.

Error checking cannot be too hard unless I am missing something important :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08  7:08       ` David Hildenbrand
@ 2024-05-08  7:12         ` David Hildenbrand
  2024-05-08  9:02           ` Ryan Roberts
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08  7:12 UTC (permalink / raw)
  To: Baolin Wang, Ryan Roberts, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08.05.24 09:08, David Hildenbrand wrote:
> On 08.05.24 06:45, Baolin Wang wrote:
>>
>>
>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>> to 'always' to keep compatibility.
>>>
>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>
>>> However, thinking about this a bit more, I wonder if the decision we made to
>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>> for legacy back compat after all, I don't think there is any use case where
>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>
>> Agree. This is also our usage of 'inherit'.

Missed that one: there might be use cases in the future once we would 
start defaulting to "inherit" for all knobs (a distro might default to 
that) and default-enable THP in the global knob. Then, it would be easy 
to disable any THP by disabling the global knob. (I think that's the 
future we're heading to, where we'd have an "auto" mode that can be set 
on the global toggle).

But I am just making up use cases ;) I think it will be valuable and 
just doing it consistently now might be cleaner.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-08  3:44     ` Baolin Wang
@ 2024-05-08  7:15       ` David Hildenbrand
  2024-05-08  9:06         ` Baolin Wang
  2024-05-08  8:53       ` Ryan Roberts
  1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08  7:15 UTC (permalink / raw)
  To: Baolin Wang, Ryan Roberts, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08.05.24 05:44, Baolin Wang wrote:
> 
> 
> On 2024/5/7 18:37, Ryan Roberts wrote:
>> On 06/05/2024 09:46, Baolin Wang wrote:
>>> Add large folio mapping establishment support for finish_fault() as a preparation,
>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>> patches.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>    mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>    1 file changed, 33 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index eea6e4984eae..936377220b77 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>    {
>>>    	struct vm_area_struct *vma = vmf->vma;
>>>    	struct page *page;
>>> +	struct folio *folio;
>>>    	vm_fault_t ret;
>>>    	bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>    		      !(vma->vm_flags & VM_SHARED);
>>> +	int type, nr_pages, i;
>>> +	unsigned long addr = vmf->address;
>>>    
>>>    	/* Did we COW the page? */
>>>    	if (is_cow)
>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>    			return VM_FAULT_OOM;
>>>    	}
>>>    
>>> +	folio = page_folio(page);
>>> +	nr_pages = folio_nr_pages(folio);
>>> +
>>> +	if (unlikely(userfaultfd_armed(vma))) {
>>> +		nr_pages = 1;
>>> +	} else if (nr_pages > 1) {
>>> +		unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>> +		unsigned long end = start + nr_pages * PAGE_SIZE;
>>> +
>>> +		/* In case the folio size in page cache beyond the VMA limits. */
>>> +		addr = max(start, vma->vm_start);
>>> +		nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>> +
>>> +		page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>
>> I still don't really follow the logic in this else if block. Isn't it possible
>> that finish_fault() gets called with a page from a folio that isn't aligned with
>> vmf->address?
>>
>> For example, let's say we have a file who's size is 64K and which is cached in a
>> single large folio in the page cache. But the file is mapped into a process at
>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
> 
> For shmem, this doesn't happen because the VA is aligned with the
> hugepage size in the shmem_get_unmapped_area() function. See patch 7.

Does that cover mremap() and MAP_FIXED as well.

We should try doing this as cleanly as possible, to prepare for the 
future / corner cases.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-08  3:44     ` Baolin Wang
  2024-05-08  7:15       ` David Hildenbrand
@ 2024-05-08  8:53       ` Ryan Roberts
  2024-05-08  9:31         ` Baolin Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08  8:53 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 04:44, Baolin Wang wrote:
> 
> 
> On 2024/5/7 18:37, Ryan Roberts wrote:
>> On 06/05/2024 09:46, Baolin Wang wrote:
>>> Add large folio mapping establishment support for finish_fault() as a
>>> preparation,
>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>> patches.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>   1 file changed, 33 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index eea6e4984eae..936377220b77 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>   {
>>>       struct vm_area_struct *vma = vmf->vma;
>>>       struct page *page;
>>> +    struct folio *folio;
>>>       vm_fault_t ret;
>>>       bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>                 !(vma->vm_flags & VM_SHARED);
>>> +    int type, nr_pages, i;
>>> +    unsigned long addr = vmf->address;
>>>         /* Did we COW the page? */
>>>       if (is_cow)
>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>               return VM_FAULT_OOM;
>>>       }
>>>   +    folio = page_folio(page);
>>> +    nr_pages = folio_nr_pages(folio);
>>> +
>>> +    if (unlikely(userfaultfd_armed(vma))) {
>>> +        nr_pages = 1;
>>> +    } else if (nr_pages > 1) {
>>> +        unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>> +        unsigned long end = start + nr_pages * PAGE_SIZE;
>>> +
>>> +        /* In case the folio size in page cache beyond the VMA limits. */
>>> +        addr = max(start, vma->vm_start);
>>> +        nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>> +
>>> +        page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>
>> I still don't really follow the logic in this else if block. Isn't it possible
>> that finish_fault() gets called with a page from a folio that isn't aligned with
>> vmf->address?
>>
>> For example, let's say we have a file who's size is 64K and which is cached in a
>> single large folio in the page cache. But the file is mapped into a process at
>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
> 
> For shmem, this doesn't happen because the VA is aligned with the hugepage size
> in the shmem_get_unmapped_area() function. See patch 7.

Certainly agree that shmem can always make sure that it packs a vma in a way
such that its folios are naturally aligned in VA when faulting in memory. If you
mremap it, that alignment will be lost; I don't think that would be a problem
for a single process; mremap will take care of moving the ptes correctly and
this path is not involved.

But what about the case when a process mmaps a shmem region, then forks, then
the child mremaps the shmem region. Then the parent faults in a THP into the
region (nicely aligned). Then the child faults in the same offset in the region
and gets the THP that the parent allocated; that THP will be aligned in the
parent's VM space but not in the child's.

> 
>> start=0 and end=64K I think?
> 
> Yes. Unfortunately, some file systems that support large mappings do not perform
> alignment for multi-size THP (non-PMD sized, for example: 64K). I think this
> requires modification to __get_unmapped_area--->thp_get_unmapped_area_vmflags()
> or file->f_op->get_unmapped_area() to align VA for multi-size THP in future.

By nature of the fact that a file mapping is shared between multiple processes
and each process can map it where ever it wants down to 1 page granularity, its
impossible for any THP containing a part of that file to be VA-aligned in every
process it is mapped in.

> 
> So before adding that VA alignment changes, only allow building the large folio
> mapping for anonymous shmem:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 936377220b77..9e4d51826d23 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4786,7 +4786,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>         folio = page_folio(page);
>         nr_pages = folio_nr_pages(folio);
> 
> -       if (unlikely(userfaultfd_armed(vma))) {
> +       if (unlikely(userfaultfd_armed(vma)) || !vma_is_anon_shmem(vma)) {

If the above theoretical flow for fork & mremap is valid, then I don't think
this is sufficient.

>                 nr_pages = 1;
>         } else if (nr_pages > 1) {
>                 unsigned long start = ALIGN_DOWN(vmf->address, nr_pages *
> PAGE_SIZE);
> 
>> Additionally, I think this path will end up mapping the entire folio (as long as
>> it fits in the VMA). But this bypasses the fault-around configuration. As I
>> think I mentioned against the RFC, this will inflate the RSS of the process and
>> can cause behavioural changes as a result. I believe the current advice is to
>> disable fault-around to prevent this kind of bloat when needed.
> 
> With above change, I do not think this is a problem? since users already want to
> use mTHP for anonymous shmem.
> 
>> It might be that you need a special variant of finish_fault() for shmem?
>>
>>
>>> +    }
>>>       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>> -                      vmf->address, &vmf->ptl);
>>> +                       addr, &vmf->ptl);
>>>       if (!vmf->pte)
>>>           return VM_FAULT_NOPAGE;
>>>         /* Re-check under ptl */
>>> -    if (likely(!vmf_pte_changed(vmf))) {
>>> -        struct folio *folio = page_folio(page);
>>> -        int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>> -
>>> -        set_pte_range(vmf, folio, page, 1, vmf->address);
>>> -        add_mm_counter(vma->vm_mm, type, 1);
>>> -        ret = 0;
>>> -    } else {
>>> -        update_mmu_tlb(vma, vmf->address, vmf->pte);
>>> +    if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>>> +        update_mmu_tlb(vma, addr, vmf->pte);
>>> +        ret = VM_FAULT_NOPAGE;
>>> +        goto unlock;
>>> +    } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>>> +        for (i = 0; i < nr_pages; i++)
>>> +            update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>           ret = VM_FAULT_NOPAGE;
>>> +        goto unlock;
>>>       }
>>>   +    set_pte_range(vmf, folio, page, nr_pages, addr);
>>> +    type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>> +    add_mm_counter(vma->vm_mm, type, nr_pages);
>>> +    ret = 0;
>>> +
>>> +unlock:
>>>       pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>       return ret;
>>>   }


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08  7:12         ` David Hildenbrand
@ 2024-05-08  9:02           ` Ryan Roberts
  2024-05-08  9:56             ` Baolin Wang
  2024-05-08 12:02             ` David Hildenbrand
  0 siblings, 2 replies; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08  9:02 UTC (permalink / raw)
  To: David Hildenbrand, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 08:12, David Hildenbrand wrote:
> On 08.05.24 09:08, David Hildenbrand wrote:
>> On 08.05.24 06:45, Baolin Wang wrote:
>>>
>>>
>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>> "advise",
>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>> to 'always' to keep compatibility.
>>>>
>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>
>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>> for legacy back compat after all, I don't think there is any use case where
>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>
>>> Agree. This is also our usage of 'inherit'.
> 
> Missed that one: there might be use cases in the future once we would start
> defaulting to "inherit" for all knobs (a distro might default to that) and
> default-enable THP in the global knob. Then, it would be easy to disable any THP
> by disabling the global knob. (I think that's the future we're heading to, where
> we'd have an "auto" mode that can be set on the global toggle).
> 
> But I am just making up use cases ;) I think it will be valuable and just doing
> it consistently now might be cleaner.

I agree that consistency between enabled and shmem_enabled is top priority. And
yes, I had forgotten about the glorious "auto" future. So probably continuing
all sizes to select "inherit" is best.

But for shmem_enabled, that means we need the following error checking:

 - It is an error to set "force" for any size except PMD-size

 - It is an error to set "force" for the global control if any size except PMD-
   size is set to "inherit"

 - It is an error to set "inherit" for any size except PMD-size if the global
   control is set to "force".

Certainly not too difficult to code and prove to be correct, but not the nicest
UX from the user's point of view when they start seeing errors.

I think we previously said this would likely be temporary, and if/when tmpfs
gets mTHP support, we could simplify and allow all sizes to be set to "force".
But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
more suited to the approach the page cache takes to transparently ramp up the
folio size as it faults more in. (Just saying there is a chance that this error
checking becomes permanent).


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config
  2024-05-08  2:13     ` Baolin Wang
@ 2024-05-08  9:06       ` Ryan Roberts
  2024-05-08  9:40         ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08  9:06 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 03:13, Baolin Wang wrote:
> 
> 
> On 2024/5/7 18:21, Ryan Roberts wrote:
>> On 06/05/2024 09:46, Baolin Wang wrote:
>>> Move highest_order() and next_order() out of the CONFIG_TRANSPARENT_HUGEPAGE
>>> macro, which can be common functions to be used.
>>
>> Sorry if I haven't kept up with the discussion, but why is this needed? I
>> wouldn't expect a need to iterate over orders if THP is compile-time disabled
>> because we will never try to allocate THP?
> 
> Cause I don't want to add some dummy functions to avoid building errors if
> CONFIG_TRANSPARENT_HUGEPAGE is not enabled in patch 6. Another thought is that
> the pagecache can also allocate a large folio even when THP is not enabled, so
> these helpers may be used in the future (not sure though).

OK, I'll admit I haven't looked at the latter patches yet - I'd like to conclude
on the interface and mapping/alignment strategy first.

But it wasn't necessary to access these functions for the anon/private case
without CONFIG_TRANSPARENT_HUGEPAGE, so I'm wondering why it's needed for shmem
case. I would expect that they don't need to be defined at all.

> 
> Anyway, I also have no strong perference for this patch, below dummy functions
> can also work for me:
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c15bebb2cf53..7aa802ee2ce5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -586,6 +586,16 @@ static inline bool thp_migration_supported(void)
>  {
>         return false;
>  }
> +
> +static inline int highest_order(unsigned long orders)
> +{
> +        return 0;
> +}
> +
> +static inline int next_order(unsigned long *orders, int prev)
> +{
> +        return 0;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-08  7:15       ` David Hildenbrand
@ 2024-05-08  9:06         ` Baolin Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  9:06 UTC (permalink / raw)
  To: David Hildenbrand, Ryan Roberts, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/8 15:15, David Hildenbrand wrote:
> On 08.05.24 05:44, Baolin Wang wrote:
>>
>>
>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>> Add large folio mapping establishment support for finish_fault() as 
>>>> a preparation,
>>>> to support multi-size THP allocation of anonymous shmem pages in the 
>>>> following
>>>> patches.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>>    1 file changed, 33 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index eea6e4984eae..936377220b77 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>    {
>>>>        struct vm_area_struct *vma = vmf->vma;
>>>>        struct page *page;
>>>> +    struct folio *folio;
>>>>        vm_fault_t ret;
>>>>        bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>                  !(vma->vm_flags & VM_SHARED);
>>>> +    int type, nr_pages, i;
>>>> +    unsigned long addr = vmf->address;
>>>>        /* Did we COW the page? */
>>>>        if (is_cow)
>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>                return VM_FAULT_OOM;
>>>>        }
>>>> +    folio = page_folio(page);
>>>> +    nr_pages = folio_nr_pages(folio);
>>>> +
>>>> +    if (unlikely(userfaultfd_armed(vma))) {
>>>> +        nr_pages = 1;
>>>> +    } else if (nr_pages > 1) {
>>>> +        unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * 
>>>> PAGE_SIZE);
>>>> +        unsigned long end = start + nr_pages * PAGE_SIZE;
>>>> +
>>>> +        /* In case the folio size in page cache beyond the VMA 
>>>> limits. */
>>>> +        addr = max(start, vma->vm_start);
>>>> +        nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>> +
>>>> +        page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>
>>> I still don't really follow the logic in this else if block. Isn't it 
>>> possible
>>> that finish_fault() gets called with a page from a folio that isn't 
>>> aligned with
>>> vmf->address?
>>>
>>> For example, let's say we have a file who's size is 64K and which is 
>>> cached in a
>>> single large folio in the page cache. But the file is mapped into a 
>>> process at
>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You 
>>> will calculate
>>
>> For shmem, this doesn't happen because the VA is aligned with the
>> hugepage size in the shmem_get_unmapped_area() function. See patch 7.
> 
> Does that cover mremap() and MAP_FIXED as well.

Good point. Thanks for pointing this out.

> We should try doing this as cleanly as possible, to prepare for the 
> future / corner cases.

Sure. Let me re-think about the algorithm.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-08  8:53       ` Ryan Roberts
@ 2024-05-08  9:31         ` Baolin Wang
  2024-05-08 10:47           ` Ryan Roberts
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  9:31 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/8 16:53, Ryan Roberts wrote:
> On 08/05/2024 04:44, Baolin Wang wrote:
>>
>>
>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>> Add large folio mapping establishment support for finish_fault() as a
>>>> preparation,
>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>> patches.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>>    1 file changed, 33 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index eea6e4984eae..936377220b77 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>    {
>>>>        struct vm_area_struct *vma = vmf->vma;
>>>>        struct page *page;
>>>> +    struct folio *folio;
>>>>        vm_fault_t ret;
>>>>        bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>                  !(vma->vm_flags & VM_SHARED);
>>>> +    int type, nr_pages, i;
>>>> +    unsigned long addr = vmf->address;
>>>>          /* Did we COW the page? */
>>>>        if (is_cow)
>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>                return VM_FAULT_OOM;
>>>>        }
>>>>    +    folio = page_folio(page);
>>>> +    nr_pages = folio_nr_pages(folio);
>>>> +
>>>> +    if (unlikely(userfaultfd_armed(vma))) {
>>>> +        nr_pages = 1;
>>>> +    } else if (nr_pages > 1) {
>>>> +        unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>> +        unsigned long end = start + nr_pages * PAGE_SIZE;
>>>> +
>>>> +        /* In case the folio size in page cache beyond the VMA limits. */
>>>> +        addr = max(start, vma->vm_start);
>>>> +        nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>> +
>>>> +        page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>
>>> I still don't really follow the logic in this else if block. Isn't it possible
>>> that finish_fault() gets called with a page from a folio that isn't aligned with
>>> vmf->address?
>>>
>>> For example, let's say we have a file who's size is 64K and which is cached in a
>>> single large folio in the page cache. But the file is mapped into a process at
>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will calculate
>>
>> For shmem, this doesn't happen because the VA is aligned with the hugepage size
>> in the shmem_get_unmapped_area() function. See patch 7.
> 
> Certainly agree that shmem can always make sure that it packs a vma in a way
> such that its folios are naturally aligned in VA when faulting in memory. If you
> mremap it, that alignment will be lost; I don't think that would be a problem

When mremap it, it will also call shmem_get_unmapped_area() to align the 
VA, but for mremap() with MAP_FIXED flag as David pointed out, yes, this 
patch may be not work perfectly.

> for a single process; mremap will take care of moving the ptes correctly and
> this path is not involved.
> 
> But what about the case when a process mmaps a shmem region, then forks, then
> the child mremaps the shmem region. Then the parent faults in a THP into the
> region (nicely aligned). Then the child faults in the same offset in the region
> and gets the THP that the parent allocated; that THP will be aligned in the
> parent's VM space but not in the child's.

Sorry, I did not get your point here. IIUC, the child's VA will also be 
aligned if the child mremap is not set MAP_FIXED, since the child's 
mremap will still call shmem_get_unmapped_area() to find an aligned new 
VA. Please correct me if I missed your point.

>>> start=0 and end=64K I think?
>>
>> Yes. Unfortunately, some file systems that support large mappings do not perform
>> alignment for multi-size THP (non-PMD sized, for example: 64K). I think this
>> requires modification to __get_unmapped_area--->thp_get_unmapped_area_vmflags()
>> or file->f_op->get_unmapped_area() to align VA for multi-size THP in future.
> 
> By nature of the fact that a file mapping is shared between multiple processes
> and each process can map it where ever it wants down to 1 page granularity, its
> impossible for any THP containing a part of that file to be VA-aligned in every
> process it is mapped in.

Yes, so let me re-polish this patch. Thanks.

>> So before adding that VA alignment changes, only allow building the large folio
>> mapping for anonymous shmem:
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 936377220b77..9e4d51826d23 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4786,7 +4786,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>          folio = page_folio(page);
>>          nr_pages = folio_nr_pages(folio);
>>
>> -       if (unlikely(userfaultfd_armed(vma))) {
>> +       if (unlikely(userfaultfd_armed(vma)) || !vma_is_anon_shmem(vma)) {
> 
> If the above theoretical flow for fork & mremap is valid, then I don't think
> this is sufficient.
> 
>>                  nr_pages = 1;
>>          } else if (nr_pages > 1) {
>>                  unsigned long start = ALIGN_DOWN(vmf->address, nr_pages *
>> PAGE_SIZE);
>>
>>> Additionally, I think this path will end up mapping the entire folio (as long as
>>> it fits in the VMA). But this bypasses the fault-around configuration. As I
>>> think I mentioned against the RFC, this will inflate the RSS of the process and
>>> can cause behavioural changes as a result. I believe the current advice is to
>>> disable fault-around to prevent this kind of bloat when needed.
>>
>> With above change, I do not think this is a problem? since users already want to
>> use mTHP for anonymous shmem.
>>
>>> It might be that you need a special variant of finish_fault() for shmem?
>>>
>>>
>>>> +    }
>>>>        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>>> -                      vmf->address, &vmf->ptl);
>>>> +                       addr, &vmf->ptl);
>>>>        if (!vmf->pte)
>>>>            return VM_FAULT_NOPAGE;
>>>>          /* Re-check under ptl */
>>>> -    if (likely(!vmf_pte_changed(vmf))) {
>>>> -        struct folio *folio = page_folio(page);
>>>> -        int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>> -
>>>> -        set_pte_range(vmf, folio, page, 1, vmf->address);
>>>> -        add_mm_counter(vma->vm_mm, type, 1);
>>>> -        ret = 0;
>>>> -    } else {
>>>> -        update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>> +    if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>>>> +        update_mmu_tlb(vma, addr, vmf->pte);
>>>> +        ret = VM_FAULT_NOPAGE;
>>>> +        goto unlock;
>>>> +    } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>>>> +        for (i = 0; i < nr_pages; i++)
>>>> +            update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>>            ret = VM_FAULT_NOPAGE;
>>>> +        goto unlock;
>>>>        }
>>>>    +    set_pte_range(vmf, folio, page, nr_pages, addr);
>>>> +    type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>> +    add_mm_counter(vma->vm_mm, type, nr_pages);
>>>> +    ret = 0;
>>>> +
>>>> +unlock:
>>>>        pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>        return ret;
>>>>    }

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config
  2024-05-08  9:06       ` Ryan Roberts
@ 2024-05-08  9:40         ` Baolin Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  9:40 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/8 17:06, Ryan Roberts wrote:
> On 08/05/2024 03:13, Baolin Wang wrote:
>>
>>
>> On 2024/5/7 18:21, Ryan Roberts wrote:
>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>> Move highest_order() and next_order() out of the CONFIG_TRANSPARENT_HUGEPAGE
>>>> macro, which can be common functions to be used.
>>>
>>> Sorry if I haven't kept up with the discussion, but why is this needed? I
>>> wouldn't expect a need to iterate over orders if THP is compile-time disabled
>>> because we will never try to allocate THP?
>>
>> Cause I don't want to add some dummy functions to avoid building errors if
>> CONFIG_TRANSPARENT_HUGEPAGE is not enabled in patch 6. Another thought is that
>> the pagecache can also allocate a large folio even when THP is not enabled, so
>> these helpers may be used in the future (not sure though).
> 
> OK, I'll admit I haven't looked at the latter patches yet - I'd like to conclude
> on the interface and mapping/alignment strategy first.
> 
> But it wasn't necessary to access these functions for the anon/private case
> without CONFIG_TRANSPARENT_HUGEPAGE, so I'm wondering why it's needed for shmem
> case. I would expect that they don't need to be defined at all.

Currently in the shmem_alloc_and_add_folio() function, the hugepage 
allocating is not guarded with '#ifdef CONFIG_TRANSPARENT_HUGEPAGE', but 
rather with 'IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)', which can lead to 
some building errors when CONFIG_TRANSPARENT_HUGEPAGE is not enabled. 
However, this is not a big issue, and I will make some adjustments to 
avoid defining dummy functions.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08  9:02           ` Ryan Roberts
@ 2024-05-08  9:56             ` Baolin Wang
  2024-05-08 10:48               ` Ryan Roberts
  2024-05-08 12:02             ` David Hildenbrand
  1 sibling, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2024-05-08  9:56 UTC (permalink / raw)
  To: Ryan Roberts, David Hildenbrand, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/8 17:02, Ryan Roberts wrote:
> On 08/05/2024 08:12, David Hildenbrand wrote:
>> On 08.05.24 09:08, David Hildenbrand wrote:
>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>> "advise",
>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>> to 'always' to keep compatibility.
>>>>>
>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>
>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>
>>>> Agree. This is also our usage of 'inherit'.
>>
>> Missed that one: there might be use cases in the future once we would start
>> defaulting to "inherit" for all knobs (a distro might default to that) and
>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>> by disabling the global knob. (I think that's the future we're heading to, where
>> we'd have an "auto" mode that can be set on the global toggle).
>>
>> But I am just making up use cases ;) I think it will be valuable and just doing
>> it consistently now might be cleaner.
> 
> I agree that consistency between enabled and shmem_enabled is top priority. And
> yes, I had forgotten about the glorious "auto" future. So probably continuing
> all sizes to select "inherit" is best.
> 
> But for shmem_enabled, that means we need the following error checking:
> 
>   - It is an error to set "force" for any size except PMD-size
> 
>   - It is an error to set "force" for the global control if any size except PMD-
>     size is set to "inherit"
> 
>   - It is an error to set "inherit" for any size except PMD-size if the global
>     control is set to "force".
> 
> Certainly not too difficult to code and prove to be correct, but not the nicest
> UX from the user's point of view when they start seeing errors.
> 
> I think we previously said this would likely be temporary, and if/when tmpfs
> gets mTHP support, we could simplify and allow all sizes to be set to "force".
> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
> more suited to the approach the page cache takes to transparently ramp up the
> folio size as it faults more in. (Just saying there is a chance that this error
> checking becomes permanent).

The strategy for tmpfs supporting mTHP will require more discussions and 
evaluations in the future. However, regardless of the strategy (explicit 
mTHP control or page cache control), I think it would be possible to use 
'force' to override previous strategies for some testing purposes. This 
appears to be permissible according to the explanation in the current 
documentation: "force the huge option on for all - very useful for 
testing". So it seems not permanent?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-08  9:31         ` Baolin Wang
@ 2024-05-08 10:47           ` Ryan Roberts
  2024-05-09  1:10             ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08 10:47 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 10:31, Baolin Wang wrote:
> 
> 
> On 2024/5/8 16:53, Ryan Roberts wrote:
>> On 08/05/2024 04:44, Baolin Wang wrote:
>>>
>>>
>>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>> Add large folio mapping establishment support for finish_fault() as a
>>>>> preparation,
>>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>>> patches.
>>>>>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> ---
>>>>>    mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>>>    1 file changed, 33 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index eea6e4984eae..936377220b77 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>    {
>>>>>        struct vm_area_struct *vma = vmf->vma;
>>>>>        struct page *page;
>>>>> +    struct folio *folio;
>>>>>        vm_fault_t ret;
>>>>>        bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>>                  !(vma->vm_flags & VM_SHARED);
>>>>> +    int type, nr_pages, i;
>>>>> +    unsigned long addr = vmf->address;
>>>>>          /* Did we COW the page? */
>>>>>        if (is_cow)
>>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>                return VM_FAULT_OOM;
>>>>>        }
>>>>>    +    folio = page_folio(page);
>>>>> +    nr_pages = folio_nr_pages(folio);
>>>>> +
>>>>> +    if (unlikely(userfaultfd_armed(vma))) {
>>>>> +        nr_pages = 1;
>>>>> +    } else if (nr_pages > 1) {
>>>>> +        unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>>> +        unsigned long end = start + nr_pages * PAGE_SIZE;
>>>>> +
>>>>> +        /* In case the folio size in page cache beyond the VMA limits. */
>>>>> +        addr = max(start, vma->vm_start);
>>>>> +        nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>>> +
>>>>> +        page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>>
>>>> I still don't really follow the logic in this else if block. Isn't it possible
>>>> that finish_fault() gets called with a page from a folio that isn't aligned
>>>> with
>>>> vmf->address?
>>>>
>>>> For example, let's say we have a file who's size is 64K and which is cached
>>>> in a
>>>> single large folio in the page cache. But the file is mapped into a process at
>>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will
>>>> calculate
>>>
>>> For shmem, this doesn't happen because the VA is aligned with the hugepage size
>>> in the shmem_get_unmapped_area() function. See patch 7.
>>
>> Certainly agree that shmem can always make sure that it packs a vma in a way
>> such that its folios are naturally aligned in VA when faulting in memory. If you
>> mremap it, that alignment will be lost; I don't think that would be a problem
> 
> When mremap it, it will also call shmem_get_unmapped_area() to align the VA, but
> for mremap() with MAP_FIXED flag as David pointed out, yes, this patch may be
> not work perfectly.

Assuming this works similarly to anon mTHP, remapping to an arbitrary address
shouldn't be a problem within a single process; the previously allocated folios
will now be unaligned, but they will be correctly mapped so it doesn't break
anything. And new faults will allocate folios so that they are as large as
allowed by the sysfs interface AND which do not overlap with any non-none pte
AND which are naturally aligned. It's when you start sharing with other
processes that the fun and games start...

> 
>> for a single process; mremap will take care of moving the ptes correctly and
>> this path is not involved.
>>
>> But what about the case when a process mmaps a shmem region, then forks, then
>> the child mremaps the shmem region. Then the parent faults in a THP into the
>> region (nicely aligned). Then the child faults in the same offset in the region
>> and gets the THP that the parent allocated; that THP will be aligned in the
>> parent's VM space but not in the child's.
> 
> Sorry, I did not get your point here. IIUC, the child's VA will also be aligned
> if the child mremap is not set MAP_FIXED, since the child's mremap will still
> call shmem_get_unmapped_area() to find an aligned new VA.

In general, you shouldn't be relying on the vma bounds being aligned to a THP
boundary.

> Please correct me if I missed your point.

(I'm not 100% sure this is definitely how it works, but seems the only sane way
to me):

Let's imagine we have a process that maps 4 pages of shared anon memory at VA=64K:

  mmap(64K, 16K, PROT_X, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, ...)

Then it forks a child, and the child moves the mapping to VA=68K:

  mremap(64K, 16K, 16K, MREMAP_FIXED | MREMAP_MAYMOVE, 68K)

Then the parent writes to address 64K (offset 0 in the shared region); this will
fault and cause a 16K mTHP to be allocated and mapped, covering the whole region
at 64K-80K in the parent.

Then the child reads address 68K (offset 0 in the shared region); this will
fault and cause the previously allocated 16K folio to be looked up and it must
be mapped in the child between 68K-84K. This is not naturally aligned in the child.

For the child, your code will incorrectly calculate start/end as 64K-80K.

> 
>>>> start=0 and end=64K I think?
>>>
>>> Yes. Unfortunately, some file systems that support large mappings do not perform
>>> alignment for multi-size THP (non-PMD sized, for example: 64K). I think this
>>> requires modification to __get_unmapped_area--->thp_get_unmapped_area_vmflags()
>>> or file->f_op->get_unmapped_area() to align VA for multi-size THP in future.
>>
>> By nature of the fact that a file mapping is shared between multiple processes
>> and each process can map it where ever it wants down to 1 page granularity, its
>> impossible for any THP containing a part of that file to be VA-aligned in every
>> process it is mapped in.
> 
> Yes, so let me re-polish this patch. Thanks.
> 
>>> So before adding that VA alignment changes, only allow building the large folio
>>> mapping for anonymous shmem:
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 936377220b77..9e4d51826d23 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4786,7 +4786,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>          folio = page_folio(page);
>>>          nr_pages = folio_nr_pages(folio);
>>>
>>> -       if (unlikely(userfaultfd_armed(vma))) {
>>> +       if (unlikely(userfaultfd_armed(vma)) || !vma_is_anon_shmem(vma)) {
>>
>> If the above theoretical flow for fork & mremap is valid, then I don't think
>> this is sufficient.
>>
>>>                  nr_pages = 1;
>>>          } else if (nr_pages > 1) {
>>>                  unsigned long start = ALIGN_DOWN(vmf->address, nr_pages *
>>> PAGE_SIZE);
>>>
>>>> Additionally, I think this path will end up mapping the entire folio (as
>>>> long as
>>>> it fits in the VMA). But this bypasses the fault-around configuration. As I
>>>> think I mentioned against the RFC, this will inflate the RSS of the process and
>>>> can cause behavioural changes as a result. I believe the current advice is to
>>>> disable fault-around to prevent this kind of bloat when needed.
>>>
>>> With above change, I do not think this is a problem? since users already want to
>>> use mTHP for anonymous shmem.
>>>
>>>> It might be that you need a special variant of finish_fault() for shmem?
>>>>
>>>>
>>>>> +    }
>>>>>        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>>>> -                      vmf->address, &vmf->ptl);
>>>>> +                       addr, &vmf->ptl);
>>>>>        if (!vmf->pte)
>>>>>            return VM_FAULT_NOPAGE;
>>>>>          /* Re-check under ptl */
>>>>> -    if (likely(!vmf_pte_changed(vmf))) {
>>>>> -        struct folio *folio = page_folio(page);
>>>>> -        int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>>> -
>>>>> -        set_pte_range(vmf, folio, page, 1, vmf->address);
>>>>> -        add_mm_counter(vma->vm_mm, type, 1);
>>>>> -        ret = 0;
>>>>> -    } else {
>>>>> -        update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>>> +    if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
>>>>> +        update_mmu_tlb(vma, addr, vmf->pte);
>>>>> +        ret = VM_FAULT_NOPAGE;
>>>>> +        goto unlock;
>>>>> +    } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>>>>> +        for (i = 0; i < nr_pages; i++)
>>>>> +            update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>>>            ret = VM_FAULT_NOPAGE;
>>>>> +        goto unlock;
>>>>>        }
>>>>>    +    set_pte_range(vmf, folio, page, nr_pages, addr);
>>>>> +    type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
>>>>> +    add_mm_counter(vma->vm_mm, type, nr_pages);
>>>>> +    ret = 0;
>>>>> +
>>>>> +unlock:
>>>>>        pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>        return ret;
>>>>>    }


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08  9:56             ` Baolin Wang
@ 2024-05-08 10:48               ` Ryan Roberts
  0 siblings, 0 replies; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08 10:48 UTC (permalink / raw)
  To: Baolin Wang, David Hildenbrand, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 10:56, Baolin Wang wrote:
> 
> 
> On 2024/5/8 17:02, Ryan Roberts wrote:
>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>> "advise",
>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>>> to 'always' to keep compatibility.
>>>>>>
>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>
>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>> one.
>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>> just
>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>>
>>>>> Agree. This is also our usage of 'inherit'.
>>>
>>> Missed that one: there might be use cases in the future once we would start
>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>>> by disabling the global knob. (I think that's the future we're heading to, where
>>> we'd have an "auto" mode that can be set on the global toggle).
>>>
>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>> it consistently now might be cleaner.
>>
>> I agree that consistency between enabled and shmem_enabled is top priority. And
>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>> all sizes to select "inherit" is best.
>>
>> But for shmem_enabled, that means we need the following error checking:
>>
>>   - It is an error to set "force" for any size except PMD-size
>>
>>   - It is an error to set "force" for the global control if any size except PMD-
>>     size is set to "inherit"
>>
>>   - It is an error to set "inherit" for any size except PMD-size if the global
>>     control is set to "force".
>>
>> Certainly not too difficult to code and prove to be correct, but not the nicest
>> UX from the user's point of view when they start seeing errors.
>>
>> I think we previously said this would likely be temporary, and if/when tmpfs
>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>> more suited to the approach the page cache takes to transparently ramp up the
>> folio size as it faults more in. (Just saying there is a chance that this error
>> checking becomes permanent).
> 
> The strategy for tmpfs supporting mTHP will require more discussions and
> evaluations in the future. However, regardless of the strategy (explicit mTHP
> control or page cache control), I think it would be possible to use 'force' to
> override previous strategies for some testing purposes. This appears to be
> permissible according to the explanation in the current documentation: "force
> the huge option on for all - very useful for testing". So it seems not permanent?

Yeah ok, makes sense to me.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
       [not found] ` <CGME20240508113934eucas1p13a3972f3f9955365f40155e084a7c7d5@eucas1p1.samsung.com>
@ 2024-05-08 11:39   ` Daniel Gomez
  2024-05-08 11:58     ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Daniel Gomez @ 2024-05-08 11:39 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, ioworker0, wangkefeng.wang,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, linux-mm,
	linux-kernel

On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shared pages will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Many implement anonymous page sharing through
> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> also including the anonymous shared pages, in order to enjoy the benefits of
> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have all the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option. By default all sizes will be set to "never"
> except PMD size, which is set to "inherit". This ensures backward compatibility
> with the shmem enabled of the top level, meanwhile also allows independent
> control of shmem enabled for each mTHP.

I'm trying to understand the adoption of mTHP and how it fits into the adoption
of (large) folios that the kernel is moving towards. Can you, or anyone involved
here, explain this? How much do they overlap, and can we benefit from having
both? Is there any argument against the adoption of large folios here that I
might have missed?

> 
> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
> 
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.02s        3.14s         82936.359                  2630746.027
> 
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.
> 
> TODO:
>  - Support mTHP for tmpfs.
>  - Do not split the large folio when share memory swap out.
>  - Can swap in a large folio for share memory.
> 
> Changes from RFC:
>  - Rebase the patch set against the new mm-unstable branch, per Lance.
>  - Add a new patch to export highest_order() and next_order().
>  - Add a new patch to align mTHP size in shmem_get_unmapped_area().
>  - Handle the uffd case and the VMA limits case when building mapping for
>    large folio in the finish_fault() function, per Ryan.
>  - Remove unnecessary 'order' variable in patch 3, per Kefeng.
>  - Keep the anon shmem counters' name consistency.
>  - Modify the strategy to support mTHP for anonymous shmem, discussed with
>    Ryan and David.
>  - Add reviewed tag from Barry.
>  - Update the commit message.
> 
> Baolin Wang (8):
>   mm: move highest_order() and next_order() out of the THP config
>   mm: memory: extend finish_fault() to support large folio
>   mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>   mm: shmem: add THP validation for PMD-mapped THP related statistics
>   mm: shmem: add multi-size THP sysfs interface for anonymous shmem
>   mm: shmem: add mTHP support for anonymous shmem
>   mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
>   mm: shmem: add mTHP counters for anonymous shmem
> 
>  Documentation/admin-guide/mm/transhuge.rst |  29 ++
>  include/linux/huge_mm.h                    |  35 ++-
>  mm/huge_memory.c                           |  17 +-
>  mm/memory.c                                |  43 ++-
>  mm/shmem.c                                 | 335 ++++++++++++++++++---
>  5 files changed, 387 insertions(+), 72 deletions(-)
> 
> -- 
> 2.39.3
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-08 11:39   ` Daniel Gomez
@ 2024-05-08 11:58     ` David Hildenbrand
  2024-05-08 14:28       ` Daniel Gomez
  2024-05-08 19:23       ` Luis Chamberlain
  0 siblings, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08 11:58 UTC (permalink / raw)
  To: Daniel Gomez, Baolin Wang
  Cc: akpm, hughd, willy, ioworker0, wangkefeng.wang, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, linux-mm, linux-kernel

On 08.05.24 13:39, Daniel Gomez wrote:
> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have all the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option. By default all sizes will be set to "never"
>> except PMD size, which is set to "inherit". This ensures backward compatibility
>> with the shmem enabled of the top level, meanwhile also allows independent
>> control of shmem enabled for each mTHP.
> 
> I'm trying to understand the adoption of mTHP and how it fits into the adoption
> of (large) folios that the kernel is moving towards. Can you, or anyone involved
> here, explain this? How much do they overlap, and can we benefit from having
> both? Is there any argument against the adoption of large folios here that I
> might have missed?

mTHP are implemented using large folios, just like traditional PMD-sized 
THP are. (you really should explore the history of mTHP and how it all 
works internally)

The biggest challenge with memory that cannot be evicted on memory 
pressure to be reclaimed (in contrast to your ordinary files in the 
pagecache) is memory waste, well, and placement of large chunks of 
memory in general, during page faults.

In the worst case (no swap), you allocate a large chunk of memory once 
and it will stick around until freed: no reclaim of that memory.

That's the reason why THP for anonymous memory and SHMEM have toggles to 
manually enable and configure them, in contrast to the pagecache. The 
same was done for mTHP for anonymous memory, and now (anon) shmem follows.

There are plans to have, at some point, have it all working 
automatically, but a lot for that for anonymous memory (and shmem 
similarly) is still missing and unclear.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08  9:02           ` Ryan Roberts
  2024-05-08  9:56             ` Baolin Wang
@ 2024-05-08 12:02             ` David Hildenbrand
  2024-05-08 12:10               ` David Hildenbrand
  1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08 12:02 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08.05.24 11:02, Ryan Roberts wrote:
> On 08/05/2024 08:12, David Hildenbrand wrote:
>> On 08.05.24 09:08, David Hildenbrand wrote:
>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>> "advise",
>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>> to 'always' to keep compatibility.
>>>>>
>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>
>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>
>>>> Agree. This is also our usage of 'inherit'.
>>
>> Missed that one: there might be use cases in the future once we would start
>> defaulting to "inherit" for all knobs (a distro might default to that) and
>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>> by disabling the global knob. (I think that's the future we're heading to, where
>> we'd have an "auto" mode that can be set on the global toggle).
>>
>> But I am just making up use cases ;) I think it will be valuable and just doing
>> it consistently now might be cleaner.
> 
> I agree that consistency between enabled and shmem_enabled is top priority. And
> yes, I had forgotten about the glorious "auto" future. So probably continuing
> all sizes to select "inherit" is best.
> 
> But for shmem_enabled, that means we need the following error checking:
> 
>   - It is an error to set "force" for any size except PMD-size
> 
>   - It is an error to set "force" for the global control if any size except PMD-
>     size is set to "inherit"
> 
>   - It is an error to set "inherit" for any size except PMD-size if the global
>     control is set to "force".
> 
> Certainly not too difficult to code and prove to be correct, but not the nicest
> UX from the user's point of view when they start seeing errors.
> 
> I think we previously said this would likely be temporary, and if/when tmpfs
> gets mTHP support, we could simplify and allow all sizes to be set to "force".
> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
> more suited to the approach the page cache takes to transparently ramp up the
> folio size as it faults more in. (Just saying there is a chance that this error
> checking becomes permanent).

Note that with shmem you're inherently facing the same memory waste 
issues etc as you would with anonymous memory. (sometimes even worse, if 
you're running shmem that's configured to be unswappable!).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08 12:02             ` David Hildenbrand
@ 2024-05-08 12:10               ` David Hildenbrand
  2024-05-08 12:43                 ` Ryan Roberts
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08 12:10 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08.05.24 14:02, David Hildenbrand wrote:
> On 08.05.24 11:02, Ryan Roberts wrote:
>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>> "advise",
>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>>> to 'always' to keep compatibility.
>>>>>>
>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>
>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>>
>>>>> Agree. This is also our usage of 'inherit'.
>>>
>>> Missed that one: there might be use cases in the future once we would start
>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>>> by disabling the global knob. (I think that's the future we're heading to, where
>>> we'd have an "auto" mode that can be set on the global toggle).
>>>
>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>> it consistently now might be cleaner.
>>
>> I agree that consistency between enabled and shmem_enabled is top priority. And
>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>> all sizes to select "inherit" is best.
>>
>> But for shmem_enabled, that means we need the following error checking:
>>
>>    - It is an error to set "force" for any size except PMD-size
>>
>>    - It is an error to set "force" for the global control if any size except PMD-
>>      size is set to "inherit"
>>
>>    - It is an error to set "inherit" for any size except PMD-size if the global
>>      control is set to "force".
>>
>> Certainly not too difficult to code and prove to be correct, but not the nicest
>> UX from the user's point of view when they start seeing errors.
>>
>> I think we previously said this would likely be temporary, and if/when tmpfs
>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>> more suited to the approach the page cache takes to transparently ramp up the
>> folio size as it faults more in. (Just saying there is a chance that this error
>> checking becomes permanent).
> 
> Note that with shmem you're inherently facing the same memory waste
> issues etc as you would with anonymous memory. (sometimes even worse, if
> you're running shmem that's configured to be unswappable!).

Also noting that memory waste is not really a problem when a write to a 
shmem file allocates a large folio that stays within boundaries of that 
write; issues only pop up if you end up over-allocating, especially, 
during page faults where you have not that much clue about what to do 
(single address, no real range provided).

There is the other issue that wasting large chunks of contiguous memory 
on stuff that barely benefits from it. With memory that maybe never gets 
evicted, there is no automatic "handing back" of that memory to the 
system to be used by something else. With ordinary files, that's a bit 
different. But I did not look closer into that issue yet, it's one of 
the reasons MADV_HUGEPAGE was added IIRC.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08 12:10               ` David Hildenbrand
@ 2024-05-08 12:43                 ` Ryan Roberts
  2024-05-08 12:44                   ` Ryan Roberts
  2024-05-08 12:45                   ` David Hildenbrand
  0 siblings, 2 replies; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08 12:43 UTC (permalink / raw)
  To: David Hildenbrand, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 13:10, David Hildenbrand wrote:
> On 08.05.24 14:02, David Hildenbrand wrote:
>> On 08.05.24 11:02, Ryan Roberts wrote:
>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>> "advise",
>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>> equivalent
>>>>>>>> to 'always' to keep compatibility.
>>>>>>>
>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>
>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>> one.
>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>> just
>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>> Applying
>>>>>>
>>>>>> Agree. This is also our usage of 'inherit'.
>>>>
>>>> Missed that one: there might be use cases in the future once we would start
>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>> THP
>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>> where
>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>
>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>> it consistently now might be cleaner.
>>>
>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>> all sizes to select "inherit" is best.
>>>
>>> But for shmem_enabled, that means we need the following error checking:
>>>
>>>    - It is an error to set "force" for any size except PMD-size
>>>
>>>    - It is an error to set "force" for the global control if any size except
>>> PMD-
>>>      size is set to "inherit"
>>>
>>>    - It is an error to set "inherit" for any size except PMD-size if the global
>>>      control is set to "force".
>>>
>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>> UX from the user's point of view when they start seeing errors.
>>>
>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>> more suited to the approach the page cache takes to transparently ramp up the
>>> folio size as it faults more in. (Just saying there is a chance that this error
>>> checking becomes permanent).
>>
>> Note that with shmem you're inherently facing the same memory waste
>> issues etc as you would with anonymous memory. (sometimes even worse, if
>> you're running shmem that's configured to be unswappable!).
> 
> Also noting that memory waste is not really a problem when a write to a shmem
> file allocates a large folio that stays within boundaries of that write; issues
> only pop up if you end up over-allocating, especially, during page faults where
> you have not that much clue about what to do (single address, no real range
> provided).
> 
> There is the other issue that wasting large chunks of contiguous memory on stuff
> that barely benefits from it. With memory that maybe never gets evicted, there
> is no automatic "handing back" of that memory to the system to be used by
> something else. With ordinary files, that's a bit different. But I did not look
> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.

OK understood. Although, with tmpfs you're not going to mmap it then randomly
extend the file through page faults - mmap doesn't permit that, I don't think?
So presumably the user must explicitly set the size of the file first? Are you
suggesting there are a lot of use cases where a large tmpfs file is created,
mmaped then only accessed sparsely?



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08 12:43                 ` Ryan Roberts
@ 2024-05-08 12:44                   ` Ryan Roberts
  2024-05-08 12:45                   ` David Hildenbrand
  1 sibling, 0 replies; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08 12:44 UTC (permalink / raw)
  To: David Hildenbrand, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 13:43, Ryan Roberts wrote:
> On 08/05/2024 13:10, David Hildenbrand wrote:
>> On 08.05.24 14:02, David Hildenbrand wrote:
>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>> "advise",
>>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>> equivalent
>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>
>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>
>>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>> one.
>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>> just
>>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>> Applying
>>>>>>>
>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>
>>>>> Missed that one: there might be use cases in the future once we would start
>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>> THP
>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>> where
>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>
>>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>>> it consistently now might be cleaner.
>>>>
>>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>> all sizes to select "inherit" is best.
>>>>
>>>> But for shmem_enabled, that means we need the following error checking:
>>>>
>>>>    - It is an error to set "force" for any size except PMD-size
>>>>
>>>>    - It is an error to set "force" for the global control if any size except
>>>> PMD-
>>>>      size is set to "inherit"
>>>>
>>>>    - It is an error to set "inherit" for any size except PMD-size if the global
>>>>      control is set to "force".
>>>>
>>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>>> UX from the user's point of view when they start seeing errors.
>>>>
>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>> folio size as it faults more in. (Just saying there is a chance that this error
>>>> checking becomes permanent).
>>>
>>> Note that with shmem you're inherently facing the same memory waste
>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>> you're running shmem that's configured to be unswappable!).
>>
>> Also noting that memory waste is not really a problem when a write to a shmem
>> file allocates a large folio that stays within boundaries of that write; issues
>> only pop up if you end up over-allocating, especially, during page faults where
>> you have not that much clue about what to do (single address, no real range
>> provided).
>>
>> There is the other issue that wasting large chunks of contiguous memory on stuff
>> that barely benefits from it. With memory that maybe never gets evicted, there
>> is no automatic "handing back" of that memory to the system to be used by
>> something else. With ordinary files, that's a bit different. But I did not look
>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.
> 
> OK understood. Although, with tmpfs you're not going to mmap it then randomly
> extend the file through page faults - mmap doesn't permit that, I don't think?
> So presumably the user must explicitly set the size of the file first? Are you
> suggesting there are a lot of use cases where a large tmpfs file is created,
> mmaped then only accessed sparsely?

I know that's often the case for anon memory, but not sure if you would expect
the same pattern with an explicit file?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08 12:43                 ` Ryan Roberts
  2024-05-08 12:44                   ` Ryan Roberts
@ 2024-05-08 12:45                   ` David Hildenbrand
  2024-05-08 12:54                     ` Ryan Roberts
  1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08 12:45 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08.05.24 14:43, Ryan Roberts wrote:
> On 08/05/2024 13:10, David Hildenbrand wrote:
>> On 08.05.24 14:02, David Hildenbrand wrote:
>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>> "advise",
>>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>> equivalent
>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>
>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>
>>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>> one.
>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>> just
>>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>> Applying
>>>>>>>
>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>
>>>>> Missed that one: there might be use cases in the future once we would start
>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>> THP
>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>> where
>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>
>>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>>> it consistently now might be cleaner.
>>>>
>>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>> all sizes to select "inherit" is best.
>>>>
>>>> But for shmem_enabled, that means we need the following error checking:
>>>>
>>>>     - It is an error to set "force" for any size except PMD-size
>>>>
>>>>     - It is an error to set "force" for the global control if any size except
>>>> PMD-
>>>>       size is set to "inherit"
>>>>
>>>>     - It is an error to set "inherit" for any size except PMD-size if the global
>>>>       control is set to "force".
>>>>
>>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>>> UX from the user's point of view when they start seeing errors.
>>>>
>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>> folio size as it faults more in. (Just saying there is a chance that this error
>>>> checking becomes permanent).
>>>
>>> Note that with shmem you're inherently facing the same memory waste
>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>> you're running shmem that's configured to be unswappable!).
>>
>> Also noting that memory waste is not really a problem when a write to a shmem
>> file allocates a large folio that stays within boundaries of that write; issues
>> only pop up if you end up over-allocating, especially, during page faults where
>> you have not that much clue about what to do (single address, no real range
>> provided).
>>
>> There is the other issue that wasting large chunks of contiguous memory on stuff
>> that barely benefits from it. With memory that maybe never gets evicted, there
>> is no automatic "handing back" of that memory to the system to be used by
>> something else. With ordinary files, that's a bit different. But I did not look
>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.
> 
> OK understood. Although, with tmpfs you're not going to mmap it then randomly
> extend the file through page faults - mmap doesn't permit that, I don't think?
> So presumably the user must explicitly set the size of the file first? Are you
> suggesting there are a lot of use cases where a large tmpfs file is created,
> mmaped then only accessed sparsely?

I don't know about "a lot of use cases", but for VMs that's certainly 
how it's used.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08 12:45                   ` David Hildenbrand
@ 2024-05-08 12:54                     ` Ryan Roberts
  2024-05-08 13:07                       ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08 12:54 UTC (permalink / raw)
  To: David Hildenbrand, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 13:45, David Hildenbrand wrote:
> On 08.05.24 14:43, Ryan Roberts wrote:
>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>> interface
>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>> mTHP,
>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>> set to:
>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>> "advise",
>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>> the top
>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>> equivalent
>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>
>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>>
>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>> made to
>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>>> one.
>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>>> just
>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>> where
>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>> Applying
>>>>>>>>
>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>
>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>>> THP
>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>> where
>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>
>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>> doing
>>>>>> it consistently now might be cleaner.
>>>>>
>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>> And
>>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>>> all sizes to select "inherit" is best.
>>>>>
>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>
>>>>>     - It is an error to set "force" for any size except PMD-size
>>>>>
>>>>>     - It is an error to set "force" for the global control if any size except
>>>>> PMD-
>>>>>       size is set to "inherit"
>>>>>
>>>>>     - It is an error to set "inherit" for any size except PMD-size if the
>>>>> global
>>>>>       control is set to "force".
>>>>>
>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>> nicest
>>>>> UX from the user's point of view when they start seeing errors.
>>>>>
>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>> error
>>>>> checking becomes permanent).
>>>>
>>>> Note that with shmem you're inherently facing the same memory waste
>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>> you're running shmem that's configured to be unswappable!).
>>>
>>> Also noting that memory waste is not really a problem when a write to a shmem
>>> file allocates a large folio that stays within boundaries of that write; issues
>>> only pop up if you end up over-allocating, especially, during page faults where
>>> you have not that much clue about what to do (single address, no real range
>>> provided).
>>>
>>> There is the other issue that wasting large chunks of contiguous memory on stuff
>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>> is no automatic "handing back" of that memory to the system to be used by
>>> something else. With ordinary files, that's a bit different. But I did not look
>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>> IIRC.
>>
>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>> extend the file through page faults - mmap doesn't permit that, I don't think?
>> So presumably the user must explicitly set the size of the file first? Are you
>> suggesting there are a lot of use cases where a large tmpfs file is created,
>> mmaped then only accessed sparsely?
> 
> I don't know about "a lot of use cases", but for VMs that's certainly how it's
> used.

Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
than private (or shared) anonymous memory for VMs?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08 12:54                     ` Ryan Roberts
@ 2024-05-08 13:07                       ` David Hildenbrand
  2024-05-08 13:44                         ` Ryan Roberts
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08 13:07 UTC (permalink / raw)
  To: Ryan Roberts, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08.05.24 14:54, Ryan Roberts wrote:
> On 08/05/2024 13:45, David Hildenbrand wrote:
>> On 08.05.24 14:43, Ryan Roberts wrote:
>>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>>> interface
>>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>>> mTHP,
>>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>>> set to:
>>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>>> "advise",
>>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>>> the top
>>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>>> equivalent
>>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>>
>>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>>>
>>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>>> made to
>>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>>>> one.
>>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>>>> just
>>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>>> where
>>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>>> Applying
>>>>>>>>>
>>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>>
>>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>>>> THP
>>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>>> where
>>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>>
>>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>>> doing
>>>>>>> it consistently now might be cleaner.
>>>>>>
>>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>>> And
>>>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>>>> all sizes to select "inherit" is best.
>>>>>>
>>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>>
>>>>>>      - It is an error to set "force" for any size except PMD-size
>>>>>>
>>>>>>      - It is an error to set "force" for the global control if any size except
>>>>>> PMD-
>>>>>>        size is set to "inherit"
>>>>>>
>>>>>>      - It is an error to set "inherit" for any size except PMD-size if the
>>>>>> global
>>>>>>        control is set to "force".
>>>>>>
>>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>>> nicest
>>>>>> UX from the user's point of view when they start seeing errors.
>>>>>>
>>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>>> error
>>>>>> checking becomes permanent).
>>>>>
>>>>> Note that with shmem you're inherently facing the same memory waste
>>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>>> you're running shmem that's configured to be unswappable!).
>>>>
>>>> Also noting that memory waste is not really a problem when a write to a shmem
>>>> file allocates a large folio that stays within boundaries of that write; issues
>>>> only pop up if you end up over-allocating, especially, during page faults where
>>>> you have not that much clue about what to do (single address, no real range
>>>> provided).
>>>>
>>>> There is the other issue that wasting large chunks of contiguous memory on stuff
>>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>>> is no automatic "handing back" of that memory to the system to be used by
>>>> something else. With ordinary files, that's a bit different. But I did not look
>>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>>> IIRC.
>>>
>>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>>> extend the file through page faults - mmap doesn't permit that, I don't think?
>>> So presumably the user must explicitly set the size of the file first? Are you
>>> suggesting there are a lot of use cases where a large tmpfs file is created,
>>> mmaped then only accessed sparsely?
>>
>> I don't know about "a lot of use cases", but for VMs that's certainly how it's
>> used.
> 

There are more details around that and the sparsity (memory ballooning, 
virtio-mem, free page reporting), but it might distract here :) I'll 
note that shmem+THP is known to be problematic with memory ballooning.

> Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
> than private (or shared) anonymous memory for VMs?

The primary use case I know of is sharing VM memory with other processes 
(usually not child processes): DPDK/SPDK and other vhost-user variants 
(such as virtiofs) mmap() all guest memory to access it directly (some 
sort of multi-process hypervisors). They either use real-file-based 
shmem or memfd (essentially the same without a named file) for that.

Then, there is live-hypervisor upgrade, whereby you start a second 
hypervisor process that will take over. People use shmem for that, so 
you can minimize downtime by migrating guest memory simply by mmap'ing 
the shmem file into the new hypervisor.

Shared anonymous memory is basically never used (I only know of one 
corner case in QEMU).

I would assume that there are also DBs making use of rather sparse 
shmem? But no expert on that.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-05-08 13:07                       ` David Hildenbrand
@ 2024-05-08 13:44                         ` Ryan Roberts
  0 siblings, 0 replies; 52+ messages in thread
From: Ryan Roberts @ 2024-05-08 13:44 UTC (permalink / raw)
  To: David Hildenbrand, Baolin Wang, akpm, hughd
  Cc: willy, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel

On 08/05/2024 14:07, David Hildenbrand wrote:
> On 08.05.24 14:54, Ryan Roberts wrote:
>> On 08/05/2024 13:45, David Hildenbrand wrote:
>>> On 08.05.24 14:43, Ryan Roberts wrote:
>>>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>>>> interface
>>>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>>>> mTHP,
>>>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>>>> set to:
>>>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>>>> "advise",
>>>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>>>> the top
>>>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>>>> equivalent
>>>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>>>
>>>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>>>>
>>>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>>>> made to
>>>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the
>>>>>>>>>>> wrong
>>>>>>>>>>> one.
>>>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit
>>>>>>>>>>> (this is
>>>>>>>>>>> just
>>>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>>>> where
>>>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>>>> Applying
>>>>>>>>>>
>>>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>>>
>>>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>>>> default-enable THP in the global knob. Then, it would be easy to disable
>>>>>>>> any
>>>>>>>> THP
>>>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>>>> where
>>>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>>>
>>>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>>>> doing
>>>>>>>> it consistently now might be cleaner.
>>>>>>>
>>>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>>>> And
>>>>>>> yes, I had forgotten about the glorious "auto" future. So probably
>>>>>>> continuing
>>>>>>> all sizes to select "inherit" is best.
>>>>>>>
>>>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>>>
>>>>>>>      - It is an error to set "force" for any size except PMD-size
>>>>>>>
>>>>>>>      - It is an error to set "force" for the global control if any size
>>>>>>> except
>>>>>>> PMD-
>>>>>>>        size is set to "inherit"
>>>>>>>
>>>>>>>      - It is an error to set "inherit" for any size except PMD-size if the
>>>>>>> global
>>>>>>>        control is set to "force".
>>>>>>>
>>>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>>>> nicest
>>>>>>> UX from the user's point of view when they start seeing errors.
>>>>>>>
>>>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>>>> gets mTHP support, we could simplify and allow all sizes to be set to
>>>>>>> "force".
>>>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it
>>>>>>> would be
>>>>>>> more suited to the approach the page cache takes to transparently ramp up
>>>>>>> the
>>>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>>>> error
>>>>>>> checking becomes permanent).
>>>>>>
>>>>>> Note that with shmem you're inherently facing the same memory waste
>>>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>>>> you're running shmem that's configured to be unswappable!).
>>>>>
>>>>> Also noting that memory waste is not really a problem when a write to a shmem
>>>>> file allocates a large folio that stays within boundaries of that write;
>>>>> issues
>>>>> only pop up if you end up over-allocating, especially, during page faults
>>>>> where
>>>>> you have not that much clue about what to do (single address, no real range
>>>>> provided).
>>>>>
>>>>> There is the other issue that wasting large chunks of contiguous memory on
>>>>> stuff
>>>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>>>> is no automatic "handing back" of that memory to the system to be used by
>>>>> something else. With ordinary files, that's a bit different. But I did not
>>>>> look
>>>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>>>> IIRC.
>>>>
>>>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>>>> extend the file through page faults - mmap doesn't permit that, I don't think?
>>>> So presumably the user must explicitly set the size of the file first? Are you
>>>> suggesting there are a lot of use cases where a large tmpfs file is created,
>>>> mmaped then only accessed sparsely?
>>>
>>> I don't know about "a lot of use cases", but for VMs that's certainly how it's
>>> used.
>>
> 
> There are more details around that and the sparsity (memory ballooning,
> virtio-mem, free page reporting), but it might distract here :) I'll note that
> shmem+THP is known to be problematic with memory ballooning.
> 
>> Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
>> than private (or shared) anonymous memory for VMs?
> 
> The primary use case I know of is sharing VM memory with other processes
> (usually not child processes): DPDK/SPDK and other vhost-user variants (such as
> virtiofs) mmap() all guest memory to access it directly (some sort of
> multi-process hypervisors). They either use real-file-based shmem or memfd
> (essentially the same without a named file) for that.
> 
> Then, there is live-hypervisor upgrade, whereby you start a second hypervisor
> process that will take over. People use shmem for that, so you can minimize
> downtime by migrating guest memory simply by mmap'ing the shmem file into the
> new hypervisor.
> 
> Shared anonymous memory is basically never used (I only know of one corner case
> in QEMU).
> 
> I would assume that there are also DBs making use of rather sparse shmem? But no
> expert on that.
> 

That all makes sense - thanks for the lesson!


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-08 11:58     ` David Hildenbrand
@ 2024-05-08 14:28       ` Daniel Gomez
  2024-05-08 17:03         ` David Hildenbrand
  2024-05-09  3:08         ` Baolin Wang
  2024-05-08 19:23       ` Luis Chamberlain
  1 sibling, 2 replies; 52+ messages in thread
From: Daniel Gomez @ 2024-05-08 14:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baolin Wang, akpm, hughd, willy, ioworker0, wangkefeng.wang,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, linux-mm,
	linux-kernel

On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> On 08.05.24 13:39, Daniel Gomez wrote:
> > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > > through commit 19eaf44954df, that can allow THP to be configured through the
> > > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > > 
> > > However, the anonymous shared pages will ignore the anonymous mTHP rule
> > > configured through the sysfs interface, and can only use the PMD-mapped
> > > THP, that is not reasonable. Many implement anonymous page sharing through
> > > mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> > > therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> > > also including the anonymous shared pages, in order to enjoy the benefits of
> > > mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> > > than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > > 
> > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > which can have all the same values as the top-level
> > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > additional "inherit" option. By default all sizes will be set to "never"
> > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > with the shmem enabled of the top level, meanwhile also allows independent
> > > control of shmem enabled for each mTHP.
> > 
> > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > here, explain this? How much do they overlap, and can we benefit from having
> > both? Is there any argument against the adoption of large folios here that I
> > might have missed?
> 
> mTHP are implemented using large folios, just like traditional PMD-sized THP
> are. (you really should explore the history of mTHP and how it all works
> internally)

I'll check more in deep the code. By any chance are any of you going to be at
LSFMM this year? I have this session [1] scheduled for Wednesday and it would
be nice to get your feedback on it and if you see this working together with
mTHP/THP.

[1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/

> 
> The biggest challenge with memory that cannot be evicted on memory pressure
> to be reclaimed (in contrast to your ordinary files in the pagecache) is
> memory waste, well, and placement of large chunks of memory in general,
> during page faults.
> 
> In the worst case (no swap), you allocate a large chunk of memory once and
> it will stick around until freed: no reclaim of that memory.

I can see that path being triggered by some fstests but only for THP (where we
can actually reclaim memory).

> 
> That's the reason why THP for anonymous memory and SHMEM have toggles to
> manually enable and configure them, in contrast to the pagecache. The same
> was done for mTHP for anonymous memory, and now (anon) shmem follows.
> 
> There are plans to have, at some point, have it all working automatically,
> but a lot for that for anonymous memory (and shmem similarly) is still
> missing and unclear.

Thanks.

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-08 14:28       ` Daniel Gomez
@ 2024-05-08 17:03         ` David Hildenbrand
  2024-05-09 19:18           ` Daniel Gomez
  2024-05-09  3:08         ` Baolin Wang
  1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-08 17:03 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Baolin Wang, akpm, hughd, willy, ioworker0, wangkefeng.wang,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, linux-mm,
	linux-kernel

On 08.05.24 16:28, Daniel Gomez wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are. (you really should explore the history of mTHP and how it all works
>> internally)
> 
> I'll check more in deep the code. By any chance are any of you going to be at
> LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> be nice to get your feedback on it and if you see this working together with
> mTHP/THP.
>

I'll be around and will attend that session! But note that I am still 
scratching my head what to do with "ordinary" shmem, especially because 
of the weird way shmem behaves in contrast to real files (below). Some 
input from Hugh might be very helpful.

Example: you write() to a shmem file and populate a 2M THP. Then, nobody 
touches that file for a long time. There are certainly other mmap() 
users that could better benefit from that THP ... and without swap that 
THP will be trapped there possibly a long time (unless I am missing an 
important piece of shmem THP design :) )? Sure, if we only have THP's 
it's nice, that's just not the reality unfortunately. IIRC, that's one 
of the reasons why THP for shmem can be enabled/disabled. But again, 
still scratching my head ...


Note that this patch set only tackles anonymous shmem 
(MAP_SHARED|MAP_ANON), which is in 99.999% of all cases only accessed 
via page tables (memory allocated during page faults). I think there are 
ways to grab the fd (/proc/self/fd), but IIRC only corner cases 
read/write that.

So in that sense, anonymous shmem (this patch set) behaves mostly like 
ordinary anonymous memory, and likely there is not much overlap with 
other "allocate large folios during read/write/fallocate" as in [1]. 
swap might have an overlap.


The real confusion begins when we have ordinary shmem: some users never 
mmap it and only read/write, some users never read/write it and only 
mmap it and some (less common?) users do both.

And shmem really is special: it looks like "just another file", but 
memory-consumption and reclaim wise it behaves just like anonymous 
memory. It might be swappable ("usually very limited backing disk space 
available") or it might not.

In a subthread here we are discussing what to do with that special 
"shmem_enabled = force" mode ... and it's all complicated I think.

> [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
> 
>>
>> The biggest challenge with memory that cannot be evicted on memory pressure
>> to be reclaimed (in contrast to your ordinary files in the pagecache) is
>> memory waste, well, and placement of large chunks of memory in general,
>> during page faults.
>>
>> In the worst case (no swap), you allocate a large chunk of memory once and
>> it will stick around until freed: no reclaim of that memory.
> 
> I can see that path being triggered by some fstests but only for THP (where we
> can actually reclaim memory).

Is that when we punch-hole a partial THP and split it? I'd be interested 
in what that test does.



-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-08 11:58     ` David Hildenbrand
  2024-05-08 14:28       ` Daniel Gomez
@ 2024-05-08 19:23       ` Luis Chamberlain
  2024-05-09 17:48         ` David Hildenbrand
  1 sibling, 1 reply; 52+ messages in thread
From: Luis Chamberlain @ 2024-05-08 19:23 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox, Christoph Lameter,
	Christoph Hellwig, Dave Chinner
  Cc: Daniel Gomez, Baolin Wang, akpm, hughd, ioworker0,
	wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts, shy828301,
	ziy, linux-mm, linux-kernel, Linux FS Devel

On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> On 08.05.24 13:39, Daniel Gomez wrote:
> > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > which can have all the same values as the top-level
> > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > additional "inherit" option. By default all sizes will be set to "never"
> > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > with the shmem enabled of the top level, meanwhile also allows independent
> > > control of shmem enabled for each mTHP.
> > 
> > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > here, explain this? How much do they overlap, and can we benefit from having
> > both? Is there any argument against the adoption of large folios here that I
> > might have missed?
> 
> mTHP are implemented using large folios, just like traditional PMD-sized THP
> are.
> 
> The biggest challenge with memory that cannot be evicted on memory pressure
> to be reclaimed (in contrast to your ordinary files in the pagecache) is
> memory waste, well, and placement of large chunks of memory in general,
> during page faults.
> 
> In the worst case (no swap), you allocate a large chunk of memory once and
> it will stick around until freed: no reclaim of that memory.
> 
> That's the reason why THP for anonymous memory and SHMEM have toggles to
> manually enable and configure them, in contrast to the pagecache. The same
> was done for mTHP for anonymous memory, and now (anon) shmem follows.
> 
> There are plans to have, at some point, have it all working automatically,
> but a lot for that for anonymous memory (and shmem similarly) is still
> missing and unclear.

Whereas the use for large folios for filesystems is already automatic,
so long as the filesystem supports it. We do this in readahead and write
path already for iomap, we opportunistically use large folios if we can,
otherwise we use smaller folios.

So a recommended approach by Matthew was to use the readahead and write
path, just as in iomap to determine the size of the folio to use [0].
The use of large folios would also be automatic and not require any
knobs at all.

The mTHP approach would be growing the "THP" use in filesystems by the
only single filesystem to use THP. Meanwhile use of large folios is already
automatic with the approach taken by iomap.

We're at a crux where it does beg the question if we should continue to
chug on with tmpfs being special and doing things differently extending
the old THP interface with mTHP, or if it should just use large folios
using the same approach as iomap did.

From my perspective the more shared code the better, and the more shared
paths the better. There is a chance to help test swap with large folios
instead of splitting the folios for swap, and that would could be done
first with tmpfs. I have not evaluated the difference in testing or how
we could get the most of shared code if we take a mTHP approach or the
iomap approach for tmpfs, that should be considered.

Are there other things to consider? Does this require some dialog at
LSFMM?

[0] https://lore.kernel.org/all/ZHD9zmIeNXICDaRJ@casper.infradead.org/

  Luis

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio
  2024-05-08 10:47           ` Ryan Roberts
@ 2024-05-09  1:10             ` Baolin Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-09  1:10 UTC (permalink / raw)
  To: Ryan Roberts, akpm, hughd
  Cc: willy, david, ioworker0, wangkefeng.wang, ying.huang, 21cnbao,
	shy828301, ziy, linux-mm, linux-kernel



On 2024/5/8 18:47, Ryan Roberts wrote:
> On 08/05/2024 10:31, Baolin Wang wrote:
>>
>>
>> On 2024/5/8 16:53, Ryan Roberts wrote:
>>> On 08/05/2024 04:44, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:37, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> Add large folio mapping establishment support for finish_fault() as a
>>>>>> preparation,
>>>>>> to support multi-size THP allocation of anonymous shmem pages in the following
>>>>>> patches.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>>     mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
>>>>>>     1 file changed, 33 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index eea6e4984eae..936377220b77 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>>     {
>>>>>>         struct vm_area_struct *vma = vmf->vma;
>>>>>>         struct page *page;
>>>>>> +    struct folio *folio;
>>>>>>         vm_fault_t ret;
>>>>>>         bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>>>>                   !(vma->vm_flags & VM_SHARED);
>>>>>> +    int type, nr_pages, i;
>>>>>> +    unsigned long addr = vmf->address;
>>>>>>           /* Did we COW the page? */
>>>>>>         if (is_cow)
>>>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>>                 return VM_FAULT_OOM;
>>>>>>         }
>>>>>>     +    folio = page_folio(page);
>>>>>> +    nr_pages = folio_nr_pages(folio);
>>>>>> +
>>>>>> +    if (unlikely(userfaultfd_armed(vma))) {
>>>>>> +        nr_pages = 1;
>>>>>> +    } else if (nr_pages > 1) {
>>>>>> +        unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>>>> +        unsigned long end = start + nr_pages * PAGE_SIZE;
>>>>>> +
>>>>>> +        /* In case the folio size in page cache beyond the VMA limits. */
>>>>>> +        addr = max(start, vma->vm_start);
>>>>>> +        nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
>>>>>> +
>>>>>> +        page = folio_page(folio, (addr - start) >> PAGE_SHIFT);
>>>>>
>>>>> I still don't really follow the logic in this else if block. Isn't it possible
>>>>> that finish_fault() gets called with a page from a folio that isn't aligned
>>>>> with
>>>>> vmf->address?
>>>>>
>>>>> For example, let's say we have a file who's size is 64K and which is cached
>>>>> in a
>>>>> single large folio in the page cache. But the file is mapped into a process at
>>>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will
>>>>> calculate
>>>>
>>>> For shmem, this doesn't happen because the VA is aligned with the hugepage size
>>>> in the shmem_get_unmapped_area() function. See patch 7.
>>>
>>> Certainly agree that shmem can always make sure that it packs a vma in a way
>>> such that its folios are naturally aligned in VA when faulting in memory. If you
>>> mremap it, that alignment will be lost; I don't think that would be a problem
>>
>> When mremap it, it will also call shmem_get_unmapped_area() to align the VA, but
>> for mremap() with MAP_FIXED flag as David pointed out, yes, this patch may be
>> not work perfectly.
> 
> Assuming this works similarly to anon mTHP, remapping to an arbitrary address
> shouldn't be a problem within a single process; the previously allocated folios
> will now be unaligned, but they will be correctly mapped so it doesn't break
> anything. And new faults will allocate folios so that they are as large as
> allowed by the sysfs interface AND which do not overlap with any non-none pte
> AND which are naturally aligned. It's when you start sharing with other
> processes that the fun and games start...
> 
>>
>>> for a single process; mremap will take care of moving the ptes correctly and
>>> this path is not involved.
>>>
>>> But what about the case when a process mmaps a shmem region, then forks, then
>>> the child mremaps the shmem region. Then the parent faults in a THP into the
>>> region (nicely aligned). Then the child faults in the same offset in the region
>>> and gets the THP that the parent allocated; that THP will be aligned in the
>>> parent's VM space but not in the child's.
>>
>> Sorry, I did not get your point here. IIUC, the child's VA will also be aligned
>> if the child mremap is not set MAP_FIXED, since the child's mremap will still
>> call shmem_get_unmapped_area() to find an aligned new VA.
> 
> In general, you shouldn't be relying on the vma bounds being aligned to a THP
> boundary.
> 
>> Please correct me if I missed your point.
> 
> (I'm not 100% sure this is definitely how it works, but seems the only sane way
> to me):
> 
> Let's imagine we have a process that maps 4 pages of shared anon memory at VA=64K:
> 
>    mmap(64K, 16K, PROT_X, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, ...)
> 
> Then it forks a child, and the child moves the mapping to VA=68K:
> 
>    mremap(64K, 16K, 16K, MREMAP_FIXED | MREMAP_MAYMOVE, 68K)
> 
> Then the parent writes to address 64K (offset 0 in the shared region); this will
> fault and cause a 16K mTHP to be allocated and mapped, covering the whole region
> at 64K-80K in the parent.
> 
> Then the child reads address 68K (offset 0 in the shared region); this will
> fault and cause the previously allocated 16K folio to be looked up and it must
> be mapped in the child between 68K-84K. This is not naturally aligned in the child.
> 
> For the child, your code will incorrectly calculate start/end as 64K-80K.

OK, so you set MREMAP_FIXED flag, just as David pointed out. Yes, it 
will not aligned in the child for this case. Thanks for the explanation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-08 14:28       ` Daniel Gomez
  2024-05-08 17:03         ` David Hildenbrand
@ 2024-05-09  3:08         ` Baolin Wang
  1 sibling, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2024-05-09  3:08 UTC (permalink / raw)
  To: Daniel Gomez, David Hildenbrand
  Cc: akpm, hughd, willy, ioworker0, wangkefeng.wang, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, linux-mm, linux-kernel



On 2024/5/8 22:28, Daniel Gomez wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are. (you really should explore the history of mTHP and how it all works
>> internally)
> 
> I'll check more in deep the code. By any chance are any of you going to be at
> LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> be nice to get your feedback on it and if you see this working together with
> mTHP/THP.
> 
> [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/

Great. I'm also interested in tmpfs support for large folios (or mTHP), 
so please CC me if you plan to send a new version.

As David mentioned, this patchset is mainly about adding mTHP support 
for anonymous shmem, and I think that some of the swap support for large 
folios could work together.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-08 19:23       ` Luis Chamberlain
@ 2024-05-09 17:48         ` David Hildenbrand
  2024-05-10 18:53           ` Luis Chamberlain
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2024-05-09 17:48 UTC (permalink / raw)
  To: Luis Chamberlain, Matthew Wilcox, Christoph Lameter,
	Christoph Hellwig, Dave Chinner
  Cc: Daniel Gomez, Baolin Wang, akpm, hughd, ioworker0,
	wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts, shy828301,
	ziy, linux-mm, linux-kernel, Linux FS Devel

On 08.05.24 21:23, Luis Chamberlain wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are.
>>
>> The biggest challenge with memory that cannot be evicted on memory pressure
>> to be reclaimed (in contrast to your ordinary files in the pagecache) is
>> memory waste, well, and placement of large chunks of memory in general,
>> during page faults.
>>
>> In the worst case (no swap), you allocate a large chunk of memory once and
>> it will stick around until freed: no reclaim of that memory.
>>
>> That's the reason why THP for anonymous memory and SHMEM have toggles to
>> manually enable and configure them, in contrast to the pagecache. The same
>> was done for mTHP for anonymous memory, and now (anon) shmem follows.
>>
>> There are plans to have, at some point, have it all working automatically,
>> but a lot for that for anonymous memory (and shmem similarly) is still
>> missing and unclear.
> 
> Whereas the use for large folios for filesystems is already automatic,
> so long as the filesystem supports it. We do this in readahead and write
> path already for iomap, we opportunistically use large folios if we can,
> otherwise we use smaller folios.
> 
> So a recommended approach by Matthew was to use the readahead and write
> path, just as in iomap to determine the size of the folio to use [0].
> The use of large folios would also be automatic and not require any
> knobs at all.

Yes, I remember discussing that with Willy at some point, including why 
shmem is unfortunately a bit more "special", because you might not even 
have a disk backend ("swap") at all where you could easily reclaim memory.

In the extreme form, you can consider SHMEM as memory that might be 
always mlocked, even without the user requiring special mlock limits ...

> 
> The mTHP approach would be growing the "THP" use in filesystems by the
> only single filesystem to use THP. Meanwhile use of large folios is already
> automatic with the approach taken by iomap.

Yes, it's the extension of existing shmem_enabled (that -- I'm afraid -- 
was added for good reasons).

> 
> We're at a crux where it does beg the question if we should continue to
> chug on with tmpfs being special and doing things differently extending
> the old THP interface with mTHP, or if it should just use large folios
> using the same approach as iomap did.

I'm afraid shmem will remain to some degree special. Fortunately it's 
not alone, hugetlbfs is even more special ;)

> 
>  From my perspective the more shared code the better, and the more shared
> paths the better. There is a chance to help test swap with large folios
> instead of splitting the folios for swap, and that would could be done
> first with tmpfs. I have not evaluated the difference in testing or how
> we could get the most of shared code if we take a mTHP approach or the
> iomap approach for tmpfs, that should be considered.

I don't have a clear picture yet of what might be best for ordinary 
shmem (IOW, not MAP_SHARED|MAP_PRIVATE), and I'm afraid there is no easy 
answer.

As long as we don't end up wasting memory, it's not obviously bad. But 
some things might be tricky (see my example about large folios stranding 
in shmem and never being able to be really reclaimed+reused for better 
purposes)

I'll note that mTHP really is just (supposed to be) a user interface to 
enable the various folio sizes (well, and to expose better per-size 
stats), not more.

 From that point of view, it's just a filter. Enable all, and you get 
the same behavior as you likely would in the pagecache mode.

 From a shared-code and testing point of view, there really wouldn't be 
a lot of differences. Again, essentially just a filter.


> 
> Are there other things to consider? Does this require some dialog at
> LSFMM?

As raised in my reply to Daniel, I'll be at LSF/MM and happy to discuss. 
I'm also not a SHMEM expert, so I'm hoping at some point we'd get 
feedback from Hugh.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-08 17:03         ` David Hildenbrand
@ 2024-05-09 19:18           ` Daniel Gomez
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel Gomez @ 2024-05-09 19:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baolin Wang, akpm, hughd, willy, ioworker0, wangkefeng.wang,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, linux-mm,
	linux-kernel

On Wed, May 08, 2024 at 07:03:57PM +0200, David Hildenbrand wrote:
> On 08.05.24 16:28, Daniel Gomez wrote:
> > On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> > > On 08.05.24 13:39, Daniel Gomez wrote:
> > > > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > > > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > > > > through commit 19eaf44954df, that can allow THP to be configured through the
> > > > > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > > > > 
> > > > > However, the anonymous shared pages will ignore the anonymous mTHP rule
> > > > > configured through the sysfs interface, and can only use the PMD-mapped
> > > > > THP, that is not reasonable. Many implement anonymous page sharing through
> > > > > mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> > > > > therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> > > > > also including the anonymous shared pages, in order to enjoy the benefits of
> > > > > mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> > > > > than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > > > > 
> > > > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > > > which can have all the same values as the top-level
> > > > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > > > additional "inherit" option. By default all sizes will be set to "never"
> > > > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > > > with the shmem enabled of the top level, meanwhile also allows independent
> > > > > control of shmem enabled for each mTHP.
> > > > 
> > > > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > > > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > > > here, explain this? How much do they overlap, and can we benefit from having
> > > > both? Is there any argument against the adoption of large folios here that I
> > > > might have missed?
> > > 
> > > mTHP are implemented using large folios, just like traditional PMD-sized THP
> > > are. (you really should explore the history of mTHP and how it all works
> > > internally)
> > 
> > I'll check more in deep the code. By any chance are any of you going to be at
> > LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> > be nice to get your feedback on it and if you see this working together with
> > mTHP/THP.
> > 
> 
> I'll be around and will attend that session! But note that I am still
> scratching my head what to do with "ordinary" shmem, especially because of
> the weird way shmem behaves in contrast to real files (below). Some input
> from Hugh might be very helpful.

I'm looking forward to meet you there and have your feedback!

> 
> Example: you write() to a shmem file and populate a 2M THP. Then, nobody
> touches that file for a long time. There are certainly other mmap() users
> that could better benefit from that THP ... and without swap that THP will
> be trapped there possibly a long time (unless I am missing an important
> piece of shmem THP design :) )? Sure, if we only have THP's it's nice,
> that's just not the reality unfortunately. IIRC, that's one of the reasons
> why THP for shmem can be enabled/disabled. But again, still scratching my
> head ...
> 
> 
> Note that this patch set only tackles anonymous shmem (MAP_SHARED|MAP_ANON),
> which is in 99.999% of all cases only accessed via page tables (memory
> allocated during page faults). I think there are ways to grab the fd
> (/proc/self/fd), but IIRC only corner cases read/write that.
> 
> So in that sense, anonymous shmem (this patch set) behaves mostly like
> ordinary anonymous memory, and likely there is not much overlap with other
> "allocate large folios during read/write/fallocate" as in [1]. swap might
> have an overlap.
> 
> 
> The real confusion begins when we have ordinary shmem: some users never mmap
> it and only read/write, some users never read/write it and only mmap it and
> some (less common?) users do both.
> 
> And shmem really is special: it looks like "just another file", but
> memory-consumption and reclaim wise it behaves just like anonymous memory.
> It might be swappable ("usually very limited backing disk space available")
> or it might not.
> 
> In a subthread here we are discussing what to do with that special
> "shmem_enabled = force" mode ... and it's all complicated I think.
> 
> > [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
> > 
> > > 
> > > The biggest challenge with memory that cannot be evicted on memory pressure
> > > to be reclaimed (in contrast to your ordinary files in the pagecache) is
> > > memory waste, well, and placement of large chunks of memory in general,
> > > during page faults.
> > > 
> > > In the worst case (no swap), you allocate a large chunk of memory once and
> > > it will stick around until freed: no reclaim of that memory.
> > 
> > I can see that path being triggered by some fstests but only for THP (where we
> > can actually reclaim memory).
> 
> Is that when we punch-hole a partial THP and split it? I'd be interested in
> what that test does.

The reclaim path I'm referring to is triggered when we reach max capacity
(-ENOSPC) in shmem_alloc_and_add_folio(). We reclaim space by splitting large
folios (regardless of their dirty or uptodate condition).

One of the tests that hits this path is generic/100 (with huge option enabled).
- First, it creates a directory structure in $TEMP_DIR (/tmp). Dir size is
around 26M.
- Then, it tars it up into $TEMP_DIR/temp.tar.
- Finally, untars the compressed file into $TEST_DIR (/media/test, which is the
huge tmpfs mountdir). What happens in generic/100 under the huge=always case
is that you fill up the dedicated space very quickly (this is 1G in xfstests
for tmpfs) and then you start reclaiming.

> 
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/8] add mTHP support for anonymous shmem
  2024-05-09 17:48         ` David Hildenbrand
@ 2024-05-10 18:53           ` Luis Chamberlain
  0 siblings, 0 replies; 52+ messages in thread
From: Luis Chamberlain @ 2024-05-10 18:53 UTC (permalink / raw)
  To: David Hildenbrand, Hugh Dickins
  Cc: Matthew Wilcox, Christoph Lameter, Christoph Hellwig,
	Dave Chinner, Daniel Gomez, Baolin Wang, akpm, hughd, ioworker0,
	wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts, shy828301,
	ziy, linux-mm, linux-kernel, Linux FS Devel

On Thu, May 09, 2024 at 07:48:46PM +0200, David Hildenbrand wrote:
> On 08.05.24 21:23, Luis Chamberlain wrote:
> >  From my perspective the more shared code the better, and the more shared
> > paths the better. There is a chance to help test swap with large folios
> > instead of splitting the folios for swap, and that would could be done
> > first with tmpfs. I have not evaluated the difference in testing or how
> > we could get the most of shared code if we take a mTHP approach or the
> > iomap approach for tmpfs, that should be considered.
> 
> I don't have a clear picture yet of what might be best for ordinary shmem
> (IOW, not MAP_SHARED|MAP_PRIVATE), and I'm afraid there is no easy answer.

OK so it sounds like the different options needs to be thought out and
reviewed.

> As long as we don't end up wasting memory, it's not obviously bad.

Sure.

> But some
> things might be tricky (see my example about large folios stranding in shmem
> and never being able to be really reclaimed+reused for better purposes)

Where is that stated BTW? Could that be resolved?

> I'll note that mTHP really is just (supposed to be) a user interface to
> enable the various folio sizes (well, and to expose better per-size stats),
> not more.

Sure but given filesystems using large folios don't have silly APIs for
using which large folios to enable, it just seems odd for tmpfs to take
a different approach.

> From that point of view, it's just a filter. Enable all, and you get the
> same behavior as you likely would in the pagecache mode.

Which begs the quesiton, *why* have an API to just constrain to certain
large folios, which diverges from what filesystems are doing with large
folios?

> > Are there other things to consider? Does this require some dialog at
> > LSFMM?
> 
> As raised in my reply to Daniel, I'll be at LSF/MM and happy to discuss. I'm
> also not a SHMEM expert, so I'm hoping at some point we'd get feedback from
> Hugh.

Hugh, will you be at LSFMM?

  Luis

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2024-05-10 18:54 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-06  8:46 [PATCH 0/8] add mTHP support for anonymous shmem Baolin Wang
2024-05-06  8:46 ` [PATCH 1/8] mm: move highest_order() and next_order() out of the THP config Baolin Wang
2024-05-07 10:21   ` Ryan Roberts
2024-05-08  2:13     ` Baolin Wang
2024-05-08  9:06       ` Ryan Roberts
2024-05-08  9:40         ` Baolin Wang
2024-05-06  8:46 ` [PATCH 2/8] mm: memory: extend finish_fault() to support large folio Baolin Wang
2024-05-07 10:37   ` Ryan Roberts
2024-05-08  3:44     ` Baolin Wang
2024-05-08  7:15       ` David Hildenbrand
2024-05-08  9:06         ` Baolin Wang
2024-05-08  8:53       ` Ryan Roberts
2024-05-08  9:31         ` Baolin Wang
2024-05-08 10:47           ` Ryan Roberts
2024-05-09  1:10             ` Baolin Wang
2024-05-06  8:46 ` [PATCH 3/8] mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio() Baolin Wang
2024-05-06  8:46 ` [PATCH 4/8] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
2024-05-06  8:46 ` [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
2024-05-07 10:52   ` Ryan Roberts
2024-05-08  4:45     ` Baolin Wang
2024-05-08  7:08       ` David Hildenbrand
2024-05-08  7:12         ` David Hildenbrand
2024-05-08  9:02           ` Ryan Roberts
2024-05-08  9:56             ` Baolin Wang
2024-05-08 10:48               ` Ryan Roberts
2024-05-08 12:02             ` David Hildenbrand
2024-05-08 12:10               ` David Hildenbrand
2024-05-08 12:43                 ` Ryan Roberts
2024-05-08 12:44                   ` Ryan Roberts
2024-05-08 12:45                   ` David Hildenbrand
2024-05-08 12:54                     ` Ryan Roberts
2024-05-08 13:07                       ` David Hildenbrand
2024-05-08 13:44                         ` Ryan Roberts
2024-05-06  8:46 ` [PATCH 6/8] mm: shmem: add mTHP support " Baolin Wang
2024-05-07 10:46   ` kernel test robot
2024-05-08  6:03     ` Baolin Wang
2024-05-06  8:46 ` [PATCH 7/8] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
2024-05-06  8:46 ` [PATCH 8/8] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
2024-05-06 10:54 ` [PATCH 0/8] add mTHP support " Lance Yang
2024-05-07  1:47   ` Baolin Wang
2024-05-07  6:50     ` Lance Yang
2024-05-07 10:20 ` Ryan Roberts
2024-05-08  5:45   ` Baolin Wang
     [not found] ` <CGME20240508113934eucas1p13a3972f3f9955365f40155e084a7c7d5@eucas1p1.samsung.com>
2024-05-08 11:39   ` Daniel Gomez
2024-05-08 11:58     ` David Hildenbrand
2024-05-08 14:28       ` Daniel Gomez
2024-05-08 17:03         ` David Hildenbrand
2024-05-09 19:18           ` Daniel Gomez
2024-05-09  3:08         ` Baolin Wang
2024-05-08 19:23       ` Luis Chamberlain
2024-05-09 17:48         ` David Hildenbrand
2024-05-10 18:53           ` Luis Chamberlain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.